high performance computing on graphics processing units: hgpu.org

Posts

Apr, 16

Acceleration of CFD and data analysis using graphics processors

Graphics processing units function well as high performance computing devices for scientific computing. The non-standard processor architecture and high memory bandwidth allow graphics processing units (GPUs) to provide some of the best performance in terms of FLOPS per dollar. Recently these capabilities became accessible for general purpose computations with the CUDA programming environment on NVIDIA […]

CUDA

Apr, 16

Fast GPU-based fluid simulations using SPH

Graphical Processing Units (GPUs) are massive floating-point stream processors, and through the recent development of tools such as CUDA and OpenCL it has become possible to fully utilize them for scientific computing. We have developed an open-source CUDA-based acceleration framework for 3D Computational Fluid Dynamics (CFD) using Smoothed Particle Hydrodynamics (SPH). This paper describes the […]

CUDA

Apr, 16

Heterogeneous Highly Parallel Implementation of Matrix Exponentiation Using GPU

The vision of super computer at every desk can be realized by powerful and highly parallel CPUs or GPUs or APUs. Graphics processors once specialized for the graphics applications only, are now used for the highly computational intensive general purpose applications. Very expensive GFLOPs and TFLOP performance has become very cheap with the GPGPUs. Current […]

OpenCL

Apr, 14

Efficient Hash Tables on the GPU

Advances in GPU architecture have made efficient implementations of hash tables possible, allowing fast parallel constructions and retrievals despite the uncoalesced memory accesses naturally incurred by hashing algorithms. The key is to mitigate the penalty of these accesses by minimizing the number that occur and utilizing the cache (when one is available). Most work done […]

CUDA

Apr, 14

An implementation of the tile QR factorization for a GPU and multiple CPUs

The tile QR factorization provides an efficient and scalable way for factoring a dense matrix in parallel on multicore processors. This article presents a way of efficiently implementing the algorithm on a system with a powerful GPU and many multicore CPUs.

CUDA

Apr, 14

Parallelization of PageRank on Multicore Processors

PageRank is a prominent metric used by search engines for ranking of search results. Page rank of a particular web page is a function of page ranks of all the web pages pointing to this page. The algorithm works on a large number of web pages and is thus computational intensive. The need of hardware […]

CUDA

Apr, 14

phiGEMM: a CPU-GPU library for porting Quantum ESPRESSO on hybrid systems

GPU computing has revolutionized HPC by bringing the performance of the supercomputer to the desktop. Attractive price, performance, and power characteristics allow multiple GPUs to be plugged into both desktop machines as well as supercomputer nodes for increased performance. Excellent performance and scalability can be achieved for some problems using hybrid combinations of multiple GPUs […]

CUDA

Apr, 14

GPU parallel computing: Programming language, debugging tools and data structures

With many cores driven by high memory bandwidth, today’s graphics processing unit (GPU) has involved into an absolute computing workhorse. More and more scientists, researchers and software developers are using GPUs to accelerate their algorithms and applications. Developing complex programs and software on the GPU, however, is still far from easy with existing tools provided […]

CUDA

Apr, 13

A GPU Memory System Comparison for an Elliptic Test Problem

This paper presents GPU-based solutions to the Poisson equation with homogeneous Dirichlet boundary conditions in two spatial dimensions. This problem has well-understood behavior, but similar computation to many more complex real-world problems. We analyze the GPU performance using three types of memory access in the CUDA memory model (direct access to global memory, texture access, […]

CUDA

Apr, 13

Software Model Checking for GPGPU Programs, Towards a Verification Tool

The tremendous computing power GPUs are capable of makes of them the epicenter of an unprecedented attention for applications other than graphics and gaming. Apart from the highly parallel nature of the programs to be run on GPUs, the sought after gain in computing power is only achieved with low level tuning at threads level […]

CUDA

Apr, 13

Writing a modular GPGPU program in Java

This paper proposes a Java to CUDA runtime program translator for scientific-computing applications. Traditionally, these applications have been written in Fortran or C without using a rich modularization mechanism. Our translator enables those applications to be written in Java and run on GPGPUs while exploiting a rich modularization mechanism in Java. This translator dynamically generates […]

CUDA

Apr, 13

Parallel programming with CUDA

This report documents our master thesis project, which is about parallel programming with CUDA, the NVIDIA GPU architecture with support for general purpose computing. The purpose of the thesis is to uncover the qualities of CUDA as a parallel computing platform, determining the possibilities and limitations of its ability to handle different types of algorithms. […]

CUDA

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Acceleration of CFD and data analysis using graphics processors

Fast GPU-based fluid simulations using SPH

Heterogeneous Highly Parallel Implementation of Matrix Exponentiation Using GPU

Efficient Hash Tables on the GPU

An implementation of the tile QR factorization for a GPU and multiple CPUs

Parallelization of PageRank on Multicore Processors

phiGEMM: a CPU-GPU library for porting Quantum ESPRESSO on hybrid systems

GPU parallel computing: Programming language, debugging tools and data structures

A GPU Memory System Comparison for an Elliptic Test Problem

Software Model Checking for GPGPU Programs, Towards a Verification Tool

Writing a modular GPGPU program in Java

Parallel programming with CUDA

Recent source codes

CL4SE: A Context Learning Benchmark For Software Engineering Tasks

CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Vortex-Optimized Light-weight Toolchain (VOLT)

SciDef: Automated Definition Extraction from Scientific Literature

bioagent-bench: Benchmark for evaluating LLM agents in bioinformatics

Benchmark suite for LLM inference on NVIDIA consumer GPUs

Theorizer: from the paper Generating Literature-Driven Scientific Discoveries at Scale

Most viewed papers (last 30 days)