high performance computing on graphics processing units: hgpu.org

Posts

Oct, 5

Performance Analysis and Optimisation of the OP2 Framework on Many-core Architectures

This paper presents a benchmarking, performance analysis and optimisation study of the OP2 "active" library, which provides an abstraction framework for the parallel execution of unstructured mesh applications. OP2 aims to decouple the scientific specification of the application from its parallel implementation, and thereby achieve code longevity and near-optimal performance through re-targeting the application to […]

CUDA

Oct, 5

GPU accelerated 2-D staggered-grid finite difference seismic modelling

The staggered-grid finite difference (FD) method demands significantly computational capability and is inefficient for seismic wave modelling in 2-D viscoelastic media on a single PC. To improve computation speedup, a graphic processing units (GPUs) accelerated method was proposed, for modern GPUs have now become ubiquitous in desktop computers and offer an excellent cost-to-performance-ratio parallelism. The […]

OpenGL

Oct, 5

Applying software-managed caching and CPU/GPU task scheduling for accelerating dynamic workloads

In this talk we address two problems frequently encountered by GPU developers: optimizing memory access for kernels with complex input-dependent access patterns, and mapping the computations to a GPU or a CPU in composite applications with multiple dependent kernels. Both require dynamic adaptation and tuning of execution policies to allow high performance for a wide […]

CUDA

Oct, 5

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

In this study computations of the two-dimensional Direct Simulation Monte Carlo (DSMC) method using Graphics Processing Units (GPUs) are presented. An all-device (GPU) computational approach is adopted-where the entire computation is performed on the GPU device, leaving the CPU idle-which includes particle moving, indexing, collisions between particles and state sampling. The subsequent application to GPU […]

CUDA

Oct, 5

A Framework for Automated Performance Tuning and Code Verification on GPU Computing Platforms

Emerging multi-core processor designs create a computing paradigm capable of advancing numerous scientific areas, including medicine, data mining, biology, physics, and earth sciences. However, the trends in multi-core hardware technology have advanced far ahead of the advances in software technology and programmer productivity. For the most part, current scientists only leverage multi-core and GPU (Graphical […]

Oct, 5

High-Order Discontinuous Galerkin Methods by GPU Metaprogramming

Discontinuous Galerkin (DG) methods for the numerical solution of par- tial differential equations have enjoyed considerable success because they are both flexible and robust: They allow arbitrary unstructured geometries and easy control of accuracy without compromising simulation stability. In a recent publication, we have shown that DG methods also adapt readily to execution on modern, […]

CUDA

Oct, 5

Flexible, high performance convolutional neural networks for image classification

We present a fast, fully parameterizable GPU implementation of Convolutional Neural Network variants. Our feature extractors are neither carefully designed nor pre-wired, but rather learned in a supervised way. Our deep hierarchical architectures achieve the best published results on benchmarks for object classification (NORB, CIFAR10) and handwritten digit recognition (MNIST), with error rates of 2.53%, […]

CUDA

Oct, 5

A parallel error diffusion implementation on a GPU

In this paper, we investigate the suitability of the GPU for a parallel implementation of the pinwheel error diffusion. We demonstrate a high-performance GPU implementation by efficiently parallelizing and unrolling the image processing algorithm. Our GPU implementation achieves a 10 – 30x speedup over a two-threaded CPU error diffusion implementation with comparable image quality. We […]

CUDA

Oct, 4

GPU performance comparison for accelerated radar data processing

Radar is a data-intensive measurement technique often requiring significant processing to make full use of the received signal. However, computing capacity is limited at remote or mobile radar installations thereby limiting radar data products used for real-time decisions. We used graphics processing units (GPUs) to accelerate processing of high resolution phase-coded radar data from the […]

OpenCL

Oct, 4

A Massive Data Parallel Computational Framework on Petascale/Exascale Hybrid Computer Systems

Heterogeneous systems are becoming more common on High Performance Computing (HPC) systems. Even using tools like CUDA [1] and OpenCL [2] it is a non-trivial task to obtain optimal performance on the GPU. Approaches to simplifying this task include Merge [3] (a library based framework for heterogeneous multi-core systems), Zippy [4] (a framework for parallel […]

CUDA

•

OpenCL

Oct, 4

Architecture-Aware Optimization on a 1600-core Graphics Processor

The graphics processing unit (GPU) continues to make significant strides as an accelerator in commodity cluster computing for high-performance computing (HPC). For example, three of the top five fastest supercomputers in the world, as ranked by the TOP500, employ GPUs as accelerators. Despite this increasing interest in GPUs, however, optimizing the performance of a GPU-accelerated […]

CUDA

•

OpenCL

Oct, 4

Fine-grained Parallel ILU Preconditioners with Fill-ins for Multi-core CPUs and GPUs

Numerical simulation and its huge computational demands require a close coupling between efficient mathematical methods and their hardware-aware implementation on emerging and highly parallel computing platforms. The paradigm shift towards manycore parallelism not only offers a high potential of computing capabilities but also comes up with urgent challenges in designing scalable, portable, and flexible software […]

OpenCL

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Performance Analysis and Optimisation of the OP2 Framework on Many-core Architectures

GPU accelerated 2-D staggered-grid finite difference seismic modelling

Applying software-managed caching and CPU/GPU task scheduling for accelerating dynamic workloads

Parallel Direct Simulation Monte Carlo Computation Using CUDA on GPUs

A Framework for Automated Performance Tuning and Code Verification on GPU Computing Platforms

High-Order Discontinuous Galerkin Methods by GPU Metaprogramming

Flexible, high performance convolutional neural networks for image classification

A parallel error diffusion implementation on a GPU

GPU performance comparison for accelerated radar data processing

A Massive Data Parallel Computational Framework on Petascale/Exascale Hybrid Computer Systems

Architecture-Aware Optimization on a 1600-core Graphics Processor

Fine-grained Parallel ILU Preconditioners with Fill-ins for Multi-core CPUs and GPUs

Recent source codes

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Vortex-Optimized Light-weight Toolchain (VOLT)

SciDef: Automated Definition Extraction from Scientific Literature

bioagent-bench: Benchmark for evaluating LLM agents in bioinformatics

Benchmark suite for LLM inference on NVIDIA consumer GPUs

Theorizer: from the paper Generating Literature-Driven Scientific Discoveries at Scale

Nsight Python: a Python kernel profiling interface based on NVIDIA Nsight Tools

Awesome LLM-Driven Kernel Generation

Most viewed papers (last 30 days)