high performance computing on graphics processing units: hgpu.org

Posts

Jul, 12

High-Performance Symmetric Block Ciphers on Multicore CPU and GPUs

As the data protection with encryption becomes important day by day, the encryption processing using General Purpose computation on a Graphic Processing Unit (GPGPU) has been noticed as one of the methods to realize high-speed data protection technology. GPUs have evolved in recent years into powerful parallel computing devices, with a high cost-performance ratio. However, […]

CUDA

Jul, 12

A Note on Particle Filters Applied to DSGE Models

This paper compares the properties of two particle filters – the Bootstrap Filter and the Auxiliary Particle Filter – applied to the computation of the likelihood of artificial data simulated from a basic DSGE model with nominal and real rigidities. Particle filters are compared in terms of speed, quality of the approximation of the probability […]

OpenCL

Jul, 12

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications

Transition to hybrid CPU/GPU platforms in high performance computing is challenging in the aspect of efficient utilisation of the heterogeneous hardware and existing optimised software. During recent years, scientific software has been ported to multicore and GPU architectures and now should be reused on hybrid platforms. In this paper, we model the performance of such […]

CUDA

Jul, 11

Fast Algorithms for the Solution of Stochastic Partial Differential Equations

We explore the performance of several algorithms for the solution of stochastic partial differential equations including the stochastic Galerkin method and the stochastic sparse grid collocation method. We also introduce a new method called the adaptive kernel density estimation (KDE) collocation method, which addresses some of the deficiencies present in other stochastic PDE solution methods. […]

CUDA

Jul, 11

Stencil-Aware GPU Optimization of Iterative Solvers

Numerical solutions of nonlinear partial differential equations frequently rely on iterative Newton-Krylov methods, which linearize a finite-difference stencil-based discretization of a problem, producing a sparse matrix with regular structure. Knowledge of this structure can be used to exploit parallelism and locality of reference on modern cache-based multi and many-core architectures, achieving high performance for computations […]

CUDA

Jul, 11

Invitation to a Standard Programming Interface for Massively Parallel Computing Environment: OpenCL

Multicore/manycore architecture accelerates demand for a new programming environment to utilize the massive processors integrated in an LSI. GPU (Graphics Processing Unit) is one of the typical hardware environments. The programming environments on GPU are traditionally vendor-/hardware-specific, where complicate the management of uniform programs that access computing resources of the massively parallel platform. The recently […]

OpenCL

Jul, 11

Geometric Algebra enhanced Precompiler for C++ and OpenCL

The focus of the this work is a simplified integration of algorithms expressed in Geometric Algebra (GA) in modern high level computer languages, namely C++, OpenCL and CUDA. A high runtime performance in terms of GA is achieved using symbolic simplification and code generation by a Precompiler that is directly integrated into CMake-based build toolchains.

CUDA

•

OpenCL

Jul, 11

A fully parallel, high precision, N-body code running on hybrid computing platforms

We present a new implementation of the numerical integration of the classical, gravitational, N-body problem based on a high order Hermite’s integration scheme with block time steps, with a direct evaluation of the particle-particle forces. The main innovation of this code (called HiGPUs) is its full parallelization, exploiting both OpenMP and MPI in the use […]

OpenCL

Jul, 11

Hybrid Monte Carlo with Wilson Dirac operator on the Fermi GPU

In this article we present our implementation of a Hybrid Monte Carlo algorithm for Lattice Gauge Theory using two degenerate flavours of Wilson-Dirac fermions on a Fermi GPU. We find that using registers instead of global memory speeds up the code by almost an order of magnitude. To map the array variables to scalars, so […]

CUDA

Jul, 10

Exposure Render: An Interactive Photo-Realistic Volume Rendering Framework

The field of volume visualization has undergone rapid development during the past years, both due to advances in suitable computing hardware and due to the increasing availability of large volume datasets. Recent work has focused on increasing the visual realism in Direct Volume Rendering (DVR) by integrating a number of visually plausible but often effect-specific […]

CUDA

Jul, 10

Multi-level Parallelization of Advanced Video Coding on Hybrid CPU/GPU Platform

In this paper we propose a dynamic model for parallel H.264/AVC video encoding on hybrid GPU/CPU systems. Entire inter-loop is parallelized on both CPU and GPU and computationally light and efficient model is proposed to dynamically distribute computation load among simultaneously processing devices. This model includes both dependency aware task scheduling and load balancing algorithm […]

CUDA

Jul, 10

Runtime Systems and Scheduling Support for High-End CPU-GPU Architectures

In recent years, multi-core CPUs and many-core GPUs have emerged as mainstream and cost-effective means for scaling. Consequently, a trend that is receiving wide attention is of heterogeneous computing platforms consisting of both CPU and GPU. Such heterogeneous architectures are pervasive across notebooks, desktops, clusters, supercomputers and cloud environments. While they expose huge potential for […]

CUDA

•

OpenCL

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

* * *

high performance computing on graphics processing units: hgpu.org

Posts

High-Performance Symmetric Block Ciphers on Multicore CPU and GPUs

A Note on Particle Filters Applied to DSGE Models

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications

Fast Algorithms for the Solution of Stochastic Partial Differential Equations

Stencil-Aware GPU Optimization of Iterative Solvers

Invitation to a Standard Programming Interface for Massively Parallel Computing Environment: OpenCL

Geometric Algebra enhanced Precompiler for C++ and OpenCL

A fully parallel, high precision, N-body code running on hybrid computing platforms

Hybrid Monte Carlo with Wilson Dirac operator on the Fermi GPU

Exposure Render: An Interactive Photo-Realistic Volume Rendering Framework

Multi-level Parallelization of Advanced Video Coding on Hybrid CPU/GPU Platform

Runtime Systems and Scheduling Support for High-End CPU-GPU Architectures

Recent source codes

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

CUDABench: Benchmarking LLMs for Text-to-CUDA Generation

CL4SE: A Context Learning Benchmark For Software Engineering Tasks

CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Vortex-Optimized Light-weight Toolchain (VOLT)

SciDef: Automated Definition Extraction from Scientific Literature

bioagent-bench: Benchmark for evaluating LLM agents in bioinformatics

Most viewed papers (last 30 days)