high performance computing on graphics processing units: hgpu.org

Posts

Feb, 22

Energy efficiency of mixed precision iterative refinement methods using hybrid hardware platforms

In this paper we evaluate the possibility of using mixed precision algorithms on different hardware platforms to obtain energy-efficient solvers for linear systems of equations. Our test-cases arise in the context of computational fluid dynamics. Therefore, we analyze the energy efficiency of common cluster nodes and a hybrid, GPU-accelerated cluster node, when applying a linear […]

Feb, 22

Performance analysis and optimization of three-dimensional FDTD on GPU using roofline model

The Finite-Difference Time-Domain (FDTD) method is commonly used for electromagnetic field simulations. Recently, successful hardware-accelerations using Graphics Processing Unit (GPU) have been reported for the large-scale FDTD simulations. In this paper, we present a performance analysis of the three-dimensional (3D) FDTD on GPU using the roofline model. We find that theoretical predictions on maximum performance […]

CUDA

Feb, 22

Data Structures and Transformations for Physically Based Simulation on a GPU

As general purpose computing on Graphics Processing Units (GPGPU) matures, more complicated scientific applications are being targeted to utilize the data-level parallelism available on a GPU. Implementing physically-based simulation on data-parallel hardware requires preprocessing overhead which affects application performance. We discuss our implementation of physics-based data structures that provide significant performance improvements when used on […]

CUDA

Feb, 22

Parallel power flow solutions using a biconjugate gradient algorithm and a Newton method: A GPU-based approach

A new approach to solve the power flow problem based on graphic processing units is presented in this paper. A Newton method is implemented to solve the set of nonlinear equations of the power flow formulation. A parallel kernel for the biconjugate gradient method allows solving the voltage corrections on a graphic processing card. While […]

Feb, 22

GPU Acceleration of Runge-Kutta Integrators

We consider the use of commodity graphics processing units (GPUs) for the common task of numerically integrating ordinary differential equations (ODEs), achieving speed-ups of up to 115-fold over comparable serial CPU implementations, and 15-fold over multithreaded CPU code with SIMD intrinsics. Using Lorenz ’96 models as a case study, single and double precision benchmarks are […]

CUDA

Feb, 21

Software-Based ECC for GPUs

Commodity off-the-shelf GPUs lack error checking mechanisms for graphics memory, whereas conventional HPC platforms have used hardware-based ECC for DRAMs. To alleviate this reliability concern, we propose a software-based ECC for GPGPU applications. We add small program codes to normal CUDA programs that compute ECCs for data residing in graphics memory so that transient bit-flips […]

CUDA

Feb, 21

Accelerated Root Finding for Computational Finance

A parallel implementation of root finding on an SIMD application accelerator is reported. These are roots of stochastic differential equations in the computational finance domain which require a stochastic simulation to be performed for each evaluation of the pricing function. Experiments show that a speedup of 15X can be achieved over using a stand-alone CPU […]

Feb, 21

An Automated Approach for SIMD Kernel Generation for GPU based Software Acceleration

Graphics Processing Units (GPUs) are highly parallel Single Instruction Multiple Data (SIMD) engines, with extremely high degrees of available hardware parallelism. The task of implementing a software routine on a GPU currently requires significant manual design, iteration and experimentation. This paper presents an automated approach to partition a software application into kernels (which are executed […]

CUDA

Feb, 21

Assembling large mosaics of electron microscope images using GPU

Understanding the neural circuitry of the retina requires us to map the connectivity of individual neurons in large neuronal tissue sections and analyze signal communication across processes from the electron microscopy images. One of the major bottlenecks in the critical path is the image mosaicing process where 2D slices are assembled from scanned microscopy image […]

CUDA

Feb, 21

Accelerating the Stochastic Simulation Algorithm

In order for scientists to learn more about molecular biology, it is imperative that they have the ability to construct accurate models that predict the reactions of species of molecules. Generating these models using deterministic approaches is not feasible as these models may violate some of the assumptions underlying classical differential equations models (e.g., small […]

CUDA

Feb, 21

Acceleration of Binomial Options Pricing via Parallelizing along time-axis on a GPU

Since the introduction of organized trading of options for commodities and equities, computing fair prices for options has been an important problem in financial engineering. A variety of numerical methods, including Monte Carlo methods, binomial trees, and numerical solution of stochastic differential equations, are used to compute fair prices. Traders and brokerage firms constantly strive […]

CUDA

Feb, 21

A massively parallel framework using P systems and GPUs

Since CUDA programing model appeared on the general purpose computations, the developers can extract all the power contained in GPUs (Graphics Processing Unit) across many computational domains. Among these domains, P systems or membrane systems provide a high level computational modeling framework that allows, in theory, to obtain polynomial time solutions to NP-complete problems by […]

CUDA

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Energy efficiency of mixed precision iterative refinement methods using hybrid hardware platforms

Performance analysis and optimization of three-dimensional FDTD on GPU using roofline model

Data Structures and Transformations for Physically Based Simulation on a GPU

Parallel power flow solutions using a biconjugate gradient algorithm and a Newton method: A GPU-based approach

GPU Acceleration of Runge-Kutta Integrators

Software-Based ECC for GPUs

Accelerated Root Finding for Computational Finance

An Automated Approach for SIMD Kernel Generation for GPU based Software Acceleration

Assembling large mosaics of electron microscope images using GPU

Accelerating the Stochastic Simulation Algorithm

Acceleration of Binomial Options Pricing via Parallelizing along time-axis on a GPU

A massively parallel framework using P systems and GPUs

Recent source codes

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Vortex-Optimized Light-weight Toolchain (VOLT)

SciDef: Automated Definition Extraction from Scientific Literature

bioagent-bench: Benchmark for evaluating LLM agents in bioinformatics

Benchmark suite for LLM inference on NVIDIA consumer GPUs

Theorizer: from the paper Generating Literature-Driven Scientific Discoveries at Scale

Nsight Python: a Python kernel profiling interface based on NVIDIA Nsight Tools

Awesome LLM-Driven Kernel Generation

Most viewed papers (last 30 days)