high performance computing on graphics processing units: hgpu.org

Posts

Dec, 18

Implementation of 3D FFTs Across Multiple GPUs in Shared Memory Environments

In this paper, a novel implementation of the distributed 3D Fast Fourier Transform (FFT) on a multi-GPU platform using CUDA is presented. The 3D FFT is the core of many simulation methods, thus its fast calculation is critical. The main bottleneck of the distributed 3D FFT is the global data exchange which must be performed. […]

CUDA

Dec, 18

Theoretical and Numerical Analysis of Three Approaches to the GPGPU Application of the Explicit FDTD Method

The Finite-Difference Time-Domain method (FDTD) is a modelling technique for electromagnetic waves propagation. There is a great range of domains of application, for example geophysics, defence, microwaves like radar, or biomedicine. However, FDTD is a computationally intensive method, but has potential for parallelisation. The use of General-Purpose computing on Graphics Processing Units (GPGPU) is examined […]

CUDA

Dec, 18

Accelerating Haskell Array Codes with Algorithmic Skeletons on GPUs

GPUs have been gaining popularity as general purpose parallel processors that deliver a performance to cost ratio superior to that of CPUs. However, programming on GPUs has remained a specialised area, as it often requires significant knowledge about the GPU architecture and platform-specific parallelisation of the algorithms that are implemented. Furthermore, the dominant programming models […]

CUDA

Dec, 18

Single-Pass GPU-Raycasting for Structured Adaptive Mesh Refinement Data

Structured Adaptive Mesh Refinement (SAMR) is a popular numerical technique to study processes with high spatial and temporal dynamic range. It reduces computational requirements by adapting the lattice on which the underlying differential equations are solved to most efficiently represent the solution. Particularly in astrophysics and cosmology such simulations now can capture spatial scales ten […]

OpenGL

Dec, 18

Database Operation Development on the GPU

The performance of database operations has always been an important factor in database research being done. This has never been more important, as the quantity of data is growing at an alarming rate. This coupled with the recent growth of using graphics processors as general compute processors has led to many advancements in the field […]

Dec, 16

Acceleration of multivariate analysis techniques in TMVA using GPUs

A feasibility study into the acceleration of multivariate analysis techniques using Graphics Processing Units (GPUs) will be presented. The MLP-based Artificial Neural Network method contained in the TMVA framework has been chosen as a focus for investigation. It was found that the network training time on a GPU was lower than for CPU execution as […]

CUDA

Dec, 16

Accuracy, Memory, and Speed Strategies in GPU-Based Finite-Element Matrix-Generation

This letter presents strategies on how to optimize graphics processing unit (GPU)-based finite-element matrix-generation that occurs in the finite element method (FEM) using higher-order curvilinear elements. The goal of the optimization is to increase the speed of evaluation and assembly of large finite-element matrices on a single GPU while maintaining the accuracy of numerical integration […]

CUDA

Dec, 16

Productive High Performance Parallel Programming with Auto-tuned Domain-Specific Embedded Languages

As the complexity of machines and architectures has increased, performance tuning has become more challenging, leading to the failure of general compilers to generate the best possible optimized code. Expert performance programmers can often hand-write code that outperforms compiler-optimized low-level code by an order of magnitude. At the same time, the complexity of programs has […]

CUDA

Dec, 16

Communication-Avoiding Optimization of Geometric Multigrid on GPUs

Multigrid methods are widely used to accelerate the convergence of iterative solvers for linear systems in a number of different application areas. In this report, we explore communication-avoiding implementations of Geometric Multigrid on Nvidia GPUs. We achieved an overall gain of 1.2x for the whole multigrid algorithm over baseline implementation. We also provide an insight […]

CUDA

Dec, 16

Circular Hough Transform in OpenCL

In this paper, the details of the circular hough transform are explained and the performances of three different implementations(CPU, OpenCL and CUDA) are also shown. The goal of this project is to contribute to the computer vision literature by porting the circular hough transform written in CUDA to OpenCL.

CUDA

•

OpenCL

Dec, 15

Performance study of using the Direct Compute API for implementing Support vector machines on GPUs

Today graphics processing units (GPUs) are not only able to generate graphical imaging but also able to expose its multicore architecture to increase computationally heavy general purpose algorithms that can be adapted to the multicore architecture of the GPU. The study conducted in this thesis explores the efficiency of using the general purpose graphics processing […]

CUDA

•

OpenCL

Dec, 15

Advanced Techniques for the Rendering and Visualization of Volumetric Seismic Data

An important part of today’s search for hydrocarbon reservoirs such as oil and gas is the use of seismic methods which measure changes in acoustic impedance to explore the interior of the earth. Similar to medical imaging techniques such as MRI or CT, seismic methods generate image slices (survey lines) through the subsurface geology. By […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Implementation of 3D FFTs Across Multiple GPUs in Shared Memory Environments

Theoretical and Numerical Analysis of Three Approaches to the GPGPU Application of the Explicit FDTD Method

Accelerating Haskell Array Codes with Algorithmic Skeletons on GPUs

Single-Pass GPU-Raycasting for Structured Adaptive Mesh Refinement Data

Database Operation Development on the GPU

Acceleration of multivariate analysis techniques in TMVA using GPUs

Accuracy, Memory, and Speed Strategies in GPU-Based Finite-Element Matrix-Generation

Productive High Performance Parallel Programming with Auto-tuned Domain-Specific Embedded Languages

Communication-Avoiding Optimization of Geometric Multigrid on GPUs

Circular Hough Transform in OpenCL

Performance study of using the Direct Compute API for implementing Support vector machines on GPUs

Advanced Techniques for the Rendering and Visualization of Volumetric Seismic Data

Recent source codes

DITRON: Distributed Compiler based on Triton for Parallel Systems

IntelliKit: Agent-first tooling for AMD hardware

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)