high performance computing on graphics processing units: hgpu.org

Posts

Oct, 14

Accelerating Large Scale Image Analyses on Parallel CPU-GPU Equipped Systems

General-purpose graphical processing units (GPGPUs) have transformed high-performance computing over the past decade. Making great computational power available with reduced cost and power consumption overheads, heterogeneous CPU-GPU-equipped systems have helped to make possible the emerging class of exascale data-intensive applications. Although the theoretical performance achieved by these hybrid systems is impressive, taking practical advantage of […]

CUDA

Oct, 14

CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization

As the computational power of GPUs continues to scale with Moore’s Law, an increasing number of applications are becoming limited by memory bandwidth. We propose an approach for programming GPUs with tightly-coupled specialized DMA warps for performing memory transfers between on-chip and off-chip memories. Separate DMA warps improve memory bandwidth utilization by better exploiting available […]

CUDA

Oct, 14

OptiML: An implicitly parallel domain-specific language for machine learning

As the size of datasets continues to grow, machine learning applications are becoming increasingly limited by the amount of available computational power. Taking advantage of modern hardware requires using multiple parallel programming models targeted at different devices (e.g. CPUs and GPUs). However, programming these devices to run efficiently and correctly is difficult, error-prone, and results […]

OpenCL

Oct, 14

Liszt: A Domain Specific Language for Building Portable Mesh-based PDE Solvers

Heterogeneous computers with processors and accelerators are becoming widespread in scientific computing. However, it is difficult to program hybrid architectures and there is no commonly accepted programming model. Ideally, applications should be written in a way that is portable to many platforms, but providing this portability for general programs is a hard problem. By restricting […]

CUDA

Oct, 14

GPU Computing Gems: Jade Edition

This is the second volume of Morgan Kaufmann’s GPU Computing Gems, offering an all-new set of insights, ideas, and practical ";hands-on"; skills from researchers and developers worldwide. Each chapter gives you a window into the work being performed across a variety of application domains, and the opportunity to witness the impact of parallel GPU computing […]

CUDA

Oct, 14

Towards scalar synchronization in SIMT architectures

An important class of compute accelerators are graphics processing units (GPUs). Popular programming models for non-graphics computation on GPUs, such as CUDA and OpenCL, provide an abstraction of many parallel scalar threads. Contemporary GPU hardware groups 32 to 64 scalar threads as a single warp or wavefront and executes this group of scalar threads in […]

CUDA

•

OpenCL

Oct, 14

A Heterogeneous Parallel Framework for Domain-Specific Languages

Computing systems are becoming increasingly parallel and heterogeneous, and therefore new applications must be capable of exploiting parallelism in order to continue achieving high performance. However, targeting these emerging devices often requires using multiple disparate programming models and making decisions that can limit forward scalability. In previous work we proposed the use of domain-specific languages […]

OpenCL

Oct, 14

Fast Multipole Method vs. Spectral Method for the Simulation of Isotropic Turbulence on GPUs

This paper presents calculations of homogeneous isotropic turbulence at Re_{lambda} = 100 using both a pseudo-spectral method and a fast multipole vortex method on a 256^3 grid. For the vortex method, both algorithmic and hardware acceleration are applied using a highly parallel fast multipole method (FMM) on GPUs. The spectral methods uses the FFTW library […]

CUDA

Oct, 13

Benchmarking Across Platforms: European Option Pricing

Using a popular Monte Carlo workload which implements European option pricing, we tested a variety of architectures including NVIDIA and AMD GPUs, ClearSpeed accelerator and multi-core processors and different programming approaches. We conclude that this particular workload seems most suitable for running on GPU type of architecture compared to other alternatives such as CPU or […]

CUDA

•

OpenCL

Oct, 13

Firepile: Run-time Compilation for GPUs in Scala

Recent advances have enabled GPUs to be used as general-purpose parallel processors on commodity hardware for little cost. However, the ability to program these devices has not kept up with their performance. The programming model for GPUs has a number of restrictions that make it dif?cult to program. For example, software running on the GPU […]

OpenCL

Oct, 13

A rendering method for simulated emission nebulae

Emission nebulae are some of the most beautiful stellar phenomena. The newly formed hot stars inside the nebulae ionize the surrounding gas making it glow in variety of colors. The focus of this work is to find a method for interactive rendering of simulated emission nebulae. A rendering program has been developed to render and […]

OpenCL

Oct, 13

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Accelerating Large Scale Image Analyses on Parallel CPU-GPU Equipped Systems

CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization

OptiML: An implicitly parallel domain-specific language for machine learning

Liszt: A Domain Specific Language for Building Portable Mesh-based PDE Solvers

GPU Computing Gems: Jade Edition

Towards scalar synchronization in SIMT architectures

A Heterogeneous Parallel Framework for Domain-Specific Languages

Fast Multipole Method vs. Spectral Method for the Simulation of Isotropic Turbulence on GPUs

Benchmarking Across Platforms: European Option Pricing

Firepile: Run-time Compilation for GPUs in Scala

A rendering method for simulated emission nebulae

Introduction to GPU Radix Sort

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)