high performance computing on graphics processing units: hgpu.org

Posts

Oct, 12

Distance Fields Accelerated with OpenCL

An important task in any graphical simulation is the collision detection between the objects in the simulation. It is desirable to have a good general method for collision detection with high performance. This thesis describes an implementation of a collision detection method that uses distance fields to detect collisions. This method is quite robust and […]

OpenCL

Oct, 12

Cinematic Particle Systems with OpenCL

High-particle-count simulations are becoming increasingly crucial in many different aspects of our world today: both in entertainment – within video games, movies, and the like – and in scientific fields, where particle systems are capable of simulating and visualizing many interesting phenomena. This paper will explore the possibility of parallelizing the simulation of these large […]

OpenCL

Oct, 12

Color Correction Acceleration Using a Color Cube and OpenCL

The article deals with the problem of real time color correction on modern but not dedicated video hardware, suggesting a new implementation of fast algorithm for color transformation utilizing 3D look-up tables. We focus on highly parallel nature of the proposed method and employ the GPU to perform the color calculations side-byside. The paper is […]

OpenCL

Oct, 12

Evaluating performance and portability of OpenCL programs

Recently, OpenCL, a new open programming standard for GPGPU programming, has become available in addition to CUDA. OpenCL can support various compute devices due to its higher abstraction programming framework. Since there is a semantic gap between OpenCL and compute devices, the OpenCL C compiler plays important roles to exploit the potential of compute devices […]

CUDA

•

OpenCL

Oct, 11

Real-Time Rigid Body Interactions

Rigid body simulations are useful in many areas, most notably video games and computer animation. However, the requirements for accuracy and performance vary greatly between applications. In this project we combine methods and techniques from different sources to implement a rigid body simulation. The simulation uses a particle representation to approximate objects with the intent […]

OpenCL

•

OpenGL

Oct, 11

Performance Characterization and Optimization of Atomic Operations on AMD GPUs

Atomic operations are important building blocks in supporting general-purpose computing on graphics processing units (GPUs). For instance, they can be used to coordinate execution between concurrent threads, and in turn, assist in constructing complex data structures such as hash tables or implementing GPU-wide barrier synchronization. While the performance of atomic operations has improved substantially on […]

OpenCL

Oct, 11

Performance and Power Analysis of ATI GPU: A Statistical Approach

We present a comprehensive study on the performance and power consumption of a recent ATI GPU. By employing a rigorous statistical model to analyze execution behaviors of representative general-purpose GPU (GPGPU) applications, we conduct insightful investigations on the target GPU architecture. Our results demonstrate that the GPU execution throughput and the power dissipation are dependent […]

OpenCL

Oct, 11

Fast Surface Extraction and Visualization of Medical Images using OpenCL and GPUs

Marching Cubes (MC) is an algorithm that extracts surfaces from volumetric data. It is used extensively in visualization and analysis of medical data from modalities like CT and MR, often after a 3D segmentation of the interesting structures is performed. Traditional implementations of MC on modern CPUs are slow, using several seconds (even minutes) to […]

CUDA

•

OpenCL

Oct, 11

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers between the CPU and GPU over PCIe. Emerging heterogeneous computing architectures that "fuse" the functionality of […]

OpenCL

Oct, 11

PyPs, a programmable pass manager

As hardware platforms are growing in complexity, compiler infrastructures need more flexibility: due to the heterogeneity of these platforms, compiler phases must be combined in unusual and dynamic ways, and several tools may need to be combined to handle specific parts of the compilation process efficiently. The need for flexibility also appears in iterative compilation […]

CUDA

Oct, 11

High Performance Parallel Design Based on Session Programming

Session programming is a programming model based on the theory of session types, a typing system for pi-calculus. Session types is developed to model structured interaction between processes and correctly typed process will have the property of communication safety. Session Java (SJ) is a full implementation of session types in Java. In this project, we […]

CUDA

Oct, 11

Static Compilation Analysis for Host-Accelerator Communication Optimization

We present an automatic, static program transformation that schedules and generates efficient memory transfers between a computer host and its hardware accelerator, addressing a well-known performance bottleneck. Our automatic approach uses two simple heuristics: to perform transfers to the accelerator as early as possible and to delay transfers back from the accelerator as late as […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Distance Fields Accelerated with OpenCL

Cinematic Particle Systems with OpenCL

Color Correction Acceleration Using a Color Cube and OpenCL

Evaluating performance and portability of OpenCL programs

Real-Time Rigid Body Interactions

Performance Characterization and Optimization of Atomic Operations on AMD GPUs

Performance and Power Analysis of ATI GPU: A Statistical Approach

Fast Surface Extraction and Visualization of Medical Images using OpenCL and GPUs

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

PyPs, a programmable pass manager

High Performance Parallel Design Based on Session Programming

Static Compilation Analysis for Host-Accelerator Communication Optimization

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)