high performance computing on graphics processing units: hgpu.org

Posts

Sep, 6

CUDA-based GPU Implementation of Hierarchical Belief Propagation for Fast Stereo Matching

Stereo matching based on the Markov random field model has a global optimization problem. Solutions of the problem can be inferred by the belief propagation (BP) algorithm. The BP algorithm effectively estimates global solutions, but it takes a very long time to calculate messages. In this paper, we implement the hierarchical BP algorithm on a […]

CUDA

Sep, 6

Electromagnetic effects in capacitively coupled plasma simulated with a PIC-MCC darwin code

To increase the efficiency of the plasma assisted material processing with help of the capacitively coupled plasma discharge frequency of the driven field and spatial size of the modern devices tend to higher values. This can lead to a stronger influence of the electromagnetic effects, which in turn can affect the plasma uniformity, one of […]

Sep, 5

Virtual Rheoscopic Fluids

We present a visualization technique for simulated fluid dynamics data that visualizes the gradient of the velocity field in an intuitive way. Our work is inspired by rheoscopic particles, which are small, flat particles that, when suspended in fluid, align themselves with the shear of the flow. We adopt the physical principles of real rheoscopic […]

Sep, 5

Graphical future

The future of computing is something that is very much on the mind of nVidia CEO Jen-Hsun Huang, not least because he thinks his company is going to have a hand in it. As a maker of graphics processing units (GPUs), nVidia has had more of a walk-on role in the PC. If you want […]

Sep, 5

Fast Construction of SAH BVHs on the Intel Many Integrated Core (MIC) Architecture

We investigate how to efficiently build bounding volume hierarchies (BVHs) with surface area heuristic (SAH) on the Intel Many Integrated Core (MIC) Architecture. To achieve maximum performance, we use four key concepts: progressive 10-bit quantization to reduce cache footprint with negligible loss in BVH quality; an AoSoA data layout that allows efficient streaming and SIMD […]

Sep, 5

A CUDA-based parallel implementation of K-nearest neighbor algorithm

Recent developments in Graphics Processing Units (GPUs) have enabled inexpensive high performance computing for general-purpose applications. Due to GPU’s tremendous computing capability, it has emerged as the co-processor of the CPU to achieve a high overall throughput. CUDA programming model provides the programmers adequate C language like APIs to better exploit the parallel power of […]

CUDA

Sep, 5

Real-Time Tone Mapping for High-Resolution HDR Images

High dynamic range rendering attempts to take an HDR image and produce a more realistic representation on a limited range computer monitor. Although several tone mapping operators have been proposed in recent years, no evaluation has yet been undertaken to explore which operator is more suitable for hardware implementation. In this paper, we begin with […]

OpenGL

Sep, 5

DUODECIM – a structure for point scan compression and rendering

In this paper we present a compression scheme for large point scans including per-point normals. For the encoding of such scans we introduce a particular type of closest sphere packing grids, the hexagonal close packing (HCP). HCP grids provide a structure for an optimal packing of 3D space, and for a given sampling error they […]

OpenGL

Sep, 5

Fault table generation using Graphics Processing Units

In this paper, we explore the implementation of fault table generation on a Graphics Processing Unit (GPU). A fault table is essential for fault diagnosis and fault detection in VLSI testing and debug. Generating a fault table requires extensive fault simulation, with no fault dropping, and is extremely expensive from a computational standpoint. Fault simulation […]

CUDA

Sep, 5

Efficient Execution on GPUs of Field-Based Vehicular Mobility Models

Large-scale scenarios of vehicular traffic simulation problems are characterized by complex queuing effects, control mechanisms and other interactions of the traffic on the control and vice versa. While small-sized scenarios are relatively easy to explore and analyze, larger scenarios need specialized treatment for efficient execution. The simulation challenges of speed and scale become pronounced when […]

Sep, 5

Isocube: Exploiting the Cubemap Hardware

This paper proposes a novel six-face spherical map, isocube, that fully utilizes the cubemap hardware built in most GPUs. Unlike the cubemap, the proposed isocube uniformly samples the unit sphere (uniformly distributed), and all samples span the same solid angle (equally important). Its mapping computation contains only a small overhead. By feeding the cubemap hardware […]

OpenGL

Sep, 5

A CUDA Based Implementation of an Image Authentication Algorithm

Image authentication is an important technology to protect images from being malicious tampered and have became an indispensable part of digital world. The main schemes used for image authentication are signature and watermarking in the last decade. However, in traditional serial manners, the operations of both methods are time-consuming, and limit the wide use of […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

CUDA-based GPU Implementation of Hierarchical Belief Propagation for Fast Stereo Matching

Electromagnetic effects in capacitively coupled plasma simulated with a PIC-MCC darwin code

Virtual Rheoscopic Fluids

Graphical future

Fast Construction of SAH BVHs on the Intel Many Integrated Core (MIC) Architecture

A CUDA-based parallel implementation of K-nearest neighbor algorithm

Real-Time Tone Mapping for High-Resolution HDR Images

DUODECIM – a structure for point scan compression and rendering

Fault table generation using Graphics Processing Units

Efficient Execution on GPUs of Field-Based Vehicular Mobility Models

Isocube: Exploiting the Cubemap Hardware

A CUDA Based Implementation of an Image Authentication Algorithm

Recent source codes

DITRON: Distributed Compiler based on Triton for Parallel Systems

IntelliKit: Agent-first tooling for AMD hardware

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)