high performance computing on graphics processing units: hgpu.org

Posts

Dec, 12

A Fast Similarity Join Algorithm Using Graphics Processing Units

A similarity join operation A BOWTIE_epsiv B takes two sets of points A, B and a value epsiv isin Ropf, and outputs pairs of points p in A,q in B, such that the distance D(p,q) < epsiv. Similarity joins find use in a variety of fields, such as clustering, text mining, and multimedia databases. A […]

CUDA

Dec, 12

A map reduce framework for programming graphics processors

Recent developments in programmable, highly parallel Graphics Processing Units (GPUs) have enabled high performance general purpose computation. We describe a framework designed for high performance GPU programming, built on Nvidia’s Compute Unified Device Architecture (CUDA) platform. The framework is built around the Map Reduce abstraction, which allows application developers to focus on their application, while […]

CUDA

Dec, 12

CUDA: Scalable parallel programming for high-performance scientific computing

Graphics processing units (GPUs) originally designed for computer video cards have emerged as the most powerful chip in a high-performance workstation. Unlike multicore CPU architectures, which currently ship with two or four cores, GPU architectures are “manycore” with hundreds of cores capable of running thousands of threads in parallel. NVIDIA’s CUDA is a co-evolved hardware-software […]

CUDA

Dec, 12

Deformation modeling using global medial representation structures and evaluation by biset mesh matching

In this paper, we present a novel hybrid deformation model using global mass-spring medial representation structures and local finite element model. We employ the hybrid models, by fully calculating the FEM deformation in the local operation part while only calculating the global deformation by medial representation method. To achieve the real-time requirement of realistic deformable […]

CUDA

Dec, 11

Online Dynamic Graph Drawing

This paper presents an algorithm for drawing a sequence of graphs online. The algorithm strives to maintain the global structure of the graph and thus the user’s mental map, while allowing arbitrary modifications between consecutive layouts. The algorithm works online and uses various execution culling methods in order to reduce the layout time and handle […]

OpenGL

Dec, 11

High performance computing for deformable image registration: Towards a new paradigm in adaptive radiotherapy

The advent of readily available temporal imaging or time series volumetric (4D) imaging has become an indispensable component of treatment planning and adaptive radiotherapy (ART) at many radiotherapy centers. Deformable image registration (DIR) is also used in other areas of medical imaging, including motion corrected image reconstruction. Due to long computation time, clinical applications of […]

CUDA

Dec, 11

Accelerating Reed-Solomon coding in RAID systems with GPUs

Graphical Processing Units (GPUs) have been applied to more types of computations than just graphics processing for several years. Until recently, however, GPU hardware has not been capable of efficiently performing general data processing tasks. With the advent of more general-purpose extensions to GPUs, many more types of computations are now possible. One such computation […]

CUDA

Dec, 11

Fast scan algorithms on graphics processors

Scan and segmented scan are important data-parallel primitives for a wide range of applications. We present fast, work-efficient algorithms for these primitives on graphics processing units (GPUs). We use novel data representations that map well to the GPU architecture. Our algorithms exploit shared memory to improve memory performance. We further improve the performance of our […]

CUDA

Dec, 11

H.264/AVC motion estimation implementation on Compute Unified Device Architecture (CUDA)

Due to the rapid growth of graphics processing unit (GPU) processing capability, using GPU as a coprocessor to assist the central processing unit (CPU) in computing massive data becomes essential. In this paper, we present an efficient block-level parallel algorithm for the variable block size motion estimation (ME) in H.264/AVC with fractional pixel refinement on […]

CUDA

Dec, 11

Linear genetic programming GPGPU on Microsoft’s Xbox 360

We describe how to harness the graphics processing abilities of a consumer video game console (Xbox 360) for general programming on graphics processing unit (GPGPU) purposes. In particular, we implement a linear GP (LGP) system to solve classification and regression problems. We conduct inter- and intra-platform benchmarking of the Xbox 360 and PC, using GPU […]

Dec, 11

Real-time stereographic rendering and display of medical images with programmable GPUs

The study was to explore the power and feasibility of using programmable graphics processing units (GPUs) for real-time rendering and displaying large 3D medical datasets for stereoscopic display workstation. Lung cancer screening CT images were used for developing GPU-based stereo rendering and displaying. The study was run on a personal computer with a 128 MB […]

OpenGL

Dec, 11

Towards acceleration of fault simulation using graphics processing units

In this paper, we explore the implementation of fault simulation on a Graphics Processing Unit (GPU). In particular, we implement a fault simulator that exploits thread level parallelism. Fault simulation is inherently parallelizable, and the large number of threads that can be computed in parallel on a GPU results in a natural fit for the […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

A Fast Similarity Join Algorithm Using Graphics Processing Units

A map reduce framework for programming graphics processors

CUDA: Scalable parallel programming for high-performance scientific computing

Deformation modeling using global medial representation structures and evaluation by biset mesh matching

Online Dynamic Graph Drawing

High performance computing for deformable image registration: Towards a new paradigm in adaptive radiotherapy

Accelerating Reed-Solomon coding in RAID systems with GPUs

Fast scan algorithms on graphics processors

H.264/AVC motion estimation implementation on Compute Unified Device Architecture (CUDA)

Linear genetic programming GPGPU on Microsoft’s Xbox 360

Real-time stereographic rendering and display of medical images with programmable GPUs

Towards acceleration of fault simulation using graphics processing units

Recent source codes

Agentic Code Optimization via Compiler-LLM Cooperation

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

True 4-Bit Quantized CNN Training on CPU

cuFuzz: A GPU-oriented coverage-guided fuzzer for userland CUDA application

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Most viewed papers (last 30 days)