high performance computing on graphics processing units: hgpu.org

Posts

Dec, 8

SOMGPU: An unsupervised pattern classifier on Graphical Processing Unit

Graphical processing units (GPUs) have been, lately used for general purpose tasks owing to their implicit parallel nature. One such task is that of pattern classification. Highly parallel tasks like these suffer from performance loss owing to the sequential nature of central processing unit (CPU). To match the image processing power of human brain even […]

Dec, 8

GPU based extraction of moving objects without shadows under intensity changes

This paper proposes a GPU based algorithm for extracting moving objects in real time. The whole process of the proposed approach is handled on GPU. GPU is used for acceleration and the proposed approach increases processing speed dramatically. The method uses a* component and b* component of CIELAB color space without extracting shadow areas as […]

Dec, 8

Overview of implementation of DARPA GPU program in SAIC

This paper reviews the implementation of DARPA MTO STAP-BOY program for both Phase I and II conducted at Science Applications International Corporation (SAIC). The STAP-BOY program conducts fast covariance factorization and tuning techniques for space-time adaptive process (STAP) Algorithm Implementation on Graphics Processor unit (GPU) Architectures for Embedded Systems. The first part of our presentation […]

Dec, 8

Fast Deformable Registration on the GPU: A CUDA Implementation of Demons

In the medical imaging field, we need fast deformable registration methods especially in intra-operative settings characterized by their time-critical applications. Image registration studies which are based on graphics processing units (GPUs) provide fast implementations. However, only a small number of these GPU-based studies concentrate on deformable registration. We implemented Demons, a widely used deformable image […]

CUDA

Dec, 8

A survey of medical image registration on graphics hardware

The rapidly increasing performance of graphics processors, improving programming support and excellent performance-price ratio make graphics processing units (GPUs) a good option for a variety of computationally intensive tasks. Within this survey, we give an overview of GPU accelerated image registration. We address both, GPU experienced readers with an interest in accelerated image registration, as […]

Dec, 7

The 2011 International Conference on High Performance Computing & Simulation, HPCS 2011

The conference is to address, explore and exchange information on the state-of-the-art in high performance and large scale computing systems, their use in modeling and simulation and data intensive applications. We encourage papers with both an application or technology flavor (and their multidisciplinary integration). The scope covers architecture, performance, algorithms, middleware, and applications. Work on […]

Dec, 7

Performance evaluation of image processing algorithms on the GPU

The graphics processing unit (GPU), which originally was used exclusively for visualization purposes, has evolved into an extremely powerful co-processor. In the meanwhile, through the development of elaborate interfaces, the GPU can be used to process data and deal with computationally intensive applications. The speed-up factors attained compared to the central processing unit (CPU) are […]

CUDA

Dec, 7

Fast support vector machine training and classification on graphics processors

Recent developments in programmable, highly parallel Graphics Processing Units (GPUs) have enabled high performance implementations of machine learning algorithms. We describe a solver for Support Vector Machine training running on a GPU, using the Sequential Minimal Optimization algorithm and an adaptive first and second order working set selection heuristic, which achieves speedups of 9-35x over […]

CUDA

Dec, 7

Bandwidth intensive 3-D FFT kernel for GPUs using CUDA

Most GPU performance “hypes” have focused around tightly-coupled applications with small memory bandwidth requirements e.g., N-body, but GPUs are also commodity vector machines sporting substantial memory bandwidth; however, effective programming methodologies thereof have been poorly studied. Our new 3-D FFT kernel, written in NVIDIA CUDA, achieves nearly 80 GFLOPS on a top-end GPU, being more […]

CUDA

Dec, 7

BSGP: bulk-synchronous GPU programming

We present BSGP, a new programming language for general purpose computation on the GPU. A BSGP program looks much the same as a sequential C program. Programmers only need to supply a bare minimum of extra information to describe parallel processing on GPUs. As a result, BSGP programs are easy to read, write, and maintain. […]

CUDA

Dec, 7

Pangaea: a tightly-coupled IA32 heterogeneous chip multiprocessor

Moore’s Law and the drive towards performance efficiency have led to the on-chip integration of general-purpose cores with special-purpose accelerators. Pangaea is a heterogeneous CMP design for non-rendering workloads that integrates IA32 CPU cores with non-IA32 GPU-class multi-cores, extending the current state-of-the-art CPU-GPU integration that physically “fuses” existing CPU and GPU designs. Pangaea introduces (1) […]

Dec, 7

A single-pass GPU ray casting framework for interactive out-of-core rendering of massive volumetric datasets

We present an adaptive out-of-core technique for rendering massive scalar volumes employing single-pass GPU ray casting. The method is based on the decomposition of a volumetric dataset into small cubical bricks, which are then organized into an octree structure maintained out-of-core. The octree contains the original data at the leaves, and a filtered representation of […]

high performance computing on graphics processing units: hgpu.org

Posts

SOMGPU: An unsupervised pattern classifier on Graphical Processing Unit

GPU based extraction of moving objects without shadows under intensity changes

Overview of implementation of DARPA GPU program in SAIC

Fast Deformable Registration on the GPU: A CUDA Implementation of Demons

A survey of medical image registration on graphics hardware

The 2011 International Conference on High Performance Computing & Simulation, HPCS 2011

Performance evaluation of image processing algorithms on the GPU

Fast support vector machine training and classification on graphics processors

Bandwidth intensive 3-D FFT kernel for GPUs using CUDA

BSGP: bulk-synchronous GPU programming

Pangaea: a tightly-coupled IA32 heterogeneous chip multiprocessor

A single-pass GPU ray casting framework for interactive out-of-core rendering of massive volumetric datasets

Recent source codes

DITRON: Distributed Compiler based on Triton for Parallel Systems

IntelliKit: Agent-first tooling for AMD hardware

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)