high performance computing on graphics processing units: hgpu.org

Posts

Apr, 17

Communication-Minimizing 2D Convolution in GPU Registers

2D image convolution is ubiquitous in image processing and computer vision problems such as feature extraction. Exploiting parallelism is a common strategy for accelerating convolution. Parallel processors keep getting faster, but algorithms such as image convolution remain memory bounded on parallel processors such as GPUs. Therefore, reducing memory communication is fundamental to accelerating image convolution. […]

CUDA

Apr, 16

Zero-copy I/O processing for low-latency GPU computing

Cyber-physical systems (CPS) aim to monitor and control complex real-world phenomena where the computational cost and real-time constraints could be a major challenge. Many-core hardware accelerators such as graphics processing units (GPUs) promise to enhancing computation, leveraging the data parallelism often found in real-world scenarios of CPS, but performance is limited by the overhead of […]

CUDA

Apr, 16

Fast simulation of nonlinear radio frequency ultrasound images in inhomogeneous nonlinear media: CREANUIS

The simulation of ultrasound images is usually based on two main strategies: either a linear convolution or the use of an acoustic model. However, only the linear propagation of the pressure wave is considered on the simulation tools generally used. CREANUIS is a recent simulation tool (freely available on the Internet) which implements the nonlinear […]

Apr, 16

High-dimensional wave atoms and compression of seismic datasets

Wave atoms are a low-redundancy alternative to curvelets, suitable for high-dimensional seismic data processing. This abstract extends the wave atom orthobasis construction to 3D, 4D, and 5D Cartesian arrays, and parallelizes it in a shared-memory environment. An implementation of the algorithm for NVIDIA CUDA capable graphics processing units (GPU) is also developed to accelerate computation […]

CUDA

Apr, 16

Novel implementations of recursive discrete wavelet transform for real time computation with multicore systems on chip (SOC)

The discrete wavelet Transform (DWT) has been studied and developed in various scientific and engineering fields. Its multi-resolution and locality nature facilitates application required for progressiveness in capturing high-frequency details. However, when dealing with enormous data volume, the performance may drastically reduce. The multi-resolution sub-band encoding provided by DWT enables for higher compression ratios, and […]

CUDA

Apr, 15

Fast and Robust 3D Correspondence Matching and Its Application to Volume Registration

This paper presents a fast and accurate volume correspondence matching method using 3D Phase-Only Correlation (POC). The proposed method employs (i) a coarse-to-fine strategy using multi-scale volume pyramids for correspondence search and (ii) high-accuracy POC-based local block matching for finding dense volume correspondence with sub-voxel displacement accuracy. This paper also proposes its GPU implementation to […]

CUDA

Apr, 15

A Many-core Machine Model for Designing Algorithms with Minimum Parallelism Overheads

We propose a model of computations which aims at capturing parallelism overheads (such as communication and synchronization costs) of programs written for modern GPU architectures. We establish a Graham-Brent theorem for this model so as to estimate running time of programs running on p streaming multiprocessors. We evaluate the benefits of our model with three […]

CUDA

Apr, 15

Shell: A Spatial Decomposition Data Structure for 3D Curve Traversal on Many-core Architectures

Shared memory many-core processors such as GPUs have been extensively used in accelerating computation-intensive algorithms and applications. When porting existing algorithms from sequential or other parallel architecture models to shared memory many-core architectures, non-trivial modifications are often needed in order to match the execution patterns of the target algorithms with the characteristics of many-core architectures. […]

CUDA

Apr, 15

Efficient Partitioning Based Hierarchical Agglomerative Clustering Using Graphics Accelerators with CUDA

We explore the capabilities of today’s high-end Graphics processing units (GPU) on desktop computers to efficiently perform hierarchical agglomerative clustering (HAC) through partitioning of gene expressions. Our focus is to significantly reduce time and memory bottlenecks of the traditional HAC algorithm by parallelization and acceleration of computations without compromising the accuracy of clusters. We use […]

CUDA

Apr, 13

3D Haar-Like Elliptical Features for Object Classification in Microscopy

Object detection and classification are key tasks in computer vision that can facilitate high-throughput image analysis of microscopy data. We present a set of local image descriptors for three-dimensional (3D) microscopy datasets inspired by the well-known Haar wavelet framework. We add orientation, illumination and scale information by assuming that the neighborhood surrounding points of interests […]

Apr, 13

Co-processing SPMD Computation on GPUs and CPUs on Shared Memory System

Heterogeneous parallel system with multiprocessors and accelerators are becoming ubiquitous due to better cost-performance and energy-efficiency. These heterogeneous processor architectures have different instruction sets and are optimized for either task latency or throughput purposes. Challenges occur in regard to programmability and performance when executing SPMD computations on heterogeneous architectures simultaneously. In order to meet these […]

CUDA

Apr, 13

High Performance FFT Based Poisson Solver on a CPU-GPU Heterogeneous Platform

We develop an optimized FFT based Poisson solver on a CPU-GPU heterogeneous platform for the case when the input is too large to fit on the GPU global memory. The solver involves memory bound computations such as 3D FFT in which the large 3D data may have to be transferred over the PCIe bus several […]

CUDA

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

GigaAPI: a user-space API that simplifies multi-GPU programming, bridging the gap between the capabilities of parallel GPU systems and the ability of developers to harness their full potential

GigaAPI for GPU Parallelization

Coccinelle: a C code transformation engine using SmPL for matches, refactorings, and bug fixing

Advances in Semantic Patching for HPC-oriented Refactorings with Coccinelle

DuoReduce: MLIR's benchmark

Hardware-Assisted Software Testing and Debugging for Heterogeneous Computing

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Communication-Minimizing 2D Convolution in GPU Registers

Zero-copy I/O processing for low-latency GPU computing

Fast simulation of nonlinear radio frequency ultrasound images in inhomogeneous nonlinear media: CREANUIS

High-dimensional wave atoms and compression of seismic datasets

Novel implementations of recursive discrete wavelet transform for real time computation with multicore systems on chip (SOC)

Fast and Robust 3D Correspondence Matching and Its Application to Volume Registration

A Many-core Machine Model for Designing Algorithms with Minimum Parallelism Overheads

Shell: A Spatial Decomposition Data Structure for 3D Curve Traversal on Many-core Architectures

Efficient Partitioning Based Hierarchical Agglomerative Clustering Using Graphics Accelerators with CUDA

3D Haar-Like Elliptical Features for Object Classification in Microscopy

Co-processing SPMD Computation on GPUs and CPUs on Shared Memory System

High Performance FFT Based Poisson Solver on a CPU-GPU Heterogeneous Platform

Recent source codes

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Data-efficient LLM Fine-tuning for Code Generation

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

GigaAPI: a user-space API that simplifies multi-GPU programming, bridging the gap between the capabilities of parallel GPU systems and the ability of developers to harness their full potential

Coccinelle: a C code transformation engine using SmPL for matches, refactorings, and bug fixing

DuoReduce: MLIR's benchmark

Most viewed papers (last 30 days)