high performance computing on graphics processing units: hgpu.org

Posts

Apr, 18

Supporting Iteration in a Heterogeneous Data Flow Engine

Dataflow execution engines such as MapReduce, DryadLINQ, and PTask have enjoyed success because they simplify development for a class of important parallel applications. These systems sacrifice generality for simplicity: while many workloads are easily expressed, important idioms like iteration and recursion are difficult to express and support efficiently. We consider the problem of extending a […]

Apr, 18

Massively Parallel Suffix Array Queries and On-Demand Phrase Extraction for Statistical Machine Translation Using GPUs

Translation models can be scaled to large corpora and arbitrarily-long phrases by looking up translations of source phrases on the fly in an indexed parallel text. However, this is impractical because on-demand extraction of phrase tables is a major computational bottleneck. We solve this problem by developing novel algorithms for general purpose graphics processing units […]

CUDA

Apr, 18

Computing Privacy-Preserving Edit Distance and Smith-Waterman Problems on the GPU Architecture

This paper presents privacy-preserving, parallel computing algorithms on a graphic processing unit (GPU) architecture to solve the Edit-Distance (ED) and the Smith-Waterman (SW) problems. The ED and SW problems are formulated into dynamic programming (DP) computing problems, which are solved using the Secure Function Evaluation (SFE) to meet privacy protection requirements, based on the semi-honest […]

CUDA

Apr, 17

GPU Accelerated Face Detection (thesis)

Graphics processing units have massive parallel processing capabilities, and there is a growing interest in utilizing them for generic computing. One area of interest is computationally heavy computer vision algorithms, such as face detection and recognition. Face detection is used in a variety of applications, for example the autofocus on cameras, face and emotion recognition, […]

OpenCL

Apr, 17

Parallel GPU Architecture Simulation Framework Exploiting Work Allocation Unit Parallelism

GPU computing is at the forefront of highperformance computing, and it has greatly affected current studies on parallel software and hardware design because of its massively parallel architecture. Therefore, numerous studies have focused on the utilization of GPUs in various fields. However, studies of GPU architectures are constrained by the lack of a suitable GPU […]

CUDA

Apr, 17

LU Factorization with Partial Pivoting for a Multi-CPU, Multi-GPU Shared Memory System

LU factorization with partial pivoting is a canonical numerical procedure and the main component of the High Performance Linpack benchmark. This article presents an implementation of the algorithm for a hybrid, shared memory, system with standard CPU cores and GPU accelerators. The optimizations include lookahead, dynamic task scheduling, fine grain parallelism for memory-bound operations, autotuning, […]

CUDA

Apr, 17

A Framework for Profiling and Performance Monitoring of Heterogeneous Applications

Heterogeneous computing has become prevalent due to the comput-ing power and low cost of Graphics Processing Units(GPUs). OpenCL provides a programming model where the CPU is the master or host, and compute-intensive portions of an algorithm are offloaded to the GPU. However, the host-device model is very limiting. In this model, data-dependent, run-time optimizations that […]

OpenCL

Apr, 17

Communication-Minimizing 2D Convolution in GPU Registers

2D image convolution is ubiquitous in image processing and computer vision problems such as feature extraction. Exploiting parallelism is a common strategy for accelerating convolution. Parallel processors keep getting faster, but algorithms such as image convolution remain memory bounded on parallel processors such as GPUs. Therefore, reducing memory communication is fundamental to accelerating image convolution. […]

CUDA

Apr, 16

Zero-copy I/O processing for low-latency GPU computing

Cyber-physical systems (CPS) aim to monitor and control complex real-world phenomena where the computational cost and real-time constraints could be a major challenge. Many-core hardware accelerators such as graphics processing units (GPUs) promise to enhancing computation, leveraging the data parallelism often found in real-world scenarios of CPS, but performance is limited by the overhead of […]

CUDA

Apr, 16

Fast simulation of nonlinear radio frequency ultrasound images in inhomogeneous nonlinear media: CREANUIS

The simulation of ultrasound images is usually based on two main strategies: either a linear convolution or the use of an acoustic model. However, only the linear propagation of the pressure wave is considered on the simulation tools generally used. CREANUIS is a recent simulation tool (freely available on the Internet) which implements the nonlinear […]

Apr, 16

High-dimensional wave atoms and compression of seismic datasets

Wave atoms are a low-redundancy alternative to curvelets, suitable for high-dimensional seismic data processing. This abstract extends the wave atom orthobasis construction to 3D, 4D, and 5D Cartesian arrays, and parallelizes it in a shared-memory environment. An implementation of the algorithm for NVIDIA CUDA capable graphics processing units (GPU) is also developed to accelerate computation […]

CUDA

Apr, 16

Novel implementations of recursive discrete wavelet transform for real time computation with multicore systems on chip (SOC)

The discrete wavelet Transform (DWT) has been studied and developed in various scientific and engineering fields. Its multi-resolution and locality nature facilitates application required for progressiveness in capturing high-frequency details. However, when dealing with enormous data volume, the performance may drastically reduce. The multi-resolution sub-band encoding provided by DWT enables for higher compression ratios, and […]

CUDA

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

GigaAPI: a user-space API that simplifies multi-GPU programming, bridging the gap between the capabilities of parallel GPU systems and the ability of developers to harness their full potential

GigaAPI for GPU Parallelization

high performance computing on graphics processing units: hgpu.org

Posts

Supporting Iteration in a Heterogeneous Data Flow Engine

Massively Parallel Suffix Array Queries and On-Demand Phrase Extraction for Statistical Machine Translation Using GPUs

Computing Privacy-Preserving Edit Distance and Smith-Waterman Problems on the GPU Architecture

GPU Accelerated Face Detection (thesis)

Parallel GPU Architecture Simulation Framework Exploiting Work Allocation Unit Parallelism

LU Factorization with Partial Pivoting for a Multi-CPU, Multi-GPU Shared Memory System

A Framework for Profiling and Performance Monitoring of Heterogeneous Applications

Communication-Minimizing 2D Convolution in GPU Registers

Zero-copy I/O processing for low-latency GPU computing

Fast simulation of nonlinear radio frequency ultrasound images in inhomogeneous nonlinear media: CREANUIS

High-dimensional wave atoms and compression of seismic datasets

Novel implementations of recursive discrete wavelet transform for real time computation with multicore systems on chip (SOC)

Recent source codes

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Data-efficient LLM Fine-tuning for Code Generation

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

GigaAPI: a user-space API that simplifies multi-GPU programming, bridging the gap between the capabilities of parallel GPU systems and the ability of developers to harness their full potential

Coccinelle: a C code transformation engine using SmPL for matches, refactorings, and bug fixing

DuoReduce: MLIR's benchmark

Shamrock: Multi-GPU hydrodynamics for astrophysics

LLMPerf: GPU Performance Modeling meets Large Language Models

Most viewed papers (last 30 days)