high performance computing on graphics processing units: hgpu.org

Posts

Jun, 1

A Data-Parallel Extension to Ruby for GPGPU

We propose Ikra, a data-parallel extension to Ruby for general-purpose computing on graphical processing unit (GPGPU). Our approach is to provide a special array class with higher-order methods for describing computation on a GPU. With a static type inference system that identifies code fragments that shall be executed on a GPU and with a skeleton-based […]

CUDA

Jun, 1

Generating Device-specific GPU code for Local Operators in Medical Imaging

To cope with the complexity of programming GPU accelerators for medical imaging computations, we developed a framework to describe image processing kernels in a domainspecific language, which is embedded into C++. The description uses decoupled access/execute metadata, which allow the programmer to specify both execution constraints and memory access patterns of kernels. A source-to-source compiler […]

CUDA

•

OpenCL

Jun, 1

An open source MATLAB program for fast numerical Feynman integral calculations for open quantum system dynamics on GPUs

This MATLAB program calculates the dynamics of the reduced density matrix of an open quantum system modeled by the Feynman-Vernon model. The user gives the program a vector describing the coordinate of an open quantum system, a hamiltonian matrix describing its energy, and a spectral distribution function and temperature describing the environment’s influence on it, […]

May, 30

clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs

Sparse matrix vector multiplication (SpMV) kernel is a key computation in linear algebra. Most iterative methods are composed of SpMV operations with BLAS1 updates. Therefore, researchers make extensive efforts to optimize the SpMV kernel in sparse linear algebra. With the appearance of OpenCL, a programming language that standardizes parallel programming across a wide variety of […]

OpenCL

May, 30

Large-scale Nanostructure Simulations from X-ray Scattering Data On Graphics Processor Clusters

X-ray scattering is a valuable tool for measuring the structural properties of materials used in the design and fabrication of energy-relevant nanodevices (e.g., photovoltaic, energy storage, battery, fuel, and carbon capture and sequestration devices) that are key to the reduction of carbon emissions. Although today’s ultra-fast X-ray scattering detectors can provide tremendous information on the […]

CUDA

May, 30

Accelerating an imaging spectroscopy algorithm for submerged marine environments using heterogeneous computing

Graphics Processing Units (GPUs) have proven to be highly effective at accelerating processing speed for a large range of scientific and general purpose applications. As data needs increase, and more complex data analysis methods are used, the processing requirements for solving scientific problems also correspondingly increase. The massive parallel processing power of GPUs can be […]

OpenCL

May, 30

X-Device Query Processing by Bitwise Distribution

The diversity of hardware components within a single system calls for strategies for efficient cross-device data processing. For example, existing approaches to CPU/GPU co-processing distribute individual relational operators to the "most appropriate" device. While pleasantly simple, this strategy has a number of problems: it may leave the "inappropriate" devices idle while overloading the "appropriate" device […]

CUDA

May, 30

GPU-accelerated simulation of colloidal suspensions with direct hydrodynamic interactions

Solvent-mediated hydrodynamic interactions between colloidal particles can significantly alter their dynamics. We discuss the implementation of Stokesian dynamics in leading approximation for streaming processors as provided by the compute unified device architecture (CUDA) of recent graphics processors (GPUs). Thereby, the simulation of explicit solvent particles is avoided and hydrodynamic interactions can easily be accounted for […]

CUDA

May, 29

Performance-Analysis-Based Acceleration of Image Quality Assessment

Two stages are commonly employed in modern algorithms of image/video quality assessment (QA): (1) a local frequency-based decomposition, and (2) block-based statistical comparisons between the frequency coefficients of the reference and distorted images. This paper presents a performance analysis of and techniques for accelerating these stages. We specifically analyze and accelerate one representative QA algorithm […]

CUDA

May, 29

COVRA: A compression-domain output-sensitive volume rendering architecture based on a sparse representation of voxel blocks

We present a novel multiresolution compression-domain GPU volume rendering architecture designed for interactive local and networked exploration of rectilinear scalar volumes on commodity platforms. In our approach, the volume is decomposed into a multiresolution hierarchy of bricks. Each brick is further subdivided into smaller blocks, which are compactly described by sparse linear combinations of prototype […]

OpenGL

May, 29

A GPU-Based Track-Repeating Algorithm for Dose Calculation for Photon Radiotherapy

An essential ingredient in radiotherapy is the calculation of the dose to be delivered to the patient. Analytical algorithms are commonly used for such a task, however their accuracy is not always satisfactory. Monte Carlo techniques provide higher accuracy, but they often require large computational times. Track-repeating algorithms, for example the Fast Dose Calculator, have […]

CUDA

May, 29

Hybrid Update Algorithms for Regular Lattice and Small-World Ising Models on Graphical Processing Units

Local and cluster Monte Carlo update algorithms offer a complex tradeoff space for optimising the performance of simulations of the Ising model. We systematically explore tradeoffs between hybrid Metropolis and Wolff cluster updates for the 3D Ising model using data-parallelism and graphical processing units. We investigate performance for both regular lattices as well as for […]

CUDA

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

GigaAPI: a user-space API that simplifies multi-GPU programming, bridging the gap between the capabilities of parallel GPU systems and the ability of developers to harness their full potential

GigaAPI for GPU Parallelization

high performance computing on graphics processing units: hgpu.org

Posts

A Data-Parallel Extension to Ruby for GPGPU

Generating Device-specific GPU code for Local Operators in Medical Imaging

An open source MATLAB program for fast numerical Feynman integral calculations for open quantum system dynamics on GPUs

clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs

Large-scale Nanostructure Simulations from X-ray Scattering Data On Graphics Processor Clusters

Accelerating an imaging spectroscopy algorithm for submerged marine environments using heterogeneous computing

X-Device Query Processing by Bitwise Distribution

GPU-accelerated simulation of colloidal suspensions with direct hydrodynamic interactions

Performance-Analysis-Based Acceleration of Image Quality Assessment

COVRA: A compression-domain output-sensitive volume rendering architecture based on a sparse representation of voxel blocks

A GPU-Based Track-Repeating Algorithm for Dose Calculation for Photon Radiotherapy

Hybrid Update Algorithms for Regular Lattice and Small-World Ising Models on Graphical Processing Units

Recent source codes

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Data-efficient LLM Fine-tuning for Code Generation

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

GigaAPI: a user-space API that simplifies multi-GPU programming, bridging the gap between the capabilities of parallel GPU systems and the ability of developers to harness their full potential

Coccinelle: a C code transformation engine using SmPL for matches, refactorings, and bug fixing

DuoReduce: MLIR's benchmark

Shamrock: Multi-GPU hydrodynamics for astrophysics

LLMPerf: GPU Performance Modeling meets Large Language Models

Most viewed papers (last 30 days)