high performance computing on graphics processing units: hgpu.org

Posts

Jun, 1

Towards Distributed Heterogenous High-Performance Computing with ViennaCL

One of the major drawbacks of computing with graphics adapters is the limited available memory for relevant problem sizes. To overcome this limitation for the ViennaCL library, we investigate a partitioning approach for one of the standard benchmark problems in High-Performance Computing (HPC), namely the dense matrix-matrix product. We apply this partitioning approach to problems […]

OpenCL

Jun, 1

MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator-Based Systems

Data movement in high-performance computing systems accelerated by graphics processing units (GPUs) remains a challenging problem. Data communication in popular parallel programming models, such as the Message Passing Interface (MPI), is currently limited to the data stored in the CPU memory space. Auxiliary memory systems, such as GPU memory, are not integrated into such data […]

CUDA

Jun, 1

A Data-Parallel Extension to Ruby for GPGPU

We propose Ikra, a data-parallel extension to Ruby for general-purpose computing on graphical processing unit (GPGPU). Our approach is to provide a special array class with higher-order methods for describing computation on a GPU. With a static type inference system that identifies code fragments that shall be executed on a GPU and with a skeleton-based […]

CUDA

Jun, 1

Generating Device-specific GPU code for Local Operators in Medical Imaging

To cope with the complexity of programming GPU accelerators for medical imaging computations, we developed a framework to describe image processing kernels in a domainspecific language, which is embedded into C++. The description uses decoupled access/execute metadata, which allow the programmer to specify both execution constraints and memory access patterns of kernels. A source-to-source compiler […]

CUDA

•

OpenCL

Jun, 1

An open source MATLAB program for fast numerical Feynman integral calculations for open quantum system dynamics on GPUs

This MATLAB program calculates the dynamics of the reduced density matrix of an open quantum system modeled by the Feynman-Vernon model. The user gives the program a vector describing the coordinate of an open quantum system, a hamiltonian matrix describing its energy, and a spectral distribution function and temperature describing the environment’s influence on it, […]

May, 30

clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs

Sparse matrix vector multiplication (SpMV) kernel is a key computation in linear algebra. Most iterative methods are composed of SpMV operations with BLAS1 updates. Therefore, researchers make extensive efforts to optimize the SpMV kernel in sparse linear algebra. With the appearance of OpenCL, a programming language that standardizes parallel programming across a wide variety of […]

OpenCL

May, 30

Large-scale Nanostructure Simulations from X-ray Scattering Data On Graphics Processor Clusters

X-ray scattering is a valuable tool for measuring the structural properties of materials used in the design and fabrication of energy-relevant nanodevices (e.g., photovoltaic, energy storage, battery, fuel, and carbon capture and sequestration devices) that are key to the reduction of carbon emissions. Although today’s ultra-fast X-ray scattering detectors can provide tremendous information on the […]

CUDA

May, 30

Accelerating an imaging spectroscopy algorithm for submerged marine environments using heterogeneous computing

Graphics Processing Units (GPUs) have proven to be highly effective at accelerating processing speed for a large range of scientific and general purpose applications. As data needs increase, and more complex data analysis methods are used, the processing requirements for solving scientific problems also correspondingly increase. The massive parallel processing power of GPUs can be […]

OpenCL

May, 30

X-Device Query Processing by Bitwise Distribution

The diversity of hardware components within a single system calls for strategies for efficient cross-device data processing. For example, existing approaches to CPU/GPU co-processing distribute individual relational operators to the "most appropriate" device. While pleasantly simple, this strategy has a number of problems: it may leave the "inappropriate" devices idle while overloading the "appropriate" device […]

CUDA

May, 30

GPU-accelerated simulation of colloidal suspensions with direct hydrodynamic interactions

Solvent-mediated hydrodynamic interactions between colloidal particles can significantly alter their dynamics. We discuss the implementation of Stokesian dynamics in leading approximation for streaming processors as provided by the compute unified device architecture (CUDA) of recent graphics processors (GPUs). Thereby, the simulation of explicit solvent particles is avoided and hydrodynamic interactions can easily be accounted for […]

CUDA

May, 29

Performance-Analysis-Based Acceleration of Image Quality Assessment

Two stages are commonly employed in modern algorithms of image/video quality assessment (QA): (1) a local frequency-based decomposition, and (2) block-based statistical comparisons between the frequency coefficients of the reference and distorted images. This paper presents a performance analysis of and techniques for accelerating these stages. We specifically analyze and accelerate one representative QA algorithm […]

CUDA

May, 29

COVRA: A compression-domain output-sensitive volume rendering architecture based on a sparse representation of voxel blocks

We present a novel multiresolution compression-domain GPU volume rendering architecture designed for interactive local and networked exploration of rectilinear scalar volumes on commodity platforms. In our approach, the volume is decomposed into a multiresolution hierarchy of bricks. Each brick is further subdivided into smaller blocks, which are compactly described by sparse linear combinations of prototype […]

OpenGL

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Towards Distributed Heterogenous High-Performance Computing with ViennaCL

MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator-Based Systems

A Data-Parallel Extension to Ruby for GPGPU

Generating Device-specific GPU code for Local Operators in Medical Imaging

An open source MATLAB program for fast numerical Feynman integral calculations for open quantum system dynamics on GPUs

clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs

Large-scale Nanostructure Simulations from X-ray Scattering Data On Graphics Processor Clusters

Accelerating an imaging spectroscopy algorithm for submerged marine environments using heterogeneous computing

X-Device Query Processing by Bitwise Distribution

GPU-accelerated simulation of colloidal suspensions with direct hydrodynamic interactions

Performance-Analysis-Based Acceleration of Image Quality Assessment

COVRA: A compression-domain output-sensitive volume rendering architecture based on a sparse representation of voxel blocks

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)