high performance computing on graphics processing units: hgpu.org

Posts

Jun, 22

Resource Sharing in GPU-Accelerated Windowing Systems

Recent windowing systems allow graphics applications to directly access the graphics processing unit (GPU) for fast rendering. However, application tasks that render frames on the GPU contend heavily with the windowing server that also accesses the GPU to blit the rendered frames to the screen. This resource-sharing nature of direct rendering introduces core challenges of […]

OpenGL

Jun, 22

Exploiting SPMD Horizontal Locality

In this paper, we analyze a particular spatial locality case (called horizontal locality) inherent to manycore accelerator architectures employing barrel execution of SPMD kernels, such as GPUs. We then propose an adaptive memory access granularity framework to exploit and enforce the horizontal locality in order to reduce the interferences among accelerator cores memory accesses and […]

Jun, 21

Practical parallel imaging compressed sensing MRI: Summary of two years of experience in accelerating body MRI of pediatric patients

For the last two years, we have been experimenting with applying compressed sensing parallel imaging for body imaging of pediatric patients. It is a joint-effort by teams from UC Berkeley, Stanford University and GE Healthcare. This paper aims to summarize our experience so far. We describe our acquisition approach: 3D spoiled-gradient-echo with poisson-disc random undersampling […]

CUDA

Jun, 21

GPU-based acceleration of MPIE/MoM matrix calculation for the analysis of microstrip circuits

In this paper, we present a GPU-based algorithm which accelerates the MoM impedance matrix computation. Based on an efficient quasi-one-dimensional approximation of the reaction integrals, the MPIE formulation for the analysis of microstrip circuits is considered. We use NVIDIA CUDA as GPU development tool and choose an edge-connected line-fed patch antenna as reference problem. In […]

CUDA

Jun, 21

GPU acceleration of compton reconstruction for the PEDRO

Compton reconstruction requires the computationally intensive, yet highly parallelizable, task of Cone of Response (CoR) back-projection. The acceleration of CoR back-projection is of significant importance as a faster algorithm allows the user to increase either the size or resolution of the imaging volume. Such acceleration also lends itself to the realization of real-time reconstruction. The […]

OpenCL

Jun, 21

Improved Programming of GPU Architectures through Automated Data Allocation and Loop Restructuring

The programmability of recent graphic processing unit (GPU) architectures has been the main factor driving the dramatic increase in interest for this class of architectures as low-cost accelerators for a wide range of high-performance applications. Current GPU programming models, such as OpenCL and CUDA, still expose too many architectural features, such as the memory hierarchy, […]

CUDA

Jun, 21

GPU-based motion correction of contrast-enhanced liver MRI scans: An OpenCL implementation

Clinical diagnosis and quantification of liver disease have been improved through the development of techniques using contrast-enhanced liver MRI sequences. To qualitatively or quantitatively analyze such image sequences, one first needs to correct for rigid and non-rigid motion of the liver. For motion correction of the liver, we have employed bi-directional local correlation coefficient Demons, […]

OpenCL

Jun, 21

GPU accelerated rotation-based emission tomography reconstruction

Stochastic methods based on Maximum Likelihood Estimation (MLE) provide accurate tomographic reconstruction for emission imaging. Moreover methods based on MLE allow to include an accurate physical model of the imaging setup in the reconstruction process, thus enabling quantitative reconstruction of radio-tracer activity distribution. It has been shown that inclusion of a spatially dependent PSF that […]

CUDA

Jun, 21

Performance evaluation of the multi-device OpenCL FDTD solver

We present results of an evaluation of a multi-device OpenCL FDTD solver. Portability between hardware manufactured by different vendors and also between highly specialized and parallel computing architectures available on the market, i.e. GPUs, multi-core CPUs and devices integrating both technologies in a single-die IC, is the main advantage of this solver. For code execution […]

OpenCL

Jun, 21

Scalable Streaming-Array of Simple Soft-Processors for Stencil Computations with Constant Memory-Bandwidth

Stencil computation is one of the important kernels in scientific computations, however, the sustained performance is limited by memory bandwidth especially on multi-core microprocessors and GPGPUs due to its small operationalintensity. In this paper, we propose a scalable streaming-array (SSA) of simple soft-processors for high-performance stencil computation on multiple FPGAs. The SSA architecture allows a […]

Jun, 21

Protein alignment algorithms with an efficient backtracking routine on multiple GPUs

BACKGROUND: Pairwise sequence alignment methods are widely used in biological research. The increasing number of sequences is perceived as one of the upcoming challenges for sequence alignment methods in the nearest future. To overcome this challenge several GPU (Graphics Processing Unit) computing approaches have been proposed lately. These solutions show a great potential of a […]

CUDA

Jun, 21

Fast, parallel, GPU-based construction of space filling curves and octrees

Space Filling Curves (SFC) are particularly useful in linearization of data living in two and three dimensional spaces and have been used in a number of applications in scientific computing, and visualization. Interestingly, octrees, another versatile data structure in computer graphics, can be viewed as multiple SFCs at varying resolutions, albeit with parent-child relationship. In […]

CUDA

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Resource Sharing in GPU-Accelerated Windowing Systems

Exploiting SPMD Horizontal Locality

Practical parallel imaging compressed sensing MRI: Summary of two years of experience in accelerating body MRI of pediatric patients

GPU-based acceleration of MPIE/MoM matrix calculation for the analysis of microstrip circuits

GPU acceleration of compton reconstruction for the PEDRO

Improved Programming of GPU Architectures through Automated Data Allocation and Loop Restructuring

GPU-based motion correction of contrast-enhanced liver MRI scans: An OpenCL implementation

GPU accelerated rotation-based emission tomography reconstruction

Performance evaluation of the multi-device OpenCL FDTD solver

Scalable Streaming-Array of Simple Soft-Processors for Stencil Computations with Constant Memory-Bandwidth

Protein alignment algorithms with an efficient backtracking routine on multiple GPUs

Fast, parallel, GPU-based construction of space filling curves and octrees

Recent source codes

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

Most viewed papers (last 30 days)