high performance computing on graphics processing units: hgpu.org

Posts

Sep, 23

Quantifying NUMA and contention effects in multi-GPU systems

As system architects strive for increased density and power efficiency, the traditional compute node is being augmented with an increasing number of graphics processing units (GPUs). The integration of multiple GPUs per node introduces complex performance phenomena including non-uniform memory access (NUMA) and contention for shared system resources. Utilizing the Keeneland system, this paper quantifies […]

CUDA

Sep, 22

Register packing for cyclic reduction: a case study

We generalize a method for avoiding GPU shared communication when dealing with a downsweep pattern. We apply this generalization to Cyclic Reduction, a tridiagonal solver with this pattern. Previously, Cyclic Reduction suffered poor performance when compared to other tridiagonal solvers on the GPU due to performance issues stemming from shared-memory bandwidth bottlenecks and step-efficiency. We […]

CUDA

Sep, 22

On-the-fly elimination of dynamic irregularities for GPU computing

The power-efficient massively parallel Graphics Processing Units (GPUs) have become increasingly influential for general-purpose computing over the past few years. However, their efficiency is sensitive to dynamic irregular memory references and control flows in an application. Experiments have shown great performance gains when these irregularities are removed. But it remains an open question how to […]

CUDA

Sep, 22

Reducing branch divergence in GPU programs

Branch divergence has a significant impact on the performance of GPU programs. We propose two novel software-based optimizations, called iteration delaying and branch distribution that aim to reduce branch divergence. Iteration delaying targets a divergent branch enclosed by a loop within a kernel. It improves performance by executing loop iterations that take the same branch […]

CUDA

Sep, 22

A case for neuromorphic ISAs

The desire to create novel computing systems, paired with recent advances in neuroscientific understanding of the brain, has led researchers to develop neuromorphic architectures that emulate the brain. To date, such models are developed, trained, and deployed on the same substrate. However, excessive co-dependence between the substrate and the algorithm prevents portability, or at the […]

CUDA

Sep, 22

Acceleration of the speed of tissue characterization algorithm for coronary plaque by employing GPGPU technique

The general purpose computation technique on Graphics Processing Unit (GPGPU) has got into the limelight recently. The authors have proposed the multiple k-nearest neighbor (MkNN) classifier for the tissue characterization of coronary plaque. Its characterization performance is highly evaluated. The purpose of this paper is to accelerate the speed of MkNN classifier aiming for it […]

CUDA

Sep, 21

Reconstructing hash reversal based proof of work schemes

Proof of work schemes use client puzzles to manage limited resources on a server and provide resilience to denial of service attacks. Attacks utilizing GPUs to inflate computational capacity, known as resource inflation, are a novel and powerful threat that dramatically increase the computational disparity between clients. This disparity renders proof of work schemes based […]

CUDA

Sep, 21

Fast analysis of molecular dynamics trajectories with graphics processing units-Radial distribution function histogramming

The calculation of radial distribution functions (RDFs) from molecular dynamics trajectory data is a common and computationally expensive analysis task. The rate limiting step in the calculation of the RDF is building a histogram of the distance between atom pairs in each trajectory frame. Here we present an implementation of this histogramming scheme for multiple […]

CUDA

Sep, 21

Scalable, High Performance Fourier Domain Optical Coherence Tomography: Why FPGAs and Not GPGPUs

Fourier Domain Optical Coherence Tomography (FD-OCT) is an emerging biomedical imaging technology featuring ultra-high resolution and fast imaging speed. Due to the complexity of the FD-OCT algorithm, real time FD-OCT imaging demands high performance computing platforms. However, the scaling of real-time FD-OCT processing for increasing data acquisition rates and 3-dimensional (3D) imaging is quickly outpacing […]

CUDA

Sep, 21

Non-deterministic parallelism considered useful

The development of distributed execution engines has greatly simplified parallel programming, by shielding developers from the gory details of programming in a distributed system, and allowing them to focus on writing sequential code [8, 11, 18]. The "sacred cow" in these systems is transparent fault tolerance, which is achieved by dividing the computation into atomic […]

Sep, 21

Parallel graduated assignment algorithm for multiple graph matching based on a common labelling

This paper presents a new parallel algorithm to compute multiple graph-matching based on the Graduated Assignment. The aim of developing this parallel algorithm is to perform multiple graph matching in a current desktop computer, but, instead of executing the code in the generic processor, we execute a parallel code in the graphic processor unit. Our […]

CUDA

Sep, 21

GPU-based cloud performance for LiDAR data processing

Goal of this paper is to compare the timing/performance results of CPU and GPU on local and Cloud platform for processing massive Light Detecting and Ranging (LiDAR) topographic data. We have used locally various multi-core CPU technologies as well as GPU implementations on various graphics cards of nVidia which support CUDA, where as a cloud […]

CUDA

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Posts

Quantifying NUMA and contention effects in multi-GPU systems

Register packing for cyclic reduction: a case study

On-the-fly elimination of dynamic irregularities for GPU computing

Reducing branch divergence in GPU programs

A case for neuromorphic ISAs

Acceleration of the speed of tissue characterization algorithm for coronary plaque by employing GPGPU technique

Reconstructing hash reversal based proof of work schemes

Fast analysis of molecular dynamics trajectories with graphics processing units-Radial distribution function histogramming

Scalable, High Performance Fourier Domain Optical Coherence Tomography: Why FPGAs and Not GPGPUs

Non-deterministic parallelism considered useful

Parallel graduated assignment algorithm for multiple graph matching based on a common labelling

GPU-based cloud performance for LiDAR data processing

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)