high performance computing on graphics processing units: hgpu.org

Posts

Sep, 22

On-the-fly elimination of dynamic irregularities for GPU computing

The power-efficient massively parallel Graphics Processing Units (GPUs) have become increasingly influential for general-purpose computing over the past few years. However, their efficiency is sensitive to dynamic irregular memory references and control flows in an application. Experiments have shown great performance gains when these irregularities are removed. But it remains an open question how to […]

CUDA

Sep, 22

Reducing branch divergence in GPU programs

Branch divergence has a significant impact on the performance of GPU programs. We propose two novel software-based optimizations, called iteration delaying and branch distribution that aim to reduce branch divergence. Iteration delaying targets a divergent branch enclosed by a loop within a kernel. It improves performance by executing loop iterations that take the same branch […]

CUDA

Sep, 22

A case for neuromorphic ISAs

The desire to create novel computing systems, paired with recent advances in neuroscientific understanding of the brain, has led researchers to develop neuromorphic architectures that emulate the brain. To date, such models are developed, trained, and deployed on the same substrate. However, excessive co-dependence between the substrate and the algorithm prevents portability, or at the […]

CUDA

Sep, 22

Acceleration of the speed of tissue characterization algorithm for coronary plaque by employing GPGPU technique

The general purpose computation technique on Graphics Processing Unit (GPGPU) has got into the limelight recently. The authors have proposed the multiple k-nearest neighbor (MkNN) classifier for the tissue characterization of coronary plaque. Its characterization performance is highly evaluated. The purpose of this paper is to accelerate the speed of MkNN classifier aiming for it […]

CUDA

Sep, 21

Reconstructing hash reversal based proof of work schemes

Proof of work schemes use client puzzles to manage limited resources on a server and provide resilience to denial of service attacks. Attacks utilizing GPUs to inflate computational capacity, known as resource inflation, are a novel and powerful threat that dramatically increase the computational disparity between clients. This disparity renders proof of work schemes based […]

CUDA

Sep, 21

Fast analysis of molecular dynamics trajectories with graphics processing units-Radial distribution function histogramming

The calculation of radial distribution functions (RDFs) from molecular dynamics trajectory data is a common and computationally expensive analysis task. The rate limiting step in the calculation of the RDF is building a histogram of the distance between atom pairs in each trajectory frame. Here we present an implementation of this histogramming scheme for multiple […]

CUDA

Sep, 21

Scalable, High Performance Fourier Domain Optical Coherence Tomography: Why FPGAs and Not GPGPUs

Fourier Domain Optical Coherence Tomography (FD-OCT) is an emerging biomedical imaging technology featuring ultra-high resolution and fast imaging speed. Due to the complexity of the FD-OCT algorithm, real time FD-OCT imaging demands high performance computing platforms. However, the scaling of real-time FD-OCT processing for increasing data acquisition rates and 3-dimensional (3D) imaging is quickly outpacing […]

CUDA

Sep, 21

Non-deterministic parallelism considered useful

The development of distributed execution engines has greatly simplified parallel programming, by shielding developers from the gory details of programming in a distributed system, and allowing them to focus on writing sequential code [8, 11, 18]. The "sacred cow" in these systems is transparent fault tolerance, which is achieved by dividing the computation into atomic […]

Sep, 21

Parallel graduated assignment algorithm for multiple graph matching based on a common labelling

This paper presents a new parallel algorithm to compute multiple graph-matching based on the Graduated Assignment. The aim of developing this parallel algorithm is to perform multiple graph matching in a current desktop computer, but, instead of executing the code in the generic processor, we execute a parallel code in the graphic processor unit. Our […]

CUDA

Sep, 21

GPU-based cloud performance for LiDAR data processing

Goal of this paper is to compare the timing/performance results of CPU and GPU on local and Cloud platform for processing massive Light Detecting and Ranging (LiDAR) topographic data. We have used locally various multi-core CPU technologies as well as GPU implementations on various graphics cards of nVidia which support CUDA, where as a cloud […]

CUDA

Sep, 21

Multicore performance optimization using partner cores

As the push for parallelism continues to increase the number of cores on a chip, system design has become incredibly complex; optimizing for performance and power efficiency is now nearly impossible for the application programmer. To assist the programmer, a variety of techniques for optimizing performance and power at runtime have been developed, but many […]

Sep, 21

Mint: realizing CUDA performance in 3D stencil methods with annotated C

We present Mint, a programming model that enables the non-expert to enjoy the performance benefits of hand coded CUDA without becoming entangled in the details. Mint targets stencil methods, which are an important class of scientific applications. We have implemented the Mint programming model with a source-to-source translator that generates optimized CUDA C from traditional […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

On-the-fly elimination of dynamic irregularities for GPU computing

Reducing branch divergence in GPU programs

A case for neuromorphic ISAs

Acceleration of the speed of tissue characterization algorithm for coronary plaque by employing GPGPU technique

Reconstructing hash reversal based proof of work schemes

Fast analysis of molecular dynamics trajectories with graphics processing units-Radial distribution function histogramming

Scalable, High Performance Fourier Domain Optical Coherence Tomography: Why FPGAs and Not GPGPUs

Non-deterministic parallelism considered useful

Parallel graduated assignment algorithm for multiple graph matching based on a common labelling

GPU-based cloud performance for LiDAR data processing

Multicore performance optimization using partner cores

Mint: realizing CUDA performance in 3D stencil methods with annotated C

Recent source codes

Coccinelle: a C code transformation engine using SmPL for matches, refactorings, and bug fixing

DuoReduce: MLIR's benchmark

Shamrock: Multi-GPU hydrodynamics for astrophysics

LLMPerf: GPU Performance Modeling meets Large Language Models

Hercules: A Compiler for Productive Programming of Heterogeneous Systems

Celerity Runtime: High-level C++ for Accelerator Clusters

wgpy: WebGL accelerated numpy-compatible array library for web browser

Microbenchmarking OpenMP target offload with Catch2

SUperman: Highly Efficient Permanent Computation Library

TransCL: An Automatic CUDA-to-OpenCL Programs Transformation Framework

Most viewed papers (last 30 days)