high performance computing on graphics processing units: hgpu.org

Posts

Jan, 5

A Unified Iteration Space Transformation Framework for Sparse and Dense Tensor Algebra

We address the problem of optimizing mixed sparse and dense tensor algebra in a compiler. We show that standard loop transformations, such as strip-mining, tiling, collapsing, parallelization and vectorization, can be applied to irregular loops over sparse iteration spaces. We also show how these transformations can be applied to the contiguous value arrays of sparse […]

CUDA

Jan, 5

Pipelined Training with Stale Weights of Deep Convolutional Neural Networks

The growth in the complexity of Convolutional Neural Networks (CNNs) is increasing interest in partitioning a network across multiple accelerators during training and pipelining the backpropagation computations over the accelerators. Existing approaches avoid or limit the use of stale weights through techniques such as micro-batching or weight stashing. These techniques either underutilize of accelerators or […]

CUDA

Jan, 5

Sparse matrix partitioning for optimizing SpMV on CPU-GPU heterogeneous platforms

Sparse matrix–vector multiplication (SpMV) kernel dominates the computing cost in numerous applications. Most of the existing studies dedicated to improving this kernel have been targeting just one type of processing units, mainly multicore CPUs or graphics processing units (GPUs), and have not explored the potential of the recent, rapidly emerging, CPU-GPU heterogeneous platforms. To take […]

CUDA

Jan, 5

Towards Unified INT8 Training for Convolutional Neural Network

Recently low-bit (e.g., 8-bit) network quantization has been extensively studied to accelerate the inference. Besides inference, low-bit training with quantized gradients can further bring more considerable acceleration, since the backward process is often computation-intensive. Unfortunately, the inappropriate quantization of backward propagation usually makes the training unstable and even crash. There lacks a successful unified low-bit […]

Jan, 5

LLVM-based automation of memory decoupling for OpenCL applications on FPGAs

The availability of OpenCL High-Level Synthesis (OpenCL-HLS) has made FPGAs an attractive platform for power-efficient high-performance execution of massively parallel applications. At the same time, new design challenges emerge for massive thread-level parallelism on FPGAs. One major execution bottleneck is the high number of memory stalls exposed to data-path which overshadows the benefits of data-path […]

OpenCL

Dec, 29

Abstractions for Programming Graphics Processors in High-Level Programming Languages

Software development has long been based on hardware that grows exponentially faster, which has allowed application complexity to increase accordingly. This free lunch is over, however, and traditional CPUs (Central Processing Units) don’t double their performance every couple of years anymore. As a result, compute-intensive applications have increasingly been relying on hardware accelerators like GPUs […]

CUDA

Dec, 29

Porting tree-based hash table compression to GPGPU model checking

To reduce the costs of faulty software, methods to improve software quality are very popular nowadays. One of these methods is model checking: verifying the functional correctness of the model of a hardware or software system. The model implies a state space, which consists of all possible states of the system and all possible transitions […]

CUDA

Dec, 29

Automatic Performance Optimisation of Parallel Programs for GPUs via Rewrite Rules

Graphics Processing Units (GPUs) are now commonplace in computing systems and are the most successful parallel accelerators. Their performance is orders of magnitude higher than traditional Central Processing Units (CPUs) making them attractive for many application domains with high computational demands. However, achieving their full performance potential is extremely hard, even for experienced programmers, as […]

OpenCL

Dec, 29

Accelerating Molecular Docking by Parallelized Heterogeneous Computing – A Case Study of Performance, Quality of Results, and Energy-Efficiency using CPUs, GPUs, and FPGAs

Molecular Docking (MD) is a key tool in computer-aided drug design that aims to predict the binding pose between a small molecule and a macromolecular target. At its core, MD calculates the strength of possible binding poses, and searches for the energetically-stronger ones among those generated during simulation. Automatic Docking (AutoDock) is a widely-used MD […]

OpenCL

Dec, 29

Atmospheric turbulence removal using convolutional neural network

This paper describes a novel deep learning-based method for mitigating the effects of atmospheric distortion. We have built an end-to-end supervised convolutional neural network (CNN) to reconstruct turbulence-corrupted video sequence. Our framework has been developed on the residual learning concept, where the spatio-temporal distortions are learnt and predicted. Our experiments demonstrate that the proposed method […]

Dec, 15

JAX, M.D.: End-to-End Differentiable, Hardware Accelerated, Molecular Dynamics in Pure Python

A large fraction of computational science involves simulating the dynamics of particles that interact via pairwise or many-body interactions. These simulations, called Molecular Dynamics (MD), span a vast range of subjects from physics and materials science to biochemistry and drug discovery. Most MD software involves significant use of handwritten derivatives and code reuse across C++, […]

CUDA

Dec, 15

libmolgrid: GPU Accelerated Molecular Gridding for Deep Learning Applications

There are many ways to represent a molecule as input to a machine learning model and each is associated with loss and retention of certain kinds of information. In the interest of preserving three-dimensional spatial information, including bond angles and torsions, we have developed libmolgrid, a general-purpose library for representing three-dimensional molecules using multidimensional arrays. […]

CUDA

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

GigaAPI: a user-space API that simplifies multi-GPU programming, bridging the gap between the capabilities of parallel GPU systems and the ability of developers to harness their full potential

GigaAPI for GPU Parallelization

Coccinelle: a C code transformation engine using SmPL for matches, refactorings, and bug fixing

Advances in Semantic Patching for HPC-oriented Refactorings with Coccinelle

DuoReduce: MLIR's benchmark

Hardware-Assisted Software Testing and Debugging for Heterogeneous Computing

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Posts

A Unified Iteration Space Transformation Framework for Sparse and Dense Tensor Algebra

Pipelined Training with Stale Weights of Deep Convolutional Neural Networks

Sparse matrix partitioning for optimizing SpMV on CPU-GPU heterogeneous platforms

Towards Unified INT8 Training for Convolutional Neural Network

LLVM-based automation of memory decoupling for OpenCL applications on FPGAs

Abstractions for Programming Graphics Processors in High-Level Programming Languages

Porting tree-based hash table compression to GPGPU model checking

Automatic Performance Optimisation of Parallel Programs for GPUs via Rewrite Rules

Accelerating Molecular Docking by Parallelized Heterogeneous Computing – A Case Study of Performance, Quality of Results, and Energy-Efficiency using CPUs, GPUs, and FPGAs

Atmospheric turbulence removal using convolutional neural network

JAX, M.D.: End-to-End Differentiable, Hardware Accelerated, Molecular Dynamics in Pure Python

libmolgrid: GPU Accelerated Molecular Gridding for Deep Learning Applications

Recent source codes

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Data-efficient LLM Fine-tuning for Code Generation

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

GigaAPI: a user-space API that simplifies multi-GPU programming, bridging the gap between the capabilities of parallel GPU systems and the ability of developers to harness their full potential

Coccinelle: a C code transformation engine using SmPL for matches, refactorings, and bug fixing

DuoReduce: MLIR's benchmark

Most viewed papers (last 30 days)