high performance computing on graphics processing units: hgpu.org

Posts

Jan, 5

Pipelined Training with Stale Weights of Deep Convolutional Neural Networks

The growth in the complexity of Convolutional Neural Networks (CNNs) is increasing interest in partitioning a network across multiple accelerators during training and pipelining the backpropagation computations over the accelerators. Existing approaches avoid or limit the use of stale weights through techniques such as micro-batching or weight stashing. These techniques either underutilize of accelerators or […]

CUDA

Jan, 5

Sparse matrix partitioning for optimizing SpMV on CPU-GPU heterogeneous platforms

Sparse matrix–vector multiplication (SpMV) kernel dominates the computing cost in numerous applications. Most of the existing studies dedicated to improving this kernel have been targeting just one type of processing units, mainly multicore CPUs or graphics processing units (GPUs), and have not explored the potential of the recent, rapidly emerging, CPU-GPU heterogeneous platforms. To take […]

CUDA

Jan, 5

LLVM-based automation of memory decoupling for OpenCL applications on FPGAs

The availability of OpenCL High-Level Synthesis (OpenCL-HLS) has made FPGAs an attractive platform for power-efficient high-performance execution of massively parallel applications. At the same time, new design challenges emerge for massive thread-level parallelism on FPGAs. One major execution bottleneck is the high number of memory stalls exposed to data-path which overshadows the benefits of data-path […]

OpenCL

Jan, 5

Towards Unified INT8 Training for Convolutional Neural Network

Recently low-bit (e.g., 8-bit) network quantization has been extensively studied to accelerate the inference. Besides inference, low-bit training with quantized gradients can further bring more considerable acceleration, since the backward process is often computation-intensive. Unfortunately, the inappropriate quantization of backward propagation usually makes the training unstable and even crash. There lacks a successful unified low-bit […]

Dec, 29

Abstractions for Programming Graphics Processors in High-Level Programming Languages

Software development has long been based on hardware that grows exponentially faster, which has allowed application complexity to increase accordingly. This free lunch is over, however, and traditional CPUs (Central Processing Units) don’t double their performance every couple of years anymore. As a result, compute-intensive applications have increasingly been relying on hardware accelerators like GPUs […]

CUDA

Dec, 29

Porting tree-based hash table compression to GPGPU model checking

To reduce the costs of faulty software, methods to improve software quality are very popular nowadays. One of these methods is model checking: verifying the functional correctness of the model of a hardware or software system. The model implies a state space, which consists of all possible states of the system and all possible transitions […]

CUDA

Dec, 29

Automatic Performance Optimisation of Parallel Programs for GPUs via Rewrite Rules

Graphics Processing Units (GPUs) are now commonplace in computing systems and are the most successful parallel accelerators. Their performance is orders of magnitude higher than traditional Central Processing Units (CPUs) making them attractive for many application domains with high computational demands. However, achieving their full performance potential is extremely hard, even for experienced programmers, as […]

OpenCL

Dec, 29

Accelerating Molecular Docking by Parallelized Heterogeneous Computing – A Case Study of Performance, Quality of Results, and Energy-Efficiency using CPUs, GPUs, and FPGAs

Molecular Docking (MD) is a key tool in computer-aided drug design that aims to predict the binding pose between a small molecule and a macromolecular target. At its core, MD calculates the strength of possible binding poses, and searches for the energetically-stronger ones among those generated during simulation. Automatic Docking (AutoDock) is a widely-used MD […]

OpenCL

Dec, 29

Atmospheric turbulence removal using convolutional neural network

This paper describes a novel deep learning-based method for mitigating the effects of atmospheric distortion. We have built an end-to-end supervised convolutional neural network (CNN) to reconstruct turbulence-corrupted video sequence. Our framework has been developed on the residual learning concept, where the spatio-temporal distortions are learnt and predicted. Our experiments demonstrate that the proposed method […]

Dec, 15

JAX, M.D.: End-to-End Differentiable, Hardware Accelerated, Molecular Dynamics in Pure Python

A large fraction of computational science involves simulating the dynamics of particles that interact via pairwise or many-body interactions. These simulations, called Molecular Dynamics (MD), span a vast range of subjects from physics and materials science to biochemistry and drug discovery. Most MD software involves significant use of handwritten derivatives and code reuse across C++, […]

CUDA

Dec, 15

libmolgrid: GPU Accelerated Molecular Gridding for Deep Learning Applications

There are many ways to represent a molecule as input to a machine learning model and each is associated with loss and retention of certain kinds of information. In the interest of preserving three-dimensional spatial information, including bond angles and torsions, we have developed libmolgrid, a general-purpose library for representing three-dimensional molecules using multidimensional arrays. […]

CUDA

Dec, 15

Array Languages Make Neural Networks Fast

Modern machine learning frameworks are complex: they are typically organised in multiple layers each of which is written in a different language and they depend on a number of external libraries, but at their core they mainly consist of tensor operations. As array-oriented languages provide perfect abstractions to implement tensor operations, we consider a minimalistic […]

CUDA