high performance computing on graphics processing units: hgpu.org

Posts

Apr, 28

A Comparative Study on Exact Triangle Counting Algorithms on the GPU

We implement exact triangle counting in graphs on the GPU using three different methodologies: subgraph matching to a triangle pattern; programmable graph analytics, with a set-intersection approach; and a matrix formulation based on sparse matrix-matrix multiplies. All three deliver best-of-class performance over CPU implementations and over comparable GPU implementations, with the graph-analytic approach achieving the […]

CUDA

Apr, 28

A Strategy for Automatic Performance Tuning of Stencil Computations on GPUs

We propose and evaluate a novel strategy for tuning the performance of a class of stencil computations on Graphics Processing Units. The strategy uses a machine learning model to predict the optimal way to load data from memory followed by a heuristic that divides other optimizations into groups and exhaustively explores one group at a […]

OpenCL

Apr, 28

Accelerating Blockchain Search of Full Nodes Using GPUs

Blockchain is a distributed ledger system based on P2P network and originally used for a crypto currency system. The P2P network of Blockchain is maintained by full nodes which are in charge of verifying all the transactions in the network. However, most Blockchain user nodes do not act as full nodes, because workload of full […]

CUDA

Apr, 28

Automatic generation of CUDA code performing tensor manipulations using C++ expression templates

We present a C++ library, TLoops, which uses a hierarchy of expression templates to represent operations upon tensorial quantities in single lines of C++ code that resemble analytic equations. These expressions may be run as-is, but may also be used to emit equivalent low-level C or CUDA code, which either performs the operations more quickly […]

CUDA

Apr, 25

BrainSlug: Transparent Acceleration of Deep Learning Through Depth-First Parallelism

Project page: BrainSlug: Transparent Neural Network Acceleration (http://www.brainslug.info/) Neural network frameworks such as PyTorch and TensorFlow are the workhorses of numerous machine learning applications ranging from object recognition to machine translation. While these frameworks are versatile and straightforward to use, the training of and inference in deep neural networks is resource (energy, compute, and […]

CUDA

Apr, 25

A Survey of Techniques for Dynamic Branch Prediction

Branch predictor (BP) is an essential component in modern processors since high BP accuracy can improve performance and reduce energy by decreasing the number of instructions executed on wrong-path. However, reducing latency and storage overhead of BP while maintaining high accuracy presents significant challenges. In this paper, we present a survey of dynamic branch prediction […]

Apr, 22

CANNA: Neural Network Acceleration using Configurable Approximation on GPGPU

Neural networks have been successfully used in many applications. Due to their computational complexity, it is difficult to implement them on embedded devices. Neural networks are inherently approximate and thus can be simplified. In this paper, CANNA proposes a gradual training approximation which adaptively sets the level of hardware approximation depending on the neural network’s […]

OpenCL

Apr, 22

Fast inference of deep neural networks in FPGAs for particle physics

Recent results at the Large Hadron Collider (LHC) have pointed to enhanced physics capabilities through the improvement of the real-time event processing techniques. Machine learning methods are ubiquitous and have proven to be very powerful in LHC physics, and particle physics as a whole. However, exploration of the use of such techniques in low-latency, low-power […]

Apr, 22

CytonRL: an Efficient Reinforcement Learning Open-source Toolkit Implemented in C++

This paper presents an open-source enforcement learning toolkit named CytonRL. The toolkit implements four recent advanced deep Q-learning algorithms from scratch using C++ and NVIDIA’s GPU-accelerated libraries. The code is simple and elegant, owing to an open-source general-purpose neural network library named CytonLib. Benchmark shows that the toolkit achieves competitive performances on the popular Atari […]

CUDA

Apr, 22

mu-cuDNN: Accelerating Deep Learning Frameworks with Micro-Batching

NVIDIA cuDNN is a low-level library that provides GPU kernels frequently used in deep learning. Specifically, cuDNN implements several equivalent convolution algorithms, whose performance and memory footprint may vary considerably, depending on the layer dimensions. When an algorithm is automatically selected by cuDNN, the decision is performed on a per-layer basis, and thus it often […]

CUDA

Apr, 22

Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking

Every year, novel NVIDIA GPU designs are introduced. This rapid architectural and technological progression, coupled with a reluctance by manufacturers to disclose low-level details, makes it difficult for even the most proficient GPU software designers to remain up-to-date with the technological advances at a microarchitectural level. To address this dearth of public, microarchitectural-level information on […]

CUDA

Apr, 15

DLL: A Blazing Fast Deep Neural Network Library

Deep Learning Library (DLL) is a new library for machine learning with deep neural networks that focuses on speed. It supports feed-forward neural networks such as fully-connected Artificial Neural Networks (ANNs) and Convolutional Neural Networks (CNNs). It also has very comprehensive support for Restricted Boltzmann Machines (RBMs) and Convolutional RBMs. Our main motivation for this […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

A Comparative Study on Exact Triangle Counting Algorithms on the GPU

A Strategy for Automatic Performance Tuning of Stencil Computations on GPUs

Accelerating Blockchain Search of Full Nodes Using GPUs

Automatic generation of CUDA code performing tensor manipulations using C++ expression templates

BrainSlug: Transparent Acceleration of Deep Learning Through Depth-First Parallelism

A Survey of Techniques for Dynamic Branch Prediction

CANNA: Neural Network Acceleration using Configurable Approximation on GPGPU

Fast inference of deep neural networks in FPGAs for particle physics

CytonRL: an Efficient Reinforcement Learning Open-source Toolkit Implemented in C++

mu-cuDNN: Accelerating Deep Learning Frameworks with Micro-Batching

Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking

DLL: A Blazing Fast Deep Neural Network Library

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)