high performance computing on graphics processing units: hgpu.org

Posts

Nov, 7

Source-to-Source Automatic Differentiation of OpenMP Parallel Loops

This paper presents our work toward correct and efficient automatic differentiation of OpenMP parallel worksharing loops in forward and reverse mode. Automatic differentiation is a method to obtain gradients of numerical programs, which are crucial in optimization, uncertainty quantification, and machine learning. The computational cost to compute gradients is a common bottleneck in practice. For […]

Oct, 31

Improving Performance and Energy Efficiency of GPUs through Locality Analysis

The massive parallelism provided by general-purpose GPUs (GPGPUs) possessing numerous compute threads in their streaming multiprocessors (SMs) and enormous memory bandwidths have made them the de-facto accelerator of choice in many scientific domains. To support the complex memory access patterns of applications, GPGPUs have a multi-level memory hierarchy consisting of a huge register file and […]

CUDA

Oct, 31

Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training

The Transformer architecture has improved the performance of deep learning models in domains such as Computer Vision and Natural Language Processing. Together with better performance come larger model sizes. This imposes challenges to the memory wall of the current accelerator hardware such as GPU. It is never ideal to train large models such as Vision […]

CUDA

Oct, 31

Mixed precision in Graphics Processing Unit

Modern graphics computing units (GPUs) are designed and optimized to perform highly parallel numerical calculations. This parallelism has enabled (and promises) significant advantages, both in terms of energy performance and calculation. In this document, we take stock of the different applications of mixed precision. We recall the standards currently used in the overwhelming majority of […]

Oct, 31

TorchAudio: Building Blocks for Audio and Speech Processing

This document describes version 0.10 of torchaudio: building blocks for machine learning applications in the audio and speech processing domain. The objective of torchaudio is to accelerate the development and deployment of machine learning applications for researchers and engineers by providing off-the-shelf building blocks. The building blocks are designed to be GPU-compatible, automatically differentiable, and […]

Oct, 31

Bolt: Bridging the Gap between Auto-tuners and Hardware-native Performance

Today’s auto-tuners (e.g., AutoTVM, Ansor) generate efficient tensor programs by navigating a large search space to identify effective implementations, but they do so with opaque hardware details. Thus, their performance could fall behind that of hardware-native libraries (e.g., cuBLAS, cuDNN), which are hand-optimized by device vendors to extract high performance. On the other hand, these […]

CUDA

Oct, 24

The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding

There is often variation in the shape and size of input data used for deep learning. In many cases, such data can be represented using tensors with non-uniform shapes, or ragged tensors. Due to limited and non-portable support for efficient execution on ragged tensors, current deep learning frameworks generally use techniques such as padding and […]

CUDA

Oct, 24

Principal Kernel Analysis: A Tractable Methodology to Simulate Scaled GPU Workloads

Simulating all threads in a scaled GPU workload results in prohibitive simulation cost. Cycle-level simulation is orders of magnitude slower than native silicon, the only solution is to reduce the amount of work simulated while accurately representing the program. Existing solutions to simulate GPU programs either scale the input size, simulate the first several billion […]

CUDA

Oct, 24

Monitoring Collective Communication Among GPUs

Communication among devices in multi-GPU systems plays an important role in terms of performance and scalability. In order to optimize an application, programmers need to know the type and amount of the communication happening among GPUs. Although there are prior works to gather this information in MPI applications on distributed systems and multi-threaded applications on […]

CUDA

Oct, 24

OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems

Python has become a dominant programming language for emerging areas like Machine Learning (ML), Deep Learning (DL), and Data Science (DS). An attractive feature of Python is that it provides easy-to-use programming interface while allowing library developers to enhance performance of their applications by harnessing the computing power offered by High Performance Computing (HPC) platforms. […]

CUDA

Oct, 24

Least Squares on GPUs in Multiple Double Precision

This paper describes the application of the code generated by the CAMPARY software to accelerate the solving of linear systems in the least squares sense on Graphics Processing Units (GPUs), in double double, quad double, and octo double precision. The goal is to use accelerators to offset the cost overhead caused by multiple double precision […]

CUDA

Oct, 17

Homomorphic-Encrypted Volume Rendering

Computationally demanding tasks are typically calculated in dedicated data centers, and real-time visualizations also follow this trend. Some rendering tasks, however, require the highest level of confidentiality so that no other party, besides the owner, can read or see the sensitive data. Here we present a direct volume rendering approach that performs volume rendering directly […]

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Source-to-Source Automatic Differentiation of OpenMP Parallel Loops

Improving Performance and Energy Efficiency of GPUs through Locality Analysis

Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training

Mixed precision in Graphics Processing Unit

TorchAudio: Building Blocks for Audio and Speech Processing

Bolt: Bridging the Gap between Auto-tuners and Hardware-native Performance

The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding

Principal Kernel Analysis: A Tractable Methodology to Simulate Scaled GPU Workloads

Monitoring Collective Communication Among GPUs

OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems

Least Squares on GPUs in Multiple Double Precision

Homomorphic-Encrypted Volume Rendering

Recent source codes

QArray

Celerity: High-level C++ for Accelerator Clusters

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

OpenMP5-Offload-OpenMC-Intel-PVC

Most viewed papers (last 30 days)