high performance computing on graphics processing units: hgpu.org

Posts

Mar, 19

Towards a Benchmarking Suite for Kernel Tuners

As computing system become more complex, it is becoming harder for programmers to keep their codes optimized as the hardware gets updated. Autotuners try to alleviate this by hiding as many architecture-based optimization details as possible from the user, so that the code can be used efficiently across different generations of systems. In this article […]

CUDA

Mar, 12

ARK: GPU-driven Code Execution for Distributed Deep Learning

Modern state-of-the-art deep learning (DL) applications tend to scale out to a large number of parallel GPUs. Unfortunately, we observe that the collective communication overhead across GPUs is often the key limiting factor of performance for distributed DL. It under-utilizes the networking bandwidth by frequent transfers of small data chunks, which also incurs a substantial […]

Mar, 12

BenchDirect: A Directed Language Model for Compiler Benchmarks

The exponential increase of hardware-software complexity has made it impossible for compiler engineers to find the right optimization heuristics manually. Predictive models have been shown to find near optimal heuristics with little human effort but they are limited by a severe lack of diverse benchmarks to train on. Generative AI has been used by researchers […]

OpenCL

Mar, 12

A Deep Learning Model for Loop Interchange

Loop interchange is an important code optimization that improves data locality and extracts parallelism. While previous research in compilers has tried to automate the selection of which loops to interchange, existing methods have an important limitation. They use less precise machine models. This is mainly because developing a model to predict whether to interchange two […]

Mar, 12

Bridging Control-Centric and Data-Centric Optimization

With the rise of specialized hardware and new programming languages, code optimization has shifted its focus towards promoting data locality. Most production-grade compilers adopt a control-centric mindset – instruction-driven optimization augmented with scalar-based dataflow – whereas other approaches provide domain-specific and general purpose data movement minimization, which can miss important control-flow optimizations. As the two […]

Mar, 12

Runtime Support for Performance Portability on Heterogeneous Distributed Platforms

Hardware heterogeneity is here to stay for high-performance computing. Large-scale systems are currently equipped with multiple GPU accelerators per compute node and are expected to incorporate more specialized hardware. This shift in the computing ecosystem offers many opportunities for performance improvement; however, it also increases the complexity of programming for such architectures. This work introduces […]

CUDA

Mar, 5

Harmonic CUDA: Asynchronous Programming on GPUs

We introduce Harmonic CUDA, a dataflow programming model for GPUs that allows programmers to describe algorithms as a dependency graph of producers and consumers where data flows continuously through the graph for the duration of the kernel. This makes it easier for programmers to exploit asynchrony, warp specialization, and hardware acceleration. Using Harmonic CUDA, we […]

CUDA

Mar, 5

EvoTorch: Scalable Evolutionary Computation in Python

Evolutionary computation is an important component within various fields such as artificial intelligence research, reinforcement learning, robotics, industrial automation and/or optimization, engineering design, etc. Considering the increasing computational demands and the dimensionalities of modern optimization problems, the requirement for scalable, re-usable, and practical evolutionary algorithm implementations has been growing. To address this requirement, we present […]

CUDA

Mar, 5

DeepAxe: A Framework for Exploration of Approximation and Reliability Trade-offs in DNN Accelerators

While the role of Deep Neural Networks (DNNs) in a wide range of safety-critical applications is expanding, emerging DNNs experience massive growth in terms of computation power. It raises the necessity of improving the reliability of DNN accelerators yet reducing the computational burden on the hardware platforms, i.e. reducing the energy consumption and execution time […]

Mar, 5

RTIndeX: Exploiting Hardware-Accelerated GPU Raytracing for Database Indexing

Data management on GPUs has become increasingly relevant due to a tremendous rise in processing power and available GPU memory. Just like in the CPU world, there is a need for performant GPU-resident index structures to speed up query processing. Unfortunately, mapping indexes efficiently to the highly parallel and hard-to-program hardware is challenging and often […]

CUDA

Mar, 5

Interconnect Bandwidth Heterogeneity on AMD MI250x and Infinity Fabric

Demand for low-latency and high-bandwidth data transfer between GPUs has driven the development of multi-GPU nodes. Physical constraints on the manufacture and integration of such systems has yielded heterogeneous intra-node interconnects, where not all devices are connected equally. The next generation of supercomputing platforms are expected to feature AMD CPUs and GPUs. This work characterizes […]

Feb, 26

Porting numerical integration codes from CUDA to oneAPI: a case study

We present our experience in porting optimized CUDA implementations to oneAPI. We focus on the use case of numerical integration, particularly the CUDA implementations of PAGANI and m-Cubes. We faced several challenges that caused performance degradation in the oneAPI ports. These include differences in utilized registers per thread, compiler optimizations, and mappings of CUDA library […]

CUDA

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Posts

Towards a Benchmarking Suite for Kernel Tuners

ARK: GPU-driven Code Execution for Distributed Deep Learning

BenchDirect: A Directed Language Model for Compiler Benchmarks

A Deep Learning Model for Loop Interchange

Bridging Control-Centric and Data-Centric Optimization

Runtime Support for Performance Portability on Heterogeneous Distributed Platforms

Harmonic CUDA: Asynchronous Programming on GPUs

EvoTorch: Scalable Evolutionary Computation in Python

DeepAxe: A Framework for Exploration of Approximation and Reliability Trade-offs in DNN Accelerators

RTIndeX: Exploiting Hardware-Accelerated GPU Raytracing for Database Indexing

Interconnect Bandwidth Heterogeneity on AMD MI250x and Infinity Fabric

Porting numerical integration codes from CUDA to oneAPI: a case study

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)