high performance computing on graphics processing units: hgpu.org

Posts

Oct, 6

Verification of GPU Program Optimizations in Lean

Graphics processing units (GPUs) have become of major importance for highperformance computing due to their high throughput. To get the best possible performance, GPU programs are frequently optimized. However, every optimization carries the risk of introducing bugs. In this thesis, we present a framework for the theorem prover Lean to formally verify transformations of GPU […]

Oct, 6

waLBerla: A block-structured high-performance framework for multiphysics simulations

Programming current supercomputers efficiently is a challenging task. Multiple levels of parallelism on the core, on the compute node, and between nodes need to be exploited to make full use of the system. Heterogeneous hardware architectures with accelerators further complicate the development process. waLBerla addresses these challenges by providing the user with highly efficient building […]

CUDA

Oct, 6

Syntix: A Profiling Based Resource Estimator for CUDA Kernels

Trending applications such as AI and data analytics have mandated the use of GPUs in modern datacenters for performance reasons. Current practice dictates to dedicate GPUs to applications, which limits the amount of concurrent users to the available GPUs. That use of GPUs contradicts with the policy of datacenters to oversubscribe resources and accommodate as […]

CUDA

Oct, 6

MIOpen: An Open Source Library For Deep Learning Primitives

Deep Learning has established itself to be a common occurrence in the business lexicon. The unprecedented success of deep learning in recent years can be attributed to: abundance of data, availability of gargantuan compute capabilities offered by GPUs, and adoption of open-source philosophy by the researchers and industry. Deep neural networks can be decomposed into […]

OpenCL

Sep, 29

Exascale Deep Learning for Scientific Inverse Problems

We introduce novel communication strategies in synchronous distributed Deep Learning consisting of decentralized gradient reduction orchestration and computational graph-aware grouping of gradient tensors. These new techniques produce an optimal overlap between computation and communication and result in near-linear scaling (0.93) of distributed training up to 27,600 NVIDIA V100 GPUs on the Summit Supercomputer. We demonstrate […]

Sep, 29

Futhark Vulkan Backend

This paper describes the effort, challenges, and limitations involved in the implementation of a Futhark compiler variant using the Vulkan API version 1.1 for compiling Futhark programs targeting GPUs. Compared to the existing OpenCL backend with the same purpose, the more modern Vulkan API could offer some performance benefits and may extend the scope of […]

OpenCL

Sep, 29

Heterogeneous Resource-Elastic Management for FPGAs: Concepts, Theory and Implementation

Despite deployment of FPGAs at the edge and cloud data centers due to their performance and energy advantage, FPGA runtime systems commonly tend to support only one-application-at-a-time and cannot adapt to dynamic workloads with reasonable response times. Therefore, this paper proposes the concepts and theory of resource elasticity for FPGA systems to allow a task […]

OpenCL

Sep, 29

Elastic deep learning in multi-tenant GPU cluster

Multi-tenant GPU clusters are common nowadays due to the huge success of deep learning and training jobs are usually conducted with multiple distributed GPUs. These GPU clusters are managed with various goals including short JCT, high resource utilization and quick response to small jobs. In this paper, we show that elasticity, which is the ability […]

Sep, 29

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations, longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction techniques to lower memory consumption and increase the training speed […]

Sep, 22

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Recent work in unsupervised language modeling demonstrates that training large neural language models advances the state of the art in Natural Language Processing applications. However, for very large models, memory constraints limit the size of models that can be practically trained. Model parallelism allows us to train larger models, because the parameters can be split […]

Sep, 22

Performance and Power Evaluation of AI Accelerators for Training Deep Learning Models

Deep neural networks (DNNs) have become widely used in many AI applications. Yet, training a DNN requires a huge amount of calculations and it takes a long time and energy to train a satisfying model. Nowadays, many-core AI accelerators (e.g., GPUs and TPUs) play a key role in training DNNs. However, different many-core processors from […]

Sep, 22

Model-Based Warp-Level Tiling for Image Processing Programs on GPUs

The efficient execution of image processing pipelines on GPUs is an area of active research. The state-of-art involves 1) dividing portions of an image into overlapped tiles, where each tile can be processed by a single thread block and 2) fusing loops together to improve memory locality. However, the state-of-the-art has two limitations: 1) synchronization […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Verification of GPU Program Optimizations in Lean

waLBerla: A block-structured high-performance framework for multiphysics simulations

Syntix: A Profiling Based Resource Estimator for CUDA Kernels

MIOpen: An Open Source Library For Deep Learning Primitives

Exascale Deep Learning for Scientific Inverse Problems

Futhark Vulkan Backend

Heterogeneous Resource-Elastic Management for FPGAs: Concepts, Theory and Implementation

Elastic deep learning in multi-tenant GPU cluster

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Performance and Power Evaluation of AI Accelerators for Training Deep Learning Models

Model-Based Warp-Level Tiling for Image Processing Programs on GPUs

Recent source codes

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)