high performance computing on graphics processing units: hgpu.org

Posts

Oct, 13

Performance Impact of Memory Channels on Sparse and Irregular Algorithms

Graph processing is typically considered to be a memory-bound rather than compute-bound problem. One common line of thought is that more available memory bandwidth corresponds to better graph processing performance. However, in this work we demonstrate that the key factor in the utilization of the memory system for graph algorithms is not necessarily the raw […]

CUDA

Oct, 6

Taichi: A Language for High-Performance Computation on Spatially Sparse Data Structures

3D visual computing data are often spatially sparse. To exploit such sparsity, people have developed hierarchical sparse data structures, such as multilevel sparse voxel grids, particles, and 3D hash tables. However, developing and using these high-performance sparse data structures is challenging, due to their intrinsic complexity and overhead. We propose Taichi, a new data-oriented programming […]

CUDA

Oct, 6

Verification of GPU Program Optimizations in Lean

Graphics processing units (GPUs) have become of major importance for highperformance computing due to their high throughput. To get the best possible performance, GPU programs are frequently optimized. However, every optimization carries the risk of introducing bugs. In this thesis, we present a framework for the theorem prover Lean to formally verify transformations of GPU […]

Oct, 6

waLBerla: A block-structured high-performance framework for multiphysics simulations

Programming current supercomputers efficiently is a challenging task. Multiple levels of parallelism on the core, on the compute node, and between nodes need to be exploited to make full use of the system. Heterogeneous hardware architectures with accelerators further complicate the development process. waLBerla addresses these challenges by providing the user with highly efficient building […]

CUDA

Oct, 6

Syntix: A Profiling Based Resource Estimator for CUDA Kernels

Trending applications such as AI and data analytics have mandated the use of GPUs in modern datacenters for performance reasons. Current practice dictates to dedicate GPUs to applications, which limits the amount of concurrent users to the available GPUs. That use of GPUs contradicts with the policy of datacenters to oversubscribe resources and accommodate as […]

CUDA

Oct, 6

MIOpen: An Open Source Library For Deep Learning Primitives

Deep Learning has established itself to be a common occurrence in the business lexicon. The unprecedented success of deep learning in recent years can be attributed to: abundance of data, availability of gargantuan compute capabilities offered by GPUs, and adoption of open-source philosophy by the researchers and industry. Deep neural networks can be decomposed into […]

OpenCL

Sep, 29

Exascale Deep Learning for Scientific Inverse Problems

We introduce novel communication strategies in synchronous distributed Deep Learning consisting of decentralized gradient reduction orchestration and computational graph-aware grouping of gradient tensors. These new techniques produce an optimal overlap between computation and communication and result in near-linear scaling (0.93) of distributed training up to 27,600 NVIDIA V100 GPUs on the Summit Supercomputer. We demonstrate […]

Sep, 29

Futhark Vulkan Backend

This paper describes the effort, challenges, and limitations involved in the implementation of a Futhark compiler variant using the Vulkan API version 1.1 for compiling Futhark programs targeting GPUs. Compared to the existing OpenCL backend with the same purpose, the more modern Vulkan API could offer some performance benefits and may extend the scope of […]

OpenCL

Sep, 29

Heterogeneous Resource-Elastic Management for FPGAs: Concepts, Theory and Implementation

Despite deployment of FPGAs at the edge and cloud data centers due to their performance and energy advantage, FPGA runtime systems commonly tend to support only one-application-at-a-time and cannot adapt to dynamic workloads with reasonable response times. Therefore, this paper proposes the concepts and theory of resource elasticity for FPGA systems to allow a task […]

OpenCL

Sep, 29

Elastic deep learning in multi-tenant GPU cluster

Multi-tenant GPU clusters are common nowadays due to the huge success of deep learning and training jobs are usually conducted with multiple distributed GPUs. These GPU clusters are managed with various goals including short JCT, high resource utilization and quick response to small jobs. In this paper, we show that elasticity, which is the ability […]

Sep, 29

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations, longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction techniques to lower memory consumption and increase the training speed […]

Sep, 22

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Recent work in unsupervised language modeling demonstrates that training large neural language models advances the state of the art in Natural Language Processing applications. However, for very large models, memory constraints limit the size of models that can be practically trained. Model parallelism allows us to train larger models, because the parameters can be split […]

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Performance Impact of Memory Channels on Sparse and Irregular Algorithms

Taichi: A Language for High-Performance Computation on Spatially Sparse Data Structures

Verification of GPU Program Optimizations in Lean

waLBerla: A block-structured high-performance framework for multiphysics simulations

Syntix: A Profiling Based Resource Estimator for CUDA Kernels

MIOpen: An Open Source Library For Deep Learning Primitives

Exascale Deep Learning for Scientific Inverse Problems

Futhark Vulkan Backend

Heterogeneous Resource-Elastic Management for FPGAs: Concepts, Theory and Implementation

Elastic deep learning in multi-tenant GPU cluster

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)