high performance computing on graphics processing units: hgpu.org

Posts

Mar, 30

Advances in Semantic Patching for HPC-oriented Refactorings with Coccinelle

Currently, the most energy-efficient hardware platforms for floating point-intensive calculations (also known as High Performance Computing, or HPC) are graphical processing units (GPUs). However, porting existing scientific codes to GPUs can be far from trivial. This article summarizes our recent advances in enabling machine-assisted, HPC-oriented refactorings with reference to existing APIs and programming idioms available […]

CUDA

Mar, 30

PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch

CUDA Graphs — a recent hardware feature introduced for NVIDIA GPUs — aim to reduce CPU launch overhead by capturing and launching a series of GPU tasks (kernels) as a DAG. However, deploying CUDA Graphs faces several challenges today due to the static structure of a graph. It also incurs performance overhead due to data […]

CUDA

Mar, 30

Efficient allocation of image recognition and LLM tasks on multi-GPU system

This work is concerned with the evaluation of the performance of parallelization of learning and tuning processes for image classification and large language models. For machine learning model in image recognition, various parallelization methods are developed based on different hardware and software scenarios: simple data parallelism, distributed data parallelism, and distributed processing. A detailed description […]

CUDA

Mar, 30

Hardware-Assisted Software Testing and Debugging for Heterogeneous Computing

There is a growing interest in the computer architecture community to incorporate heterogeneity and specialization to improve performance. Developers can write heterogeneous applications that consist of host code and kernel code, where compute-intensive kernels can be offloaded from CPU to GPU, FPGA, or quantum computer. However, the high complexity of these systems can pose challenges […]

Mar, 30

TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives

Large deep learning models have achieved state-of-the-art performance in a wide range of tasks. These models often necessitate distributed systems for efficient training and inference. The fundamental building blocks for distributed model execution are intra-layer parallel operators. The most effective approach to enhancing the performance of intra-layer parallel operators involves overlapping computation with communication. The […]

Mar, 30

Analyzing Modern NVIDIA GPU cores

GPUs are the most popular platform for accelerating HPC workloads, such as artificial intelligence and science simulations. However, most microarchitectural research in academia relies on GPU core pipeline designs based on architectures that are more than 15 years old. This paper reverse engineers modern NVIDIA GPU cores, unveiling many key aspects of its design and […]

Mar, 23

The Shamrock code: I- Smoothed Particle Hydrodynamics on GPUs

We present Shamrock, a performance portable framework developed in C++17 with the SYCL programming standard, tailored for numerical astrophysics on Exascale architectures. The core of Shamrock is an accelerated parallel tree with negligible construction time, whose efficiency is based on binary algebra. The Smoothed Particle Hydrodynamics algorithm of the Phantom code is implemented in Shamrock. […]

CUDA

Mar, 23

Concurrent Scheduling of High-Level Parallel Programs on Multi-GPU Systems

Parallel programming models can encourage performance portability by moving the responsibility for work assignment and data distribution from the programmer to a runtime system. However, analyzing the resulting implicit memory allocations, coherence operations and their interdependencies can quickly introduce delays into the latency-sensitive execution pipeline of a distributed-memory application. In this paper, we show how […]

Mar, 23

Hercules: A Compiler for Productive Programming of Heterogeneous Systems

Modern computing systems increasingly rely on composing heterogeneous devices to improve performance and efficiency. Programming these systems is often unproductive: algorithm implementations must be coupled to system-specific logic, including device-specific optimizations, partitioning, and inter-device communication and synchronization, which requires developing different programs for different system configurations. We propose the Juno language, which represents general purpose […]

CUDA

Mar, 23

ML-Triton, A Multi-Level Compilation and Language Extension to Triton GPU Programming

In the era of LLMs, dense operations such as GEMM and MHA are critical components. These operations are well-suited for parallel execution using a tilebased approach. While traditional GPU programming often relies on low level interfaces like CUDA or SYCL, Triton has emerged as a DSL that offers a more user-friendly and portable alternative by […]

CUDA

Mar, 23

LLMPerf: GPU Performance Modeling meets Large Language Models

Performance modeling, a pivotal domain in program cost analysis, currently relies on manually crafted models constrained by various program and hardware limitations, especially in the intricate landscape of GPGPU. Meanwhile, Large Language Models (LLMs) have demonstrated their effectiveness in addressing diverse programming challenges. Our work establishes a connection between LLMs and performance modeling, employing the […]

OpenCL

Mar, 10

Mpache: Interaction Aware Multi-level Cache Bypassing on GPUs

Graphics Processing Units (GPUs) are essential for general-purpose applications and are commonly leveraging multi-level caches to alleviate memory access pressure. However, the default cache management may lose opportunities for optimal performance in different applications. Although existing cache bypassing techniques tend to address this challenge, these methods predominantly concentrate on single-level cache, thus restricting their potential […]

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Advances in Semantic Patching for HPC-oriented Refactorings with Coccinelle

PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch

Efficient allocation of image recognition and LLM tasks on multi-GPU system

Hardware-Assisted Software Testing and Debugging for Heterogeneous Computing

TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives

Analyzing Modern NVIDIA GPU cores

The Shamrock code: I- Smoothed Particle Hydrodynamics on GPUs

Concurrent Scheduling of High-Level Parallel Programs on Multi-GPU Systems

Hercules: A Compiler for Productive Programming of Heterogeneous Systems

ML-Triton, A Multi-Level Compilation and Language Extension to Triton GPU Programming

LLMPerf: GPU Performance Modeling meets Large Language Models

Mpache: Interaction Aware Multi-level Cache Bypassing on GPUs

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)