high performance computing on graphics processing units: hgpu.org

Posts

Aug, 3

GBOTuner: Autotuning of OpenMP Parallel Codes with Bayesian Optimization and Code Representation Transfer Learning

Empirical autotuning methods such as Bayesian optimization (BO) are a powerful approach that allows us to optimize tuning parameters of parallel codes as black-boxes. However, BO is an expensive approach because it relies on empirical samples from true evaluations for varying parameter configurations. In this thesis, we present GBOTuner, an autotuning framework for optimizing the […]

Aug, 3

Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks

The demand for AI-generated GPU kernels is rapidly growing, influenced by the need for scalable, hardware-optimized solutions in both industry and academia. As deep learning workloads grow in complexity and diversity, it is imperative to automate low-level kernel development to meet performance and productivity demands. Major cloud providers, semiconductor companies, and research institutions are now […]

Aug, 3

OpenDwarfs 2025: Modernizing the OpenDwarfs Benchmark Suite for Heterogeneous Computing

As the era of heterogeneous computing evolves, benchmarking tools are vital for measuring performance across diverse architectures. We present OpenDwarfs 2025, a reengineered and modernized version of the OpenDwarfs benchmark suite, originally developed to evaluate the performance of heterogeneous systems using OpenCL. Our comprehensive reengineering process involved addressing compatibility issues with modern compilers, resolving bugs, […]

OpenCL

Aug, 3

Performance Portable Gradient Computations Using Source Transformation

Derivative computation is a key component of optimization, sensitivity analysis, uncertainty quantification, and nonlinear solvers. Automatic differentiation (AD) is a powerful technique for evaluating such derivatives, and in recent years, has been integrated into programming environments such as Jax, PyTorch, and TensorFlow to support derivative computations needed for training of machine learning models, resulting in […]

CUDA

Jul, 20

Kevin: Multi-Turn RL for Generating CUDA Kernels

Writing GPU kernels is a challenging task and critical for AI systems’ efficiency. It is also highly iterative: domain experts write code and improve performance through execution feedback. Moreover, it presents verifiable rewards like correctness and speedup, making it a natural environment to apply Reinforcement Learning (RL). To explicitly incorporate the iterative nature of this […]

CUDA

Jul, 20

Specx: a C++ task-based runtime system for heterogeneous distributed architectures

Parallelization is needed everywhere, from laptops and mobile phones to supercomputers. Among parallel programming models, task-based programming has demonstrated a powerful potential and is widely used in high-performance scientific computing. Not only does it allow efficient parallelization across distributed heterogeneous computing nodes, but it also allows for elegant source code structuring by describing hardware-independent algorithms. […]

CUDA

Jul, 20

Pre-Training LLMs on a budget: A comparison of three optimizers

Optimizers play a decisive role in reducing pre-training times for LLMs and achieving better-performing models. In this study, we compare three major variants: the de-facto standard AdamW, the simpler Lion, developed through an evolutionary search, and the second-order optimizer Sophia. For better generalization, we train with two different base architectures and use a single- and […]

CUDA

Jul, 20

Dissecting the NVIDIA Blackwell Architecture with Microbenchmarks

The rapid development in scientific research provides a need for more compute power, which is partly being solved by GPUs. This paper presents a microarchitectural analysis of the modern NVIDIA Blackwell architecture by studying GPU performance features with thought through microbenchmarks. We unveil key subsystems, including the memory hierarchy, SM execution pipelines, and the SM […]

CUDA

Jul, 20

Using Deep Reinforcement Learning for Automatic Code Optimization in the MLIR Compiler

This work focuses on the use of deep reinforcement learning (DRL) to automate code optimization within modern compiler infrastructures. Code optimization is a critical step in program transformation that aims to improve performance and reduce resource consumption while preserving correctness. Traditional approaches to code optimization rely on manual or heuristic-based methods, which are often time-consuming […]

Jul, 13

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

The rise of GPU-based high-performance computing (HPC) has driven the widespread adoption of parallel programming models such as CUDA. Yet, the inherent complexity of parallel programming creates a demand for the automated sequential-to-parallel approaches. However, data scarcity poses a significant challenge for machine learning-based sequential-to-parallel code translation. Although recent back-translation methods show promise, they still […]

CUDA

Jul, 13

KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling

Autoscaling GPU inference workloads in Kubernetes remains challenging due to the reactive and threshold-based nature of default mechanisms such as the Horizontal Pod Autoscaler (HPA), which struggle under dynamic and bursty traffic patterns and lack integration with GPU-level metrics. We present KIS-S, a unified framework that combines KISim, a GPU-aware Kubernetes Inference Simulator, with KIScaler, […]

Jul, 13

Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms

The NVIDIA Collective Communication Library (NCCL) is a critical software layer enabling high-performance collectives on large-scale GPU clusters. Despite being open source with a documented API, its internal design remains largely opaque. The orchestration of communication channels, selection of protocols, and handling of memory movement across devices and nodes are not well understood, making it […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

GBOTuner: Autotuning of OpenMP Parallel Codes with Bayesian Optimization and Code Representation Transfer Learning

Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks

OpenDwarfs 2025: Modernizing the OpenDwarfs Benchmark Suite for Heterogeneous Computing

Performance Portable Gradient Computations Using Source Transformation

Kevin: Multi-Turn RL for Generating CUDA Kernels

Specx: a C++ task-based runtime system for heterogeneous distributed architectures

Pre-Training LLMs on a budget: A comparison of three optimizers

Dissecting the NVIDIA Blackwell Architecture with Microbenchmarks

Using Deep Reinforcement Learning for Automatic Code Optimization in the MLIR Compiler

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling

Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms

Recent source codes

Allo: Accelerator Design Language

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

HPC Benchmark Survey

HDM: Home made Diffusion Models

General Matrix Multiplication (GEMM)

CrossTL: Universal Programming Language & Translator

TBD-GPU

DG-SWEM - The Discontinuous Galerkin Shallow Water Equation Model

torchPDLP: Primal-Dual Linear Programming in PyTorch. In collaboration with AMD and IPAM

Benchmarks for Dissecting CPU-GPU Unified Physical Memory on AMD MI300A APUs

Most viewed papers (last 30 days)