high performance computing on graphics processing units: hgpu.org

Posts

Aug, 10

AGFT: An Adaptive GPU Frequency Tuner for Real-Time LLM Inference Optimization

The explosive growth of interactive Large Language Models (LLMs) has placed unprecedented demands for low latency on cloud GPUs, forcing them into high-power modes and causing escalating energy costs. Real-time inference workloads exhibit significant dynamic volatility, presenting substantial energy-saving opportunities. However, traditional static or rule-based power management strategies struggle to exploit these opportunities without compromising […]

Aug, 10

Understanding the Landscape of Ampere GPU Memory Errors

Graphics Processing Units (GPUs) have become a de facto solution for accelerating high-performance computing (HPC) applications. Understanding their memory error behavior is an essential step toward achieving efficient and reliable HPC systems. In this work, we present a large-scale cross-supercomputer study to characterize GPU memory reliability, covering three supercomputers – Delta, Polaris, and Perlmutter – […]

Aug, 10

ConTraPh: Contrastive Learning for Parallelization and Performance Optimization

With the advancement of HPC platforms, the demand for high-performing applications continues to grow. One effective way to enhance program performance is through parallelization. However, fully leveraging the powerful hardware of HPC platforms poses significant challenges. Even experienced developers must carefully consider factors such as runtime, memory usage, and thread-scheduling overhead. Additionally, achieving successful parallelization […]

OpenCL

Aug, 10

SIGMo: High-Throughput Batched Subgraph Isomorphism on GPUs for Molecular Matching

Subgraph isomorphism is a fundamental graph problem with applications in diverse domains from biology to social network analysis. Of particular interest is molecular matching, which uses a subgraph isomorphism formulation for the drug discovery process. While subgraph isomorphism is known to be NP-complete and computationally expensive, in the molecular matching formulation a number of domain […]

CUDA

Aug, 10

DGEMM without FP64 Arithmetic – using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme

Since AI computations require low-precision matrix multiplications, processors with enhanced performance for these operations are increasing along with the growing demand for AI computations. However, it is difficult to use these operations directly for scientific computations. The Ozaki scheme, an accurate matrix multiplication method proposed by Ozaki et al. in 2012, enables FP64 matrix multiplication […]

Aug, 3

NPUEval: Optimizing NPU Kernels with LLMs and Open Source Compilers

Neural processing units (NPUs) are gaining prominence in power-sensitive devices like client devices, with AI PCs being defined by their inclusion of these specialized processors. Running AI workloads efficiently on these devices requires libraries of optimized kernels. Creating efficient kernels demands expertise in domain-specific C++ with vector intrinsics and in-depth knowledge of the target architecture. […]

Aug, 3

GBOTuner: Autotuning of OpenMP Parallel Codes with Bayesian Optimization and Code Representation Transfer Learning

Empirical autotuning methods such as Bayesian optimization (BO) are a powerful approach that allows us to optimize tuning parameters of parallel codes as black-boxes. However, BO is an expensive approach because it relies on empirical samples from true evaluations for varying parameter configurations. In this thesis, we present GBOTuner, an autotuning framework for optimizing the […]

Aug, 3

Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks

The demand for AI-generated GPU kernels is rapidly growing, influenced by the need for scalable, hardware-optimized solutions in both industry and academia. As deep learning workloads grow in complexity and diversity, it is imperative to automate low-level kernel development to meet performance and productivity demands. Major cloud providers, semiconductor companies, and research institutions are now […]

Aug, 3

OpenDwarfs 2025: Modernizing the OpenDwarfs Benchmark Suite for Heterogeneous Computing

As the era of heterogeneous computing evolves, benchmarking tools are vital for measuring performance across diverse architectures. We present OpenDwarfs 2025, a reengineered and modernized version of the OpenDwarfs benchmark suite, originally developed to evaluate the performance of heterogeneous systems using OpenCL. Our comprehensive reengineering process involved addressing compatibility issues with modern compilers, resolving bugs, […]

OpenCL

Aug, 3

Performance Portable Gradient Computations Using Source Transformation

Derivative computation is a key component of optimization, sensitivity analysis, uncertainty quantification, and nonlinear solvers. Automatic differentiation (AD) is a powerful technique for evaluating such derivatives, and in recent years, has been integrated into programming environments such as Jax, PyTorch, and TensorFlow to support derivative computations needed for training of machine learning models, resulting in […]

CUDA

Jul, 20

Kevin: Multi-Turn RL for Generating CUDA Kernels

Writing GPU kernels is a challenging task and critical for AI systems’ efficiency. It is also highly iterative: domain experts write code and improve performance through execution feedback. Moreover, it presents verifiable rewards like correctness and speedup, making it a natural environment to apply Reinforcement Learning (RL). To explicitly incorporate the iterative nature of this […]

CUDA

Jul, 20

Specx: a C++ task-based runtime system for heterogeneous distributed architectures

Parallelization is needed everywhere, from laptops and mobile phones to supercomputers. Among parallel programming models, task-based programming has demonstrated a powerful potential and is widely used in high-performance scientific computing. Not only does it allow efficient parallelization across distributed heterogeneous computing nodes, but it also allows for elegant source code structuring by describing hardware-independent algorithms. […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

AGFT: An Adaptive GPU Frequency Tuner for Real-Time LLM Inference Optimization

Understanding the Landscape of Ampere GPU Memory Errors

ConTraPh: Contrastive Learning for Parallelization and Performance Optimization

SIGMo: High-Throughput Batched Subgraph Isomorphism on GPUs for Molecular Matching

DGEMM without FP64 Arithmetic – using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme

NPUEval: Optimizing NPU Kernels with LLMs and Open Source Compilers

GBOTuner: Autotuning of OpenMP Parallel Codes with Bayesian Optimization and Code Representation Transfer Learning

Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks

OpenDwarfs 2025: Modernizing the OpenDwarfs Benchmark Suite for Heterogeneous Computing

Performance Portable Gradient Computations Using Source Transformation

Kevin: Multi-Turn RL for Generating CUDA Kernels

Specx: a C++ task-based runtime system for heterogeneous distributed architectures

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)