high performance computing on graphics processing units: hgpu.org

Posts

Aug, 17

Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling

This paper presents Block, a distributed scheduling framework designed to optimize load balancing and auto-provisioning across instances in large language model serving frameworks by leveraging contextual information from incoming requests. Unlike popular model serving systems that rely on monolithic and heuristic task schedulers, Block operates as a fully distributed, stateless, and predictive scheduling system to […]

CUDA

Aug, 17

Performant Unified GPU Kernels for Portable Singular Value Computation Across Hardware and Precision

This paper presents a portable, GPU-accelerated implementation of a QR-based singular value computation algorithm in Julia. The singular value ecomposition (SVD) is a fundamental numerical tool in scientific computing and machine learning, providing optimal low-rank matrix approximations. Its importance has increased even more in large-scale machine learning pipelines, including large language models (LLMs), where it […]

OpenCL

Aug, 10

AGFT: An Adaptive GPU Frequency Tuner for Real-Time LLM Inference Optimization

The explosive growth of interactive Large Language Models (LLMs) has placed unprecedented demands for low latency on cloud GPUs, forcing them into high-power modes and causing escalating energy costs. Real-time inference workloads exhibit significant dynamic volatility, presenting substantial energy-saving opportunities. However, traditional static or rule-based power management strategies struggle to exploit these opportunities without compromising […]

Aug, 10

Understanding the Landscape of Ampere GPU Memory Errors

Graphics Processing Units (GPUs) have become a de facto solution for accelerating high-performance computing (HPC) applications. Understanding their memory error behavior is an essential step toward achieving efficient and reliable HPC systems. In this work, we present a large-scale cross-supercomputer study to characterize GPU memory reliability, covering three supercomputers – Delta, Polaris, and Perlmutter – […]

Aug, 10

ConTraPh: Contrastive Learning for Parallelization and Performance Optimization

With the advancement of HPC platforms, the demand for high-performing applications continues to grow. One effective way to enhance program performance is through parallelization. However, fully leveraging the powerful hardware of HPC platforms poses significant challenges. Even experienced developers must carefully consider factors such as runtime, memory usage, and thread-scheduling overhead. Additionally, achieving successful parallelization […]

OpenCL

Aug, 10

SIGMo: High-Throughput Batched Subgraph Isomorphism on GPUs for Molecular Matching

Subgraph isomorphism is a fundamental graph problem with applications in diverse domains from biology to social network analysis. Of particular interest is molecular matching, which uses a subgraph isomorphism formulation for the drug discovery process. While subgraph isomorphism is known to be NP-complete and computationally expensive, in the molecular matching formulation a number of domain […]

CUDA

Aug, 10

DGEMM without FP64 Arithmetic – using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme

Since AI computations require low-precision matrix multiplications, processors with enhanced performance for these operations are increasing along with the growing demand for AI computations. However, it is difficult to use these operations directly for scientific computations. The Ozaki scheme, an accurate matrix multiplication method proposed by Ozaki et al. in 2012, enables FP64 matrix multiplication […]

Aug, 3

NPUEval: Optimizing NPU Kernels with LLMs and Open Source Compilers

Neural processing units (NPUs) are gaining prominence in power-sensitive devices like client devices, with AI PCs being defined by their inclusion of these specialized processors. Running AI workloads efficiently on these devices requires libraries of optimized kernels. Creating efficient kernels demands expertise in domain-specific C++ with vector intrinsics and in-depth knowledge of the target architecture. […]

Aug, 3

GBOTuner: Autotuning of OpenMP Parallel Codes with Bayesian Optimization and Code Representation Transfer Learning

Empirical autotuning methods such as Bayesian optimization (BO) are a powerful approach that allows us to optimize tuning parameters of parallel codes as black-boxes. However, BO is an expensive approach because it relies on empirical samples from true evaluations for varying parameter configurations. In this thesis, we present GBOTuner, an autotuning framework for optimizing the […]

Aug, 3

Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks

The demand for AI-generated GPU kernels is rapidly growing, influenced by the need for scalable, hardware-optimized solutions in both industry and academia. As deep learning workloads grow in complexity and diversity, it is imperative to automate low-level kernel development to meet performance and productivity demands. Major cloud providers, semiconductor companies, and research institutions are now […]

Aug, 3

OpenDwarfs 2025: Modernizing the OpenDwarfs Benchmark Suite for Heterogeneous Computing

As the era of heterogeneous computing evolves, benchmarking tools are vital for measuring performance across diverse architectures. We present OpenDwarfs 2025, a reengineered and modernized version of the OpenDwarfs benchmark suite, originally developed to evaluate the performance of heterogeneous systems using OpenCL. Our comprehensive reengineering process involved addressing compatibility issues with modern compilers, resolving bugs, […]

OpenCL

Aug, 3

Performance Portable Gradient Computations Using Source Transformation

Derivative computation is a key component of optimization, sensitivity analysis, uncertainty quantification, and nonlinear solvers. Automatic differentiation (AD) is a powerful technique for evaluating such derivatives, and in recent years, has been integrated into programming environments such as Jax, PyTorch, and TensorFlow to support derivative computations needed for training of machine learning models, resulting in […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling

Performant Unified GPU Kernels for Portable Singular Value Computation Across Hardware and Precision

AGFT: An Adaptive GPU Frequency Tuner for Real-Time LLM Inference Optimization

Understanding the Landscape of Ampere GPU Memory Errors

ConTraPh: Contrastive Learning for Parallelization and Performance Optimization

SIGMo: High-Throughput Batched Subgraph Isomorphism on GPUs for Molecular Matching

DGEMM without FP64 Arithmetic – using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme

NPUEval: Optimizing NPU Kernels with LLMs and Open Source Compilers

GBOTuner: Autotuning of OpenMP Parallel Codes with Bayesian Optimization and Code Representation Transfer Learning

Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks

OpenDwarfs 2025: Modernizing the OpenDwarfs Benchmark Suite for Heterogeneous Computing

Performance Portable Gradient Computations Using Source Transformation

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)