high performance computing on graphics processing units: hgpu.org

Posts

Apr, 13

Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization

We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe. On the agent side, Kernel-Smith maintains a population of executable candidates and iteratively improves them using an archive of top-performing and diverse programs together with structured execution feedback on compilation, correctness, […]

CUDA

Apr, 13

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute engines. For each layer, we stream parameters in and compute gradients out, minimizing […]

CUDA

Apr, 13

DVM: Real-Time Kernel Generation for Dynamic AI Models

Dynamism is common in AI computation, e.g., the dynamic tensor shapes and the dynamic control flows in models. Due to the long compilation time, existing runtime compilation damages the model efficiency, while the offline compilers either suffer from the long compilation time and device memory footprint to cover all the possible execution instances of a […]

Apr, 13

Agentic Code Optimization via Compiler-LLM Cooperation

Generating performant executables from high level languages is critical to software performance across a wide range of domains. Modern compilers perform this task by passing code through a series of well-studied optimizations at progressively lower levels of abstraction, but may miss optimization opportunities that require high-level reasoning about a program’s purpose. Recent work has proposed […]

Apr, 13

CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe

High-performance GPU kernels are critical to modern machine learning systems, yet developing efficient implementations remains a challenging, expert-driven process due to the tight coupling between algorithmic structure, memory hierarchy usage, and hardware-specific optimizations. Recent work has explored using large language models (LLMs) to generate GPU kernels automatically, but generated implementations often struggle to maintain correctness […]

CUDA

Mar, 26

DRTriton: Large-Scale Synthetic Data Reinforcement Learning for Triton Kernel Generation

Developing efficient CUDA kernels is a fundamental yet challenging task in the generative AI industry. Recent researches leverage Large Language Models (LLMs) to automatically convert PyTorch reference implementations to CUDA kernels, significantly reducing the engineering efforts. State-of-the-art LLMs, such as GPT-5.2 and Claude-Sonnet-4.5, still struggle in this specific task. To address this challenge, we propose […]

CUDA

Mar, 26

AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search

Writing high-performance GPU kernels is among the most labor-intensive tasks in machine learning systems engineering. We present AutoKernel, an open-source framework that applies an autonomous agent loop to GPU kernel optimization for arbitrary PyTorch models. Given a model, AutoKernel profiles it to identify computational bottlenecks, ranks them by Amdahl's law impact, and iteratively refines Triton […]

CUDA

Mar, 26

Mixed-precision numerics in scientific applications: survey and perspectives

The explosive demand for artificial intelligence (AI) workloads has led to a significant increase in silicon area dedicated to lower-precision computations on recent high-performance computing hardware designs. However, mixed-precision capabilities, which can achieve performance improvements of up to 8x compared to double-precision in extreme compute-intensive workloads, remain largely untapped in most scientific applications. A growing […]

Mar, 26

High-level Programming of Vulkan-based GPUs Through OpenMP

Modern applications often involve complex, structured or data-parallel computations on large datasets. Traditionally, GPUs have served as the primary accelerators for such tasks, mostly through compute-focused models like CUDA and OpenCL. Vulkan is a more recent cross-platform API, widely adopted for both high-performance graphics and compute. These models require lower-level programming, as developers have to […]

CUDA

•

OpenCL

Mar, 22

LLMQ: Efficient Lower-Precision LLM Training for Consumer GPUs

We present LLMQ, an end-to-end CUDA/C++ implementation for medium-sized language-model training, e.g. 3B to 32B parameters, on affordable, commodity GPUs. These devices are characterized by low memory availability and slow communication compared to datacentre-grade GPUs. Consequently, we showcase a range of optimizations that target these bottlenecks, including activation checkpointing, offloading, and copy-engine based collectives. LLMQ […]

CUDA

Mar, 22

Architecture-Aware LLM Inference Optimization on AMD Instinct GPUs: A Comprehensive Benchmark and Deployment Study

We present a cross-architecture evaluation of production LLM inference on AMD Instinct MI325X GPUs, benchmarking four models spanning 235B to 1 trillion parameters across three architectural families (MoE+MLA, Dense+GQA, MoE+GQA) on an 8-GPU cluster with 2TB aggregate HBM3e using vLLM v0.14.1. Our results demonstrate that architecture-aware optimization is essential: MLA models require block size 1 […]

Mar, 22

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

As agentic AI systems become increasingly capable of generating and optimizing GPU kernels, progress is constrained by benchmarks that reward speedup over software baselines rather than proximity to hardware-efficient execution. We present SOL-ExecBench, a benchmark of 235 CUDA kernel optimization problems extracted from 124 production and emerging AI models spanning language, diffusion, vision, audio, video, […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

DVM: Real-Time Kernel Generation for Dynamic AI Models

Agentic Code Optimization via Compiler-LLM Cooperation

CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe

DRTriton: Large-Scale Synthetic Data Reinforcement Learning for Triton Kernel Generation

AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search

Mixed-precision numerics in scientific applications: survey and perspectives

High-level Programming of Vulkan-based GPUs Through OpenMP

LLMQ: Efficient Lower-Precision LLM Training for Consumer GPUs

Architecture-Aware LLM Inference Optimization on AMD Instinct GPUs: A Comprehensive Benchmark and Deployment Study

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)