high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

Weinan Dai, Hanlin Wu, Qiying Yu, Huan-ang Gao, Jiahao Li, Chengquan Jiang, Weiqiang Lou, Yufan Song, Hongli Yu, Jiaze Chen, Wei-Ying Ma, Ya-Qin Zhang, Jingjing Liu, Mingxuan Wang, Xin Liu, Hao Zhou

ByteDance Seed

arXiv:2602.24286 [cs.LG], (27 Feb 2026)

DOI:10.48550/arXiv.2602.24286

@misc{dai2026cuda,

title={CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation},

author={Weinan Dai and Hanlin Wu and Qiying Yu and Huan-ang Gao and Jiahao Li and Chengquan Jiang and Weiqiang Lou and Yufan Song and Hongli Yu and Jiaze Chen and Wei-Ying Ma and Ya-Qin Zhang and Jingjing Liu and Mingxuan Wang and Xin Liu and Hao Zhou},

year={2026},

eprint={2602.24286},

archivePrefix={arXiv},

primaryClass={cs.LG},

url={https://arxiv.org/abs/2602.24286}

}

Download (PDF)

View

Source

Source codes

Package:

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

1549

views

GPU kernel optimization is fundamental to modern deep learning but remains a highly specialized task requiring deep hardware expertise. Despite strong performance in general programming, large language models (LLMs) remain uncompetitive with compiler-based systems such as this http URL for CUDA kernel generation. Existing CUDA code generation approaches either rely on training-free refinement or fine-tune models within fixed multi-turn execution-feedback loops, but both paradigms fail to fundamentally improve the model&#39;s intrinsic CUDA optimization ability, resulting in limited performance gains. We present CUDA Agent, a large-scale agentic reinforcement learning system that develops CUDA kernel expertise through three components: a scalable data synthesis pipeline, a skill-augmented CUDA development environment with automated verification and profiling to provide reliable reward signals, and reinforcement learning algorithmic techniques enabling stable training. CUDA Agent achieves state-of-the-art results on KernelBench, delivering 100%, 100%, and 92% faster rate over this http URL on KernelBench Level-1, Level-2, and Level-3 splits, outperforming the strongest proprietary models such as Claude Opus 4.5 and Gemini 3 Pro by about 40% on the hardest Level-3 setting.

Tags: Code generation, Computer science, CUDA, Deep learning, nVidia, nVidia H20, Package

March 4, 2026 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

Package:

Your response

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

Package:

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)