high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Kevin: Multi-Turn RL for Generating CUDA Kernels

Kevin: Multi-Turn RL for Generating CUDA Kernels

Carlo Baronio, Pietro Marsella, Ben Pan, Simon Guo, Silas Alberti

Stanford University, Cognition AI

arXiv:2507.11948 [cs.LG], (16 Jul 2025)

DOI:10.48550/arXiv.2507.11948

@misc{baronio2025kevinmultiturnrlgenerating,

title={Kevin: Multi-Turn RL for Generating CUDA Kernels},

author={Carlo Baronio and Pietro Marsella and Ben Pan and Simon Guo and Silas Alberti},

year={2025},

eprint={2507.11948},

archivePrefix={arXiv},

primaryClass={cs.LG},

url={https://arxiv.org/abs/2507.11948}

}

Download (PDF)

View

Source

1228

views

Writing GPU kernels is a challenging task and critical for AI systems’ efficiency. It is also highly iterative: domain experts write code and improve performance through execution feedback. Moreover, it presents verifiable rewards like correctness and speedup, making it a natural environment to apply Reinforcement Learning (RL). To explicitly incorporate the iterative nature of this process into training, we develop a flexible multi-turn RL recipe that addresses unique challenges encountered in real-world settings, such as learning from long trajectories and effective reward attribution across turns. We present Kevin – K(ernel D)evin, the first model trained with multi-turn RL for CUDA kernel generation and optimization. In our evaluation setup, Kevin shows significant gains over its base model (QwQ-32B), improving correctness of generated kernels (in pure CUDA) from 56% to 82% and mean speedup from 0.53x to 1.10x of baseline (PyTorch Eager), and surpassing frontier models like o4-mini (0.78x). Finally, we study its behavior across test-time scaling axes: we found scaling serial refinement more beneficial than parallel sampling. In particular, when given more refinement turns, Kevin shows a higher rate of improvement.

Tags: Artificial intelligence, Computer science, CUDA, LLM, Machine learning, nVidia, nVidia H100, nVidia H200, Performance

July 20, 2025 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Kevin: Multi-Turn RL for Generating CUDA Kernels

Your response

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)

Kevin: Multi-Turn RL for Generating CUDA Kernels

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)