The AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition
Sakana AI
Sakana AI, 2025
@article{lange2025cuda,
title={The AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition},
author={Lange, Robert Tjarko and Prasad, Aaditya and Sun, Qi and Faldor, Maxence and Tang, Yujin and Ha, David},
year={2025}
}
Recent advances in Large Language Models have driven large-scale deployment, resulting in ever-growing inference time and energy demand. While manual optimization of low-level code implementations is feasible, it is an arduous task that requires deep expertise to balance the complex interplay of algorithmic, software, and hardware bottlenecks. This report presents the first comprehensive agentic framework for fully automatic CUDA kernel discovery and optimization, enabling frontier large language models to perform the translation of torch code to CUDA kernels and then iteratively improve their runtime. We introduce The AI CUDA Engineer, which acts in sequential stages. First, it translates raw PyTorch code into equivalent CUDA kernels. Next, it optimizes their runtime performance using a novel evolutionary meta-generation procedure tailored towards the CUDA ecosystem. Finally, it uses an innovation archive of discovered ’stepping stone’ kernels to improve future performance on new tasks. The AI CUDA Engineer can produce CUDA kernels that exceed the performance of torch native and compiled kernels. Out of the 250 tasks tested, The AI CUDA Engineer successfully optimizes 186 tasks to a median speedup of 1.52x. For operations such as fused 3D convolutions or Diagonal Matrix Multiplication, we show runtime improvements ≥50x over their torch implementations. Alongside this report, we release the best discovered kernels, an accompanying dataset of all discovered kernels and an interactive webpage for exploration of the results.
February 24, 2025 by hgpu