high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta

KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta

Gang Liao, Hongsen Qin, Ying Wang, Alicia Golden, Michael Kuchnik, Yavuz Yetim, Jia Jiunn Ang, Chunli Fu, Yihan He, Samuel Hsia, Zewei Jiang, Dianshi Li, Uladzimir Pashkevich, Varna Puvvada, Feng Shi, Matt Steiner, Ruichao Xiao, Nathan Yan, Xiayu Yu, Zhou Fang, Abdul Zainul-Abedin, Ketan Singh, Hongtao Yu, Wenyuan Chi, Barney Huang, Sean Zhang, Noah Weller, Zach Marine, Wyatt Cook, Carole-Jean Wu, Gaoxiang Liu

KernelEvolve Team, Meta Platforms

arXiv:2512.23236 [cs.LG], (30 Dec 2025)

DOI:10.48550/arXiv.2512.23236

@misc{liao2025kernelevolvescalingagentickernel,

title={KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta},

author={Gang Liao and Hongsen Qin and Ying Wang and Alicia Golden and Michael Kuchnik and Yavuz Yetim and Jia Jiunn Ang and Chunli Fu and Yihan He and Samuel Hsia and Zewei Jiang and Dianshi Li and Uladzimir Pashkevich and Varna Puvvada and Feng Shi and Matt Steiner and Ruichao Xiao and Nathan Yan and Xiayu Yu and Zhou Fang and Abdul Zainul-Abedin and Ketan Singh and Hongtao Yu and Wenyuan Chi and Barney Huang and Sean Zhang and Noah Weller and Zach Marine and Wyatt Cook and Carole-Jean Wu and Gaoxiang Liu},

year={2025},

eprint={2512.23236},

archivePrefix={arXiv},

primaryClass={cs.LG},

url={https://arxiv.org/abs/2512.23236}

}

Download (PDF)

View

Source

1348

views

Making deep learning recommendation model (DLRM) training and inference fast and efficient is important. However, this presents three key system challenges – model architecture diversity, kernel primitive diversity, and hardware generation and architecture heterogeneity. This paper presents KernelEvolve-an agentic kernel coding framework-to tackle heterogeneity at-scale for DLRM. KernelEvolve is designed to take kernel specifications as input and automate the process of kernel generation and optimization for recommendation model across heterogeneous hardware architectures. KernelEvolve does so by operating at multiple programming abstractions, from Triton and CuTe DSL to low-level hardware agnostic languages, spanning the full hardware-software optimization stack. The kernel optimization process is described as graph-based search with selection policy, universal operator, fitness function, and termination rule, dynamically adapts to runtime execution context through retrieval-augmented prompt synthesis. We designed, implemented, and deployed KernelEvolve to optimize a wide variety of production recommendation models across generations of NVIDIA and AMD GPUs, as well as Meta’s AI accelerators. We validate KernelEvolve on the publicly-available KernelBench suite, achieving 100% pass rate on all 250 problems across three difficulty levels, and 160 PyTorch ATen operators across three heterogeneous hardware platforms, demonstrating 100% correctness. KernelEvolve reduces development time from weeks to hours and achieves substantial performance improvements over PyTorch baselines across diverse production use cases and for heterogeneous AI systems at-scale. Beyond performance efficiency improvements, KernelEvolve significantly mitigates the programmability barrier for new AI hardware by enabling automated kernel generation for in-house developed AI hardware.

Tags: AI, AMD Radeon Instinct MI300X, AMD Radeon Instinct MI350X, ATI, Computer science, CUDA, Deep learning, Heterogeneous systems, LLM, nVidia, nVidia A100, nVidia H100, PTX, ROCm, Triton

January 4, 2026 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta

Your response

Recent source codes

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

Agentic Code Optimization via Compiler-LLM Cooperation

Most viewed papers (last 30 days)

KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)