high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Qitong Sun, Jun Han, Tianlin Li, Zhe Tang, Sheng Chen, Fei Yang, Aishan Liu, Xianglong Liu, Yang Liu

School of Computer Science and Engineering, Beihang University, China

arXiv:2603.10085 [cs.LG], (10 Mar 2026)

DOI:10.48550/arXiv.2603.10085

@misc{sun2026kernelskill,

title={KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization},

author={Qitong Sun and Jun Han and Tianlin Li and Zhe Tang and Sheng Chen and Fei Yang and Aishan Liu and Xianglong Liu and Yang Liu},

year={2026},

eprint={2603.10085},

archivePrefix={arXiv},

primaryClass={cs.LG},

url={https://arxiv.org/abs/2603.10085}

}

Download (PDF)

View

Source

Source codes

Package:

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

1060

views

Improving GPU kernel efficiency is crucial for advancing AI systems. Recent work has explored leveraging large language models (LLMs) for GPU kernel generation and optimization. However, existing LLM-based kernel optimization pipelines typically rely on opaque, implicitly learned heuristics within the LLMs to determine optimization strategies. This leads to inefficient trial-and-error and weakly interpretable optimizations. Our key insight is to replace implicit heuristics with expert optimization skills that are knowledge-driven and aware of task trajectories. Specifically, we present KernelSkill, a multi-agent framework with a dual-level memory architecture. KernelSkill operates by coordinating agents with long-term memory of reusable expert skills and short-term memory to prevent repetitive backtracking. On KernelBench Levels 1-3, KernelSkill achieves a 100% success rate and average speedups of 5.44x, 2.82x, and 1.92x over Torch Eager on Levels 1, 2, and 3, respectively, outperforming prior baselines. Code is available.

Tags: Computer science, CUDA, LLM, nVidia, nVidia A100, Package, Performance, PyTorch

March 15, 2026 by hgpu

No votes yet.

Please wait...