MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU

hgpu.org » Applications » Computer science » MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU

MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU

Kun Cheng, Songshuo Lu, Sicong Liao, Tankun Li, Yafei Zhang, Dong Yang, Qiheng Lv, Hua Wang, Zhi Chen, Yaohua Tang

Moore Threads AI

arXiv:2606.04847 [cs.CV], (3 Jun 2026)

DOI:10.48550/arXiv.2606.04847

@misc{cheng2026musacoder,

title={MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU},

author={Kun Cheng and Songshuo Lu and Sicong Liao and Tankun Li and Yafei Zhang and Dong Yang and Qiheng Lv and Hua Wang and Zhi Chen and Yaohua Tang},

year={2026},

eprint={2606.04847},

archivePrefix={arXiv},

primaryClass={cs.CV},

url={https://arxiv.org/abs/2606.04847}

}

Download (PDF)

View

Source

1251

views

Native GPU kernel generation turns high-level tensor programs into executable, efficient low-level code. Existing Large Language Models (LLMs) struggle with this task, while execution-based reinforcement learning suffers from sparse rewards, reward hacking, and training instability. We present MusaCoder, a full-stack training framework for native GPU kernel generation on CUDA and MUSA backends. MusaCoder combines progressive kernel-oriented data synthesis, diversity-preserving rejection fine-tuning, and execution-feedback Reinforcement Learning (RL) through MooreEval, a distributed verifier and reward environment. To stabilize RL, MusaCoder introduces PrimeEcho for first-turn-anchored multi-turn rewards, Buffered Dynamic Retry for recovering signals from all-failed hard samples, and MirrorPop for off-policy sequence filtering. Experiments on KernelBench and a MUSA-ported variant show that MusaCoder outperforms strong open-source and proprietary baselines in both correctness and empirical speedup, with the 9B model matching or exceeding frontier closed-source models and the 27B model establishing a new state of the art. These results demonstrate not only the effectiveness of full-stack execution-feedback training for native kernel generation, but also the capability of Moore Threads GPUs to support the complete LLM post-training stack, providing a practical foundation for large-model training and optimization on emerging accelerators.

Tags: Computer science, CUDA, LLM, PyTorch

June 8, 2026 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org