30265

Robust LLM Training Infrastructure at ByteDance

Borui Wan, Gaohong Liu, Zuquan Song, Jun Wang, Yun Zhang, Guangming Sheng, Shuguang Wang, Houmin Wei, Chenyuan Wang, Weiqiang Lou, Xi Yang, Mofan Zhang, Kaihua Jiang, Cheng Ren, Xiaoyun Zhi, Menghan Yu, Zhe Nan, Zhuolin Zheng, Baoquan Zhong, Qinlong Wang, Huan Yu, Jinxin Chi, Wang Zhang, Yuhan Li, Zixian Du, Sida Zhao, Yongqiang Zhang, Jingzhe Tang, Zherui Liu, Chuan Wu, Yanghua Peng, Haibin Lin, Wencong Xiao, Xin Liu, Liang Xiang
The University of Hong Kong
arXiv:2509.16293 [cs.LG], (19 Sep 2025)

@misc{wan2025robustllmtraininginfrastructure,

   title={Robust LLM Training Infrastructure at ByteDance},

   author={Borui Wan and Gaohong Liu and Zuquan Song and Jun Wang and Yun Zhang and Guangming Sheng and Shuguang Wang and Houmin Wei and Chenyuan Wang and Weiqiang Lou and Xi Yang and Mofan Zhang and Kaihua Jiang and Cheng Ren and Xiaoyun Zhi and Menghan Yu and Zhe Nan and Zhuolin Zheng and Baoquan Zhong and Qinlong Wang and Huan Yu and Jinxin Chi and Wang Zhang and Yuhan Li and Zixian Du and Sida Zhao and Yongqiang Zhang and Jingzhe Tang and Zherui Liu and Chuan Wu and Yanghua Peng and Haibin Lin and Wencong Xiao and Xin Liu and Liang Xiang},

   year={2025},

   eprint={2509.16293},

   archivePrefix={arXiv},

   primaryClass={cs.LG},

   url={https://arxiv.org/abs/2509.16293}

}

Download Download (PDF)   View View   Source Source   

651

views

The training scale of large language models (LLMs) has reached tens of thousands of GPUs and is still continuously expanding, enabling faster learning of larger models. Accompanying the expansion of the resource scale is the prevalence of failures (CUDA error, NaN values, job hang, etc.), which poses significant challenges to training stability. Any large-scale LLM training infrastructure should strive for minimal training interruption, efficient fault diagnosis, and effective failure tolerance to enable highly efficient continuous training. This paper presents ByteRobust, a large-scale GPU infrastructure management system tailored for robust and stable training of LLMs. It exploits the uniqueness of LLM training process and gives top priorities to detecting and recovering failures in a routine manner. Leveraging parallelisms and characteristics of LLM training, ByteRobust enables high-capacity fault tolerance, prompt fault demarcation, and localization with an effective data-driven approach, comprehensively ensuring continuous and efficient training of LLM tasks. ByteRobust is deployed on a production GPU platform with over 200,000 GPUs and achieves 97% ETTR for a three-month training job on 9,600 GPUs
No votes yet.
Please wait...

You must be logged in to post a comment.

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us: