title={Robust LLM Training Infrastructure at ByteDance},
author={Borui Wan and Gaohong Liu and Zuquan Song and Jun Wang and Yun Zhang and Guangming Sheng and Shuguang Wang and Houmin Wei and Chenyuan Wang and Weiqiang Lou and Xi Yang and Mofan Zhang and Kaihua Jiang and Cheng Ren and Xiaoyun Zhi and Menghan Yu and Zhe Nan and Zhuolin Zheng and Baoquan Zhong and Qinlong Wang and Huan Yu and Jinxin Chi and Wang Zhang and Yuhan Li and Zixian Du and Sida Zhao and Yongqiang Zhang and Jingzhe Tang and Zherui Liu and Chuan Wu and Yanghua Peng and Haibin Lin and Wencong Xiao and Xin Liu and Liang Xiang},
The training scale of large language models (LLMs) has reached tens of thousands of GPUs and is still continuously expanding, enabling faster learning of larger models. Accompanying the expansion of the resource scale is the prevalence of failures (CUDA error, NaN values, job hang, etc.), which poses significant challenges to training stability. Any large-scale LLM training infrastructure should strive for minimal training interruption, efficient fault diagnosis, and effective failure tolerance to enable highly efficient continuous training. This paper presents ByteRobust, a large-scale GPU infrastructure management system tailored for robust and stable training of LLMs. It exploits the uniqueness of LLM training process and gives top priorities to detecting and recovering failures in a routine manner. Leveraging parallelisms and characteristics of LLM training, ByteRobust enables high-capacity fault tolerance, prompt fault demarcation, and localization with an effective data-driven approach, comprehensively ensuring continuous and efficient training of LLM tasks. ByteRobust is deployed on a production GPU platform with over 200,000 GPUs and achieves 97% ETTR for a three-month training job on 9,600 GPUs