high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Robust LLM Training Infrastructure at ByteDance

Robust LLM Training Infrastructure at ByteDance

Borui Wan, Gaohong Liu, Zuquan Song, Jun Wang, Yun Zhang, Guangming Sheng, Shuguang Wang, Houmin Wei, Chenyuan Wang, Weiqiang Lou, Xi Yang, Mofan Zhang, Kaihua Jiang, Cheng Ren, Xiaoyun Zhi, Menghan Yu, Zhe Nan, Zhuolin Zheng, Baoquan Zhong, Qinlong Wang, Huan Yu, Jinxin Chi, Wang Zhang, Yuhan Li, Zixian Du, Sida Zhao, Yongqiang Zhang, Jingzhe Tang, Zherui Liu, Chuan Wu, Yanghua Peng, Haibin Lin, Wencong Xiao, Xin Liu, Liang Xiang

The University of Hong Kong

arXiv:2509.16293 [cs.LG], (19 Sep 2025)

DOI:10.48550/arXiv.2509.16293

@misc{wan2025robustllmtraininginfrastructure,

title={Robust LLM Training Infrastructure at ByteDance},

author={Borui Wan and Gaohong Liu and Zuquan Song and Jun Wang and Yun Zhang and Guangming Sheng and Shuguang Wang and Houmin Wei and Chenyuan Wang and Weiqiang Lou and Xi Yang and Mofan Zhang and Kaihua Jiang and Cheng Ren and Xiaoyun Zhi and Menghan Yu and Zhe Nan and Zhuolin Zheng and Baoquan Zhong and Qinlong Wang and Huan Yu and Jinxin Chi and Wang Zhang and Yuhan Li and Zixian Du and Sida Zhao and Yongqiang Zhang and Jingzhe Tang and Zherui Liu and Chuan Wu and Yanghua Peng and Haibin Lin and Wencong Xiao and Xin Liu and Liang Xiang},

year={2025},

eprint={2509.16293},

archivePrefix={arXiv},

primaryClass={cs.LG},

url={https://arxiv.org/abs/2509.16293}

}

Download (PDF)

View

Source

909

views

The training scale of large language models (LLMs) has reached tens of thousands of GPUs and is still continuously expanding, enabling faster learning of larger models. Accompanying the expansion of the resource scale is the prevalence of failures (CUDA error, NaN values, job hang, etc.), which poses significant challenges to training stability. Any large-scale LLM training infrastructure should strive for minimal training interruption, efficient fault diagnosis, and effective failure tolerance to enable highly efficient continuous training. This paper presents ByteRobust, a large-scale GPU infrastructure management system tailored for robust and stable training of LLMs. It exploits the uniqueness of LLM training process and gives top priorities to detecting and recovering failures in a routine manner. Leveraging parallelisms and characteristics of LLM training, ByteRobust enables high-capacity fault tolerance, prompt fault demarcation, and localization with an effective data-driven approach, comprehensively ensuring continuous and efficient training of LLM tasks. ByteRobust is deployed on a production GPU platform with over 200,000 GPUs and achieves 97% ETTR for a three-month training job on 9,600 GPUs

Tags: AI, Computer science, CUDA, LLM, nVidia, nVidia L20

September 28, 2025 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Robust LLM Training Infrastructure at ByteDance

Your response

Recent source codes

HipKittens: Fast and Furious AMD Kernels

Fortran xDSL dialects

mt4g: Memory Topology 4 GPUs

Falcon: GPU-Based Floating-point Adaptive Lossless Compression

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

LC Framework

pplx-garden: Perplexity open source garden for inference technology

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

Most viewed papers (last 30 days)

Robust LLM Training Infrastructure at ByteDance

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)