28410

Improving Automatic Parallel Training via Balanced Memory Workload Optimization

Yujie Wang, Youhe Jiang, Xupeng Miao, Fangcheng Fu, Xiaonan Nie, Bin Cui
Key Lab of High Confidence Software Technologies (MOE), School of CS, Peking University, Beijing 100871, China
arXiv:2307.02031 [cs.LG], (5 Jul 2023)

@misc{wang2023improving,

   title={Improving Automatic Parallel Training via Balanced Memory Workload Optimization},

   author={Yujie Wang and Youhe Jiang and Xupeng Miao and Fangcheng Fu and Xiaonan Nie and Bin Cui},

   year={2023},

   eprint={2307.02031},

   archivePrefix={arXiv},

   primaryClass={cs.LG}

}

Download Download (PDF)   View View   Source Source   

576

views

Transformer models have emerged as the leading approach for achieving state-of-the-art performance across various application domains, serving as the foundation for advanced large-scale deep learning (DL) models. However, efficiently training these models across multiple GPUs remains a complex challenge due to the abundance of parallelism options. Existing DL systems either require manual efforts to design distributed training plans or limit parallelism combinations to a constrained search space. In this paper, we present Galvatron-BMW, a novel system framework that integrates multiple prevalent parallelism dimensions and automatically identifies the most efficient hybrid parallelism strategy. To effectively navigate this vast search space, we employ a decision tree approach for decomposition and pruning based on intuitive insights. We further utilize a dynamic programming search algorithm to derive the optimal plan. Moreover, to improve resource utilization and enhance system efficiency, we propose a bi-objective optimization workflow that focuses on workload balance. Our evaluations on different Transformer models demonstrate the capabilities of Galvatron-BMW in automating distributed training under varying GPU memory constraints. Across all tested scenarios, Galvatron-BMW consistently achieves superior system throughput, surpassing previous approaches that rely on limited parallelism strategies.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: