28301

Communication-minimizing Asynchronous Tensor Parallelism

Siddharth Singh, Zack Sating, Abhinav Bhatele
Department of Computer Science, University of Maryland, College Park, USA
arXiv:2305.13525 [cs.LG], (22 May 2023)

@misc{singh2023communicationminimizing,

   title={Communication-minimizing Asynchronous Tensor Parallelism},

   author={Siddharth Singh and Zack Sating and Abhinav Bhatele},

   year={2023},

   eprint={2305.13525},

   archivePrefix={arXiv},

   primaryClass={cs.LG}

}

Download Download (PDF)   View View   Source Source   

451

views

As state-of-the-art neural networks scale to billions of parameters, designing parallel algorithms that can train these networks efficiently on multi-GPU clusters has become critical. This paper presents Tensor3D, a novel three-dimensional (3D) approach to parallelize tensor computations, that strives to minimize the idle time incurred due to communication in parallel training of large multi-billion parameter models. First, we introduce an intelligent distribution of neural network parameters across GPUs that eliminates communication required for satisfying data dependencies of individual layers. Then, we propose a novel overdecomposition of the parallel training process, using which we achieve significant overlap of communication with computation, thereby reducing GPU idle time. Finally, we present a communication model, which helps users identify communication optimal decompositions of available hardware resources for a given neural network. For a 28B parameter CNN on 256 A100 GPUs, Tensor3D improves the training time by nearly 60% as compared to Megatron-LM.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: