CoCoNet: Co-Optimizing Computation and Communication for Distributed Machine Learning
University of Massachusetts Amherst, United States
arXiv:2105.05720 [cs.DC], (13 May 2021)
@misc{jangda2021coconet,
title={CoCoNet: Co-Optimizing Computation and Communication for Distributed Machine Learning},
author={Abhinav Jangda and Jun Huang and Guodong Liu and Amir Hossein Nodehi Sabet and Saeed Maleki and Youshan Miao and Madanlal Musuvathi and Todd Mytkowicz and Olli Sarikivi},
year={2021},
eprint={2105.05720},
archivePrefix={arXiv},
primaryClass={cs.DC}
}
Modern deep learning workloads run on distributed hardware and are difficult to optimize — data, model, and pipeline parallelism require a developer to thoughtfully restructure their workload around optimized computation and communication kernels in libraries such as cuBLAS and NCCL. The logical separation between computation and communication leaves performance on the table with missed optimization opportunities across abstraction boundaries. To explore these opportunities, this paper presents CoCoNet, which consists of a compute language to express programs with both computation and communication, a scheduling language to apply transformations on such programs, and a compiler to generate high performance kernels. Providing both computation and communication as first class constructs enables new optimizations, such as overlapping or fusion of communication with computation. CoCoNet allowed us to optimize several data, model and pipeline parallel workloads in existing deep learning systems with very few lines of code. We show significant improvements after integrating novel CoCoNet generated kernels.
May 23, 2021 by hgpu