Elastic deep learning in multi-tenant GPU cluster
Chinese University of Hong Kong, Hong Kong
arXiv:1909.11985 [cs.DC], (26 Sep 2019)
@misc{wu2019elastic,
title={Elastic deep learning in multi-tenant GPU cluster},
author={Wu, Yidi and Ma, Kaihao and Yan, Xiao and Liu, Zhi and Cheng, James},
year={2019},
eprint={1909.11985},
archivePrefix={arXiv},
primaryClass={cs.DC}
}
Multi-tenant GPU clusters are common nowadays due to the huge success of deep learning and training jobs are usually conducted with multiple distributed GPUs. These GPU clusters are managed with various goals including short JCT, high resource utilization and quick response to small jobs. In this paper, we show that elasticity, which is the ability to adjust the parallelism (number of GPUs) of a job with low overhead, helps to achieve the goals of GPU cluster management. With elasticity, we can adjust the trade-off between throughput and efficiency, adapt to the cluster load variations, utilize transient idle resource and etc. Motivated by the benefits of elasticity, we designed Amoeba, which requires minimum change to user code and provides a simple API for the scheduler to control the parallelism of jobs. Amoeba is general in that it delegates single machine execution to existing deep learning frameworks and uses light-weight control layer for coordination and management. As it is crucial to reduce the overhead of parallelism adjustment, Amoeba adopts key designs including automatic job management, background scaling and dynamic data pipeline. Experimental results show that Amoeba introduces negligible overhead to normal training without parallelism adjustment and pays significantly lower cost (around 95%) for scaling comparing with naive stop-resume. Moreover, we also show that state-of-the-art GPU cluster scheduler can leverage elasticity with simple modifications and reduce the average JCT by as much as 29% over the case without elasticity.
September 29, 2019 by hgpu