Themis: Fair and Efficient GPU Cluster Scheduling for Machine Learning Workloads
University of Wisconsin – Madison
arXiv:1907.01484 [cs.DC], (2 Jul 2019)
@misc{mahajan2019themis,
title={Themis: Fair and Efficient GPU Cluster Scheduling for Machine Learning Workloads},
author={Mahajan, Kshiteej and Singhvi, Arjun and Balasubramanian, Arjun and Batra, Varun and Chavali, Surya Teja and Venkataraman, Shivaram and Akella, Aditya and Phanishayee, Amar and Chawla, Shuchi},
year={2019},
eprint={1907.01484},
archivePrefix={arXiv},
primaryClass={cs.DC}
}
Modern distributed machine learning (ML) training workloads benefit significantly from leveraging GPUs. However, significant contention ensues when multiple such workloads are run atop a shared cluster of GPUs. A key question is how to fairly apportion GPUs across workloads while ensuring overall cluster efficiency. We find that established cluster scheduling disciplines that provide instantaneous fair share of resources are a poor fit because of ML workloads’ unique attributes. ML jobs are typically long running, have coarse grained tasks that need to be gang-scheduled, and their performance is sensitive to tasks’ relative placement. These properties cannot be captured by existing fair sharing schemes. We propose Themis, a new scheduling framework for ML training workloads. It’s GPU allocation policy enforces that ML workloads complete in a finish-time fair manner, a new notion we introduce. To capture placement sensitivity and ensure efficiency, Themis uses a two-level scheduling architecture where ML workloads bid on available resources that are offered in an auction run by a central arbiter. Our auction design allocates GPUs to winning bids by trading off efficiency for fairness in the short term but compensating for finish-time fairness in the long term. Our evaluation on a number of machine learning models shows that Themis can ensure greater fairness while providing more efficient allocations compared to state-of-the-art schedulers.
July 4, 2019 by hgpu