Themis: Fair and Efficient GPU Cluster Scheduling for Machine Learning Workloads

hgpu.org » Applications » Computer science » Themis: Fair and Efficient GPU Cluster Scheduling for Machine Learning Workloads

Themis: Fair and Efficient GPU Cluster Scheduling for Machine Learning Workloads

Kshiteej Mahajan, Arjun Singhvi, Arjun Balasubramanian, Varun Batra, Surya Teja Chavali, Shivaram Venkataraman, Aditya Akella, Amar Phanishayee, Shuchi Chawla

University of Wisconsin – Madison

arXiv:1907.01484 [cs.DC], (2 Jul 2019)

BibTeX

Download (PDF)

View

Source

1891

views

Modern distributed machine learning (ML) training workloads benefit significantly from leveraging GPUs. However, significant contention ensues when multiple such workloads are run atop a shared cluster of GPUs. A key question is how to fairly apportion GPUs across workloads while ensuring overall cluster efficiency. We find that established cluster scheduling disciplines that provide instantaneous fair share of resources are a poor fit because of ML workloads’ unique attributes. ML jobs are typically long running, have coarse grained tasks that need to be gang-scheduled, and their performance is sensitive to tasks’ relative placement. These properties cannot be captured by existing fair sharing schemes. We propose Themis, a new scheduling framework for ML training workloads. It’s GPU allocation policy enforces that ML workloads complete in a finish-time fair manner, a new notion we introduce. To capture placement sensitivity and ensure efficiency, Themis uses a two-level scheduling architecture where ML workloads bid on available resources that are offered in an auction run by a central arbiter. Our auction design allocates GPUs to winning bids by trading off efficiency for fairness in the short term but compensating for finish-time fairness in the long term. Our evaluation on a number of machine learning models shows that Themis can ensure greater fairness while providing more efficient allocations compared to state-of-the-art schedulers.

Tags: Computer science, GPU cluster, Machine learning, nVidia, Task scheduling, Tesla K80, Tesla M60

July 4, 2019 by hgpu

Rating: 1.5/5. From 2 votes.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org