high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Elastic deep learning in multi-tenant GPU cluster

Elastic deep learning in multi-tenant GPU cluster

Yidi Wu, Kaihao Ma, Xiao Yan, Zhi Liu, James Cheng

Chinese University of Hong Kong, Hong Kong

arXiv:1909.11985 [cs.DC], (26 Sep 2019)

BibTeX

Download (PDF)

View

Source

1738

views

Multi-tenant GPU clusters are common nowadays due to the huge success of deep learning and training jobs are usually conducted with multiple distributed GPUs. These GPU clusters are managed with various goals including short JCT, high resource utilization and quick response to small jobs. In this paper, we show that elasticity, which is the ability to adjust the parallelism (number of GPUs) of a job with low overhead, helps to achieve the goals of GPU cluster management. With elasticity, we can adjust the trade-off between throughput and efficiency, adapt to the cluster load variations, utilize transient idle resource and etc. Motivated by the benefits of elasticity, we designed Amoeba, which requires minimum change to user code and provides a simple API for the scheduler to control the parallelism of jobs. Amoeba is general in that it delegates single machine execution to existing deep learning frameworks and uses light-weight control layer for coordination and management. As it is crucial to reduce the overhead of parallelism adjustment, Amoeba adopts key designs including automatic job management, background scaling and dynamic data pipeline. Experimental results show that Amoeba introduces negligible overhead to normal training without parallelism adjustment and pays significantly lower cost (around 95%) for scaling comparing with naive stop-resume. Moreover, we also show that state-of-the-art GPU cluster scheduler can leverage elasticity with simple modifications and reduce the average JCT by as much as 29% over the case without elasticity.

Tags: Computer science, Deep learning, Distributed computing, GPU cluster, nVidia, nVidia GeForce GTX 1080 Ti, Tesla V100

September 29, 2019 by hgpu

Rating: 3.5/5. From 2 votes.

Please wait...

Your response

You must be logged in to post a comment.

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Elastic deep learning in multi-tenant GPU cluster

Your response

Recent source codes

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

Most viewed papers (last 30 days)

Elastic deep learning in multi-tenant GPU cluster

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)