high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Deep Learning Workload Scheduling in GPU Datacenters: A Survey

Deep Learning Workload Scheduling in GPU Datacenters: A Survey

Zhisheng Ye, Wei Gao, Qinghao Hu, Peng Sun, Xiaolin Wang, Yingwei Luo, Tianwei Zhang, Yonggang Wen

Peking University, China

ACM Computing Surveys, 2023

DOI:10.1145/3638757

BibTeX

Download (PDF)

View

Source

1415

views

Deep learning (DL) has demonstrated its remarkable success in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU accelerators have been collectively constructed into a GPU datacenter. An efficient scheduler design for a GPU datacenter is crucially important to reduce operational cost and improve resource utilization. However, traditional approaches designed for big data or high performance computing workloads can not support DL workloads to fully utilize the GPU resources. Recently, many schedulers are proposed to tailor for DL workloads in GPU datacenters. This paper surveys existing research efforts for both training and inference workloads. We primarily present how existing schedulers facilitate the respective workloads from the scheduling objectives and resource utilization manner. Finally, we discuss several promising future research directions including emerging DL workloads, advanced scheduling decision making and underlying hardware resources. A more detailed summary of the surveyed paper and code links can be found at our project website.

Tags: Cloud, Computer science, Deep learning, survey, Task scheduling

January 7, 2024 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

Deep Learning Workload Scheduling in GPU Datacenters: A Survey

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Deep Learning Workload Scheduling in GPU Datacenters: A Survey

Share this:

Recent source codes

Most viewed papers (last 30 days)