high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Isolated Scheduling for Distributed Training Tasks in GPU Clusters

Isolated Scheduling for Distributed Training Tasks in GPU Clusters

Xinchi Han, Weihao Jiang, Peirui Cao, Qinwei Yang, Yunzhuo Liu, Shuyao Qi, Shengkai Lin, Shizhen Zhao

Shanghai Jiao Tong University

arXiv:2308.05692 [cs.DC], (10 Aug 2023)

DOI:10.48550/arXiv.2308.05692

@misc{han2023isolated,

title={Isolated Scheduling for Distributed Training Tasks in GPU Clusters},

author={Xinchi Han and Weihao Jiang and Peirui Cao and Qinwei Yang and Yunzhuo Liu and Shuyao Qi and Shengkai Lin and Shizhen Zhao},

year={2023},

eprint={2308.05692},

archivePrefix={arXiv},

primaryClass={cs.DC}

}

Download (PDF)

View

Source

Source codes

Package:

RapinNetSim: a fast and scalable event-driven network simulator

1051

views

Distributed machine learning (DML) technology makes it possible to train large neural networks in a reasonable amount of time. Meanwhile, as the computing power grows much faster than network capacity, network communication has gradually become the bottleneck of DML. Current multi-tenant GPU clusters face network contention caused by hash-collision problem which not only further increases the overhead of communication, but also creates unfairness and affects the user experience. In this paper, we firstly analyse how network contention affects the training time in a cluster with 32 NVIDIA V100 GPUs. Then we propose vClos to eliminate network contention by jointly optimizing network topology and communication pattern in distributed training. An OCS-vClos which introduces a layer of optical circuit switches (OCSs) in the leaf-spine network is also proposed to reduce potential network resource fragmentation caused by resource allocation strategy in vClos. Testbed experiments and real-trace-based large-scale simulations are conducted to demonstrate the superiority of vClos over existing network resource scheduling strategies.

Tags: Computer science, GPU cluster, Machine learning, Neural networks, nVidia, nVidia V100, Package

August 13, 2023 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Isolated Scheduling for Distributed Training Tasks in GPU Clusters

Package:

Your response

Recent source codes

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

Most viewed papers (last 30 days)

Isolated Scheduling for Distributed Training Tasks in GPU Clusters

Package:

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)