29950

LiteGD: Lightweight and dynamic GPU Dispatching for Large-scale Heterogeneous Clusters

Kunming Zhang, Hanlong Liao, Guoming Tang
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, Guangdong, China
arXiv:2506.15595 [cs.DC], (18 Jun 2025)
BibTeX

Download Download (PDF)   View View   Source Source   

461

views

Parallel computing with multiple GPUs has become the dominant paradigm for machine learning tasks, especially those of large language models (LLMs). To reduce the latency incurred by inter-GPU communication, a common practice for parallel tasks has been to allocate GPUs based on their physical proximity. However, this long-standing assumption has notable limitations, particularly in large-scale, heterogeneous GPU clusters where bandwidth distribution among GPUs is irregular. In this paper, we introduce LiteGD, a lightweight and dynamic GPU dispatching system based on global perspectives. To tackle the difficulty of storing massive GPU topology information, LiteGD adopts a computation-aware design that leverages a lightweight Transformer network trained on sampled data. Our customized design for network structure ensures both transferability and scalability. LiteGD also employs a bidirectional tree search approach to find the optimal GPU dispatching in the data generated in the previous step, which can identify near-optimal solutions while reducing search overhead. We implement and evaluate LiteGD in both real and simulated GPU clusters with homogeneous and heterogeneous interconnects, respectively. Experimental results demonstrate that LiteGD consistently achieves high GPU bandwidth efficacy (approximately 90%) across various cluster configurations and 80% in real-world H100 cluster, significantly outperforming conventional default and interconnect topology-aware dispatching methods, particularly in large-scale heterogeneous environments.
No votes yet.
Please wait...

You must be logged in to post a comment.

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us:

contact@hpgu.org