29914

FLASH: Fast All-to-All Communication in GPU Clusters

Yiran Lei, Dongjoo Lee, Liangyu Zhao, Daniar Kurniawan, Chanmyeong Kim, Heetaek Jeong, Changsu Kim, Hyeonseong Choi, Liangcheng Yu, Arvind Krishnamurthy, Justine Sherry, Eriko Nurvitadhi
Carnegie Mellon University
arXiv:2505.09764 [cs.DC]

@misc{lei2025flashfastalltoallcommunication,

   title={FLASH: Fast All-to-All Communication in GPU Clusters},

   author={Yiran Lei and Dongjoo Lee and Liangyu Zhao and Daniar Kurniawan and Chanmyeong Kim and Heetaek Jeong and Changsu Kim and Hyeonseong Choi and Liangcheng Yu and Arvind Krishnamurthy and Justine Sherry and Eriko Nurvitadhi},

   year={2025},

   eprint={2505.09764},

   archivePrefix={arXiv},

   primaryClass={cs.DC},

   url={https://arxiv.org/abs/2505.09764}

}

Download Download (PDF)   View View   Source Source   

427

views

Scheduling All-to-All communications efficiently is fundamental to minimizing job completion times in distributed systems. Incast and straggler flows can slow down All-to-All transfers; and GPU clusters bring additional straggler challenges due to highly heterogeneous link capacities between technologies like NVLink and Ethernet. Existing schedulers all suffer high overheads relative to theoretically optimal transfers. Classical, simple scheduling algorithms such as SpreadOut fail to minimize transfer completion times; modern optimization-based schedulers such as TACCL achieve better completion times but with computation times that can be orders of magnitude longer than the transfer itself. This paper presents FLASH, which schedules near-optimal All-to-All transfers with a simple, polynomial time algorithm. FLASH keeps the bottleneck inter-server network maximally utilized and, in the background, shuffles data between GPUs over fast intra-server networks to mitigate stragglers. We prove that, so long as intra-server networks are significantly faster than inter-server networks, FLASH approaches near-optimal transfer completion times. We implement FLASH and demonstrate that its computational overheads are negligible, yet it achieves transfer completion times that are comparable to state-of-the-art solver-based schedulers.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us: