The Landscape of GPU-Centric Communication
Koç University, Turkey
arXiv:2409.09874 [cs.DC], (15 Sep 2024)
@misc{unat2024landscapegpucentriccommunication,
title={The Landscape of GPU-Centric Communication},
author={Didem Unat and Ilyas Turimbetov and Mohammed Kefah Taha Issa and Doğan Sağbili and Flavio Vella and Daniele De Sensi and Ismayil Ismayilov},
year={2024},
eprint={2409.09874},
archivePrefix={arXiv},
primaryClass={cs.DC},
url={https://arxiv.org/abs/2409.09874}
}
In recent years, GPUs have become the preferred accelerators for HPC and ML applications due to their parallelism and fast memory bandwidth. While GPUs boost computation, inter-GPU communication can create scalability bottlenecks, especially as the number of GPUs per node and cluster grows. Traditionally, the CPU managed multi-GPU communication, but advancements in GPU-centric communication now challenge this CPU dominance by reducing its involvement, granting GPUs more autonomy in communication tasks, and addressing mismatches in multi-GPU communication and computation. This paper provides a landscape of GPU-centric communication, focusing on vendor mechanisms and user-level library supports. It aims to clarify the complexities and diverse options in this field, define the terminology, and categorize existing approaches within and across nodes. The paper discusses vendor-provided mechanisms for communication and memory management in multi-GPU execution and reviews major communication libraries, their benefits, challenges, and performance insights. Then, it explores key research paradigms, future outlooks, and open research questions. By extensively describing GPU-centric communication techniques across the software and hardware stacks, we provide researchers, programmers, engineers, and library designers insights on how to exploit multi-GPU systems at their best.
September 22, 2024 by hgpu