29439

Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric

Gabin Schieffer, Ruimin Shi, Stefano Markidis, Andreas Herten, Jennifer Faj, Ivy Peng
KTH Royal Institute of Technology, Stockholm, Sweden
arXiv:2410.00801 [cs.DC], (1 Oct 2024)

@misc{schieffer2024understandingdatamovementamd,

   title={Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric},

   author={Gabin Schieffer and Ruimin Shi and Stefano Markidis and Andreas Herten and Jennifer Faj and Ivy Peng},

   year={2024},

   eprint={2410.00801},

   archivePrefix={arXiv},

   primaryClass={cs.DC},

   url={https://arxiv.org/abs/2410.00801}

}

Download Download (PDF)   View View   Source Source   

460

views

Modern GPU systems are constantly evolving to meet the needs of computing-intensive applications in scientific and machine learning domains. However, there is typically a gap between the hardware capacity and the achievable application performance. This work aims to provide a better understanding of the Infinity Fabric interconnects on AMD GPUs and CPUs. We propose a test and evaluation methodology for characterizing the performance of data movements on multi-GPU systems, stressing different communication options on AMD MI250X GPUs, including point-to-point and collective communication, and memory allocation strategies between GPUs, as well as the host CPU. In a single-node setup with four GPUs, we show that direct peer-to-peer memory accesses between GPUs and utilization of the RCCL library outperform MPI-based solutions in terms of memory/communication latency and bandwidth. Our test and evaluation method serves as a base for validating memory and communication strategies on a system and improving applications on AMD multi-GPU computing systems.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: