high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric

Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric

Gabin Schieffer, Ruimin Shi, Stefano Markidis, Andreas Herten, Jennifer Faj, Ivy Peng

KTH Royal Institute of Technology, Stockholm, Sweden

arXiv:2410.00801 [cs.DC], (1 Oct 2024)

DOI:10.48550/arXiv.2410.00801

BibTeX

Download (PDF)

View

Source

821

views

Modern GPU systems are constantly evolving to meet the needs of computing-intensive applications in scientific and machine learning domains. However, there is typically a gap between the hardware capacity and the achievable application performance. This work aims to provide a better understanding of the Infinity Fabric interconnects on AMD GPUs and CPUs. We propose a test and evaluation methodology for characterizing the performance of data movements on multi-GPU systems, stressing different communication options on AMD MI250X GPUs, including point-to-point and collective communication, and memory allocation strategies between GPUs, as well as the host CPU. In a single-node setup with four GPUs, we show that direct peer-to-peer memory accesses between GPUs and utilization of the RCCL library outperform MPI-based solutions in terms of memory/communication latency and bandwidth. Our test and evaluation method serves as a base for validating memory and communication strategies on a system and improving applications on AMD multi-GPU computing systems.

Tags: AMD, AMD Radeon Instinct MI250X, ATI, Computer science, HIP, Machine learning, Memory, Performance

October 6, 2024 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)