high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Joint Training on AMD and NVIDIA GPUs

Joint Training on AMD and NVIDIA GPUs

Jon Hu, Thomas Jia, Jing Zhu, Zhendong Yu

Zettabyte AI, Inc.

arXiv:2602.18007 [cs.DC], (20 Feb 2026)

DOI:10.48550/arXiv.2602.18007

@misc{hu2026joint,

title={Joint Training on AMD and NVIDIA GPUs},

author={Jon Hu and Thomas Jia and Jing Zhu and Zhendong Yu},

year={2026},

eprint={2602.18007},

archivePrefix={arXiv},

primaryClass={cs.DC},

url={https://arxiv.org/abs/2602.18007}

}

Download (PDF)

View

Source

1035

views

As large language models continue to scale, training demands on compute and system capacity grow rapidly, making single-vendor homogeneous clusters insufficient. This paper presents a technical solution for heterogeneous mixed training in AMD-NVIDIA environments. We first adopt a compatibility-oriented approach based on CPU-Forwarding Communication, with differentiated communication back-end selection across parallel groups and multi-NIC parallel data transfer. To achieve higher performance, we further propose another Device-Direct Communication approach, integrating a CPU-offloading P2P mechanism to enable direct cross-vendor GPU data transfer without host-memory staging. Experiments on LLaMA-8B and Qwen2-7B demonstrate that the proposed Device-Direct Communication approach achieves up to 98% of the throughput of an NVIDIA homogeneous system, while preserving training stability and correctness.

Tags: AMD, AMD Radeon Instinct MI325X, Computer science, CUDA, Heterogeneous systems, nVidia, nVidia H200, ROCm

March 1, 2026 by hgpu

No votes yet.

Please wait...