GPU-Initiated Networking for NCCL

hgpu.org » Applications » Computer science » GPU-Initiated Networking for NCCL

GPU-Initiated Networking for NCCL

Khaled Hamidouche, John Bachan, Pak Markthub, Peter-Jan Gootzen, Elena Agostini, Sylvain Jeaugey, Aamir Shafi, Georgios Theodorakis, Manjunath Gorentla Venkata

NVIDIA Corporation

arXiv:2511.15076 [cs.DC], (24 Nov 2025)

DOI:10.48550/arXiv.2511.15076

@misc{hamidouche2025gpuinitiatednetworkingnccl,

title={GPU-Initiated Networking for NCCL},

author={Khaled Hamidouche and John Bachan and Pak Markthub and Peter-Jan Gootzen and Elena Agostini and Sylvain Jeaugey and Aamir Shafi and Georgios Theodorakis and Manjunath Gorentla Venkata},

year={2025},

eprint={2511.15076},

archivePrefix={arXiv},

primaryClass={cs.DC},

url={https://arxiv.org/abs/2511.15076}

}

Download (PDF)

View

Source

1199

views

Modern AI workloads, especially Mixture-of-Experts (MoE) architectures, increasingly demand low-latency, fine-grained GPU-to-GPU communication with device-side control. Traditional GPU communication follows a host-initiated model, where the CPU orchestrates all communication operations – a characteristic of the CUDA runtime. Although robust for collective operations, applications requiring tight integration of computation and communication can benefit from device-initiated communication that eliminates CPU coordination overhead. NCCL 2.28 introduces the Device API with three operation modes: Load/Store Accessible (LSA) for NVLink/PCIe, Multimem for NVLink SHARP, and GPU-Initiated Networking (GIN) for network RDMA. This paper presents the GIN architecture, design, semantics, and highlights its impact on MoE communication. GIN builds on a three-layer architecture: i) NCCL Core host-side APIs for device communicator setup and collective memory window registration; ii) Device-side APIs for remote memory operations callable from CUDA kernels; and iii) A network plugin architecture with dual semantics (GPUDirect Async Kernel-Initiated and Proxy) for broad hardware support. The GPUDirect Async Kernel-Initiated backend leverages DOCA GPUNetIO for direct GPU-to-NIC communication, while the Proxy backend provides equivalent functionality via lock-free GPU-to-CPU queues over standard RDMA networks. We demonstrate GIN’s practicality through integration with DeepEP, an MoE communication library. Comprehensive benchmarking shows that GIN provides device-initiated communication within NCCL’s unified runtime, combining low-latency operations with NCCL’s collective algorithms and production infrastructure.

Tags: Computer science, CUDA, Hardware Architecture, Machine learning, nVidia, nVidia H100

November 30, 2025 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

* * *

high performance computing on graphics processing units: hgpu.org