high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » RDMA Point-to-Point Communication for LLM Systems

RDMA Point-to-Point Communication for LLM Systems

Nandor Licker, Kevin Hu, Vladimir Zaytsev, Lequn Chen

Perplexity AI

arXiv:2510.27656 [cs.DC], (31 Oct 2025)

DOI:10.48550/arXiv.2510.27656

@misc{licker2025rdmapointtopointcommunicationllm,

title={RDMA Point-to-Point Communication for LLM Systems},

author={Nandor Licker and Kevin Hu and Vladimir Zaytsev and Lequn Chen},

year={2025},

eprint={2510.27656},

archivePrefix={arXiv},

primaryClass={cs.DC},

url={https://arxiv.org/abs/2510.27656}

}

Download (PDF)

View

Source

Source codes

Package:

pplx-garden: Perplexity open source garden for inference technology

865

views

Emerging Large Language Model (LLM) system patterns, such as disaggregated inference, Mixture-of-Experts (MoE) routing, and asynchronous reinforcement fine-tuning, require flexible point-to-point communication beyond simple collectives. Existing implementations are locked to specific Network Interface Controllers (NICs), hindering integration into inference engines and portability across hardware providers. We present TransferEngine, which bridges the functionality of common NICs to expose a uniform interface. TransferEngine exposes one-sided WriteImm operations with a ImmCounter primitive for completion notification, without ordering assumptions of network transport, transparently managing multiple NICs per GPU. We demonstrate peak throughput of 400 Gbps on both NVIDIA ConnectX-7 and AWS Elastic Fabric Adapter (EFA). We showcase TransferEngine through three production systems: (1) KvCache transfer for disaggregated inference with dynamic scaling, (2) RL weight updates achieving 1.3 seconds for trillion-parameter models, and (3) MoE dispatch/combine implementation exceeding DeepEP decode latency on ConnectX-7, with the first viable latencies on EFA. We demonstrate that our portable point-to-point communication complements collectives while avoiding lock-in.

Tags: Computer science, CUDA, LLM, nVidia, nVidia H200, Package, Performance, RDMA

November 9, 2025 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

high performance computing on graphics processing units: hgpu.org

RDMA Point-to-Point Communication for LLM Systems

Package:

Your response

Recent source codes

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Vortex-Optimized Light-weight Toolchain (VOLT)

SciDef: Automated Definition Extraction from Scientific Literature

bioagent-bench: Benchmark for evaluating LLM agents in bioinformatics

Benchmark suite for LLM inference on NVIDIA consumer GPUs

Theorizer: from the paper Generating Literature-Driven Scientific Discoveries at Scale

Nsight Python: a Python kernel profiling interface based on NVIDIA Nsight Tools

Awesome LLM-Driven Kernel Generation

PhysProver: Advancing Automatic Theorem Proving for Physics

ParaCodex: A Profiling-Guided Autonomous Coding Agent for Reliable Parallel Code Generation and Translation

Most viewed papers (last 30 days)

RDMA Point-to-Point Communication for LLM Systems

Package:

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)