30375

Iris: First-Class Multi-GPU Programming Experience in Triton

Muhammad Awad, Muhammad Osama, Brandon Potter
Advanced Micro Devices, Inc., Santa Clara, CA, USA
arXiv:2511.12500 [cs.DC], (16 Nov 2025)

@misc{awad2025irisfirstclassmultigpuprogramming,

   title={Iris: First-Class Multi-GPU Programming Experience in Triton},

   author={Muhammad Awad and Muhammad Osama and Brandon Potter},

   year={2025},

   eprint={2511.12500},

   archivePrefix={arXiv},

   primaryClass={cs.DC},

   url={https://arxiv.org/abs/2511.12500}

}

Multi-GPU programming traditionally requires developers to navigate complex trade-offs between performance and programmability. High-performance implementations typically rely on low-level HIP/CUDA communication libraries that demand substantial engineering effort for even basic overlap patterns, while simpler abstractions often sacrifice performance. We present Iris, a multi-GPU communication library implemented entirely in Python and Triton that eliminates this trade-off. Iris provides tile-based symmetric memory abstractions that naturally align with Triton’s programming model, enabling developers to write single-source kernels that seamlessly interleave computation and communication. We demonstrate a taxonomy of compute-communication overlap patterns–from bulk-synchronous to fine-grained workgroup specialization–that can be implemented with minimal code changes in Iris, often requiring just a few additional lines within the same Triton kernel. Our evaluation shows that Iris achieves near-optimal bandwidth utilization in microbenchmarks and delivers up to 1.79x speedup over PyTorch and RCCL for GEMM+All-Scatter workloads, demonstrating that high-level implementations can match or exceed heavily-optimized libraries while dramatically simplifying multi-GPU programming.
No votes yet.
Please wait...

You must be logged in to post a comment.

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us: