high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Iris: First-Class Multi-GPU Programming Experience in Triton

Iris: First-Class Multi-GPU Programming Experience in Triton

Muhammad Awad, Muhammad Osama, Brandon Potter

Advanced Micro Devices, Inc., Santa Clara, CA, USA

arXiv:2511.12500 [cs.DC], (16 Nov 2025)

DOI:10.48550/arXiv.2511.12500

@misc{awad2025irisfirstclassmultigpuprogramming,

title={Iris: First-Class Multi-GPU Programming Experience in Triton},

author={Muhammad Awad and Muhammad Osama and Brandon Potter},

year={2025},

eprint={2511.12500},

archivePrefix={arXiv},

primaryClass={cs.DC},

url={https://arxiv.org/abs/2511.12500}

}

Download (PDF)

View

Source

Source codes

Package:

Iris: AMD RAD’s multi-GPU Triton-based framework for seamless multi-GPU programming

694

views

Multi-GPU programming traditionally requires developers to navigate complex trade-offs between performance and programmability. High-performance implementations typically rely on low-level HIP/CUDA communication libraries that demand substantial engineering effort for even basic overlap patterns, while simpler abstractions often sacrifice performance. We present Iris, a multi-GPU communication library implemented entirely in Python and Triton that eliminates this trade-off. Iris provides tile-based symmetric memory abstractions that naturally align with Triton’s programming model, enabling developers to write single-source kernels that seamlessly interleave computation and communication. We demonstrate a taxonomy of compute-communication overlap patterns–from bulk-synchronous to fine-grained workgroup specialization–that can be implemented with minimal code changes in Iris, often requiring just a few additional lines within the same Triton kernel. Our evaluation shows that Iris achieves near-optimal bandwidth utilization in microbenchmarks and delivers up to 1.79x speedup over PyTorch and RCCL for GEMM+All-Scatter workloads, demonstrating that high-level implementations can match or exceed heavily-optimized libraries while dramatically simplifying multi-GPU programming.

Tags: AMD Radeon Instinct MI300X, ATI, Benchmarking, Computer science, CUDA, HIP, nVidia, Package, Python, Triton

November 23, 2025 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Iris: First-Class Multi-GPU Programming Experience in Triton

Package:

Your response

Recent source codes

SeedFold: Scaling Biomolecular Structure Prediction

Tilus: A Tile-Level GPU Kernel Programming Language

Memory-Efficient Acceleration of Block Low-Rank Foundation Models on Resource Constrained GPUs

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

BoltzGen:Toward Universal Binder Design

cuPilot: A Strategy-Coordinated Multi-agent Framework for CUDA Kernel Evolution

MATLAB Tensor Core models

TritonForge: Transform PyTorch Operations into Optimized GPU Kernels with LLMs

RLTune: Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters

tritonBLAS: A Lightweight Triton-based General Matrix Multiplication (GEMM) Library

Most viewed papers (last 30 days)

Iris: First-Class Multi-GPU Programming Experience in Triton

Package:

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)