30409

ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels

Stuart H. Sul, Simran Arora, Benjamin F. Spector, Christopher Ré
Department of Computer Science, Stanford University
arXiv:2511.13940 [cs.DC], (17 Nov 2025)

@misc{sul2025parallelkittenssystematicpracticalsimplification,

   title={ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels},

   author={Stuart H. Sul and Simran Arora and Benjamin F. Spector and Christopher Ré},

   year={2025},

   eprint={2511.13940},

   archivePrefix={arXiv},

   primaryClass={cs.DC},

   url={https://arxiv.org/abs/2511.13940}

}

Inter-GPU communication has become a major bottleneck for modern AI workloads as models scale and improvements in hardware compute throughput outpace improvements in interconnect bandwidth. Existing systems mitigate this through compute-communication overlap but often fail to meet theoretical peak performance across heterogeneous workloads and new accelerators. Instead of operator-specific techniques, we ask whether a small set of simple, reusable principles can systematically guide the design of optimal multi-GPU kernels. We present ParallelKittens (PK), a minimal CUDA framework that drastically simplifies the development of overlapped multi-GPU kernels. PK extends the ThunderKittens framework and embodies the principles of multi-GPU kernel design through eight core primitives and a unified programming template, derived from a comprehensive analysis of the factors that govern multi-GPU performance data-transfer mechanisms, resource scheduling, and design overheads. We validate PK on both Hopper and Blackwell architectures. With fewer than 50 lines of device code, PK achieves up to 2.33x speedup for data- and tensor-parallel workloads, 4.08x for sequence-parallel workloads, and 1.22x for expert-parallel workloads.
No votes yet.
Please wait...

You must be logged in to post a comment.

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us: