high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels

ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels

Stuart H. Sul, Simran Arora, Benjamin F. Spector, Christopher Ré

Department of Computer Science, Stanford University

arXiv:2511.13940 [cs.DC], (17 Nov 2025)

DOI:10.48550/arXiv.2511.13940

@misc{sul2025parallelkittenssystematicpracticalsimplification,

title={ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels},

author={Stuart H. Sul and Simran Arora and Benjamin F. Spector and Christopher Ré},

year={2025},

eprint={2511.13940},

archivePrefix={arXiv},

primaryClass={cs.DC},

url={https://arxiv.org/abs/2511.13940}

}

Download (PDF)

View

Source

Source codes

Package:

ThunderKittens: Tile primitives for speedy kernels

825

views

Inter-GPU communication has become a major bottleneck for modern AI workloads as models scale and improvements in hardware compute throughput outpace improvements in interconnect bandwidth. Existing systems mitigate this through compute-communication overlap but often fail to meet theoretical peak performance across heterogeneous workloads and new accelerators. Instead of operator-specific techniques, we ask whether a small set of simple, reusable principles can systematically guide the design of optimal multi-GPU kernels. We present ParallelKittens (PK), a minimal CUDA framework that drastically simplifies the development of overlapped multi-GPU kernels. PK extends the ThunderKittens framework and embodies the principles of multi-GPU kernel design through eight core primitives and a unified programming template, derived from a comprehensive analysis of the factors that govern multi-GPU performance data-transfer mechanisms, resource scheduling, and design overheads. We validate PK on both Hopper and Blackwell architectures. With fewer than 50 lines of device code, PK achieves up to 2.33x speedup for data- and tensor-parallel workloads, 4.08x for sequence-parallel workloads, and 1.22x for expert-parallel workloads.

Tags: Computer science, CUDA, Heterogeneous systems, nVidia, nVidia B200, nVidia H100, Package

November 30, 2025 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

* * *

high performance computing on graphics processing units: hgpu.org

ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels

Package:

Your response

Recent source codes

CL4SE: A Context Learning Benchmark For Software Engineering Tasks

CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Vortex-Optimized Light-weight Toolchain (VOLT)

SciDef: Automated Definition Extraction from Scientific Literature

bioagent-bench: Benchmark for evaluating LLM agents in bioinformatics

Benchmark suite for LLM inference on NVIDIA consumer GPUs

Theorizer: from the paper Generating Literature-Driven Scientific Discoveries at Scale

Most viewed papers (last 30 days)

ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels

Package:

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)