Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs

hgpu.org » Applications » Computer science » Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs

Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs

Rupanshu Soi, Rohan Yadav, Fredrik Kjolstad, Alex Aiken, Maryam Mehri Dehnavi, Michael Garland, Michael Bauer

Stanford University

arXiv:2512.18134 [cs.PL], (19 Dec 2025)

DOI:10.48550/arXiv.2512.18134

@misc{soi2025optimalsoftwarepipeliningwarp,

title={Optimal Software Pipelining and Warp Specialization for Tensor Core GPUs},

author={Rupanshu Soi and Rohan Yadav and Fredrik Kjolstad and Alex Aiken and Maryam Mehri Dehnavi and Michael Garland and Michael Bauer},

year={2025},

eprint={2512.18134},

archivePrefix={arXiv},

primaryClass={cs.PL},

url={https://arxiv.org/abs/2512.18134}

}

Download (PDF)

View

Source

750

views

GPU architectures have continued to grow in complexity, with recent incarnations introducing increasingly powerful fixed-function units for matrix multiplication and data movement to accompany highly parallel general-purpose cores. To fully leverage these machines, software must use sophisticated schedules that maximally utilize all hardware resources. Since realizing such schedules is complex, both programmers and compilers routinely employ program transformations, such as software pipelining (SWP) and warp specialization (WS), to do so in practice. However, determining how best to use SWP and WS in combination is a challenging problem that is currently handled through a mix of brittle compilation heuristics and fallible human intuition, with little insight into the space of solutions. To remedy this situation, we introduce a novel formulation of SWP and WS as a joint optimization problem that can be solved holistically by off-the-shelf constraint solvers. We reify our approach in Twill, the first system that automatically derives optimal SWP and WS schedules for a large class of iterative programs. Twill is heuristic-free, easily extensible to new GPU architectures, and guaranteed to produce optimal schedules. We show that Twill can rediscover, and thereby prove optimal, the SWP and WS schedules manually developed by experts for Flash Attention on both the NVIDIA Hopper and Blackwell GPU architectures.

Tags: Computer science, CUDA, nVidia, nVidia B200, nVidia H100, Programming Languages

December 29, 2025 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

high performance computing on graphics processing units: hgpu.org