high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Sawtooth Wavefront Reordering: Enhanced CuTile FlashAttention on NVIDIA GB10

Sawtooth Wavefront Reordering: Enhanced CuTile FlashAttention on NVIDIA GB10

Yifan Zhu, Yekai Pan, Chen Ding

University of Rochester, USA

arXiv:2601.16032 [cs.PF], (22 Jan 2026)

DOI:10.48550/arXiv.2601.16032

@misc{zhu2026sawtooth,

title={Sawtooth Wavefront Reordering: Enhanced CuTile FlashAttention on NVIDIA GB10},

author={Yifan Zhu and Yekai Pan and Chen Ding},

year={2026},

eprint={2601.16032},

archivePrefix={arXiv},

primaryClass={cs.PF},

url={https://arxiv.org/abs/2601.16032}

}

Download (PDF)

View

Source

473

views

High-performance attention kernels are essential for Large Language Models. This paper presents analysis of CuTile-based Flash Attention memory behavior and a technique to improve its cache performance. In particular, our analysis on the NVIDIA GB10 (Grace Blackwell) identifies the main cause of L2 cache miss. Leveraging this insight, we introduce a new programming technique called Sawtooth Wavefront Reordering that reduces L2 misses. We validate it in both CUDA and CuTile, observing 50% or greater reduction in L2 misses and up to 60% increase in throughput on GB10.

Tags: Computer science, CUDA, nVidia, nVidia GB10, Programming techniques

January 25, 2026 by hgpu

No votes yet.

Please wait...