Sawtooth Wavefront Reordering: Enhanced CuTile FlashAttention on NVIDIA GB10
University of Rochester, USA
arXiv:2601.16032 [cs.PF], (22 Jan 2026)
@misc{zhu2026sawtooth,
title={Sawtooth Wavefront Reordering: Enhanced CuTile FlashAttention on NVIDIA GB10},
author={Yifan Zhu and Yekai Pan and Chen Ding},
year={2026},
eprint={2601.16032},
archivePrefix={arXiv},
primaryClass={cs.PF},
url={https://arxiv.org/abs/2601.16032}
}
High-performance attention kernels are essential for Large Language Models. This paper presents analysis of CuTile-based Flash Attention memory behavior and a technique to improve its cache performance. In particular, our analysis on the NVIDIA GB10 (Grace Blackwell) identifies the main cause of L2 cache miss. Leveraging this insight, we introduce a new programming technique called Sawtooth Wavefront Reordering that reduces L2 misses. We validate it in both CUDA and CuTile, observing 50% or greater reduction in L2 misses and up to 60% increase in throughput on GB10.
January 25, 2026 by hgpu
Your response
You must be logged in to post a comment.




