30508

Sawtooth Wavefront Reordering: Enhanced CuTile FlashAttention on NVIDIA GB10

Yifan Zhu, Yekai Pan, Chen Ding
University of Rochester, USA
arXiv:2601.16032 [cs.PF], (22 Jan 2026)

@misc{zhu2026sawtooth,

   title={Sawtooth Wavefront Reordering: Enhanced CuTile FlashAttention on NVIDIA GB10},

   author={Yifan Zhu and Yekai Pan and Chen Ding},

   year={2026},

   eprint={2601.16032},

   archivePrefix={arXiv},

   primaryClass={cs.PF},

   url={https://arxiv.org/abs/2601.16032}

}

Download Download (PDF)   View View   Source Source   

235

views

High-performance attention kernels are essential for Large Language Models. This paper presents analysis of CuTile-based Flash Attention memory behavior and a technique to improve its cache performance. In particular, our analysis on the NVIDIA GB10 (Grace Blackwell) identifies the main cause of L2 cache miss. Leveraging this insight, we introduce a new programming technique called Sawtooth Wavefront Reordering that reduces L2 misses. We validate it in both CUDA and CuTile, observing 50% or greater reduction in L2 misses and up to 60% increase in throughput on GB10.
No votes yet.
Please wait...

You must be logged in to post a comment.

* * *

* * *

HGPU group © 2010-2026 hgpu.org

All rights belong to the respective authors

Contact us: