29899

GPU Performance Portability needs Autotuning

Burkhard Ringlein, Thomas Parnell, Radu Stoica
IBM Research Europe, S ̈aumerstrasse 4, 8803 R̈uschlikon, Switzerland
arXiv:2505.03780 [cs.AR], (30 Apr 2025)

@misc{ringlein2025gpuperformanceportabilityneeds,

   title={GPU Performance Portability needs Autotuning},

   author={Burkhard Ringlein and Thomas Parnell and Radu Stoica},

   year={2025},

   eprint={2505.03780},

   archivePrefix={arXiv},

   primaryClass={cs.AR},

   url={https://arxiv.org/abs/2505.03780}

}

Download Download (PDF)   View View   Source Source   

577

views

As LLMs grow in complexity, achieving state-of-the-art performance requires tight co-design across algorithms, software, and hardware. Today’s reliance on a single dominant platform limits portability, creates vendor lock-in, and raises barriers for new AI hardware. In this work, we make the case for combining just-in-time (JIT) compilation with kernel parameter autotuning to enable portable, state-of-the-art performance LLM execution without code changes. Focusing on flash attention — a widespread performance-critical LLM kernel — we demonstrate that this approach explores up to 15x more kernel parameter configurations, produces significantly more diverse code across multiple dimensions, and even outperforms vendor-optimized implementations by up to 230%, all while reducing kernel code size by 70x and eliminating manual code optimizations. Our results highlight autotuning as a promising path to unlocking model portability across GPU vendors.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us: