high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » GPU Performance Portability needs Autotuning

GPU Performance Portability needs Autotuning

Burkhard Ringlein, Thomas Parnell, Radu Stoica

IBM Research Europe, S ̈aumerstrasse 4, 8803 R̈uschlikon, Switzerland

arXiv:2505.03780 [cs.AR], (30 Apr 2025)

DOI:10.48550/arXiv.2505.03780

BibTeX

Download (PDF)

View

Source

682

views

As LLMs grow in complexity, achieving state-of-the-art performance requires tight co-design across algorithms, software, and hardware. Today’s reliance on a single dominant platform limits portability, creates vendor lock-in, and raises barriers for new AI hardware. In this work, we make the case for combining just-in-time (JIT) compilation with kernel parameter autotuning to enable portable, state-of-the-art performance LLM execution without code changes. Focusing on flash attention — a widespread performance-critical LLM kernel — we demonstrate that this approach explores up to 15x more kernel parameter configurations, produces significantly more diverse code across multiple dimensions, and even outperforms vendor-optimized implementations by up to 230%, all while reducing kernel code size by 70x and eliminating manual code optimizations. Our results highlight autotuning as a promising path to unlocking model portability across GPU vendors.

Tags: AMD Radeon Instinct MI250, ATI, Auto-Tuning, Computer science, CUDA, DSL, HIP, LLM, nVidia, nVidia A100, Performance, performance portability

May 18, 2025 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

GPU Performance Portability needs Autotuning

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

GPU Performance Portability needs Autotuning

Share this:

Recent source codes

Most viewed papers (last 30 days)