high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Enhancing Transformer Performance and Portability through Auto-tuning Frameworks

Enhancing Transformer Performance and Portability through Auto-tuning Frameworks

Patricia Siwinska,Jie Lei,Adrian Castello,Pedro Alonso-Jord́a,Enrique S. Quintana-Orti

Universitat Polit`ecnica de Val`encia, Spain

Research Square, 2025

DOI:10.21203/rs.3.rs-7846280/v1

@article{siwinska2025enhancing,

title={Enhancing Transformer Performance and Portability through Auto-tuning Frameworks},

author={Siwinska, Patricia and Lei, Jie and Castell{‘o}, Adri{‘a}n and Alonso-Jord{‘a}, Pedro and Quintana-Ort{‘i}, Enrique S},

year={2025}

}

Download (PDF)

View

Source

Source codes

Package:

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

316

views

Transformer-based models such as BERT and GPT2 have become the foundation of many modern applications, yet their execution requires substantial computational and memory resources. To address these challenges, recent advances in compiler technology and hardware accelerators have introduced new opportunities for performance portability. In this work, we evaluate JAX and TVM as high-level frameworks that combine a NumPy-like programming model with Just-In-Time (JIT) or Ahead-of-Time (AOT) code optimization and compilation, enabling efficient execution across CPUs or GPUs, and, in the case of JAX, on TPUs as well. We present systematic implementations of the core Transformer encoder and decoder blocks in JAX and TVM and compare their automatically optimized code against NumPy and CuPy baselines. Our experimental study covers heterogeneous hardware platforms (AMD CPU, NVIDIA GPUs, and Google TPUs) and multiple arithmetic precisions (FP32, INT8, and INT32). Results show that JAX and TVM deliver significant performance improvements over standard libraries, while reducing the programming effort required to adapt to different hardware. These findings demonstrate the potential of JIT- and AOT-oriented frameworks to serve as a portable and efficient solution for deploying Transformer workloads in diverse computing environments.

Tags: Auto-Tuning, Computer science, CUDA, Deep learning, Heterogeneous systems, LLM, nVidia, nVidia A100, Package, performance portability

November 2, 2025 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Enhancing Transformer Performance and Portability through Auto-tuning Frameworks

Package:

Your response

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)

Enhancing Transformer Performance and Portability through Auto-tuning Frameworks

Package:

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)