Performant Unified GPU Kernels for Portable Singular Value Computation Across Hardware and Precision

hgpu.org » Applications » Computer science » Performant Unified GPU Kernels for Portable Singular Value Computation Across Hardware and Precision

Performant Unified GPU Kernels for Portable Singular Value Computation Across Hardware and Precision

Evelyne Ringoot, Rabab Alomairy, Valentin Churavy, Alan Edelman

Massachusetts Institute of Technology, Cambridge, MA, USA

arXiv:2508.06339 [cs.DC], (8 Aug 2025)

DOI:10.1145/3754598.3754667

@misc{ringoot2025performant,

title={Performant Unified GPU Kernels for Portable Singular Value Computation Across Hardware and Precision},

author={Evelyne Ringoot and Rabab Alomairy and Valentin Churavy and Alan Edelman},

year={2025},

eprint={2508.06339},

archivePrefix={arXiv},

primaryClass={cs.DC}

}

Download (PDF)

View

Source

1693

views

This paper presents a portable, GPU-accelerated implementation of a QR-based singular value computation algorithm in Julia. The singular value ecomposition (SVD) is a fundamental numerical tool in scientific computing and machine learning, providing optimal low-rank matrix approximations. Its importance has increased even more in large-scale machine learning pipelines, including large language models (LLMs), where it enables low-rank adaptation (LoRA). The implemented algorithm is based on the classic two-stage QR reduction, consisting of successive matrix reduction to band form and bidiagonal form. Our implementation leverages Julia’s multiple dispatch and metaprogramming capabilities, integrating with the GPUArrays and KernelAbstractions frameworks to provide a unified type and hardware-agnostic function. It supports diverse GPU architectures and data types, and is, to our knowledge, the first GPU-accelerated singular value implementation to support Apple Metal GPUs and half precision. Performance results on multiple GPU backends and data types demonstrate that portability does not require sacrificing performance: the unified function outperforms most linear algebra libraries (MAGMA, SLATE, rocSOLVER, oneMKL) for matrix sizes larger than 1024×1024, and achieves 80%-90% of the performance of cuSOLVER for large matrices.

Tags: AMD Radeon Instinct MI250, Apple M1 Pro, ATI, Computer science, HIP, Intel, Intel Ponte Vecchio Max 1100, Kokkos, Linear Algebra, Machine learning, nVidia, nVidia A100, nVidia GeForce RTX 4060, nVidia H100, OpenCL, SYCL

August 17, 2025 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org