Utilizing Tensor Cores in Futhark
University of Copenhagen, Department of Mathematical Sciences
University of Copenhagen, 2025
@article{kortbaek2025utilizing,
title={Utilizing Tensor Cores in Futhark},
author={Kortb{ae}k, Kristoffer August and Lejb{o}lle, Rune Ejnar Bang},
year={2025}
}
Modern hardware has become more heterogeneous, and with the AI boom, specialized hardware for especially performing matrix multiplication has become readily available. In NVIDIA graphical processing units (GPUs), Tensor Cores allow for efficient execution of matrix multiplication routines that can significantly speed up AI and deep learning operations, as well as other programs containing matrix multiplication. However, programming for the Tensor Cores is not straightforward, and often requires adapting code to restrictions and performance guidelines unique to this hardware. The hardware is made more accessible through application specific libraries such as cuBLAS and cuDNN, but for more general use specialized CUDA or PTX code targeting the Tensor Cores must be written. In order to ease the use of Tensor Cores in general, we propose to integrate the use of Tensor Cores into Futhark, a data parallel array language and highly optimizing compiler that generates efficient GPU code. Our main contribution is to allow Futhark programs with matrix multiplication in an intragroup kernel to use the Tensor Cores. We evaluate the Futhark compiler output against handwritten CUDA programs using the Tensor Cores and the stock unmodified compiler on benchmarks such as the matrix multiplication routine from LU-Decomposition in Rodinia [1], FlashAttention [2] like programs, and other matrix multiplication programs. The results show that our modified compiler is still considerably slower than handwritten implementations, but compared to the stock Futhark compiler we see speedups between 1.9x and 60x.
December 24, 2024 by hgpu