high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Utilizing Tensor Cores in Futhark

Utilizing Tensor Cores in Futhark

Kristoffer August Kortbæk, Rune Ejnar Bang Lejbølle

University of Copenhagen, Department of Mathematical Sciences

University of Copenhagen, 2025

@article{kortbaek2025utilizing,

title={Utilizing Tensor Cores in Futhark},

author={Kortb{ae}k, Kristoffer August and Lejb{o}lle, Rune Ejnar Bang},

year={2025}

}

Download (PDF)

View

Source

Source codes

Package:

Matrix multiplication using Tensor Cores in CUDA

982

views

Modern hardware has become more heterogeneous, and with the AI boom, specialized hardware for especially performing matrix multiplication has become readily available. In NVIDIA graphical processing units (GPUs), Tensor Cores allow for efficient execution of matrix multiplication routines that can significantly speed up AI and deep learning operations, as well as other programs containing matrix multiplication. However, programming for the Tensor Cores is not straightforward, and often requires adapting code to restrictions and performance guidelines unique to this hardware. The hardware is made more accessible through application specific libraries such as cuBLAS and cuDNN, but for more general use specialized CUDA or PTX code targeting the Tensor Cores must be written. In order to ease the use of Tensor Cores in general, we propose to integrate the use of Tensor Cores into Futhark, a data parallel array language and highly optimizing compiler that generates efficient GPU code. Our main contribution is to allow Futhark programs with matrix multiplication in an intragroup kernel to use the Tensor Cores. We evaluate the Futhark compiler output against handwritten CUDA programs using the Tensor Cores and the stock unmodified compiler on benchmarks such as the matrix multiplication routine from LU-Decomposition in Rodinia [1], FlashAttention [2] like programs, and other matrix multiplication programs. The results show that our modified compiler is still considerably slower than handwritten implementations, but compared to the stock Futhark compiler we see speedups between 1.9x and 60x.

Tags: Benchmarking, Computer science, CUDA, Heterogeneous systems, High-level Languages, Matrix multiplication, nVidia, nVidia A100, Package, PTX

December 24, 2024 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Utilizing Tensor Cores in Futhark

Package:

Your response

Recent source codes

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

TRUST: a thermalhydraulic software package for CFD simulations

Modular: The Modular Platform (includes MAX & Mojo)

Allo: Accelerator Design Language

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

Most viewed papers (last 30 days)

Utilizing Tensor Cores in Futhark

Package:

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)