high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » A Performance Portable Matrix Free Dense MTTKRP in GenTen

A Performance Portable Matrix Free Dense MTTKRP in GenTen

Gabriel Kosmacher, Eric T. Phipps, Sivasankaran Rajamanickam

The University of Texas at Austin

arXiv:2510.14891 [cs.MS], (16 Oct 2025)

DOI:10.48550/arXiv.2510.14891

@misc{kosmacher2025performanceportablematrixfree,

title={A Performance Portable Matrix Free Dense MTTKRP in GenTen},

author={Gabriel Kosmacher and Eric T. Phipps and Sivasankaran Rajamanickam},

year={2025},

eprint={2510.14891},

archivePrefix={arXiv},

primaryClass={cs.MS},

url={https://arxiv.org/abs/2510.14891}

}

Download (PDF)

View

Source

Source codes

Package:

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

565

views

We extend the GenTen tensor decomposition package by introducing an accelerated dense matricized tensor times Khatri-Rao product (MTTKRP), the workhorse kernel for canonical polyadic (CP) tensor decompositions, that is portable and performant on modern CPU and GPU architectures. In contrast to the state-of-the-art matrix multiply based MTTKRP kernels used by Tensor Toolbox, TensorLy, etc., that explicitly form Khatri-Rao matrices, we develop a matrix-free element-wise parallelization approach whose memory cost grows with the rank R like the sum of the tensor shape O(R(n+m+k)), compared to matrix-based methods whose memory cost grows like the product of the tensor shape O(R(mnk)). For the largest problem we study, a rank 2000 MTTKRP, the smaller growth rate yields a matrix-free memory cost of just 2% of the matrix-based methods, a 50x improvement. In practice, the reduced memory impact means our matrix-free MTTKRP can compute a rank 2000 tensor decomposition on a single NVIDIA H100 instead of six H100s using a matrix-based MTTKRP. We also compare our optimized matrix-free MTTKRP to baseline matrix-free implementations on different devices, showing a 3x single-device speedup on an Intel 8480+ CPU and an 11x speedup on a H100 GPU. In addition to numerical results, we provide fine grained performance models for an ideal multi-level cache machine, compare analytical performance predictions to empirical results, and provide a motivated heuristic selection for selecting an algorithmic hyperparameter.

Tags: Algorithms, Computer science, CUDA, Kokkos, Mathematical Software, nVidia, nVidia H100, OpenMP, Package

October 19, 2025 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

* * *

high performance computing on graphics processing units: hgpu.org

A Performance Portable Matrix Free Dense MTTKRP in GenTen

Package:

Your response

Recent source codes

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

CUDABench: Benchmarking LLMs for Text-to-CUDA Generation

CL4SE: A Context Learning Benchmark For Software Engineering Tasks

CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Vortex-Optimized Light-weight Toolchain (VOLT)

SciDef: Automated Definition Extraction from Scientific Literature

bioagent-bench: Benchmark for evaluating LLM agents in bioinformatics

Most viewed papers (last 30 days)

A Performance Portable Matrix Free Dense MTTKRP in GenTen

Package:

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)