high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Harnessing Batched BLAS/LAPACK Kernels on GPUs for Parallel Solutions of Block Tridiagonal Systems

Harnessing Batched BLAS/LAPACK Kernels on GPUs for Parallel Solutions of Block Tridiagonal Systems

David Jin, Alexis Montoison, Sungho Shin

Massachusetts Institute of Technology

arXiv:2509.03015 [cs.MS], (3 Sep 2025)

DOI:10.48550/arXiv.2509.03015

@misc{jin2025harnessingbatchedblaslapackkernels,

title={Harnessing Batched BLAS/LAPACK Kernels on GPUs for Parallel Solutions of Block Tridiagonal Systems},

author={David Jin and Alexis Montoison and Sungho Shin},

year={2025},

eprint={2509.03015},

archivePrefix={arXiv},

primaryClass={cs.MS},

url={https://arxiv.org/abs/2509.03015}

}

Download (PDF)

View

Source

Source codes

Package:

TBD-GPU

8276

views

We present a GPU implementation for the factorization and solution of block-tridiagonal symmetric positive definite linear systems, which commonly arise in time-dependent estimation and optimal control problems. Our method employs a recursive algorithm based on Schur complement reduction, transforming the system into a hierarchy of smaller, independent blocks that can be efficiently solved in parallel using batched BLAS/LAPACK routines. While batched routines have been used in sparse solvers, our approach applies these kernels in a tailored way by exploiting the block-tridiagonal structure known in advance. Performance benchmarks based on our open-source, cross-platform implementation, TBD-GPU, demonstrate the advantages of this tailored utilization: achieving substantial speed-ups compared to state-of-the-art CPU direct solvers, including CHOLMOD and HSL MA57, while remaining competitive with NVIDIA cuDSS. However, the current implementation still performs sequential calls of batched routines at each recursion level, and the block size must be sufficiently large to adequately amortize kernel launch overhead.

Tags: AMD Radeon Instinct MI300X, ATI, Benchmarking, BLAS, Computer science, CUDA, Factorization, Julia, nVidia, nVidia H200, Package, ROCm

September 7, 2025 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Harnessing Batched BLAS/LAPACK Kernels on GPUs for Parallel Solutions of Block Tridiagonal Systems

Package:

Your response

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)

Harnessing Batched BLAS/LAPACK Kernels on GPUs for Parallel Solutions of Block Tridiagonal Systems

Package:

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)