high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » High Performance Matrix Multiplication

High Performance Matrix Multiplication

Ethan Davis

arXiv:2509.04594 [cs.PF], (4 Sep 2025)

DOI:10.48550/arXiv.2509.04594

@misc{davis2025high,

title={High Performance Matrix Multiplication},

author={Ethan Davis},

year={2025},

eprint={2509.04594},

archivePrefix={arXiv},

primaryClass={cs.PF}

}

Download (PDF)

View

Source

Source codes

Package:

General Matrix Multiplication (GEMM)

6242

views

Matrix multiplication is the foundation from much of the success from high performance technologies like deep learning, scientific simulations, and video graphics. High level programming languages like Python and R rely on highly optimized low level libraries for performing core linear algebra operations like matrix multiplication from Basic Linear Algebra Subprograms (BLAS). This paper compares the performance of five different matrix multiplication algorithms using CuBLAS, CUDA, BLAS, OpenMP, and C++ Threads. We find statistical significance with a p-value below 5e-12 to support the hypothesis that for square matrices where is at least 10,000 then the in order performance as measured in floating point operations per second (FLOPS) for these matrix multiplication algorithms is CuBLAS, CUDA, BLAS, OpenMP, and C++ Threads.

Tags: BLAS, Computer science, CUBLAS, CUDA, Linear Algebra, Matrix multiplication, nVidia, OpenMP, Package, Performance, Python, Tesla V100

September 14, 2025 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

High Performance Matrix Multiplication

Package:

Your response

Recent source codes

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

Most viewed papers (last 30 days)

High Performance Matrix Multiplication

Package:

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)