high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices

A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices

Chetan Jhurani, Paul Mullowney

Tech-X Corporation, Boulder

BibTeX

Download (PDF)

View

Source

3046

views

We present an interface and an implementation of the General Matrix Multiply (GEMM) routine for multiple small matrices processed simultaneously on NVIDIA graphics processing units (GPUs). We focus on matrix sizes under 16. The implementation can be easily extended to larger sizes. For single precision matrices, our implementation is 30% to 600% faster than the batched cuBLAS implementation distributed in the CUDA Toolkit 5.0 on NVIDIA Tesla K20c. For example, we obtain 104 GFlop/s and 216 GFlop/s when multiplying 100,000 independent matrix pairs of size 10 and 16, respectively. Similar improvement in performance is obtained for other sizes, in single and double precision for real and complex types, and when the number of matrices is smaller. Apart from our implementation, our different function interface also plays an important role in the improved performance. Applications of this software include Finite Element computation on GPUs.

Tags: BLAS, CUBLAS, CUDA, Dense linear algebra, GEMM, Linear Algebra, nVidia, Parallel programming, Tesla K20

April 9, 2013 by chetan.jhurani

No votes yet.

Please wait...

high performance computing on graphics processing units: hgpu.org

A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices

Share this:

Recent source codes

Most viewed papers (last 30 days)