high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs

MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs

Tingxing Dong, Azzam Haidar, Piotr Luszczek, Stanimire Tomov, Ahmad Abdelfattah, Jack Dongarra

Innovative Computing Laboratory, University of Tennessee, Knoxville, TN, 37996

ICL Tech Report, 08/2016, 2016

BibTeX

Download (PDF)

View

Source

2650

views

A particularly challenging class of problems arising in many applications, called batched problems, involves linear algebra operations on many small-sized matrices. We proposed and designed batched BLAS (Basic Linear Algebra Subroutines), Level-2 GEMV and Level-3 GEMM, to solve them. We illustrate how to optimize batched GEMV and GEMM to assist batched advance factorization (e.g. bi-diagonalization) and other BLAS routines (e.g. forward/back substitution) to achieve optimal performance on GPUs. Our solutions achieved up to 2.8-3x speedups compared to CUBLAS and MKL solutions, wherever possible. We applied our batched methodology in a real-world Hydrodynamic application by reformulating the tensor operations into batched BLAS GEMV and GEMM operations. A 2.5x speedup and a 1.4x greenup are obtained by changing 10% of the code. We accelerated and scaled it on Titan supercomputer to 4096 nodes.

Tags: BLAS, Computer science, CUDA, Factorization, Fluid dynamics, Linear Algebra, nVidia, Tesla K40

August 23, 2016 by hgpu

Rating: 1.8/5. From 3 votes.

Please wait...

Your response

You must be logged in to post a comment.

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs

Your response

Recent source codes

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

Most viewed papers (last 30 days)

MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)