high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs

MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs

Tingxing Dong, Azzam Haidar, Piotr Luszczek, Stanimire Tomov, Ahmad Abdelfattah, Jack Dongarra

Innovative Computing Laboratory, University of Tennessee, Knoxville, TN, 37996

ICL Tech Report, 08/2016, 2016

@techreport{ICL970,

title={MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs},

journal={ICL Tech Report},

year={2016},

month={08/2016},

keywords={Batched, Bi-diagonalization, gpu, Hydrodynamic},

author={Tingxing Dong and Azzam Haidar and Piotr Luszczek and Stanimire Tomov and Ahmad Abdelfattah and Jack Dongarra}

}

Download (PDF)

View

Source

2184

views

A particularly challenging class of problems arising in many applications, called batched problems, involves linear algebra operations on many small-sized matrices. We proposed and designed batched BLAS (Basic Linear Algebra Subroutines), Level-2 GEMV and Level-3 GEMM, to solve them. We illustrate how to optimize batched GEMV and GEMM to assist batched advance factorization (e.g. bi-diagonalization) and other BLAS routines (e.g. forward/back substitution) to achieve optimal performance on GPUs. Our solutions achieved up to 2.8-3x speedups compared to CUBLAS and MKL solutions, wherever possible. We applied our batched methodology in a real-world Hydrodynamic application by reformulating the tensor operations into batched BLAS GEMV and GEMM operations. A 2.5x speedup and a 1.4x greenup are obtained by changing 10% of the code. We accelerated and scaled it on Titan supercomputer to 4096 nodes.

Tags: BLAS, Computer science, CUDA, Factorization, Fluid dynamics, Linear Algebra, nVidia, Tesla K40

August 23, 2016 by hgpu

Rating: 1.8/5. From 3 votes.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)

MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs

Share this:

Recent source codes

Most viewed papers (last 30 days)