high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Batched Matrix Computations on Hardware Accelerators

Batched Matrix Computations on Hardware Accelerators

Azzam Haidar, Tingxing Dong, Piotr Luszczek, Stanimire Tomov, Jack Dongarra

University of Tennessee

EuroMPI/Asia 2015 Workshop, 2015

@conference{haidar2015batched,

title={Batched Matrix Computations on Hardware Accelerators},

booktitle={EuroMPI/Asia 2015 Workshop},

year={2015},

month={09/2015},

address={Bordeaux, France},

author={Azzam Haidar and Piotr Luszczek and Stanimire Tomov and Jack Dongarra}

}

Download (PDF)

View

Source

2485

views

Scientific applications require solvers that work on many small size problems that are independent from each other. At the same time, the high-end hardware evolves rapidly and becomes ever more throughput-oriented and thus there is an increasing need for effective approach to develop energy efficient, high-performance codes for these small matrix problems that we call batched factorizations. The many applications that need this functionality could especially benefit from the use of GPUs, which currently are four to five times more energy efficient than multicore CPUs on important scientific workloads. This paper, consequently, describes the development of the most common, one-sided factorizations: Cholesky, LU, and QR for a set of small dense matrices. The algorithms we present together with their implementations are, by design, inherently parallel. In particular, our approach is based on representing the process as a sequence of batched BLAS routines that are executed entirely on a GPU. Importantly, this is unlike the LAPACK and the hybrid MAGMA factorization algorithms that work under drastically different assumptions of hardware design and efficiency of execution of the various computational kernels involved in the implementation. Thus, our approach is more efficient than what works for a combination of multicore CPUs and GPUs for the problems sizes of interest of the application use cases. The paradigm where upon a single chip (a GPU or a CPU) factorizes a single problem at a time is not at all efficient for in our applications’ context. We illustrate all these claims through a detailed performance analysis. With the help of profiling and tracing tools, we guide our development of batched factorizations to achieve up to two-fold speedup and three-fold better energy efficiency as compared against our highly optimized batched CPU implementations based on MKL library. The tested system featured two sockets of Intel Sandy Bridge CPUs and we compared to a batched LU factorizations featured in the CUBLAS library for GPUs, we achieve as high as 2.5x speedup on the NVIDIA K40 GPU.

Tags: Algorithms, Computer science, CUBLAS, CUDA, Factorization, Linear Algebra, nVidia, Tesla K40

April 12, 2015 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Batched Matrix Computations on Hardware Accelerators

Your response

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)

Batched Matrix Computations on Hardware Accelerators

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)