high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Communication-avoiding QR decomposition for GPUs

Communication-avoiding QR decomposition for GPUs

Michael Anderson, Grey Ballard, James Demmel, Kurt Keutzer

UC Berkeley: Department of Electrical Engineering and Computer Science, Berkeley, CA USA

IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2011

DOI:10.1109/IPDPS.2011.15

@inproceedings{anderson2011communication,

title={Communication-avoiding QR decomposition for GPUs},

author={Anderson, M. and Ballard, G. and Demmel, J. and Keutzer, K.},

booktitle={Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International},

pages={48–58},

year={2011},

organization={IEEE}

}

Download (PDF)

View

Source

1417

views

We describe an implementation of the Communication-Avoiding QR (CAQR) factorization that runs entirely on a single graphics processor (GPU). We show that the reduction in memory traffic provided by CAQR allows us to outperform existing parallel GPU implementations of QR for a large class of tall-skinny matrices. Other GPU implementations of QR handle panel factorizations by either sending the work to a general-purpose processor or using entirely bandwidth-bound operations, incurring data transfer overheads. In contrast, our QR is done entirely on the GPU using compute-bound kernels, meaning performance is good regardless of the width of the matrix. As a result, we outperform CULA, a parallel linear algebra library for GPUs by up to 17x for tall-skinny matrices and Intel’s Math Kernel Library (MKL) by up to 12x. We also discuss stationary video background subtraction as a motivating application. We apply a recent statistical approach, which requires many iterations of computing the singular value decomposition of a tall-skinny matrix. Using CAQR as a first step to getting the singular value decomposition, we are able to get the answer 3x faster than if we use a traditional bandwidth-bound GPU QR factorization tuned specifically for that matrix size, and 30x faster than if we use Intel’s Math Kernel Library (MKL) singular value decomposition routine on a multicore CPU.

Tags: Algorithms, Computer science, Factorization, Linear Algebra, nVidia, nVidia GeForce GTX 480, Tesla C2050

December 4, 2011 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org

Communication-avoiding QR decomposition for GPUs

Recent source codes

QArray

Celerity: High-level C++ for Accelerator Clusters

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

OpenMP5-Offload-OpenMC-Intel-PVC

Most viewed papers (last 30 days)

Communication-avoiding QR decomposition for GPUs

Share this:

Recent source codes

Most viewed papers (last 30 days)