Framework for Batched and GPU-resident Factorization Algorithms Applied to Block Householder Transformations

Azzam Haidar, Tingxing "Tim" Dong, Stanimire Tomov, Piotr Luszczek, Jack Dongarra
University of Tennessee, USA
ISC High Performance, 2015


   title={Framework for Batched and GPU-resident Factorization Algorithms to Block Householder Transformations},

   booktitle={ISC High Performance},





   address={Frankfurt, Germany},

   author={Azzam Haidar and Tingxing Dong and Stanimire Tomov and Piotr Luszczek and Jack Dongarra}


Download Download (PDF)   View View   Source Source   



As modern hardware keeps evolving, an increasingly effective approach to develop energy efficient and high-performance solvers is to design them to work on many small size and independent problems. Many applications already need this functionality, especially for GPUs, which are currently known to be about four to five times more energy efficient than multicore CPUs. We describe the development of the main one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the QR factorization based on Householder transformations. We refer to this mode of operation as a batched factorization. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution. The hybrid CPU-GPU algorithms rely heavily on using the multicore CPU for specific part of the workload. But in order to benefit from the GPU’s significantly higher energy efficiency, the primary design goal is to avoid the use of the multicore CPU and to exclusively rely on the GPU. Additionally, this will result in the removal of the costly CPU-to-GPU communication. Furthermore, we do not use a single symmetric multiprocessor (on the GPU) to factorize a single problem at a time. We illustrate how our performance analysis and the use of profiling and tracing tools guided the development and optimization of batched factorization to achieve up to 2-fold speedup and 3-fold better energy efficiency compared to our highly optimized batched CPU implementations based on the MKL library (when using two sockets of Intel Sandy Bridge CPUs). Compared to a batched QR factorization featured in the CUBLAS library for GPUs, we achieved up to 5 speedup on the K40 GPU.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2021 hgpu.org

All rights belong to the respective authors

Contact us: