Framework for Batched and GPU-resident Factorization Algorithms Applied to Block Householder Transformations
University of Tennessee, USA
ISC High Performance, 2015
@conference{haidar2015framework,
title={Framework for Batched and GPU-resident Factorization Algorithms to Block Householder Transformations},
booktitle={ISC High Performance},
year={2015},
month={07/2015},
publisher={Springer},
organization={Springer},
address={Frankfurt, Germany},
author={Azzam Haidar and Tingxing Dong and Stanimire Tomov and Piotr Luszczek and Jack Dongarra}
}
As modern hardware keeps evolving, an increasingly effective approach to develop energy efficient and high-performance solvers is to design them to work on many small size and independent problems. Many applications already need this functionality, especially for GPUs, which are currently known to be about four to five times more energy efficient than multicore CPUs. We describe the development of the main one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the QR factorization based on Householder transformations. We refer to this mode of operation as a batched factorization. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution. The hybrid CPU-GPU algorithms rely heavily on using the multicore CPU for specific part of the workload. But in order to benefit from the GPU’s significantly higher energy efficiency, the primary design goal is to avoid the use of the multicore CPU and to exclusively rely on the GPU. Additionally, this will result in the removal of the costly CPU-to-GPU communication. Furthermore, we do not use a single symmetric multiprocessor (on the GPU) to factorize a single problem at a time. We illustrate how our performance analysis and the use of profiling and tracing tools guided the development and optimization of batched factorization to achieve up to 2-fold speedup and 3-fold better energy efficiency compared to our highly optimized batched CPU implementations based on the MKL library (when using two sockets of Intel Sandy Bridge CPUs). Compared to a batched QR factorization featured in the CUBLAS library for GPUs, we achieved up to 5 speedup on the K40 GPU.
April 12, 2015 by hgpu