Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors
Department of Computer Science, University of California, Davis, CA 95616, USA
10th International Meeting on High-Performance Computing for Computational Science (VECPAR 2012), 2012
@article{tomas2012parallelization,
title={Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors},
author={Tom{‘a}s, A. and Bai, Z. and Hern{‘a}ndez, V.},
year={2012}
}
The QR decomposition with column pivoting (QRP) of a matrix is widely used for numerical rank revealing in applications. The performance of LAPACK implementation (DGEQP3) of the Householder QRP algorithm is limited by Level 2 BLAS operations required for updating the column norms. In this paper, we propose an implementation of the QRP algorithm using a distribution of the matrix columns in a round-robin fashion for better data locality and parallel memory bus utilization on multicore architectures. Our performance results show a 60% improvement over the routine DGEQP3 of Intel MKL (version 10.3) on a 12 core Intel Xeon X5670 machine. In addition, we show that the same data distribution is also suitable for general purpose GPU processors, where our implementation obtains up to 90 GFlops on a NVIDIA GeForce GTX480. This is about 2 times faster than the QRP implementation of MAGMA (version 1.2.1).
February 3, 2013 by hgpu