8881

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Andres Tomas, Zhaojun Bai, Vicente Hernandez
Department of Computer Science, University of California, Davis, CA 95616, USA
10th International Meeting on High-Performance Computing for Computational Science (VECPAR 2012), 2012

@article{tomas2012parallelization,

   title={Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors},

   author={Tom{‘a}s, A. and Bai, Z. and Hern{‘a}ndez, V.},

   year={2012}

}

Download Download (PDF)   View View   Source Source   

1526

views

The QR decomposition with column pivoting (QRP) of a matrix is widely used for numerical rank revealing in applications. The performance of LAPACK implementation (DGEQP3) of the Householder QRP algorithm is limited by Level 2 BLAS operations required for updating the column norms. In this paper, we propose an implementation of the QRP algorithm using a distribution of the matrix columns in a round-robin fashion for better data locality and parallel memory bus utilization on multicore architectures. Our performance results show a 60% improvement over the routine DGEQP3 of Intel MKL (version 10.3) on a 12 core Intel Xeon X5670 machine. In addition, we show that the same data distribution is also suitable for general purpose GPU processors, where our implementation obtains up to 90 GFlops on a NVIDIA GeForce GTX480. This is about 2 times faster than the QRP implementation of MAGMA (version 1.2.1).
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: