LU Factorization with Partial Pivoting for a Multi-CPU, Multi-GPU Shared Memory System
Electrical Engineering and Computer Science, University of Tennessee
10th International Meeting on High-Performance Computing for Computational Science (VECPAR), 2012
LU factorization with partial pivoting is a canonical numerical procedure and the main component of the High Performance Linpack benchmark. This article presents an implementation of the algorithm for a hybrid, shared memory, system with standard CPU cores and GPU accelerators. The optimizations include lookahead, dynamic task scheduling, fine grain parallelism for memory-bound operations, autotuning, and data layout geared towards complex memory hierarchies. Performance in excess of one Tera flop/s is achieved using four AMD Magny Cours CPUs and four NVIDIA Fermi GPUs.
April 17, 2013 by hgpu