Framework for Batched and GPU-resident Factorization Algorithms Applied to Block Householder Transformations

hgpu.org » Programming » Algorithms » Framework for Batched and GPU-resident Factorization Algorithms Applied to Block Householder Transformations

Framework for Batched and GPU-resident Factorization Algorithms Applied to Block Householder Transformations

Azzam Haidar, Tingxing "Tim" Dong, Stanimire Tomov, Piotr Luszczek, Jack Dongarra

University of Tennessee, USA

ISC High Performance, 2015

BibTeX

Download (PDF)

View

Source

2524

views

As modern hardware keeps evolving, an increasingly effective approach to develop energy efficient and high-performance solvers is to design them to work on many small size and independent problems. Many applications already need this functionality, especially for GPUs, which are currently known to be about four to five times more energy efficient than multicore CPUs. We describe the development of the main one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the QR factorization based on Householder transformations. We refer to this mode of operation as a batched factorization. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution. The hybrid CPU-GPU algorithms rely heavily on using the multicore CPU for specific part of the workload. But in order to benefit from the GPU’s significantly higher energy efficiency, the primary design goal is to avoid the use of the multicore CPU and to exclusively rely on the GPU. Additionally, this will result in the removal of the costly CPU-to-GPU communication. Furthermore, we do not use a single symmetric multiprocessor (on the GPU) to factorize a single problem at a time. We illustrate how our performance analysis and the use of profiling and tracing tools guided the development and optimization of batched factorization to achieve up to 2-fold speedup and 3-fold better energy efficiency compared to our highly optimized batched CPU implementations based on the MKL library (when using two sockets of Intel Sandy Bridge CPUs). Compared to a batched QR factorization featured in the CUBLAS library for GPUs, we achieved up to 5 speedup on the K40 GPU.

Tags: Algorithms, Computer science, CUBLAS, CUDA, Factorization, nVidia, Tesla K40

April 12, 2015 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org