Accelerating Sparse Matrix Kernels on Graphics Processing Units
Center for Security, Theory, and Algorithmic Research, International Institute of Information Technology, Hyderabad – 500 032, India
International Institute of Information Technology, 2012
@article{kumar2012accelerating,
title={Accelerating Sparse Matrix Kernels on Graphics Processing Units},
author={Kumar, M.K.},
year={2012}
}
After microprocessor clock speeds have levelled off, general purpose computing on Graphics Pro- cessing Units (GPGPUs) has shown some promise for the future High Performance Computing (HPC). This can be largely attributed to the performance per unit cost, performance per unit watt and CUDA, a programming model for GPU. For instance, for under $400, one can buy a GPU that can deliver more than a TFLOP of computing power while requiring less than 250 Wof power. It is therefore no surprise that 3 out of the top 5 supercomputers [71] use GPUs along with other computing elements. However, this comes at the expense of reinterpreting applications in a highly multithreaded manner. GPUs are good at exploiting massive data parallelism, and are well-suited for applications with regular memory access patterns and high arithmetic intensity. For several regular applications, this approach has been very successful [63, 48]. For example for dense matrix multiplication GPUs can perform upto 838GFlops/s. However, most applications in high performance computing are irregular in nature. Examples include list ranking [81], graph algorithms [53], sparse matrix computations [22], among others. Viability of using GPUs for accelerating these irregular workloads needs to be explored. In recent works, efforts are on to arrive at efficient implementations of irregular algorithms on the GPUs [81, 6], and on other recent multi-core architectures such as the Cell [4, 59, 51]. In this thesis we first accelerate Lanczos method for large sparse matrices, an algorithm for finding eigenvectors corresponding to least eigenvalues. We then use this in accelerating two important ap- plications namely spectral graph partitioning and image segmentation. For these applications our GPU imlementation shows a speedup upto 2.9 and 3.27 respectively when compared to highly optimized Intel MKL library implementation on multi-core. We then investigate sparse matrix routines from sparse blas level 2 and level 3. The investigated routines include sparse matrix multiplied with a dense vector / sparse matrix / dense matrix. Sparse Matrix Vector Multiplication (spmv) is the heart of many numerical computations and is notorious for sustaining low fractions of peak performance. This operation finds applications in solving system of linear equations, finding eigenvalues of a matrix to name a few. spmv becomes bottleneck in many of these methods. spmv performance is limited by the available memory bandwidth due to the low computation per data accessed. spmv further suffers from the memory access and thread execution constraints posed by the gpu and also from load balancing among threads. To address some of these challenges we proposed a run-time methodology to choose the right data structures to represent sparse matrices. Our methodology uses parameters that attempt to balance the load amongst threads and also identify the right kind of representation to use for storing the sparse matrix. Experimental results show that our technique improves the performance of the spmv kernel by an average of 25% when compared to the library implementation of spmv on GPU (cusp library). Sparse matrix-sparse/dense matrix multiplications, spgemm and csrmm, among other applications find usage in various matrix formulations of graph problems. GPU based supercomputers are presently experiencing severe performance issues on the Graph-500 benchmarks, a new HPC benchmark suite focusing on graph algorithms. Considering the difficulties in executing graph problems and the duality between graphs and matrices, computations such as spgemm and csrmm have recently caught the attention of HPC community. These computations pose challenges such as load balancing, irregular nature of the computation, and difficulty in predicting output size. It is even more challenging when combined with the GPU architectural constraints in memory accesses, limited shared memory, and thread execution. To address these challenges and perform the operations efficiently on a GPU, we evaluate three pos- sible variations of matrix multiplication (Row-Column, Column-Row, Row-Row) and perform suitable optimizations. Our experiments indicate that the Row-Row formulation outperforms the other formu- lations. Our row-row formulation of the spgemm operation performs up to 6x faster when compared to the highly optimized multi-core implementation in Intel MKL library. In a similar fashion, our GPU csrmm operation performs up to 5x faster when compared to corresponding implementation in the cusparse library, which outperforms the Intel MKL library implementation.
September 14, 2012 by hgpu