Sparse Matrix Matrix Multiplication on Hybrid CPU+GPU Platforms
Center for Security, Theory and Algorithmic Research (CSTAR), IIIT-Hyderabad, Gachibowli, Hyderabad, India, 500 032
Center for Security, Theory and Algorithmic Research (CSTAR), 2012
@article{matam2012sparse,
title={Sparse Matrix Matrix Multiplication on Hybrid CPU+GPU Platforms},
author={Matam, K.K. and Bharadwaj, S.R.K. and Kothapalli, K.},
year={2012}
}
Sparse matrix-sparse/dense matrix multiplications, spgemm and csrmm, among other applications find usage in various matrix formulations of graph problems. GPU based supercomputers are presently experiencing severe performance issues on the Graph-500 benchmarks, a new HPC benchmark suite focusing on graph algorithms. Considering the difficulties in executing graph problems and the duality between graphs and matrices, computations such as spgemm and csrmm have recently caught the attention of HPC community. These computations pose challenges such as load balancing, irregular nature of the computation, and difficulty in predicting output size. It is even more challenging when combined with the GPU architectural constraints in memory accesses, limited shared memory, and thread execution. To address these challenges and perform the operations efficiently on a GPU, we evaluate three possible variations of matrix multiplication (Row-Column, Column-Row, Row-Row) and perform suitable optimizations. Our experiments indicate that the Row-Row formulation outperforms the other formulations. We extend the Row-Row formulation to a CPU+GPU hybrid algorithm that simultaneously utilizes the CPU as well. In this direction, we present a heuristic to find the right amount of work division between the CPU and the GPU. For a subclass of sparse matrices, namely band matrices, we present a heuristic that can identify the right amount of work division. Our hybrid approach outperforms the corresponding pure GPU or pure CPU implementations. Our hybrid row-row formulation of the spgemm operation performs up to 6x faster when compared to the optimized multi-core implementation in the Intel MKL library. In a similar fashion, our GPU csrmm operation performs up to 5x faster when compared to corresponding implementation in the cusparse library, which outperforms the Intel MKL library implementation.
July 16, 2012 by hgpu