Sparse Matrix Matrix Multiplication on Hybrid CPU+GPU Platforms

hgpu.org » Applications » Computer science » Sparse Matrix Matrix Multiplication on Hybrid CPU+GPU Platforms

Sparse Matrix Matrix Multiplication on Hybrid CPU+GPU Platforms

Kiran Kumar Matam, Siva Rama Krishna Bharadwaj, Kishore Kothapalli

Center for Security, Theory and Algorithmic Research (CSTAR), IIIT-Hyderabad, Gachibowli, Hyderabad, India, 500 032

Center for Security, Theory and Algorithmic Research (CSTAR), 2012

@article{matam2012sparse,

title={Sparse Matrix Matrix Multiplication on Hybrid CPU+GPU Platforms},

author={Matam, K.K. and Bharadwaj, S.R.K. and Kothapalli, K.},

year={2012}

}

Download (PDF)

View

Source

2983

views

Sparse matrix-sparse/dense matrix multiplications, spgemm and csrmm, among other applications find usage in various matrix formulations of graph problems. GPU based supercomputers are presently experiencing severe performance issues on the Graph-500 benchmarks, a new HPC benchmark suite focusing on graph algorithms. Considering the difficulties in executing graph problems and the duality between graphs and matrices, computations such as spgemm and csrmm have recently caught the attention of HPC community. These computations pose challenges such as load balancing, irregular nature of the computation, and difficulty in predicting output size. It is even more challenging when combined with the GPU architectural constraints in memory accesses, limited shared memory, and thread execution. To address these challenges and perform the operations efficiently on a GPU, we evaluate three possible variations of matrix multiplication (Row-Column, Column-Row, Row-Row) and perform suitable optimizations. Our experiments indicate that the Row-Row formulation outperforms the other formulations. We extend the Row-Row formulation to a CPU+GPU hybrid algorithm that simultaneously utilizes the CPU as well. In this direction, we present a heuristic to find the right amount of work division between the CPU and the GPU. For a subclass of sparse matrices, namely band matrices, we present a heuristic that can identify the right amount of work division. Our hybrid approach outperforms the corresponding pure GPU or pure CPU implementations. Our hybrid row-row formulation of the spgemm operation performs up to 6x faster when compared to the optimized multi-core implementation in the Intel MKL library. In a similar fashion, our GPU csrmm operation performs up to 5x faster when compared to corresponding implementation in the cusparse library, which outperforms the Intel MKL library implementation.

Tags: Computer science, CUDA, Matrix multiplication, nVidia, Optimization, Sparse matrix, Tesla C2050

July 16, 2012 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org