Sparse Matrix Matrix Multiplication on Hybrid CPU+GPU Platforms

hgpu.org » Applications » Computer science » Sparse Matrix Matrix Multiplication on Hybrid CPU+GPU Platforms

Sparse Matrix Matrix Multiplication on Hybrid CPU+GPU Platforms

Kiran Kumar Matam, Siva Rama Krishna Bharadwaj, Kishore Kothapalli

Center for Security, Theory and Algorithmic Research (CSTAR), IIIT-Hyderabad, Gachibowli, Hyderabad, India, 500 032

Center for Security, Theory and Algorithmic Research (CSTAR), 2012

BibTeX

Download (PDF)

View

Source

3360

views

Sparse matrix-sparse/dense matrix multiplications, spgemm and csrmm, among other applications find usage in various matrix formulations of graph problems. GPU based supercomputers are presently experiencing severe performance issues on the Graph-500 benchmarks, a new HPC benchmark suite focusing on graph algorithms. Considering the difficulties in executing graph problems and the duality between graphs and matrices, computations such as spgemm and csrmm have recently caught the attention of HPC community. These computations pose challenges such as load balancing, irregular nature of the computation, and difficulty in predicting output size. It is even more challenging when combined with the GPU architectural constraints in memory accesses, limited shared memory, and thread execution. To address these challenges and perform the operations efficiently on a GPU, we evaluate three possible variations of matrix multiplication (Row-Column, Column-Row, Row-Row) and perform suitable optimizations. Our experiments indicate that the Row-Row formulation outperforms the other formulations. We extend the Row-Row formulation to a CPU+GPU hybrid algorithm that simultaneously utilizes the CPU as well. In this direction, we present a heuristic to find the right amount of work division between the CPU and the GPU. For a subclass of sparse matrices, namely band matrices, we present a heuristic that can identify the right amount of work division. Our hybrid approach outperforms the corresponding pure GPU or pure CPU implementations. Our hybrid row-row formulation of the spgemm operation performs up to 6x faster when compared to the optimized multi-core implementation in the Intel MKL library. In a similar fashion, our GPU csrmm operation performs up to 5x faster when compared to corresponding implementation in the cusparse library, which outperforms the Intel MKL library implementation.

Tags: Computer science, CUDA, Matrix multiplication, nVidia, Optimization, Sparse matrix, Tesla C2050

July 16, 2012 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org