Accelerating Sparse Matrix Kernels on Graphics Processing Units

Kiran Kumar Matam
Center for Security, Theory, and Algorithmic Research, International Institute of Information Technology, Hyderabad – 500 032, India
International Institute of Information Technology, 2012

   title={Accelerating Sparse Matrix Kernels on Graphics Processing Units},

   author={Kumar, M.K.},



Download Download (PDF)   View View   Source Source   



After microprocessor clock speeds have levelled off, general purpose computing on Graphics Pro- cessing Units (GPGPUs) has shown some promise for the future High Performance Computing (HPC). This can be largely attributed to the performance per unit cost, performance per unit watt and CUDA, a programming model for GPU. For instance, for under $400, one can buy a GPU that can deliver more than a TFLOP of computing power while requiring less than 250 Wof power. It is therefore no surprise that 3 out of the top 5 supercomputers [71] use GPUs along with other computing elements. However, this comes at the expense of reinterpreting applications in a highly multithreaded manner. GPUs are good at exploiting massive data parallelism, and are well-suited for applications with regular memory access patterns and high arithmetic intensity. For several regular applications, this approach has been very successful [63, 48]. For example for dense matrix multiplication GPUs can perform upto 838GFlops/s. However, most applications in high performance computing are irregular in nature. Examples include list ranking [81], graph algorithms [53], sparse matrix computations [22], among others. Viability of using GPUs for accelerating these irregular workloads needs to be explored. In recent works, efforts are on to arrive at efficient implementations of irregular algorithms on the GPUs [81, 6], and on other recent multi-core architectures such as the Cell [4, 59, 51]. In this thesis we first accelerate Lanczos method for large sparse matrices, an algorithm for finding eigenvectors corresponding to least eigenvalues. We then use this in accelerating two important ap- plications namely spectral graph partitioning and image segmentation. For these applications our GPU imlementation shows a speedup upto 2.9 and 3.27 respectively when compared to highly optimized Intel MKL library implementation on multi-core. We then investigate sparse matrix routines from sparse blas level 2 and level 3. The investigated routines include sparse matrix multiplied with a dense vector / sparse matrix / dense matrix. Sparse Matrix Vector Multiplication (spmv) is the heart of many numerical computations and is notorious for sustaining low fractions of peak performance. This operation finds applications in solving system of linear equations, finding eigenvalues of a matrix to name a few. spmv becomes bottleneck in many of these methods. spmv performance is limited by the available memory bandwidth due to the low computation per data accessed. spmv further suffers from the memory access and thread execution constraints posed by the gpu and also from load balancing among threads. To address some of these challenges we proposed a run-time methodology to choose the right data structures to represent sparse matrices. Our methodology uses parameters that attempt to balance the load amongst threads and also identify the right kind of representation to use for storing the sparse matrix. Experimental results show that our technique improves the performance of the spmv kernel by an average of 25% when compared to the library implementation of spmv on GPU (cusp library). Sparse matrix-sparse/dense matrix multiplications, spgemm and csrmm, among other applications find usage in various matrix formulations of graph problems. GPU based supercomputers are presently experiencing severe performance issues on the Graph-500 benchmarks, a new HPC benchmark suite focusing on graph algorithms. Considering the difficulties in executing graph problems and the duality between graphs and matrices, computations such as spgemm and csrmm have recently caught the attention of HPC community. These computations pose challenges such as load balancing, irregular nature of the computation, and difficulty in predicting output size. It is even more challenging when combined with the GPU architectural constraints in memory accesses, limited shared memory, and thread execution. To address these challenges and perform the operations efficiently on a GPU, we evaluate three pos- sible variations of matrix multiplication (Row-Column, Column-Row, Row-Row) and perform suitable optimizations. Our experiments indicate that the Row-Row formulation outperforms the other formu- lations. Our row-row formulation of the spgemm operation performs up to 6x faster when compared to the highly optimized multi-core implementation in Intel MKL library. In a similar fashion, our GPU csrmm operation performs up to 5x faster when compared to corresponding implementation in the cusparse library, which outperforms the Intel MKL library implementation.
VN:F [1.9.22_1171]
Rating: 0.0/5 (0 votes cast)

* * *

* * *

Follow us on Twitter

HGPU group

1512 peoples are following HGPU @twitter

Like us on Facebook

HGPU group

262 people like HGPU on Facebook

* * *

Free GPU computing nodes at hgpu.org

Registered users can now run their OpenCL application at hgpu.org. We provide 1 minute of computer time per each run on two nodes with two AMD and one nVidia graphics processing units, correspondingly. There are no restrictions on the number of starts.

The platforms are

Node 1
  • GPU device 0: nVidia GeForce GTX 560 Ti 2GB, 822MHz
  • GPU device 1: AMD/ATI Radeon HD 6970 2GB, 880MHz
  • CPU: AMD Phenom II X6 @ 2.8GHz 1055T
  • RAM: 12GB
  • OS: OpenSUSE 13.1
  • SDK: nVidia CUDA Toolkit 6.5.14, AMD APP SDK 3.0
Node 2
  • GPU device 0: AMD/ATI Radeon HD 7970 3GB, 1000MHz
  • GPU device 1: AMD/ATI Radeon HD 5870 2GB, 850MHz
  • CPU: Intel Core i7-2600 @ 3.4GHz
  • RAM: 16GB
  • OS: OpenSUSE 12.3
  • SDK: AMD APP SDK 3.0

Completed OpenCL project should be uploaded via User dashboard (see instructions and example there), compilation and execution terminal output logs will be provided to the user.

The information send to hgpu.org will be treated according to our Privacy Policy

HGPU group © 2010-2015 hgpu.org

All rights belong to the respective authors

Contact us: