high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Accelerating Sparse Matrix Kernels on Graphics Processing Units

Accelerating Sparse Matrix Kernels on Graphics Processing Units

Kiran Kumar Matam

Center for Security, Theory, and Algorithmic Research, International Institute of Information Technology, Hyderabad – 500 032, India

International Institute of Information Technology, 2012

@article{kumar2012accelerating,

title={Accelerating Sparse Matrix Kernels on Graphics Processing Units},

author={Kumar, M.K.},

year={2012}

}

Download (PDF)

View

Source

2411

views

After microprocessor clock speeds have levelled off, general purpose computing on Graphics Pro- cessing Units (GPGPUs) has shown some promise for the future High Performance Computing (HPC). This can be largely attributed to the performance per unit cost, performance per unit watt and CUDA, a programming model for GPU. For instance, for under $400, one can buy a GPU that can deliver more than a TFLOP of computing power while requiring less than 250 Wof power. It is therefore no surprise that 3 out of the top 5 supercomputers [71] use GPUs along with other computing elements. However, this comes at the expense of reinterpreting applications in a highly multithreaded manner. GPUs are good at exploiting massive data parallelism, and are well-suited for applications with regular memory access patterns and high arithmetic intensity. For several regular applications, this approach has been very successful [63, 48]. For example for dense matrix multiplication GPUs can perform upto 838GFlops/s. However, most applications in high performance computing are irregular in nature. Examples include list ranking [81], graph algorithms [53], sparse matrix computations [22], among others. Viability of using GPUs for accelerating these irregular workloads needs to be explored. In recent works, efforts are on to arrive at efficient implementations of irregular algorithms on the GPUs [81, 6], and on other recent multi-core architectures such as the Cell [4, 59, 51]. In this thesis we first accelerate Lanczos method for large sparse matrices, an algorithm for finding eigenvectors corresponding to least eigenvalues. We then use this in accelerating two important ap- plications namely spectral graph partitioning and image segmentation. For these applications our GPU imlementation shows a speedup upto 2.9 and 3.27 respectively when compared to highly optimized Intel MKL library implementation on multi-core. We then investigate sparse matrix routines from sparse blas level 2 and level 3. The investigated routines include sparse matrix multiplied with a dense vector / sparse matrix / dense matrix. Sparse Matrix Vector Multiplication (spmv) is the heart of many numerical computations and is notorious for sustaining low fractions of peak performance. This operation finds applications in solving system of linear equations, finding eigenvalues of a matrix to name a few. spmv becomes bottleneck in many of these methods. spmv performance is limited by the available memory bandwidth due to the low computation per data accessed. spmv further suffers from the memory access and thread execution constraints posed by the gpu and also from load balancing among threads. To address some of these challenges we proposed a run-time methodology to choose the right data structures to represent sparse matrices. Our methodology uses parameters that attempt to balance the load amongst threads and also identify the right kind of representation to use for storing the sparse matrix. Experimental results show that our technique improves the performance of the spmv kernel by an average of 25% when compared to the library implementation of spmv on GPU (cusp library). Sparse matrix-sparse/dense matrix multiplications, spgemm and csrmm, among other applications find usage in various matrix formulations of graph problems. GPU based supercomputers are presently experiencing severe performance issues on the Graph-500 benchmarks, a new HPC benchmark suite focusing on graph algorithms. Considering the difficulties in executing graph problems and the duality between graphs and matrices, computations such as spgemm and csrmm have recently caught the attention of HPC community. These computations pose challenges such as load balancing, irregular nature of the computation, and difficulty in predicting output size. It is even more challenging when combined with the GPU architectural constraints in memory accesses, limited shared memory, and thread execution. To address these challenges and perform the operations efficiently on a GPU, we evaluate three pos- sible variations of matrix multiplication (Row-Column, Column-Row, Row-Row) and perform suitable optimizations. Our experiments indicate that the Row-Row formulation outperforms the other formu- lations. Our row-row formulation of the spgemm operation performs up to 6x faster when compared to the highly optimized multi-core implementation in Intel MKL library. In a similar fashion, our GPU csrmm operation performs up to 5x faster when compared to corresponding implementation in the cusparse library, which outperforms the Intel MKL library implementation.

Tags: Algorithms, Computer science, CUDA, Matrix multiplication, nVidia, Sparse matrix, Tesla C1060, Tesla C2050, Thesis

September 14, 2012 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Accelerating Sparse Matrix Kernels on Graphics Processing Units

Your response

Recent source codes

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

TRUST: a thermalhydraulic software package for CFD simulations

Modular: The Modular Platform (includes MAX & Mojo)

Allo: Accelerator Design Language

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

Most viewed papers (last 30 days)

Accelerating Sparse Matrix Kernels on Graphics Processing Units

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)