high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Sparse Matrix-Vector Multiplication on GPU

Sparse Matrix-Vector Multiplication on GPU

Arash Ashari

The Ohio State University

The Ohio State University, 2014

BibTeX

Download (PDF)

View

Source

3969

views

Sparse Matrix-Vector multiplication (SpMV) is one of the key operations in linear algebra. Overcoming thread divergence, load imbalance and un-coalesced and indirect memory access due to sparsity and irregularity are challenges to optimizing SpMV on GPUs. This dissertation develops solutions that address these challenges effectively. The first part of this dissertation focuses on a new blocked row-column (BRC) storage format with a two-dimensional blocking mechanism. It reduces thread divergence by reordering and blocking rows of the input matrix with nearly equal number of non-zero elements onto the same execution units (i.e., warps). BRC improves load balance by partitioning rows into blocks with a constant number of non-zeros such that different warps perform the same amount of work. We also present an approach to optimize BRC performance by judicious selection of block size based on sparsity characteristics of the matrix. The most commonly used format for a sparse matrix is CSR (Compressed Sparse Row), but a number of other representations have recently been developed that achieve higher SpMV performance. However, the alternative representations typically impose a significant preprocessing overhead. While a high preprocessing overhead can be amortized for applications requiring many iterative invocations of SpMV that use the same matrix, it is not always feasible – for instance when analyzing large dynamically evolving graphs. The second part of this dissertation presents ACSR, an adaptive SpMV algorithm that uses the standard CSR format but reduces thread divergence by combining rows into groups (bins) which have a similar number of non-zero elements. Further, for rows in bins that span a wide range of non zero counts, dynamic parallelism is leveraged. A significant benefit of ACSR over other proposed SpMV approaches is that it works directly with the standard CSR format, and thus avoids significant preprocessing overheads. We also demonstrate the use of ACSR for the analysis of dynamic graphs, where the improvement over extant approaches is even higher. SpMV is commonly found in a wide class of data analytics and machine learning (ML) algorithms such as linear regression (LR), generalized linear models (GLM), binomial/multinomial logistic regression (LogReg), and support vector machines (SVM). SpMV is also one of the major performance bottlenecks in these algorithms. In the third part of this dissertation, we study application of SpMV in scalable ML. In particular, we identify the generic pattern of computation (y = alpha * A^T x (z . (A x x)) + beta * y) and its various instantiations. This pattern involves two SpMV computations. Fusing these computations, we develop a pipelined approach to optimize the complete pattern on GPUs. This approach not only reduces the cost of data loads due to improved temporal locality but also enables other optimizations like coarsening, as well as hierarchical aggregation of partial results.

Tags: Algorithms, Computer science, CUDA, Linear Algebra, Machine learning, nVidia, nVidia GeForce GTX 580, nVidia GeForce GTX Titan, Sparse matrix, Tesla K10, Thesis

February 9, 2015 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Sparse Matrix-Vector Multiplication on GPU

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

Sparse Matrix-Vector Multiplication on GPU

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)