Sparse Matrix-Vector Multiplication on GPU
The Ohio State University
The Ohio State University, 2014
@phdthesis{ashari2014sparse,
title={Sparse Matrix-Vector Multiplication on GPU},
author={Ashari, Arash},
year={2014},
school={The Ohio State University}
}
Sparse Matrix-Vector multiplication (SpMV) is one of the key operations in linear algebra. Overcoming thread divergence, load imbalance and un-coalesced and indirect memory access due to sparsity and irregularity are challenges to optimizing SpMV on GPUs. This dissertation develops solutions that address these challenges effectively. The first part of this dissertation focuses on a new blocked row-column (BRC) storage format with a two-dimensional blocking mechanism. It reduces thread divergence by reordering and blocking rows of the input matrix with nearly equal number of non-zero elements onto the same execution units (i.e., warps). BRC improves load balance by partitioning rows into blocks with a constant number of non-zeros such that different warps perform the same amount of work. We also present an approach to optimize BRC performance by judicious selection of block size based on sparsity characteristics of the matrix. The most commonly used format for a sparse matrix is CSR (Compressed Sparse Row), but a number of other representations have recently been developed that achieve higher SpMV performance. However, the alternative representations typically impose a significant preprocessing overhead. While a high preprocessing overhead can be amortized for applications requiring many iterative invocations of SpMV that use the same matrix, it is not always feasible – for instance when analyzing large dynamically evolving graphs. The second part of this dissertation presents ACSR, an adaptive SpMV algorithm that uses the standard CSR format but reduces thread divergence by combining rows into groups (bins) which have a similar number of non-zero elements. Further, for rows in bins that span a wide range of non zero counts, dynamic parallelism is leveraged. A significant benefit of ACSR over other proposed SpMV approaches is that it works directly with the standard CSR format, and thus avoids significant preprocessing overheads. We also demonstrate the use of ACSR for the analysis of dynamic graphs, where the improvement over extant approaches is even higher. SpMV is commonly found in a wide class of data analytics and machine learning (ML) algorithms such as linear regression (LR), generalized linear models (GLM), binomial/multinomial logistic regression (LogReg), and support vector machines (SVM). SpMV is also one of the major performance bottlenecks in these algorithms. In the third part of this dissertation, we study application of SpMV in scalable ML. In particular, we identify the generic pattern of computation (y = alpha * A^T x (z . (A x x)) + beta * y) and its various instantiations. This pattern involves two SpMV computations. Fusing these computations, we develop a pipelined approach to optimize the complete pattern on GPUs. This approach not only reduces the cost of data loads due to improved temporal locality but also enables other optimizations like coarsening, as well as hierarchical aggregation of partial results.
February 9, 2015 by hgpu