high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Sparse Matrix-Vector Multiplication on GPU

Sparse Matrix-Vector Multiplication on GPU

Arash Ashari

The Ohio State University

The Ohio State University, 2014

@phdthesis{ashari2014sparse,

title={Sparse Matrix-Vector Multiplication on GPU},

author={Ashari, Arash},

year={2014},

school={The Ohio State University}

}

Download (PDF)

View

Source

4192

views

Sparse Matrix-Vector multiplication (SpMV) is one of the key operations in linear algebra. Overcoming thread divergence, load imbalance and un-coalesced and indirect memory access due to sparsity and irregularity are challenges to optimizing SpMV on GPUs. This dissertation develops solutions that address these challenges effectively. The first part of this dissertation focuses on a new blocked row-column (BRC) storage format with a two-dimensional blocking mechanism. It reduces thread divergence by reordering and blocking rows of the input matrix with nearly equal number of non-zero elements onto the same execution units (i.e., warps). BRC improves load balance by partitioning rows into blocks with a constant number of non-zeros such that different warps perform the same amount of work. We also present an approach to optimize BRC performance by judicious selection of block size based on sparsity characteristics of the matrix. The most commonly used format for a sparse matrix is CSR (Compressed Sparse Row), but a number of other representations have recently been developed that achieve higher SpMV performance. However, the alternative representations typically impose a significant preprocessing overhead. While a high preprocessing overhead can be amortized for applications requiring many iterative invocations of SpMV that use the same matrix, it is not always feasible – for instance when analyzing large dynamically evolving graphs. The second part of this dissertation presents ACSR, an adaptive SpMV algorithm that uses the standard CSR format but reduces thread divergence by combining rows into groups (bins) which have a similar number of non-zero elements. Further, for rows in bins that span a wide range of non zero counts, dynamic parallelism is leveraged. A significant benefit of ACSR over other proposed SpMV approaches is that it works directly with the standard CSR format, and thus avoids significant preprocessing overheads. We also demonstrate the use of ACSR for the analysis of dynamic graphs, where the improvement over extant approaches is even higher. SpMV is commonly found in a wide class of data analytics and machine learning (ML) algorithms such as linear regression (LR), generalized linear models (GLM), binomial/multinomial logistic regression (LogReg), and support vector machines (SVM). SpMV is also one of the major performance bottlenecks in these algorithms. In the third part of this dissertation, we study application of SpMV in scalable ML. In particular, we identify the generic pattern of computation (y = alpha * A^T x (z . (A x x)) + beta * y) and its various instantiations. This pattern involves two SpMV computations. Fusing these computations, we develop a pipelined approach to optimize the complete pattern on GPUs. This approach not only reduces the cost of data loads due to improved temporal locality but also enables other optimizations like coarsening, as well as hierarchical aggregation of partial results.

Tags: Algorithms, Computer science, CUDA, Linear Algebra, Machine learning, nVidia, nVidia GeForce GTX 580, nVidia GeForce GTX Titan, Sparse matrix, Tesla K10, Thesis

February 9, 2015 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Sparse Matrix-Vector Multiplication on GPU

Your response

Recent source codes

RepoLaunch: Automating Build and Test Pipeline of Code Repositories on ANY Language and ANY Platform

RepoLaunch: Automating Build and Test Pipeline of Code Repositories on ANY Language and ANY Platform

CONCUR: a benchmark designed to evaluate multithreaded Java code generated by LLMs

HIPRT: Ray Tracing using HIP

MXFP4 Training Support Codebase

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

CUDABench: Benchmarking LLMs for Text-to-CUDA Generation

CL4SE: A Context Learning Benchmark For Software Engineering Tasks

CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

Most viewed papers (last 30 days)

Sparse Matrix-Vector Multiplication on GPU

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)