high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Accelerating Sparse Approximate Matrix Multiplication on GPUs

Accelerating Sparse Approximate Matrix Multiplication on GPUs

Xiaoyan Liu, Yi Liu, Ming Dun, Bohong Yin, Hailong Yang, Zhongzhi Luan, Depei Qian

School of Computer Science and Engineering, Beihang University, Beijing, China

arXiv:2103.13042 [cs.PF], (24 Mar 2021)

BibTeX

Download (PDF)

View

Source

1917

views

Although the matrix multiplication plays a vital role in computational linear algebra, there are few efficient solutions for matrix multiplication of the near-sparse matrices. The Sparse Approximate Matrix Multiply (SpAMM) is one of the algorithms to fill the performance gap neglected by traditional optimizations for dense/sparse matrix multiplication. However, existing SpAMM algorithms fail to exploit the performance potential of GPUs for acceleration. In this paper, we present cuSpAMM, the first parallel SpAMM algorithm optimized for multiple GPUs. Several performance optimizations have been proposed, including algorithm re-design to adapt to the thread parallelism, blocking strategies for memory access optimization, and the acceleration with the tensor core. In addition, we scale cuSpAMM to run on multiple GPUs with an effective load balance scheme. We evaluate cuSpAMM on both synthesized and real-world datasets on multiple GPUs. The experiment results show that cuSpAMM achieves significant performance speedup compared to vendor optimized cuBLAS and cuSPARSE libraries.

Tags: Computer science, CUBLAS, CUDA, Linear Algebra, Matrix multiplication, nVidia, Performance, Sparse matrix, Tesla V100

March 28, 2021 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Accelerating Sparse Approximate Matrix Multiplication on GPUs

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

Accelerating Sparse Approximate Matrix Multiplication on GPUs

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)