high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » LO-SpMM: Low-cost Search for High-performance SpMM Kernels on GPUs

LO-SpMM: Low-cost Search for High-performance SpMM Kernels on GPUs

Junqing Lin, Jingwei Sun, Xiaolong Shi, Honghe Zhang, Xianzhi Yu, Xinzhi Wang, Jun Yao, Guangzhong Sun

Computer Science and Technology, University of Science and Technology of China, Hefei, China

ACM Transactions on Architecture and Code Optimization, 2024

DOI:10.1145/3685277

BibTeX

Download (PDF)

View

Source

705

views

As deep neural networks (DNNs) become increasingly large and complicated, pruning techniques are proposed for lower memory footprint and more efficient inference. The most critical kernel to execute pruned sparse DNNs on GPUs is Sparse-dense Matrix Multiplication (SpMM). To maximize the performance of SpMM, despite the high-performance implementation generated from advanced tensor compilers, they often take a long time to iteratively search tuning configurations. Such a long time slows down the cycle of exploring better DNN architectures or pruning algorithms. In this paper, we propose LO-SpMM to efficiently generate high-performance SpMM implementations for sparse DNN inference. Based on the analysis of nonzero elements’ layout, the characterization of the GPU architecture, and a rank-based cost model, LO-SpMM can effectively reduce the search space and eliminate possibly low-performance candidates. Besides, rather than generating complete SpMM implementations for evaluation, LO-SpMM constructs simplified proxies to quickly estimate performance, thereby substantially reducing compilation and execution costs. Experimental results show that LO-SpMM can reduce the search time by 281 × at most, while the performance of generated SpMM implementations is comparable to or better than the state-of-the-art sparse tensor compiling solutions.

Tags: Compilers, Computer science, CUDA, Deep learning, Linear Algebra, Matrix multiplication, Neural networks, nVidia, nVidia GeForce RTX 2080 Ti, Performance, Sparse matrix, Tesla V100

August 4, 2024 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

LO-SpMM: Low-cost Search for High-performance SpMM Kernels on GPUs

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

LO-SpMM: Low-cost Search for High-performance SpMM Kernels on GPUs

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)