29320

LO-SpMM: Low-cost Search for High-performance SpMM Kernels on GPUs

Junqing Lin, Jingwei Sun, Xiaolong Shi, Honghe Zhang, Xianzhi Yu, Xinzhi Wang, Jun Yao, Guangzhong Sun
Computer Science and Technology, University of Science and Technology of China, Hefei, China
ACM Transactions on Architecture and Code Optimization, 2024

@article{lin2024spmm,

   title={LO-SpMM: Low-cost Search for High-performance SpMM Kernels on GPUs},

   author={Lin, Junqing and Sun, Jingwei and Shi, Xiaolong and Zhang, Honghe and Yu, Xianzhi and Wang, Xinzhi and Yao, Jun and Sun, Guangzhong},

   journal={ACM Transactions on Architecture and Code Optimization},

   year={2024},

   publisher={ACM New York, NY}

}

Download Download (PDF)   View View   Source Source   

404

views

As deep neural networks (DNNs) become increasingly large and complicated, pruning techniques are proposed for lower memory footprint and more efficient inference. The most critical kernel to execute pruned sparse DNNs on GPUs is Sparse-dense Matrix Multiplication (SpMM). To maximize the performance of SpMM, despite the high-performance implementation generated from advanced tensor compilers, they often take a long time to iteratively search tuning configurations. Such a long time slows down the cycle of exploring better DNN architectures or pruning algorithms. In this paper, we propose LO-SpMM to efficiently generate high-performance SpMM implementations for sparse DNN inference. Based on the analysis of nonzero elements’ layout, the characterization of the GPU architecture, and a rank-based cost model, LO-SpMM can effectively reduce the search space and eliminate possibly low-performance candidates. Besides, rather than generating complete SpMM implementations for evaluation, LO-SpMM constructs simplified proxies to quickly estimate performance, thereby substantially reducing compilation and execution costs. Experimental results show that LO-SpMM can reduce the search time by 281 × at most, while the performance of generated SpMM implementations is comparable to or better than the state-of-the-art sparse tensor compiling solutions.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: