high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Efficient Quantized Sparse Matrix Operations on Tensor Cores

Efficient Quantized Sparse Matrix Operations on Tensor Cores

Shigang Li, Kazuki Osawa, Torsten Hoefler

Department of Computer Science, ETH Zurich

arXiv:2209.06979 [cs.DC], (14 Sep 2022)

DOI:10.48550/arXiv.2209.06979

@misc{https://doi.org/10.48550/arxiv.2209.06979,

doi={10.48550/ARXIV.2209.06979},

url={https://arxiv.org/abs/2209.06979},

author={Li, Shigang and Osawa, Kazuki and Hoefler, Torsten},

keywords={Distributed, Parallel, and Cluster Computing (cs.DC), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences, C.1.4; I.2.11},

title={Efficient Quantized Sparse Matrix Operations on Tensor Cores},

publisher={arXiv},

year={2022},

}

Download (PDF)

View

Source

Source codes

Package:

Magicube

1290

views

The exponentially growing model size drives the continued success of deep learning, but it brings prohibitive computation and memory cost. From the algorithm perspective, model sparsification and quantization have been studied to alleviate the problem. From the architecture perspective, hardware vendors provide Tensor cores for acceleration. However, it is very challenging to gain practical speedups from sparse, low-precision matrix operations on Tensor cores, because of the strict requirements for data layout and lack of support for efficiently manipulating the low-precision integers. We propose Magicube, a high-performance sparse-matrix library for low-precision integers on Tensor cores. Magicube supports SpMM and SDDMM, two major sparse operations in deep learning with mixed precision. Experimental results on an NVIDIA A100 GPU show that Magicube achieves on average 1.44x (up to 2.37x) speedup over the vendor-optimized library for sparse kernels, and 1.43x speedup over the state-of-the-art with a comparable accuracy for end-to-end sparse Transformer inference.

Tags: Computer science, CUDA, Deep learning, Linear Algebra, Mixed precision, nVidia, nVidia A100, Package, Sparse matrix, Tesla V100

October 2, 2022 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Efficient Quantized Sparse Matrix Operations on Tensor Cores

Package:

Your response

Recent source codes

ParaCodex: A Profiling-Guided Autonomous Coding Agent for Reliable Parallel Code Generation and Translation

SeedFold: Scaling Biomolecular Structure Prediction

Tilus: A Tile-Level GPU Kernel Programming Language

Memory-Efficient Acceleration of Block Low-Rank Foundation Models on Resource Constrained GPUs

BoltzGen:Toward Universal Binder Design

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

cuPilot: A Strategy-Coordinated Multi-agent Framework for CUDA Kernel Evolution

MATLAB Tensor Core models

TritonForge: Transform PyTorch Operations into Optimized GPU Kernels with LLMs

RLTune: Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters

Most viewed papers (last 30 days)

Efficient Quantized Sparse Matrix Operations on Tensor Cores

Package:

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)