Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM
arXiv:2007.13055 [cs.MS], (26 Jul 2020)
@misc{gu2020optimizing,
title={Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM},
author={Zijing Gu},
year={2020},
eprint={2007.13055},
archivePrefix={arXiv},
primaryClass={cs.MS}
}
We implemented and optimized matrix multiplications between dense and block-sparse matrices on CUDA. We leveraged TVM, a deep learning compiler, to explore the schedule space of the operation and generate efficient CUDA code. With the automatic parameter tuning in TVM, our cross-thread reduction based implementation achieved competitive or better performance compared with other state-of-the-art frameworks.
August 2, 2020 by hgpu