high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM

Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM

Zijing Gu

arXiv:2007.13055 [cs.MS], (26 Jul 2020)

BibTeX

Download (PDF)

View

Source

Source codes

Package:

Benchmark for sparse-dense matrix multiplications

2242

views

We implemented and optimized matrix multiplications between dense and block-sparse matrices on CUDA. We leveraged TVM, a deep learning compiler, to explore the schedule space of the operation and generate efficient CUDA code. With the automatic parameter tuning in TVM, our cross-thread reduction based implementation achieved competitive or better performance compared with other state-of-the-art frameworks.

Tags: Computer science, CUDA, Deep learning, Machine learning, Matrix multiplication, nVidia, Package, Sparse matrix, Tesla T4

August 2, 2020 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM

Package:

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM

Package:

Share this:

Recent source codes

Most viewed papers (last 30 days)