high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Model-driven autotuning of sparse matrix-vector multiply on GPUs

Model-driven autotuning of sparse matrix-vector multiply on GPUs

Jee W. Choi, Amik Singh, Richard W. Vuduc

Georgia Institute of Technology, School of Electrical and Computer Engineering, Atlanta, Georgia, USA

Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP ’10

DOI:10.1145/1693453.1693471

@conference{choi2010model,

title={Model-driven autotuning of sparse matrix-vector multiply on GPUs},

author={Choi, J.W. and Singh, A. and Vuduc, R.W.},

booktitle={Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel computing},

pages={115–126},

year={2010},

organization={ACM}

}

Download (PDF)

View

Source

2885

views

We present a performance model-driven framework for automated performance tuning (autotuning) of sparse matrix-vector multiply (SpMV) on systems accelerated by graphics processing units (GPU). Our study consists of two parts. First, we describe several carefully hand-tuned SpMV implementations for GPUs, identifying key GPU-specific performance limitations, enhancements, and tuning opportunities. These implementations, which include variants on classical blocked compressed sparse row (BCSR) and blocked ELLPACK (BELLPACK) storage formats, match or exceed state-of-the-art implementations. For instance, our best BELLPACK implementation achieves up to 29.0 Gflop/s in single-precision and 15.7 Gflop/s in double-precision on the NVIDIA T10P multiprocessor (C1060), enhancing prior state-of-the-art unblocked implementations (Bell and Garland, 2009) by up to 1.8? and 1.5? for single-and double-precision respectively. However, achieving this level of performance requires input matrix-dependent parameter tuning. Thus, in the second part of this study, we develop a performance model that can guide tuning. Like prior autotuning models for CPUs (e.g., Im, Yelick, and Vuduc, 2004), this model requires offline measurements and run-time estimation, but more directly models the structure of multithreaded vector processors like GPUs. We show that our model can identify the implementations that achieve within 15% of those found through exhaustive search.

Tags: Algorithms, Computer science, CUDA, Linear Algebra, nVidia, Performance, Sparse matrix, Tesla C1060, Tesla C870, Tesla T10P

February 5, 2011 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Model-driven autotuning of sparse matrix-vector multiply on GPUs

Your response

Recent source codes

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

True 4-Bit Quantized CNN Training on CPU

cuFuzz: A GPU-oriented coverage-guided fuzzer for userland CUDA application

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Most viewed papers (last 30 days)

Model-driven autotuning of sparse matrix-vector multiply on GPUs

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)