Auto-Tuning of Level 1 and Level 2 BLAS for GPUs

Hans Henrik Brandenborg Sorensen
Informatics and Mathematical Modelling, Technical University of Denmark, Bldg. 321, DK-2800 Lyngby, Denmark
Concurrency Computat.: Pract. Exper., Wiley, 2012


   title={Auto-Tuning of Level 1 and Level 2 BLAS for GPUs},

   author={S{o}rensen, H.H.B.},



Download Download (PDF)   View View   Source Source   



The use of high performance libraries for dense linear algebra operations is of great importance in many numerical scientific applications. The most common operations form the backbone of the Basic Linear Algebra Subroutines (BLAS) library. In this paper, we consider the performance and auto-tuning of level 1 and level 2 BLAS routines on GPUs. As examples, we develop single-precision CUDA kernels for three of the most popular operations, the Euclidian norm (SNRM2), the matrix-vector multiplication (SGEMV), and the triangular solve (STRSV). The target hardware is the most recent Nvidia Tesla 20-series (Fermi architecture), which is designed from the ground up for high performance computing. We show that it is essentially a matter of fully utilizing the fine-grained parallelism of the many-core GPU in order to achieve high performance for level 1 and level 2 BLAS operations. We show that auto-tuning can be successfully employed to kernels for these operations so that they perform well for all input sizes.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2021 hgpu.org

All rights belong to the respective authors

Contact us: