high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Auto-Tuning of Level 1 and Level 2 BLAS for GPUs

Auto-Tuning of Level 1 and Level 2 BLAS for GPUs

Hans Henrik Brandenborg Sorensen

Informatics and Mathematical Modelling, Technical University of Denmark, Bldg. 321, DK-2800 Lyngby, Denmark

Concurrency Computat.: Pract. Exper., Wiley, 2012

@article{sorensen2012auto,

title={Auto-Tuning of Level 1 and Level 2 BLAS for GPUs},

author={S{o}rensen, H.H.B.},

year={2012}

}

Download (PDF)

View

Source

2506

views

The use of high performance libraries for dense linear algebra operations is of great importance in many numerical scientific applications. The most common operations form the backbone of the Basic Linear Algebra Subroutines (BLAS) library. In this paper, we consider the performance and auto-tuning of level 1 and level 2 BLAS routines on GPUs. As examples, we develop single-precision CUDA kernels for three of the most popular operations, the Euclidian norm (SNRM2), the matrix-vector multiplication (SGEMV), and the triangular solve (STRSV). The target hardware is the most recent Nvidia Tesla 20-series (Fermi architecture), which is designed from the ground up for high performance computing. We show that it is essentially a matter of fully utilizing the fine-grained parallelism of the many-core GPU in order to achieve high performance for level 1 and level 2 BLAS operations. We show that auto-tuning can be successfully employed to kernels for these operations so that they perform well for all input sizes.

Tags: BLAS, Computer science, CUDA, Linear Algebra, nVidia, Tesla C2050

April 16, 2012 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Auto-Tuning of Level 1 and Level 2 BLAS for GPUs

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)

Auto-Tuning of Level 1 and Level 2 BLAS for GPUs

Share this:

Recent source codes

Most viewed papers (last 30 days)