high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units

Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units

Kazuya Matsumoto, Naohito Nakasato, Stanislav Sedukhin

Graduate School of Computer Science and Engineering, The University of Aizu, Tsuruga, Ikki-Machi, Aizu-Wakamatsu City, Fukushima, 965-8580 Japan

Technical Report 2014-001, 2014

@article{matsumoto2014implementing,

title={Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units},

author={Matsumoto, Kazuya and Nakasato, Naohito and Sedukhin, Stanislav and Tsuruga, Ikki-Machi and City, Aizu-Wakamatsu},

year={2014}

}

Download (PDF)

View

Source

6353

views

This paper presents an implementation of different matrix-matrix multiplication routines in OpenCL. We utilize the high-performance GEMM (GEneral Matrix-Matrix Multiply) implementation from our previous work for the present implementation of other matrix-matrix multiply routines in Level-3 BLAS (Basic Linear Algebra Subprograms). The other routines include SYMM (Symmetric Matrix-Matrix Multiply), SYRK (Symmetric Rank-K Update), SYR2K (Symmetric Rank-2K Update), and TRMM (Triangular Matrix-Matrix Multiply). A key in our approach is to copy given matrix data by copying OpenCL kernels into a form such that a high-performance GEMM kernel can be utilized for computation. We use a previously developed auto-tuning system for the highly optimized copying kernels as well as for GEMM kernel. The performance evaluation of our implementation is conducted on four different GPUs (AMD Radeon R9 290X, FirePro W9100, Radeon HD 7970, and NVIDIA GeForce GTX Titan), a many-core processor (Intel Xeon Phi 5110P), and a multi-core processor (Core i7 3960X). The evaluation results show that the tuning on the copying kernels is effective and contributes to develop high-performance BLAS3 routines.

Tags: ATI, ATI Radeon HD 7970, BLAS, Computer science, Intel Xeon Phi, Linear Algebra, Matrix multiplication, nVidia, nVidia GeForce GTX Titan, OpenCL

October 29, 2014 by hgpu

Rating: 2.0/5. From 4 votes.

Please wait...

* * *

high performance computing on graphics processing units: hgpu.org

Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units

Recent source codes

TransCL: An Automatic CUDA-to-OpenCL Programs Transformation Framework

pyATF: The Auto-Tuning Framework (ATF) in Python

TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators

Checkpoint/Restore tool

KernelBench: Can LLMs Write GPU Kernels?

An MLIR-based toolchain for AMD AI Engine-enabled devices

Forecasting time series with constraints

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

cuSZp: Fast GPU error-bounded lossy compressor for floating-point data

Kokkos C++ Performance Portability Programming EcoSystem

Most viewed papers (last 30 days)

Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units

Share this:

Recent source codes

Most viewed papers (last 30 days)