high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units

Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units

Kazuya Matsumoto, Naohito Nakasato, Stanislav Sedukhin

Graduate School of Computer Science and Engineering, The University of Aizu, Tsuruga, Ikki-Machi, Aizu-Wakamatsu City, Fukushima, 965-8580 Japan

Technical Report 2014-001, 2014

BibTeX

Download (PDF)

View

Source

6552

views

This paper presents an implementation of different matrix-matrix multiplication routines in OpenCL. We utilize the high-performance GEMM (GEneral Matrix-Matrix Multiply) implementation from our previous work for the present implementation of other matrix-matrix multiply routines in Level-3 BLAS (Basic Linear Algebra Subprograms). The other routines include SYMM (Symmetric Matrix-Matrix Multiply), SYRK (Symmetric Rank-K Update), SYR2K (Symmetric Rank-2K Update), and TRMM (Triangular Matrix-Matrix Multiply). A key in our approach is to copy given matrix data by copying OpenCL kernels into a form such that a high-performance GEMM kernel can be utilized for computation. We use a previously developed auto-tuning system for the highly optimized copying kernels as well as for GEMM kernel. The performance evaluation of our implementation is conducted on four different GPUs (AMD Radeon R9 290X, FirePro W9100, Radeon HD 7970, and NVIDIA GeForce GTX Titan), a many-core processor (Intel Xeon Phi 5110P), and a multi-core processor (Core i7 3960X). The evaluation results show that the tuning on the copying kernels is effective and contributes to develop high-performance BLAS3 routines.

Tags: ATI, ATI Radeon HD 7970, BLAS, Computer science, Intel Xeon Phi, Linear Algebra, Matrix multiplication, nVidia, nVidia GeForce GTX Titan, OpenCL

October 29, 2014 by hgpu

Rating: 2.0/5. From 4 votes.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)