high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Accelerating BLAS on Custom Architecture through Algorithm-Architecture Co-design

Accelerating BLAS on Custom Architecture through Algorithm-Architecture Co-design

Farhad Merchant, Tarun Vatwani, Anupam Chattopadhyay, Soumyendu Raha, S K Nandy, Ranjani Narayan

School of Computer Science and Engineering, Nanyang Technological University, Singapore

arXiv:1610.06385 [cs.AR], (20 Oct 2016)

BibTeX

Download (PDF)

View

Source

1990

views

Basic Linear Algebra Subprograms (BLAS) play key role in high performance and scientific computing applications. Experimentally, yesteryear multicore and General Purpose Graphics Processing Units (GPGPUs) are capable of achieving up to 15 to 57% of the peak performance at 65W to 240W of power respectively in underlying platform for compute bound operations like Double/Single Precision General Matrix Multiplication (XGEMM) while for bandwidth bound operations like Single/Double precision Matrix-vector Multiplication (XGEMV) it is merely 5 to 7% respectively. Achieving performance for BLAS requires moving away from conventional wisdom and evolving towards customized accelerator tailored for BLAS. In this paper, we present acceleration of Level-1 (vector operations), Level-2 (matrix-vector operations), and Level-3 (matrix-matrix operations) BLAS through algorithm architecture co-design on a Coarse-grained Reconfigurable Architecture (CGRA). We choose REDEFINE CGRA as a platform for our experiments since REDEFINE can be adapted to support domain of interest through tailor-made Custom Function Units (CFUs). For efficient sequential realization of BLAS, we present a design of a Processing Element (PE) that can achieve up-to 74% of the peak performance of the PE for DGEMM, 40% for DGEMV and 20% for DDOT. We attached this PE to the REDEFINE CGRA as a CFU and show the scalibilty of our solution. Finally, we show performance improvement of 3-140x in PE over commercially available Intel micro-architectures, ClearSpeed CSX700, FPGA, and Nvidia GPGPUs.

Tags: Algorithms, BLAS, Computer science, CUDA, FPGA, Linear Algebra, Matrix multiplication, nVidia, Tesla C2050

October 29, 2016 by hgpu

No votes yet.

Please wait...

* * *

high performance computing on graphics processing units: hgpu.org

Accelerating BLAS on Custom Architecture through Algorithm-Architecture Co-design

Recent source codes

XaaS containers

microSYCL: SYCL micro-benchmarks repository

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

Most viewed papers (last 30 days)

Accelerating BLAS on Custom Architecture through Algorithm-Architecture Co-design

Share this:

Recent source codes

Most viewed papers (last 30 days)