Accelerating BLAS on Custom Architecture through Algorithm-Architecture Co-design

hgpu.org » Programming » Algorithms » Accelerating BLAS on Custom Architecture through Algorithm-Architecture Co-design

Accelerating BLAS on Custom Architecture through Algorithm-Architecture Co-design

Farhad Merchant, Tarun Vatwani, Anupam Chattopadhyay, Soumyendu Raha, S K Nandy, Ranjani Narayan

School of Computer Science and Engineering, Nanyang Technological University, Singapore

arXiv:1610.06385 [cs.AR], (20 Oct 2016)

@article{merchant2016accelerating,

title={Accelerating BLAS on Custom Architecture through Algorithm-Architecture Co-design},

author={Merchant, Farhad and Vatwani, Tarun and Chattopadhyay, Anupam and Raha, Soumyendu and Nandy, S K and Narayan, Ranjani},

year={2016},

month={oct},

archivePrefix={"arXiv"},

primaryClass={cs.AR}

}

Download (PDF)

View

Source

1606

views

Basic Linear Algebra Subprograms (BLAS) play key role in high performance and scientific computing applications. Experimentally, yesteryear multicore and General Purpose Graphics Processing Units (GPGPUs) are capable of achieving up to 15 to 57% of the peak performance at 65W to 240W of power respectively in underlying platform for compute bound operations like Double/Single Precision General Matrix Multiplication (XGEMM) while for bandwidth bound operations like Single/Double precision Matrix-vector Multiplication (XGEMV) it is merely 5 to 7% respectively. Achieving performance for BLAS requires moving away from conventional wisdom and evolving towards customized accelerator tailored for BLAS. In this paper, we present acceleration of Level-1 (vector operations), Level-2 (matrix-vector operations), and Level-3 (matrix-matrix operations) BLAS through algorithm architecture co-design on a Coarse-grained Reconfigurable Architecture (CGRA). We choose REDEFINE CGRA as a platform for our experiments since REDEFINE can be adapted to support domain of interest through tailor-made Custom Function Units (CFUs). For efficient sequential realization of BLAS, we present a design of a Processing Element (PE) that can achieve up-to 74% of the peak performance of the PE for DGEMM, 40% for DGEMV and 20% for DDOT. We attached this PE to the REDEFINE CGRA as a CFU and show the scalibilty of our solution. Finally, we show performance improvement of 3-140x in PE over commercially available Intel micro-architectures, ClearSpeed CSX700, FPGA, and Nvidia GPGPUs.

Tags: Algorithms, BLAS, Computer science, CUDA, FPGA, Linear Algebra, Matrix multiplication, nVidia, Tesla C2050

October 29, 2016 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org