16662

Accelerating BLAS on Custom Architecture through Algorithm-Architecture Co-design

Farhad Merchant, Tarun Vatwani, Anupam Chattopadhyay, Soumyendu Raha, S K Nandy, Ranjani Narayan
School of Computer Science and Engineering, Nanyang Technological University, Singapore
arXiv:1610.06385 [cs.AR], (20 Oct 2016)

@article{merchant2016accelerating,

   title={Accelerating BLAS on Custom Architecture through Algorithm-Architecture Co-design},

   author={Merchant, Farhad and Vatwani, Tarun and Chattopadhyay, Anupam and Raha, Soumyendu and Nandy, S K and Narayan, Ranjani},

   year={2016},

   month={oct},

   archivePrefix={"arXiv"},

   primaryClass={cs.AR}

}

Download Download (PDF)   View View   Source Source   

1606

views

Basic Linear Algebra Subprograms (BLAS) play key role in high performance and scientific computing applications. Experimentally, yesteryear multicore and General Purpose Graphics Processing Units (GPGPUs) are capable of achieving up to 15 to 57% of the peak performance at 65W to 240W of power respectively in underlying platform for compute bound operations like Double/Single Precision General Matrix Multiplication (XGEMM) while for bandwidth bound operations like Single/Double precision Matrix-vector Multiplication (XGEMV) it is merely 5 to 7% respectively. Achieving performance for BLAS requires moving away from conventional wisdom and evolving towards customized accelerator tailored for BLAS. In this paper, we present acceleration of Level-1 (vector operations), Level-2 (matrix-vector operations), and Level-3 (matrix-matrix operations) BLAS through algorithm architecture co-design on a Coarse-grained Reconfigurable Architecture (CGRA). We choose REDEFINE CGRA as a platform for our experiments since REDEFINE can be adapted to support domain of interest through tailor-made Custom Function Units (CFUs). For efficient sequential realization of BLAS, we present a design of a Processing Element (PE) that can achieve up-to 74% of the peak performance of the PE for DGEMM, 40% for DGEMV and 20% for DDOT. We attached this PE to the REDEFINE CGRA as a CFU and show the scalibilty of our solution. Finally, we show performance improvement of 3-140x in PE over commercially available Intel micro-architectures, ClearSpeed CSX700, FPGA, and Nvidia GPGPUs.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: