Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network

hgpu.org » Applications » Computer science » Computer vision » Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network

Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network

Jialiang Zhang, Jing Li

Department of Electrical and Computer Engineering, University of Wisconsin-Madison

25th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA2017), 2017

@inproceedings{zhang2017improving,

title={Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network},

author={Zhang, Jialiang and Li, Jing},

year={2017}

}

Download (PDF)

View

Source

2646

views

OpenCL FPGA has recently gained great popularity with emerging needs for workload acceleration such as Convolutional Neural Network (CNN), which is the most popular deep learning architecture in the domain of computer vision. While OpenCL enhances the code portability and programmability of FPGA, it comes at the expense of performance. The key challenge is to optimize the OpenCL kernels to efficiently utilize the flexible hardware resources in FPGA. Simply optimizing the OpenCL kernel code through various compiler options turns out insufficient to achieve desirable performance for both compute-intensive and data-intensive workloads such as convolutional neural networks . In this paper, we first propose an analytical performance model and apply it to perform an in-depth analysis on the resource requirement of CNN classifier kernels and available resources on modern FPGAs. We identify that the key performance bottleneck is the onchip memory bandwidth. We propose a new kernel design to effectively address such bandwidth limitation and to provide an optimal balance between computation, on-chip, and off-chip memory access. As a case study, we further apply these techniques to design a CNN accelerator based on the VGG model. Finally, we evaluate the performance of our CNN accelerator using an Altera Arria 10 GX1150 board. We achieve 866 Gop/s floating point performance at 370MHz working frequency and 1.79 Top/s 16-bit fixed-point performance at 385MHz. To the best of our knowledge, our implementation achieves the best power efficiency and performance density compared to existing work.

Tags: Computer science, Computer vision, Deep learning, FPGA, Neural networks, OpenCL

February 26, 2017 by hgpu

Rating: 1.5/5. From 2 votes.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org