high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Acceleration of Deep Learning on FPGA

Acceleration of Deep Learning on FPGA

Huyuan Li

University of Windsor

University of Windsor, 2017

BibTeX

Download (PDF)

View

Source

2566

views

In recent years, deep convolutional neural networks (ConvNet) have shown their popularity in various real world applications. To provide more accurate results, the state-of-the-art ConvNet requires millions of parameters and billions of operations to process a single image, which represents a computational challenge for general purpose processors. As a result, hardware accelerators such as Graphic Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs), have been adopted to improve the performance of ConvNet. However, GPU-based solution consumes a considerable amount of power and a traditional RTL design on FPGA requires tedious development that is very time-consuming. In this work, we propose a scalable and parameterized end-to-end ConvNet design using Intel FPGA SDK for OpenCL. To validate the design, we implement VGG 16 model on two different FPGA boards. Consequently, our designs achieve 306.41 GOPS on Intel Stratix A7 and 318.94 GOPS on Intel Arria 10 GX 10AX115. To the best of our knowledge, this outperforms previous FPGA-based accelerators. Compared to the CPU (Intel Xeon E5-2620) and a mid-range GPU (Nvidia K40), our design is 24.3X and 1.7X more energy efficient respectively.

Tags: CNN, Computer science, Deep learning, FPGA, Neural networks, nVidia, OpenCL, Tesla K40, Thesis

May 9, 2017 by hgpu

Rating: 1.5/5. From 2 votes.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Acceleration of Deep Learning on FPGA

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

Acceleration of Deep Learning on FPGA

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)