FC_ACCEL: Enabling Efficient, Low-Latency and Flexible Inference in DNN Fully Connected Layers, using Optimized Checkerboard Block matrix decomposition, fast scheduling, and a resource efficient 1D PE array with a custom HBM2 memory subsystem
University of Illinois at Chicago
University of Illinois at Chicago, 2022
@article{iliev2022fc_accel,
title={FC_ACCEL: Enabling Efficient, Low-Latency and Flexible Inference in DNN Fully Connected Layers, using Optimized Checkerboard Block matrix decomposition, fast scheduling, and a resource efficient 1D PE array with a custom HBM2 memory subsystem},
author={Iliev, Nick and Trivedi, Amit R},
year={2022}
}
This article presents a novel low latency CMOS hardware accelerator for fully connected (FC) layers in deep neural networks (DNNs). The accelerator, FC-Accel, is based on 128 8×8 or 16×16 processing elements (PEs) for matrix-vector multiplication, and 128 multiply-accumulate (MAC) units integrated with 16 High Bandwidth Memory (HBM) stack units for storing the pre-trained weights. A dedicated non-blocking crossbar switch is not used in our low-latency page-bus demultiplexer-based interconnect between the 16 HBMs and the 128 PE array. We show near-linear speedup, reduction in space complexity, and reduction in time complexity with respect to traditional parallel matrix-vector multiplication with a checkerboard block decomposition algorithm, using a novel matched HBM2 memory subsystem for weights and input feature storage; we perform 16-bit fixed-point computation on the key kernels for DNN FC layer computation : FC kernel with KxM tiles which can be scaled for different FC layer sizes. We have designed a flexible processing element, PE, which implements the scalable kernel, in an 1D array of PEs to conserve resources. PE reconfiguration can be done as required by the layer being processed (FC6,FC7,FC8 in AlexNet of VGG16 for example). Micro-architectural details for CMOS ASIC implementations are presented and simulated performance is compared to recent hardware accelerators for DNNs for AlexNet and VGG_16. When comparing simulated processing latency for the FC8 layer, FC-Accel is able to achieve 108 GOPS (non-pipelined, with a 100 MHz clock) and 1048 GOPS (pipelined, with a 662 MHz clock) which improves on a recent EIE accelerator quoted at 102 GOPS with a 800 MHz clock and using compression for the same FC8 layer. When compared to Tensaurus, a recent accelerator of Sparse-Dense Tensor computations, FC-Accel (clocked at 662 MHz) delivers a 2.5 increase in throughput over Tensaurus (clocked at 2 GHz) for VGG16 FC8. The Xilinx Versal-ACAP VC1902 FPGA has an FC8 inferencing latency of 158 usec at 1.33 GHz, which is much slower than FC-Accel’s FC8 latency of 8.5 usec. When compared with an NVIDIA Jetson AGX Xavier GPU running inference on VGG-16 FC8, FC-Accel reduces FC8 inferencing latency from the GPU’s average of 120 usec to 8.5 usec. Intel’s Arria-10 DLA FPGA achieves 26 usec for the VGG16 FC8 layer which is 3 times the latency of the proposed solution.
February 13, 2022 by hgpu