high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » FC_ACCEL: Enabling Efficient, Low-Latency and Flexible Inference in DNN Fully Connected Layers, using Optimized Checkerboard Block matrix decomposition, fast scheduling, and a resource efficient 1D PE array with a custom HBM2 memory subsystem

FC_ACCEL: Enabling Efficient, Low-Latency and Flexible Inference in DNN Fully Connected Layers, using Optimized Checkerboard Block matrix decomposition, fast scheduling, and a resource efficient 1D PE array with a custom HBM2 memory subsystem

Nick Iliev, Amit R Trivedi

University of Illinois at Chicago

University of Illinois at Chicago, 2022

DOI:10.21203/rs.3.rs-1321782/v1

BibTeX

Download (PDF)

View

Source

1633

views

This article presents a novel low latency CMOS hardware accelerator for fully connected (FC) layers in deep neural networks (DNNs). The accelerator, FC-Accel, is based on 128 8×8 or 16×16 processing elements (PEs) for matrix-vector multiplication, and 128 multiply-accumulate (MAC) units integrated with 16 High Bandwidth Memory (HBM) stack units for storing the pre-trained weights. A dedicated non-blocking crossbar switch is not used in our low-latency page-bus demultiplexer-based interconnect between the 16 HBMs and the 128 PE array. We show near-linear speedup, reduction in space complexity, and reduction in time complexity with respect to traditional parallel matrix-vector multiplication with a checkerboard block decomposition algorithm, using a novel matched HBM2 memory subsystem for weights and input feature storage; we perform 16-bit fixed-point computation on the key kernels for DNN FC layer computation : FC kernel with KxM tiles which can be scaled for different FC layer sizes. We have designed a flexible processing element, PE, which implements the scalable kernel, in an 1D array of PEs to conserve resources. PE reconfiguration can be done as required by the layer being processed (FC6,FC7,FC8 in AlexNet of VGG16 for example). Micro-architectural details for CMOS ASIC implementations are presented and simulated performance is compared to recent hardware accelerators for DNNs for AlexNet and VGG_16. When comparing simulated processing latency for the FC8 layer, FC-Accel is able to achieve 108 GOPS (non-pipelined, with a 100 MHz clock) and 1048 GOPS (pipelined, with a 662 MHz clock) which improves on a recent EIE accelerator quoted at 102 GOPS with a 800 MHz clock and using compression for the same FC8 layer. When compared to Tensaurus, a recent accelerator of Sparse-Dense Tensor computations, FC-Accel (clocked at 662 MHz) delivers a 2.5 increase in throughput over Tensaurus (clocked at 2 GHz) for VGG16 FC8. The Xilinx Versal-ACAP VC1902 FPGA has an FC8 inferencing latency of 158 usec at 1.33 GHz, which is much slower than FC-Accel’s FC8 latency of 8.5 usec. When compared with an NVIDIA Jetson AGX Xavier GPU running inference on VGG-16 FC8, FC-Accel reduces FC8 inferencing latency from the GPU’s average of 120 usec to 8.5 usec. Intel’s Arria-10 DLA FPGA achieves 26 usec for the VGG16 FC8 layer which is 3 times the latency of the proposed solution.

Tags: CNN, Computer science, Deep learning, FPGA, Matrix decomposition, Neural networks, nVidia, nVidia Jetson AGX Xavier, Performance

February 13, 2022 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

FC_ACCEL: Enabling Efficient, Low-Latency and Flexible Inference in DNN Fully Connected Layers, using Optimized Checkerboard Block matrix decomposition, fast scheduling, and a resource efficient 1D PE array with a custom HBM2 memory subsystem

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

FC_ACCEL: Enabling Efficient, Low-Latency and Flexible Inference in DNN Fully Connected Layers, using Optimized Checkerboard Block matrix decomposition, fast scheduling, and a resource efficient 1D PE array with a custom HBM2 memory subsystem

Share this:

Recent source codes

Most viewed papers (last 30 days)