CaffePresso: An Optimized Library for Deep Learning on Embedded Accelerator-based platforms
School of Computer Science and Engineering, Nanyang Technological University, Singapore 639798
International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, 2016
@article{hegde2016caffepresso,
title={CaffePresso: An Optimized Library for Deep Learning on Embedded Accelerator-based platforms},
author={Hegde, Gopalakrishna and Ramasamy, Nachiappan and Kapre, Nachiket},
year={2016}
}
Off-the-shelf accelerator-based embedded platforms offer a competitive energy-efficient solution for lightweight deep learning computations over CPU-based systems. Low-complexity classifiers used in power-constrained and performance-limited scenarios are characterized by operations on small image maps with 2-3 deep layers and few class labels. For these use cases, we consider a range of embedded systems with 5-20 W power budgets such as the Xilinx ZC706 board (with MXP soft vector processor), NVIDIA Jetson TX1 (GPU), TI Keystone II (DSP) as well as the Adapteva Parallella board (custom multi-core with NoC). Deep Learning computations push the capabilities of these platforms to the limit through compute-intensive evaluations of multiple 2D convolution filters per layer, and high communication requirements arising from the movement of intermediate maps across layers. We present CaffePresso, a Caffe-compatible framework for generating optimized mappings of user-supplied ConvNet specifications to target various accelerators such as FPGAs, DSPs, GPUs, RISC-multicores. We use an automated code generation and autotuning approach based on knowledge of the ConvNet requirements, as well as platform-specific constraints such as on-chip memory capacity, bandwidth and ALU potential. While one may expect the Jetson TX1 + cuDNN to deliver high performance for ConvNet configurations, (1) we observe a flipped result with slower GPU processing compared to most other systems for smaller embeddedfriendly datasets such as MNIST and CIFAR10, and (2) faster and more energy efficient implementation on the older 28nm TI Keystone II DSP over the newer 20nm NVIDIA TX1 SoC in all cases.
August 11, 2016 by hgpu