CaffePresso: An Optimized Library for Deep Learning on Embedded Accelerator-based platforms

hgpu.org » Applications » Computer science » CaffePresso: An Optimized Library for Deep Learning on Embedded Accelerator-based platforms

CaffePresso: An Optimized Library for Deep Learning on Embedded Accelerator-based platforms

Gopalakrishna Hegde, Siddhartha, Nachiappan Ramasamy, Nachiket Kapre

School of Computer Science and Engineering, Nanyang Technological University, Singapore 639798

International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, 2016

@article{hegde2016caffepresso,

title={CaffePresso: An Optimized Library for Deep Learning on Embedded Accelerator-based platforms},

author={Hegde, Gopalakrishna and Ramasamy, Nachiappan and Kapre, Nachiket},

year={2016}

}

Download (PDF)

View

Source

Source codes

Package:

CaffePresso: An Optimized Library for Deep Learning on Embedded Accelerator-based platforms

1879

views

Off-the-shelf accelerator-based embedded platforms offer a competitive energy-efficient solution for lightweight deep learning computations over CPU-based systems. Low-complexity classifiers used in power-constrained and performance-limited scenarios are characterized by operations on small image maps with 2-3 deep layers and few class labels. For these use cases, we consider a range of embedded systems with 5-20 W power budgets such as the Xilinx ZC706 board (with MXP soft vector processor), NVIDIA Jetson TX1 (GPU), TI Keystone II (DSP) as well as the Adapteva Parallella board (custom multi-core with NoC). Deep Learning computations push the capabilities of these platforms to the limit through compute-intensive evaluations of multiple 2D convolution filters per layer, and high communication requirements arising from the movement of intermediate maps across layers. We present CaffePresso, a Caffe-compatible framework for generating optimized mappings of user-supplied ConvNet specifications to target various accelerators such as FPGAs, DSPs, GPUs, RISC-multicores. We use an automated code generation and autotuning approach based on knowledge of the ConvNet requirements, as well as platform-specific constraints such as on-chip memory capacity, bandwidth and ALU potential. While one may expect the Jetson TX1 + cuDNN to deliver high performance for ConvNet configurations, (1) we observe a flipped result with slower GPU processing compared to most other systems for smaller embeddedfriendly datasets such as MNIST and CIFAR10, and (2) faster and more energy efficient implementation on the older 28nm TI Keystone II DSP over the newer 20nm NVIDIA TX1 SoC in all cases.

Tags: Computer science, CUDA, Deep learning, DSP, FPGA, nVidia, nVidia Tegra TX1

August 11, 2016 by hgpu

Rating: 1.5/5. From 2 votes.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org