CaffePresso: An Optimized Library for Deep Learning on Embedded Accelerator-based platforms

hgpu.org » Applications » Computer science » CaffePresso: An Optimized Library for Deep Learning on Embedded Accelerator-based platforms

CaffePresso: An Optimized Library for Deep Learning on Embedded Accelerator-based platforms

Gopalakrishna Hegde, Siddhartha, Nachiappan Ramasamy, Nachiket Kapre

School of Computer Science and Engineering, Nanyang Technological University, Singapore 639798

International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, 2016

BibTeX

Download (PDF)

View

Source

Source codes

Package:

CaffePresso: An Optimized Library for Deep Learning on Embedded Accelerator-based platforms

2257

views

Off-the-shelf accelerator-based embedded platforms offer a competitive energy-efficient solution for lightweight deep learning computations over CPU-based systems. Low-complexity classifiers used in power-constrained and performance-limited scenarios are characterized by operations on small image maps with 2-3 deep layers and few class labels. For these use cases, we consider a range of embedded systems with 5-20 W power budgets such as the Xilinx ZC706 board (with MXP soft vector processor), NVIDIA Jetson TX1 (GPU), TI Keystone II (DSP) as well as the Adapteva Parallella board (custom multi-core with NoC). Deep Learning computations push the capabilities of these platforms to the limit through compute-intensive evaluations of multiple 2D convolution filters per layer, and high communication requirements arising from the movement of intermediate maps across layers. We present CaffePresso, a Caffe-compatible framework for generating optimized mappings of user-supplied ConvNet specifications to target various accelerators such as FPGAs, DSPs, GPUs, RISC-multicores. We use an automated code generation and autotuning approach based on knowledge of the ConvNet requirements, as well as platform-specific constraints such as on-chip memory capacity, bandwidth and ALU potential. While one may expect the Jetson TX1 + cuDNN to deliver high performance for ConvNet configurations, (1) we observe a flipped result with slower GPU processing compared to most other systems for smaller embeddedfriendly datasets such as MNIST and CIFAR10, and (2) faster and more energy efficient implementation on the older 28nm TI Keystone II DSP over the newer 20nm NVIDIA TX1 SoC in all cases.

Tags: Computer science, CUDA, Deep learning, DSP, FPGA, nVidia, nVidia Tegra TX1

August 11, 2016 by hgpu

Rating: 1.5/5. From 2 votes.

Please wait...

Your response

You must be logged in to post a comment.

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org