DNN is not all you need: Parallelizing Non-Neural ML Algorithms on Ultra-Low-Power IoT Processors

hgpu.org » Applications » Computer science » DNN is not all you need: Parallelizing Non-Neural ML Algorithms on Ultra-Low-Power IoT Processors

DNN is not all you need: Parallelizing Non-Neural ML Algorithms on Ultra-Low-Power IoT Processors

Enrico Tabanelli, Giuseppe Tagliavini, Luca Benini

Department of Electrical, Electronic and Information Engineering, University of Bologna, Bologna, Italy 40136

arXiv:2107.09448 [cs.AR], (16 Jul 2021)

@misc{tabanelli2021dnn,

title={DNN is not all you need: Parallelizing Non-Neural ML Algorithms on Ultra-Low-Power IoT Processors},

author={Enrico Tabanelli and Giuseppe Tagliavini and Luca Benini},

year={2021},

eprint={2107.09448},

archivePrefix={arXiv},

primaryClass={cs.AR}

}

Download (PDF)

View

Source

Source codes

Package:

PULP: a (Parallel Ultra-Low-Power) open-source multi-core computing platform

1732

views

Machine Learning (ML) functions are becoming ubiquitous in latency- and privacy-sensitive IoT applications, prompting for a shift toward near-sensor processing at the extreme edge and the consequent increasing adoption of Parallel Ultra-Low Power (PULP) IoT processors. These compute- and memory-constrained parallel architectures need to run efficiently a wide range of algorithms, including key Non-Neural ML kernels that compete favorably with Deep Neural Networks (DNNs) in terms of accuracy under severe resource constraints. In this paper, we focus on enabling efficient parallel execution of Non-Neural ML algorithms on two RISCV-based PULP platforms, namely GAP8, a commercial chip, and PULP-OPEN, a research platform running on an FPGA emulator. We optimized the parallel algorithms through a fine-grained analysis and intensive optimization to maximize the speedup, considering two alternative Floating-Point (FP) emulation libraries on GAP8 and the native FPU support on PULP-OPEN. Experimental results show that a target-optimized emulation library can lead to an average 1.61x runtime improvement compared to a standard emulation library, while the native FPU support reaches up to 32.09x. In terms of parallel speedup, our design improves the sequential execution by 7.04x on average on the targeted octa-core platforms. Lastly, we present a comparison with the ARM Cortex-M4 microcontroller (MCU), a widely adopted commercial solution for edge deployments, which is 12.87$x slower than PULP-OPEN.

Tags: Computer science, FPGA, Machine learning, Neural networks, nVidia, Package, SoC, Tesla A100

July 25, 2021 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org