26948

Open-source FPGA-ML codesign for the MLPerf Tiny Benchmark

Hendrik Borras, Giuseppe Di Guglielmo, Javier Duarte, Nicolò Ghielmetti, Ben Hawks, Scott Hauck, Shih-Chieh Hsu, Ryan Kastner, Jason Liang, Andres Meza, Jules Muhizi, Tai Nguyen, Rushil Roy, Nhan Tran, Yaman Umuroglu, Olivia Weng, Aidan Yokuda, Michaela Blott
Heidelberg University, Heidelberg, Germany
arXiv:2206.11791 [cs.LG], (23 Jun 2022)

@misc{https://doi.org/10.48550/arxiv.2206.11791,

   doi={10.48550/ARXIV.2206.11791},

   url={https://arxiv.org/abs/2206.11791},

   author={Borras, Hendrik and Di Guglielmo, Giuseppe and Duarte, Javier and Ghielmetti, Nicolò and Hawks, Ben and Hauck, Scott and Hsu, Shih-Chieh and Kastner, Ryan and Liang, Jason and Meza, Andres and Muhizi, Jules and Nguyen, Tai and Roy, Rushil and Tran, Nhan and Umuroglu, Yaman and Weng, Olivia and Yokuda, Aidan and Blott, Michaela},

   keywords={Machine Learning (cs.LG), Hardware Architecture (cs.AR), FOS: Computer and information sciences, FOS: Computer and information sciences},

   title={Open-source FPGA-ML codesign for the MLPerf Tiny Benchmark},

   publisher={arXiv},

   year={2022},

   copyright={Creative Commons Attribution 4.0 International}

}

We present our development experience and recent results for the MLPerf Tiny Inference Benchmark on field-programmable gate array (FPGA) platforms. We use the open-source hls4ml and FINN workflows, which aim to democratize AI-hardware codesign of optimized neural networks on FPGAs. We present the design and implementation process for the keyword spotting, anomaly detection, and image classification benchmark tasks. The resulting hardware implementations are quantized, configurable, spatial dataflow architectures tailored for speed and efficiency and introduce new generic optimizations and common workflows developed as a part of this work. The full workflow is presented from quantization-aware training to FPGA implementation. The solutions are deployed on system-on-chip (Pynq-Z2) and pure FPGA (Arty A7-100T) platforms. The resulting submissions achieve latencies as low as 20 us and energy consumption as low as 30 uJ per inference. We demonstrate how emerging ML benchmarks on heterogeneous hardware platforms can catalyze collaboration and the development of new techniques and more accessible tools.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: