Deep Convolutional Network evaluation on the Intel Xeon Phi: Where Subword Parallelism meets Many-Core
Eindhoven University of Technology
Eindhoven University of Technology, 2016
@article{raina2016deep,
title={Deep Convolutional Network evaluation on the Intel Xeon Phi: Where Subword Parallelism meets Many-Core},
author={RAINA, Gaurav and Corporaal, Henk and Cuijpers, Pieter and Peemen, Maurice and Rauwerda, Gerard},
year={2016},
publisher={Technische Universiteit Eindhoven}
}
With a sharp decline in camera cost and size along with superior computing power available at increasingly low prices, computer vision applications are becoming ever present in our daily lives. Research shows that Convolutional Neural Networks (ConvNet) can outperform all other methods for computer vision tasks (such as object detection) in terms of accuracy and versatility [31]. One of the problems with these Neural Networks, which mimic the brain, is that they can be very demanding on the processor, requiring millions of computational nodes to function. Hence, it is challenging for Neural Network algorithms to achieve real-time performance on general purpose embedded platforms. Parallelization and vectorization are very effective ways to ease this problem and make it possible to implement such ConvNets on energy efficient embedded platforms. This thesis presents the evaluation of a novel ConvNet for road speed sign detection [38], on a breakthrough 57-core Intel Xeon Phi processor with 512-bit vector support. This mapping demonstrates that the parallelism inherent in the ConvNet algorithm can be effectively exploited by the 512-bit vector ISA and by utilizing the many core paradigm. Detailed evaluation shows that the best mappings require data-reuse strategies that exploit reuse at the cache and register level. These implementations are boosted by the use of low-level vector intrinsics (which are C style functions that map directly onto many Intel assembly instructions). Ultimately we demonstrate an approach which can be used to accelerate Neural Networks on highly-parallel many core processors, with execution speedups of more than 12x on single core performance alone.
January 26, 2017 by hgpu