Benchmarking Intel Xeon Phi to Guide Kernel Design
Delft University of Technology
Delft University of Technology, PDS Technical Report PDS-2013-005, 2013
@techreport{fang2013benchmarking,
title={Benchmarking Intel Xeon Phi to Guide Kernel Design},
author={Fang, Jianbin and Varbanescu, Ana Lucia and Sips, Henk and Zhang, Lilun and Che, Yonggang and Xu, Chuanfu},
year={2013}
}
With a minimum of 50 cores, Intel’s Xeon Phi is a true many-core architecture. Featuring fairly powerful cores, two levels of caches, and a very fast interconnection, the Xeon Phi is able to achieve theoretical peak of 1000 GFLOPs and over 240 GB/s. These numbers, as well as its flexibility – it can be used as both coprocessor or a stand-alone processor – are very tempting for parallel applications looking for new performance records. In this paper, we present four hardware-centric guidelines and a machine model for Xeon Phi programmers in search for performance. Specifically, we have benchmarked the main hardware components of the processor – the cores, the memory hierarchies, and the ring interconnect. We show that, in ideal microbenchmarking conditions, the achieved performance is very close to the theoretical one as given in the official programmer’s guide. Furthermore, we have identified and quantified several causes for significant performance penalties, which are not available in the official documentation. Based on this information, we synthesized four optimization guidelines and applied them to a set of kernels, aiming to systematically optimize their performance. The optimization process is guided by performance roofs, derived from the same benchmarks. Our experimental results show that, using this strategy, we can achieve impressive performance gains and, more importantly, a high utilization of the processor.
July 14, 2013 by hgpu