An Empirical Study of Intel Xeon Phi
Delft University of Technology, the Netherlands
arXiv:1310.5842 [cs.DC], (22 Oct 2013)
@article{2013arXiv1310.5842F,
author={Fang}, J. and {Varbanescu}, A.~L. and {Sips}, H. and {Zhang}, L. and {Che}, Y. and {Xu}, C.},
title={"{An Empirical Study of Intel Xeon Phi}"},
journal={ArXiv e-prints},
archivePrefix={"arXiv"},
eprint={1310.5842},
primaryClass={"cs.DC"},
keywords={Computer Science – Distributed, Parallel, and Cluster Computing, Computer Science – Performance},
year={2013},
month={oct},
adsurl={http://adsabs.harvard.edu/abs/2013arXiv1310.5842F},
adsnote={Provided by the SAO/NASA Astrophysics Data System}
}
With at least 50 cores, Intel Xeon Phi is a true many-core architecture. Featuring fairly powerful cores, two cache levels, and very fast interconnections, the Xeon Phi can get a theoretical peak of 1000 GFLOPs and over 240 GB/s. These numbers, as well as its flexibility – it can be used both as a coprocessor or as a stand-alone processor – are very tempting for parallel applications looking for new performance records. In this paper, we present an empirical study of Xeon Phi, stressing its performance limits and relevant performance factors, ultimately aiming to present a simplified view of the machine for regular programmers in search for performance. To do so, we have micro-benchmarked the main hardware components of the processor – the cores, the memory hierarchies, the ring interconnect, and the PCIe connection. We show that, in ideal microbenchmarking conditions, the performance that can be achieved is very close to the theoretical peak, as given in the official programmer’s guide. We have also identified and quantified several causes for significant performance penalties. Our findings have been captured in four optimization guidelines, and used to build a simplified programmer’s view of Xeon Phi, eventually enable the design and prototyping of applications on a functionality-based model of the architecture.
October 25, 2013 by hgpu