HISQ inverter on Intel Xeon Phi and NVIDIA GPUs
Fakultat fur Physik, Universitat Bielefeld, D-33615 Bielefeld, Germany
arXiv:1409.1510 [cs.DC], (4 Sep 2014)
@article{2014arXiv1409.1510K,
author={Kaczmarek}, O. and {Schmidt}, C. and {Steinbrecher}, P. and {Mukherjee}, S. and {Wagner}, M.},
title={"{HISQ inverter on Intel Xeon Phi and NVIDIA GPUs}"},
journal={ArXiv e-prints},
archivePrefix={"arXiv"},
eprint={1409.1510},
primaryClass={"cs.DC"},
keywords={Computer Science – Distributed, Parallel, and Cluster Computing, High Energy Physics – Lattice},
year={2014},
month={sep},
adsurl={http://adsabs.harvard.edu/abs/2014arXiv1409.1510K},
adsnote={Provided by the SAO/NASA Astrophysics Data System}
}
The runtime of a Lattice QCD simulation is dominated by a small kernel, which calculates the product of a vector by a sparse matrix known as the "Dslash" operator. Therefore, this kernel is frequently optimized for various HPC architectures. In this contribution we compare the performance of the Intel Xeon Phi to current Kepler-based NVIDIA Tesla GPUs running a conjugate gradient solver. By exposing more parallelism to the accelerator through inverting multiple vectors at the same time we obtain a performance 250 GFlop/s on both architectures. This more than doubles the performance of the inversions. We give a short overview of both architectures, discuss some details of the implementation and the effort required to obtain the achieved performance.
September 5, 2014 by hgpu