Optimization of the Brillouin operator on the KNL architecture
University of Wuppertal, Gaussstrasse 20, D-42119 Wuppertal, Germany
arXiv:1709.01828 [hep-lat], (6 Sep 2017)
@article{durr2017optimization,
title={Optimization of the Brillouin operator on the KNL architecture},
author={Durr, Stephan},
year={2017},
month={sep},
archivePrefix={"arXiv"},
primaryClass={hep-lat}
}
Experiences with optimizing the matrix-times-vector application of the Brillouin operator on the Intel KNL processor are reported. Without adjustments to the memory layout, performance figures of 360 Gflop/s in single and 270 Gflop/s in double precision are observed. This is with N_c=3 colors, N_v=12 right-hand-sides, N_{thr}=256 threads, on lattices of size 32^3*64, using exclusively OMP pragmas. Interestingly, the same routine performs quite well on Intel Core i7 architectures, too. Some observations on the much harder Wilson fermion matrix-times-vector optimization problem are added.
September 12, 2017 by hgpu