Multi GPU Performance of Conjugate Gradient Algorithm with Staggered Fermions
Lattice Gauge Theory Research Center, FPRD, and CTP, Department of Physics and Astronomy, Seoul National University, Seoul, 151-747, South Korea
arXiv:1010.4782v2 [hep-lat] (22 Oct 2010)
We report results of the performance test of GPUs obtained using the conjugate gradient (CG) algorithm for staggered fermions on the MILC fine lattice ($28^3 times 96$). We use GPUs of nVIDIA GTX 295 model for the test. When we turn off the MPI communication and use only a single GPU, the performance is 35 giga flops in double precision, which corresponds to 47% of the peak. When we turn on the MPI communication and use multi-GPUs, the performance is reduced down to 12.3 giga flops. The data transfer through the infiniband network and PCI-E bus I/O is a main bottle neck. We suggest two potential solutions of how to optimize the data transfer.
November 9, 2010 by hgpu