A GPU cluster optimized multigrid scheme for computing unsteady incompressible fluid flow
Institute for Solid State Physics and Optics, Wigner Research Centre for Physics P.O. Box 49, H-1525 Budapest, Hungary
arXiv:1309.7128 [math.NA], (27 Sep 2013)
@article{2013arXiv1309.7128T,
author={Tegze}, G. and {T{‘o}th}, G.~I.},
title={"{A GPU cluster optimized multigrid scheme for computing unsteady incompressible fluid flow}"},
journal={ArXiv e-prints},
archivePrefix={"arXiv"},
eprint={1309.7128},
primaryClass={"math.NA"},
keywords={Mathematics – Numerical Analysis, Physics – Computational Physics},
year={2013},
month={sep},
adsurl={http://adsabs.harvard.edu/abs/2013arXiv1309.7128T},
adsnote={Provided by the SAO/NASA Astrophysics Data System}
}
A multigrid scheme has been proposed that allows efficient implementation on modern CPUs, many integrated core devices (MICs), and graphics processing units (GPUs). It is shown that wide single instruction multiple data (SIMD) processing engines are used efficiently when a deep, 2h grid hierarchy is replaced with a two level scheme using 16h-32h restriction. The restriction length can be fitted to the SIMD width to fully utilize the capabilities of modern CPUs and GPUs. This way, optimal memory transfer is also ensured, since no strided memory access is required. The number of the expensive restriction steps is greatly reduced, and these are executed on bigger chunks of data that allows optimal caching strategies. A higher order interpolated stencil was developed to improve convergence rate via minimizing spurious interference between the coarse and the fine scale solutions. The method is demonstrated on solving the pressure equation for 2D incompressible fluid flow: The benchmark setups cover shear driven laminar flow in cavity, and direct numerical simulation (DNS) of a turbulent jet. We show that the scheme also allows efficient usage of distributed memory computer clusters via decreasing the number of memory transfers between host and compute devices, and among cluster nodes. The actual implementation uses a hybrid OpenCl/MPI based parallelization.
September 30, 2013 by hgpu