https://hgpu.org/?p=17886
Improving 3D Lattice Boltzmann Method stencil with asynchronous transfers on many-core processors