Multi-GPU Performance of Incompressible Flow Computation by Lattice Boltzmann Method on GPU Cluster
Global Scientific Informational and Computing Center, Tokyo institute of Technology, 2-12-1, Meguro-ku, Tokyo 152-8550
Parallel Computing (27 February 2011)
GPGPU has drawn much attention on accelerating non-graphic applications. The simulation by D3Q19 model of Lattice Boltzmann method was executed successfully on multi-node GPU cluster by using CUDA programming and MPI library. The GPU code runs on the multi-node GPU cluster TSUBAME of Tokyo Institute of technology, in which total 680 GPUs of NVIDIA Tesla are equipped. For multi-GPU computation, domain partitioning method is used to distribute computational load to multiple GPUs and GPU-to-GPU data transfer becomes sever overhead for the total performance. Comparison and analysis were made among the parallel results by 1D, 2D and 3D domain partitionings. As a result, with 384x384x384 mesh system and 96 GPUs, the performance by 3D partitioning is about 34 times higher than that of 1D partitioning. The performance curve is deviated from the idealistic line due to the long communicational time between GPUs. In order to hide the communication time, we introduced the overlapping technique between computation and communication, in which the data transfer process and computation were done in two streams simultaneously. Using 8-96 GPUs, the performances increase by a factor about 1.11.3 with overlapping mode. As a benchmark problem, a large-scaled computation of a flow around a sphere at Re=13000 was carried on successfully using mesh system 2000x1000x1000 and 100 GPUs. For such a computation with 2 Giga lattice nodes, 6.0 hours were used for processing 100,000 time steps. Under this condition, the computational time (2.79 hours) and data communication time (3.06 hours) are almost same.