MPI Parallelization of GPU-based Lattice Boltzmann Simulations

Arash Bakhtiari
Technische Universitat Munchen
Technische Universitat Munchen, 2013


   author={Bakhtiari, Arash},

   institution={Institut f{"{u}}r Informatik, Technische Universit{"{a}}t M{"{u}}nchen},


   organization={Institut f{"{u}}r Informatik, Technische Universit{"{a}}t M{"{u}}nchen},

   school={Institut f{"{u}}r Informatik, Technische Universit{"{a}}t M{"{u}}nchen},

   title={MPI Parallelization of GPU-based Lattice Boltzmann Simulations},

   type={Master’s thesis},




Download Download (PDF)   View View   Source Source   



In this thesis, a MPI parallelized LBM code for a Multi-GPU platform has been designed and implemented. The primary goal of the thesis is research on efficient and scalable Multi-GPU LBM code, which exploits advanced features of the modern GPUs, to adopt optimization techniques like overlapping of work and communication in heterogeneous CPU-GPU clusters. In order to achieve the primary goal of the thesis, three overlapping techniques have been designed and implemented. Each of these techniques exploit advanced features of OpenCL API and MPI standard to be able to simultaneously execute independent operations of Multi-GPU LBM simulation. In order to optimize the software and identify the bottlenecks, tools like Callgrind are adopted. Based on the profiling results, three optimization techniques for efficient boundary values memory access pattern on the GPU memory were developed. The overall performance of software has been evaluated on the MAC GPU cluster. In weak scaling experiments on 8 GPUs, the SBK-SCQ has achieved the 97% efficiency by four GPU as baseline but in strong scaling experiments with 8 GPUs, the MBK-SCQ method delivered 2.5 speedup as the best result. In contrast to performance of weak scaling, the overall speedup of the strong scaling is off the line expected from a linear strong scaling results due to the fact of MPI and CPU-GPU communication overheads. Contrary to the expectations, more sophisticated overlapping techniques like MBKMCQ did not achieve better results than simpler techniques such as SBK-SCQ. Techniques like MBK-MCQ suffered from the lack of support for advanced OpenCL features in the driver provided by the vendor. Finally, the Large Eddy Simulation with the Smagorinsky subgrid-scale turbulence model was implemented. By extending the software to this turbulence model, it can be used for simulation of laminar flows as well as turbulent flows on a Multi-GPU distributed memory platform.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2017 hgpu.org

All rights belong to the respective authors

Contact us: