High Performance Non-Blocking Collective Communication for Next Generation Infiniband Clusters

Krishna Kandalla
Department of Computer Science and Engineering, The Ohio State University
The Ohio State University, 2013



   author={Kandalla, Krishna Chaitanya},


   school={The Ohio State University}


Download Download (PDF)   View View   Source Source   



The emergence of multi-/many-core architectures, accelerators and high-speed networks, along with continued reduction in hardware costs make it possible to design highly capable supercomputers that offer sustained petaflop performance. However, merely using modern compute architectures and high-speed networks is not sufficient to achieve exascale science. Parallel applications typically involve explicit communication between processes to exchange data and synchronize. With increasing system sizes, the communication and synchronization overheads are bound to grow and affect the performance of parallel applications. Hence, the performance and scalability features offered by communication stacks play a key role on modern high performance computing systems. MPI has been the de-facto programming model for developing parallel applications. MPI offers various collective communication primitives that allow application developers to express group communication operations in a convenient and portable manner. Until recently, the MPI standard defined collective operations to be blocking, i.e., the processes need to wait in the MPI library until their role in the collective operation is complete. As applications are scaled out, blocking collectives lead to high communication and synchronization overheads. This spurred interest in the design and development of asynchronous collective operations in MPI, and the current MPI-3 revision offers this support. However, delivering near perfect communication/computation overlap with collective operations is non-trivial. Moreover, scientific applications also need to be re-designed to achieve communication/computation overlap through non-blocking collective operations. Simplistic solutions for designing non-blocking collective operations rely on having the CPU processors progress collective communication operations. However, such solutions cannot deliver good performance and overlap. In this dissertation, we first explore the challenges and benefits associated with designing network-offload based non-blocking collectives by leveraging features offered by the latest InfiniBand network adapters. Next, we address the important challenge of Co-Designing parallel applications, MPI communication stacks and modern computing hardware to achieve superior performance through computation/communication overlap. We re-design several important scientific applications and kernels, such as parallel 3D FFT, sparse linear solvers (Pre-Conditioned Conjugate Gradient (PCG)), dense linear algebra (High Performance Linpack (HPL) benchmark), irregular graph algorithms (2D-Breadth First Search (BFS)) to demonstrate the potential benefits of such a co-design effort. Considering the limitations of current generation hardware-based support for non-blocking collectives, we also propose a novel Functional Partitioning based approach to design dense non-blocking collectives, in an efficient manner. Further, we also propose designs to improve the performance of blocking collectives on emerging multi-/many-core architectures. All of our work is based on the MVAPICH2 software stack, which is an open- source, high-performance implementation of the MPI standard over InfiniBand, 10GigE/iWARP and RDMA over Converged Ethernet (RoCE). MVAPICH2 is being used by more than 2,055 organizations world-wide and power several supercomputers in the Top500 list.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2021 hgpu.org

All rights belong to the respective authors

Contact us: