Designing Efficient MPI and UPC Runtime for Multicore Clusters with InfiniBand, Accelerators and Co-Processors

Miao Luo
The Ohio State University
The Ohio State University, 2013


   title={Designing Efficient MPI and UPC Runtime for Multicore Clusters with InfiniBand, Accelerators and Co-Processors},

   author={Luo, Miao},


   school={The Ohio State University}


Download Download (PDF)   View View   Source Source   



High End Computing (HEC) has been growing dramatically over the past decades. The emerging multi-core systems, heterogeneous architectures and interconnects introduce various challenges and opportunities to improve the performance of communication middlewares and applications. The increasing number of processor cores and Co-Processors results in not only heavy contention on communication resources, but also much more complicated levels of communication patterns. Message Passing Interface (MPI) is the dominant parallel programming language for HPC application area in the past two decades. MPI has been very successful in implementing regular, iterative parallel algorithms with well defined communication pattern. Instead, the Partitioned Global Address Space (PGAS) programming model provides a flexible way for these applications to express parallelism. Different variations and combinations of these programming languages present new challenges in designing optimized programming model runtimes, in terms of efficient sharing of networking resources and efficient work-stealing techniques for computation load balancing across threads/processes, etc. Middlewares play a key role in delivering the benefits of new hardware techniques to support the new requirement from applications and programming models. This dissertation aims to study several critical contention problems of existing runtimes, which supports popular parallel programming models (MPI and UPC) on emerging multi-core/many-core systems. We start with shared memory contention problem within existing MPI runtime. Then we explore the network throughput congestion issue at node level, based on Unified Parallel C (UPC) runtime. We propose and implement lock-free multi-threaded runtimes for MPI/OpenMP and UPC with multi-endpoint support, respectively. Based on the multi-endpoint design, we further explore how to enhance MPI/OpenMP applications with transparent support for collective operations and minimal modifications for point-to-point operations. Finally we extend our multi-endpoint research to include GPU and MIC architecture for UPC and explore the performance features. Software developed as a part of this dissertation is available in MVAPICH2 and MVAPICH2-X. MVAPICH2 is a popular open-source implementation of MPI over InfiniBand and is used by hundreds of top computing sites all around the world. MVAPICH2-X supports both MPI and UPC hybrid programming models on InfiniBand clusters and is based on MVAPICH2 stack.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2021 hgpu.org

All rights belong to the respective authors

Contact us: