high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Designing Efficient MPI and UPC Runtime for Multicore Clusters with InfiniBand, Accelerators and Co-Processors

Designing Efficient MPI and UPC Runtime for Multicore Clusters with InfiniBand, Accelerators and Co-Processors

Miao Luo

The Ohio State University

The Ohio State University, 2013

BibTeX

Download (PDF)

View

Source

1866

views

High End Computing (HEC) has been growing dramatically over the past decades. The emerging multi-core systems, heterogeneous architectures and interconnects introduce various challenges and opportunities to improve the performance of communication middlewares and applications. The increasing number of processor cores and Co-Processors results in not only heavy contention on communication resources, but also much more complicated levels of communication patterns. Message Passing Interface (MPI) is the dominant parallel programming language for HPC application area in the past two decades. MPI has been very successful in implementing regular, iterative parallel algorithms with well defined communication pattern. Instead, the Partitioned Global Address Space (PGAS) programming model provides a flexible way for these applications to express parallelism. Different variations and combinations of these programming languages present new challenges in designing optimized programming model runtimes, in terms of efficient sharing of networking resources and efficient work-stealing techniques for computation load balancing across threads/processes, etc. Middlewares play a key role in delivering the benefits of new hardware techniques to support the new requirement from applications and programming models. This dissertation aims to study several critical contention problems of existing runtimes, which supports popular parallel programming models (MPI and UPC) on emerging multi-core/many-core systems. We start with shared memory contention problem within existing MPI runtime. Then we explore the network throughput congestion issue at node level, based on Unified Parallel C (UPC) runtime. We propose and implement lock-free multi-threaded runtimes for MPI/OpenMP and UPC with multi-endpoint support, respectively. Based on the multi-endpoint design, we further explore how to enhance MPI/OpenMP applications with transparent support for collective operations and minimal modifications for point-to-point operations. Finally we extend our multi-endpoint research to include GPU and MIC architecture for UPC and explore the performance features. Software developed as a part of this dissertation is available in MVAPICH2 and MVAPICH2-X. MVAPICH2 is a popular open-source implementation of MPI over InfiniBand and is used by hundreds of top computing sites all around the world. MVAPICH2-X supports both MPI and UPC hybrid programming models on InfiniBand clusters and is based on MVAPICH2 stack.

Tags: Algorithms, Computer science, CUDA, Heterogeneous systems, MPI, nVidia, Tesla C2050

March 9, 2014 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Designing Efficient MPI and UPC Runtime for Multicore Clusters with InfiniBand, Accelerators and Co-Processors

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

Designing Efficient MPI and UPC Runtime for Multicore Clusters with InfiniBand, Accelerators and Co-Processors

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)