high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » High Performance Non-Blocking Collective Communication for Next Generation Infiniband Clusters

High Performance Non-Blocking Collective Communication for Next Generation Infiniband Clusters

Krishna Kandalla

Department of Computer Science and Engineering, The Ohio State University

The Ohio State University, 2013

BibTeX

Download (PDF)

View

Source

1875

views

The emergence of multi-/many-core architectures, accelerators and high-speed networks, along with continued reduction in hardware costs make it possible to design highly capable supercomputers that offer sustained petaflop performance. However, merely using modern compute architectures and high-speed networks is not sufficient to achieve exascale science. Parallel applications typically involve explicit communication between processes to exchange data and synchronize. With increasing system sizes, the communication and synchronization overheads are bound to grow and affect the performance of parallel applications. Hence, the performance and scalability features offered by communication stacks play a key role on modern high performance computing systems. MPI has been the de-facto programming model for developing parallel applications. MPI offers various collective communication primitives that allow application developers to express group communication operations in a convenient and portable manner. Until recently, the MPI standard defined collective operations to be blocking, i.e., the processes need to wait in the MPI library until their role in the collective operation is complete. As applications are scaled out, blocking collectives lead to high communication and synchronization overheads. This spurred interest in the design and development of asynchronous collective operations in MPI, and the current MPI-3 revision offers this support. However, delivering near perfect communication/computation overlap with collective operations is non-trivial. Moreover, scientific applications also need to be re-designed to achieve communication/computation overlap through non-blocking collective operations. Simplistic solutions for designing non-blocking collective operations rely on having the CPU processors progress collective communication operations. However, such solutions cannot deliver good performance and overlap. In this dissertation, we first explore the challenges and benefits associated with designing network-offload based non-blocking collectives by leveraging features offered by the latest InfiniBand network adapters. Next, we address the important challenge of Co-Designing parallel applications, MPI communication stacks and modern computing hardware to achieve superior performance through computation/communication overlap. We re-design several important scientific applications and kernels, such as parallel 3D FFT, sparse linear solvers (Pre-Conditioned Conjugate Gradient (PCG)), dense linear algebra (High Performance Linpack (HPL) benchmark), irregular graph algorithms (2D-Breadth First Search (BFS)) to demonstrate the potential benefits of such a co-design effort. Considering the limitations of current generation hardware-based support for non-blocking collectives, we also propose a novel Functional Partitioning based approach to design dense non-blocking collectives, in an efficient manner. Further, we also propose designs to improve the performance of blocking collectives on emerging multi-/many-core architectures. All of our work is based on the MVAPICH2 software stack, which is an open- source, high-performance implementation of the MPI standard over InfiniBand, 10GigE/iWARP and RDMA over Converged Ethernet (RoCE). MVAPICH2 is being used by more than 2,055 organizations world-wide and power several supercomputers in the Top500 list.

Tags: Benchmarking, Computer science, InfiniBand, Intel Xeon Phi, Linear Algebra, MPI

March 14, 2014 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

High Performance Non-Blocking Collective Communication for Next Generation Infiniband Clusters

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

High Performance Non-Blocking Collective Communication for Next Generation Infiniband Clusters

Share this:

Recent source codes

Most viewed papers (last 30 days)