high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Optimizing All-to-All and Allgather Communications on GPGPU Clusters

Optimizing All-to-All and Allgather Communications on GPGPU Clusters

Ashish Kumar Singh

Ohio State University

Ohio State University, 2012

BibTeX

Download (PDF)

View

Source

2726

views

High Performance Computing (HPC) is rapidly becoming an integral part of Science,Engineering and Business. Scientists and engineers are leveraging HPC solutions to run their applications that require high bandwidth, low latency, and very high compute capabilities. General Purpose Graphics Processing Units (GPGPUs)are becoming more popular within the HPC community because of their highly parallel structure, which makes it possible for applications to gain multi-x performance gain. The Tianhe-1A and Tsubame systems received significant attention for their architectures that leverage GPGPUs. Increasingly many scientific applications that were originally written for CPUs using MPI for parallelism are being ported to these hybrid CPU-GPU clusters. In the traditional sense, CPUs perform computation while the MPI library takes care of communication. When computation is performed on GPGPUs, the data has to be moved from device memory to main memory before it can be used in communication. Though GPGPUs provide huge compute potential, the data movement to and from GPGPUs is both a performance and productivity bottleneck. Recently, the MVAPICH2 MPI library has been modified to directly support point-to-point MPI communication from the GPU memory [33]. Using this support, programmers do not need to explicitly move data to main memory before using MPI. This feature also enables performance improvement due to tight integration of GPU data movement and MPI internal protocols. Collective communication is commonly used in HPC applications. These applications spend a significant portion of their time doing such collective communications. Therefore, optimizing performance of collectives has a significant impact on the applications’ performance. The all-to-all and allgather communication operations in message-passing systems are heavily used collectives that have O(N2) communication, for N processes. In this thesis, we outline the major design alternatives for the two collective communication operations on GPGPU clusters. We propose efficient and scalable designs and provide a corresponding performance analysis. Using our dynamic staging techniques, the latency of MPI Alltoall on GPGPU clusters can be improved by 59% over a Naive approach based implementation and 44% over a Send-Recv based implementation for 32KBytes messages on 32 processes. Our proposed design, Fine Grained Pipeline, can improve the performance of MPI Allgather on GPGPU clusters by 46% over Naive design and 81% over Send-Recv based design for a message size of 16 KBytes on 64 processes. The proposed designs have been incorporated into the open source MPI stack, MVAPICH2.

Tags: Computer science, CUDA, GPU cluster, MPI, nVidia, Tesla C2050, Thesis

July 14, 2012 by hgpu

Rating: 2.3/5. From 2 votes.

Please wait...

Your response

You must be logged in to post a comment.

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Optimizing All-to-All and Allgather Communications on GPGPU Clusters

Your response

Recent source codes

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

Most viewed papers (last 30 days)

Optimizing All-to-All and Allgather Communications on GPGPU Clusters

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)