high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Optimizing All-to-All and Allgather Communications on GPGPU Clusters

Optimizing All-to-All and Allgather Communications on GPGPU Clusters

Ashish Kumar Singh

Ohio State University

Ohio State University, 2012

BibTeX

Download (PDF)

View

Source

2657

views

High Performance Computing (HPC) is rapidly becoming an integral part of Science,Engineering and Business. Scientists and engineers are leveraging HPC solutions to run their applications that require high bandwidth, low latency, and very high compute capabilities. General Purpose Graphics Processing Units (GPGPUs)are becoming more popular within the HPC community because of their highly parallel structure, which makes it possible for applications to gain multi-x performance gain. The Tianhe-1A and Tsubame systems received significant attention for their architectures that leverage GPGPUs. Increasingly many scientific applications that were originally written for CPUs using MPI for parallelism are being ported to these hybrid CPU-GPU clusters. In the traditional sense, CPUs perform computation while the MPI library takes care of communication. When computation is performed on GPGPUs, the data has to be moved from device memory to main memory before it can be used in communication. Though GPGPUs provide huge compute potential, the data movement to and from GPGPUs is both a performance and productivity bottleneck. Recently, the MVAPICH2 MPI library has been modified to directly support point-to-point MPI communication from the GPU memory [33]. Using this support, programmers do not need to explicitly move data to main memory before using MPI. This feature also enables performance improvement due to tight integration of GPU data movement and MPI internal protocols. Collective communication is commonly used in HPC applications. These applications spend a significant portion of their time doing such collective communications. Therefore, optimizing performance of collectives has a significant impact on the applications’ performance. The all-to-all and allgather communication operations in message-passing systems are heavily used collectives that have O(N2) communication, for N processes. In this thesis, we outline the major design alternatives for the two collective communication operations on GPGPU clusters. We propose efficient and scalable designs and provide a corresponding performance analysis. Using our dynamic staging techniques, the latency of MPI Alltoall on GPGPU clusters can be improved by 59% over a Naive approach based implementation and 44% over a Send-Recv based implementation for 32KBytes messages on 32 processes. Our proposed design, Fine Grained Pipeline, can improve the performance of MPI Allgather on GPGPU clusters by 46% over Naive design and 81% over Send-Recv based design for a message size of 16 KBytes on 64 processes. The proposed designs have been incorporated into the open source MPI stack, MVAPICH2.

Tags: Computer science, CUDA, GPU cluster, MPI, nVidia, Tesla C2050, Thesis

July 14, 2012 by hgpu

Rating: 2.3/5. From 2 votes.

Please wait...

* * *

high performance computing on graphics processing units: hgpu.org

Optimizing All-to-All and Allgather Communications on GPGPU Clusters

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)

Optimizing All-to-All and Allgather Communications on GPGPU Clusters

Share this:

Recent source codes

Most viewed papers (last 30 days)