Optimizing All-to-All and Allgather Communications on GPGPU Clusters
Ohio State University
Ohio State University, 2012
@phdthesis{singh2012optimizing,
title={Optimizing All-to-All and Allgather Communications on GPGPU Clusters},
author={Singh, A.K.},
year={2012},
school={The Ohio State University}
}
High Performance Computing (HPC) is rapidly becoming an integral part of Science,Engineering and Business. Scientists and engineers are leveraging HPC solutions to run their applications that require high bandwidth, low latency, and very high compute capabilities. General Purpose Graphics Processing Units (GPGPUs)are becoming more popular within the HPC community because of their highly parallel structure, which makes it possible for applications to gain multi-x performance gain. The Tianhe-1A and Tsubame systems received significant attention for their architectures that leverage GPGPUs. Increasingly many scientific applications that were originally written for CPUs using MPI for parallelism are being ported to these hybrid CPU-GPU clusters. In the traditional sense, CPUs perform computation while the MPI library takes care of communication. When computation is performed on GPGPUs, the data has to be moved from device memory to main memory before it can be used in communication. Though GPGPUs provide huge compute potential, the data movement to and from GPGPUs is both a performance and productivity bottleneck. Recently, the MVAPICH2 MPI library has been modified to directly support point-to-point MPI communication from the GPU memory [33]. Using this support, programmers do not need to explicitly move data to main memory before using MPI. This feature also enables performance improvement due to tight integration of GPU data movement and MPI internal protocols. Collective communication is commonly used in HPC applications. These applications spend a significant portion of their time doing such collective communications. Therefore, optimizing performance of collectives has a significant impact on the applications’ performance. The all-to-all and allgather communication operations in message-passing systems are heavily used collectives that have O(N2) communication, for N processes. In this thesis, we outline the major design alternatives for the two collective communication operations on GPGPU clusters. We propose efficient and scalable designs and provide a corresponding performance analysis. Using our dynamic staging techniques, the latency of MPI Alltoall on GPGPU clusters can be improved by 59% over a Naive approach based implementation and 44% over a Send-Recv based implementation for 32KBytes messages on 32 processes. Our proposed design, Fine Grained Pipeline, can improve the performance of MPI Allgather on GPGPU clusters by 46% over Naive design and 81% over Send-Recv based design for a message size of 16 KBytes on 64 processes. The proposed designs have been incorporated into the open source MPI stack, MVAPICH2.
July 14, 2012 by hgpu