Design, Implementation and Test of Efficient GPU to GPU Communication Methods
Department of Informatics, Technical University of Munich
mediaTUM, Universitätsbibliothek, Technische Universität München, 2020
@article{vanecek2020design,
title={Design, Implementation and Test of Efficient GPU to GPU Communication Methods},
author={Vanecek, Stepan},
year={2020}
}
Stencil codes are commonly used to solve many problems. On parallel heterogeneous systems with CPUs and GPUs, the domain is usually split and assigned to GPUs, where it is further divided to GPU blocks. The iterative distributed stencil computation consists of two steps – computation and communication, where the subdomains exchange boundary data, also called ’halo exchange’. On multi-node systems, it is crucial to efficiently transfer data from one GPU to another via MPI, as a de-facto standard solution in HPC. In this master thesis, methods of GPU-to-GPU data exchange via MPI are examined with focus on halo exchange. The thesis describes a design of a set of naive baseline approaches and a set of optimized solutions called taskqueue. The main idea behind the taskqueue approach consists in overlapping packing and unpacking (computation) with host-to-host MPI communication, and in reusing one kernel for both packing and unpacking workloads to eliminate the kernel launch, termination, and synchronization overheads. The implementation relies on pinned host memory, a segment of main memory that is accessible by both the CPU and GPU, that the parties use to communicate. A portable solution that runs on both NVidia and AMD GPUs is designed, so that the differences on both platforms can be observed. The performance of the taskqueue approaches is evaluated against a baseline reference on both and NVidia and AMD testbeds. The tests on NVidia yield a stable speedup that ranges from 1.09 to 1.21 for different workload sizes. Contrary to that, this approach did not prove useful on the AMD testbed, as it needed more than 200 × as much time to finish. The main reason for that are problems with concurrently reading from and writing to one memory location by the CPU and GPU. This observation, and other observations made mainly on the AMD testbed, are identified and their implications are discussed in this work. It reveals some rigours of platform-agnostic GPU development, and discovers some unexpected behaviour patterns on the AMD GPUs combined with MPI usage. Finally, optimization to the taskqueue algorithm are proposed so that it would hopefully achieve better performance also on the AMD testbed.
January 3, 2021 by hgpu