high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Design, Implementation and Test of Efficient GPU to GPU Communication Methods

Design, Implementation and Test of Efficient GPU to GPU Communication Methods

Stepan Vanecek

Department of Informatics, Technical University of Munich

mediaTUM, Universitätsbibliothek, Technische Universität München, 2020

BibTeX

Download (PDF)

View

Source

1807

views

Stencil codes are commonly used to solve many problems. On parallel heterogeneous systems with CPUs and GPUs, the domain is usually split and assigned to GPUs, where it is further divided to GPU blocks. The iterative distributed stencil computation consists of two steps – computation and communication, where the subdomains exchange boundary data, also called ’halo exchange’. On multi-node systems, it is crucial to efficiently transfer data from one GPU to another via MPI, as a de-facto standard solution in HPC. In this master thesis, methods of GPU-to-GPU data exchange via MPI are examined with focus on halo exchange. The thesis describes a design of a set of naive baseline approaches and a set of optimized solutions called taskqueue. The main idea behind the taskqueue approach consists in overlapping packing and unpacking (computation) with host-to-host MPI communication, and in reusing one kernel for both packing and unpacking workloads to eliminate the kernel launch, termination, and synchronization overheads. The implementation relies on pinned host memory, a segment of main memory that is accessible by both the CPU and GPU, that the parties use to communicate. A portable solution that runs on both NVidia and AMD GPUs is designed, so that the differences on both platforms can be observed. The performance of the taskqueue approaches is evaluated against a baseline reference on both and NVidia and AMD testbeds. The tests on NVidia yield a stable speedup that ranges from 1.09 to 1.21 for different workload sizes. Contrary to that, this approach did not prove useful on the AMD testbed, as it needed more than 200 × as much time to finish. The main reason for that are problems with concurrently reading from and writing to one memory location by the CPU and GPU. This observation, and other observations made mainly on the AMD testbed, are identified and their implications are discussed in this work. It reveals some rigours of platform-agnostic GPU development, and discovers some unexpected behaviour patterns on the AMD GPUs combined with MPI usage. Finally, optimization to the taskqueue algorithm are proposed so that it would hopefully achieve better performance also on the AMD testbed.

Tags: AMD Radeon Instinct Mi50, ATI, Computer science, Heterogeneous systems, HIP, MPI, nVidia DGX-1, Thesis

January 3, 2021 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Design, Implementation and Test of Efficient GPU to GPU Communication Methods

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

Design, Implementation and Test of Efficient GPU to GPU Communication Methods

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)