high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Improving Communication Performance in GPU-Accelerated HPC Clusters

Improving Communication Performance in GPU-Accelerated HPC Clusters

Iman Faraji

Queen’s University, Kingston, Ontario, Canada

Queen’s University, 2018

BibTeX

Download (PDF)

View

Source

Source codes

Package:

A benchmark suite to evaluate CPU and GPU communication efficiency of MPI using different communication patterns

2186

views

In recent years, GPUs have been adopted in many High-Performance Computing (HPC) clusters due to their massive computational power and energy efficiency. The Message Passing Interface (MPI) is the de-facto standard for parallel programming. Many HPC applications, written in MPI, use parallel processes and multiple GPUs to achieve higher performance and GPU memory capacity. In such applications, efficiently performing GPU inter-process communication is the key in the application performance. In this dissertation, we present proposals to improve the GPU inter-process communication in HPC clusters using novel GPU-aware designs, efficient and scalable algorithms, topology-aware designs, and hardware features. Specifically, we propose various approaches to improve the efficiency of MPI communication routines in GPU clusters. We also propose designs that evaluate the total application inter-process communication and provide solutions to improve its efficiency. First, we propose efficient GPU-aware algorithms to improve MPI collective performance. We show the importance of minimizing CPU intervention on GPU collective performance. We also utilize GPU features to enhance both collective communication and computation. As inter-process communications scale to across multi-GPU nodes and clusters, efficient inter-process communication routines must consider the physical structure of the underlying system. Given the hierarchical nature of the GPU clusters with multi-GPU nodes, we propose hierarchy-aware designs for GPU collectives and show that different algorithms are favored at different hierarchy levels. With the presence of multiple data copy mechanisms in modern GPU clusters, it is crucial to make an informed decision on how to use them for efficient inter-process communications. In this regard, we propose designs that intelligently decide which data copy mechanisms to use in GPU collectives. Using these designs, we reveal the importance of using multiple data copy mechanisms in performing multiple inter-process communications. Finally, we provide topology-aware solutions to improve the application inter-process communication efficiency, both within multi-GPU nodes and across GPU clusters. First, we study the performance of different communication channels used for GPU inter-process communications. Next, we propose topology-aware designs that consider both the system physical topology and application communication pattern. These designs improve the communication performance by performing more intensive inter-process communication on stronger communication channels.

Tags: Computer science, CUDA, GPU cluster, MPI, nVidia, Package, Performance, Thesis

January 28, 2018 by hgpu

Rating: 3.0/5. From 2 votes.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

Improving Communication Performance in GPU-Accelerated HPC Clusters

Package:

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Improving Communication Performance in GPU-Accelerated HPC Clusters

Package:

Share this:

Recent source codes

Most viewed papers (last 30 days)