high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Direct Communication Methods for Distributed GPUs

Direct Communication Methods for Distributed GPUs

Lena Oden

Ruprecht-Karls-University, Helderberg

Ruprecht-Karls-University, 2015

@article{oden2015direct,

title={Direct Communication Methods for Distributed GPUs},

author={Oden, Lena},

year={2015}

}

Download (PDF)

View

Source

2569

views

Today, GPUs and other parallel accelerators are widely used in high performance computing, due to their high computational power and high performance per watt. Still, one of the main bottlenecks of GPU-accelerated cluster computing is the data transfer between distributed GPUs. This not only affects performance, but also power consumption. Often, a data transfer between two distributed GPUs even requires intermediate copies in host memory. This overhead penalizes small data movements and synchronization operations. In this work, different communication methods for distributed GPUs are implemented and evaluated. First, a new technique, called GPUDirect RDMA, is implemented for the Extoll device and evaluated. The performance results show that this technique brings performance benefits for small- and mediums-sized data transfers, but for larger transfer sizes, a staged protocol is preferable since the PCIe-bus does not well support peer-to-peer data transfers. In the next step, GPUs are integrated to the one-sided communication library GPI-2. Since this interface was designed for heterogeneous memory structures, it allows an easy integration of GPUs. The performance results show that using one-sided communication for GPUs brings some performance benefits compared to two-sided communication which is the current state-of-the-art. However, using GPI-2 for communication still requires a host thread to control GPU-related communication, although the data is transferred directly between the GPUs without any host copies. Therefore, the subsequent part of the work analyze GPU-controlled communication. First, a put/get communication interface, based on Infiniband verbs, for the GPU is implemented. This interface enables the GPU to independently source and synchronize communication requests without any involvements of the CPU. However, the Infiniband verbs protocol adds a lot of sequential overhead to the communication, so the performance of GPU-controlled put/get communication is far behind the performance of CPU-controlled put/get communication. Another problem is intra-GPU synchronization, since GPU blocks are non-preemptive. The use of communication requests within a GPU can easily result in a deadlock. Dynamic parallelism solves this problem. Although the performance of applications using GPU-controlled communication is still slightly worse than the performance of hybrid applications, the performance per watt increases, since the CPU can be relieved from the communication work. As a communication model that is more in line with the massive parallelism of GPUs, the performance of a hardware-supported global address space for GPUs is evaluated. This global address space allows communication with simple load and store instructions which can be performed by multiple threads in parallel. With this method, the latency for a GPU-to-GPU data transfer can be reduced to 3us, using an FPGA. The results show that a global address space is best for applications that require small, non-blocking, and irregular data transfers. However, the main bottleneck of this method is that is does not allow overlapping of communication and computation which is the case for put/get communication. However, by using GPU optimized communication models, depending on the application, between 10 and 50% better energy efficiency can be reached than by using a hybrid model with CPU-controlled communication.

Tags: Algorithms, Cluster computing, Computer science, CUDA, Heterogeneous systems, MPI, nVidia, Thesis

April 25, 2015 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Direct Communication Methods for Distributed GPUs

Your response

Recent source codes

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

Device Virtual Machine (DVM)

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Agentic Code Optimization via Compiler-LLM Cooperation

Most viewed papers (last 30 days)

Direct Communication Methods for Distributed GPUs

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)