high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Optimizing Communication for Clusters of GPUs

Optimizing Communication for Clusters of GPUs

Michael Wayne LeBeane

Department of Electrical and Computer Engineering, The University of Texas at Austin

The University of Texas at Austin, 2018

@article{breternitz2018optimizing,

title={Optimizing Communication for Clusters of GPUs},

author={Breternitz Jr, Mauricio and Erez, Mattan},

year={2018}

}

Download (PDF)

View

Source

2490

views

GPUs are frequently used to accelerate data-parallel workloads across a wide variety of application domains. While GPUs offer a large amount of computational throughput within a single node, the largest problems require a cluster of such devices communicating with different compute nodes across a network. These clusters can range in size from a small handful of machines constructed from commodity parts, to several thousand machines built from specialized components. Despite widespread deployment of GPUs across clusters both big and small, communication between GPUs in networks of computers remains unwieldy. Networks of GPUs are currently programmed in a clunky coprocessor style, requiring coordination with a host CPU and driver stack to communicate with other systems. These intra-node bottlenecks for initiating communication operations are often much greater than the cost of sending data over a high-performance network. This dissertation explores new techniques to more tightly integrate GPUs with network adapters to allow efficient communication between GPUs across the network. It evaluates both hardware and software changes to NICs and GPUs to enable end-to-end, user-space communication between networks of GPUs, avoiding critical path CPU interference. First, Extended Task Queuing (XTQ) is proposed to provide the ability to launch remote kernels without intervention of a host CPU at the target. Inspired by classic work on active messaging, XTQ uses NIC architectural modifications to support remote kernel launch without the participation of the remote CPU. Bypassing the remote CPU reduces remote kernel launch latencies and allows a more decentralized, cluster-wide work dispatch system. Next, intra-kernel communication is optimized through the Command Processor Networking (ComP-Net) framework. ComP-Net uses a little-known feature of modern GPUs: embedded, programmable microprocessors that are typically referred to as Command Processors (CPs). GPU communication latency is decreased by running the network software stack on the CP instead of the host CPU. ComP-Net implements a runtime and programming interface that allows the GPU compute units to take advantage of the unique capabilities of a networking CP. Challenges related to the GPU’s relaxed memory model and L2 cache thrashing are addressed to reduce the latency of network communication through the CP. Finally, GPU Triggered Networking (GPU-TN) is proposed as an alternative intra-kernel networking scheme that enables a GPU to directly trigger network operations from within a GPU kernel without the involvement of any CPU on the critical path. GPU Triggered Networking implements a NIC hardware mechanism by which the GPU can directly trigger the network adapter to send messages. In this approach, the host CPU is responsible for creating the network command packet on behalf of the GPU and registering it with the NIC. When the GPU is ready to send a message, it "triggers" the NIC using a memory-mapped store operation. A small amount of additional hardware in the NIC collects these writes from the GPU and initiates the pending network operation when a threshold condition has been met. These optimizations allow for fine-grained remote communication without ending a kernel.

Tags: Computer science, GPU cluster, Memory model, Thesis

September 2, 2018 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Optimizing Communication for Clusters of GPUs

Your response

Recent source codes

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

TRUST: a thermalhydraulic software package for CFD simulations

Modular: The Modular Platform (includes MAX & Mojo)

Allo: Accelerator Design Language

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

Most viewed papers (last 30 days)

Optimizing Communication for Clusters of GPUs

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)