Optimizing Communication for Clusters of GPUs
Department of Electrical and Computer Engineering, The University of Texas at Austin
The University of Texas at Austin, 2018
@article{breternitz2018optimizing,
title={Optimizing Communication for Clusters of GPUs},
author={Breternitz Jr, Mauricio and Erez, Mattan},
year={2018}
}
GPUs are frequently used to accelerate data-parallel workloads across a wide variety of application domains. While GPUs offer a large amount of computational throughput within a single node, the largest problems require a cluster of such devices communicating with different compute nodes across a network. These clusters can range in size from a small handful of machines constructed from commodity parts, to several thousand machines built from specialized components. Despite widespread deployment of GPUs across clusters both big and small, communication between GPUs in networks of computers remains unwieldy. Networks of GPUs are currently programmed in a clunky coprocessor style, requiring coordination with a host CPU and driver stack to communicate with other systems. These intra-node bottlenecks for initiating communication operations are often much greater than the cost of sending data over a high-performance network. This dissertation explores new techniques to more tightly integrate GPUs with network adapters to allow efficient communication between GPUs across the network. It evaluates both hardware and software changes to NICs and GPUs to enable end-to-end, user-space communication between networks of GPUs, avoiding critical path CPU interference. First, Extended Task Queuing (XTQ) is proposed to provide the ability to launch remote kernels without intervention of a host CPU at the target. Inspired by classic work on active messaging, XTQ uses NIC architectural modifications to support remote kernel launch without the participation of the remote CPU. Bypassing the remote CPU reduces remote kernel launch latencies and allows a more decentralized, cluster-wide work dispatch system. Next, intra-kernel communication is optimized through the Command Processor Networking (ComP-Net) framework. ComP-Net uses a little-known feature of modern GPUs: embedded, programmable microprocessors that are typically referred to as Command Processors (CPs). GPU communication latency is decreased by running the network software stack on the CP instead of the host CPU. ComP-Net implements a runtime and programming interface that allows the GPU compute units to take advantage of the unique capabilities of a networking CP. Challenges related to the GPU’s relaxed memory model and L2 cache thrashing are addressed to reduce the latency of network communication through the CP. Finally, GPU Triggered Networking (GPU-TN) is proposed as an alternative intra-kernel networking scheme that enables a GPU to directly trigger network operations from within a GPU kernel without the involvement of any CPU on the critical path. GPU Triggered Networking implements a NIC hardware mechanism by which the GPU can directly trigger the network adapter to send messages. In this approach, the host CPU is responsible for creating the network command packet on behalf of the GPU and registering it with the NIC. When the GPU is ready to send a message, it "triggers" the NIC using a memory-mapped store operation. A small amount of additional hardware in the NIC collects these writes from the GPU and initiates the pending network operation when a threshold condition has been met. These optimizations allow for fine-grained remote communication without ending a kernel.
September 2, 2018 by hgpu