Communication and Coordination Paradigms for Highly-Parallel Accelerators

Marc S. Orr
University of Wisconsin-Madison
University of Wisconsin-Madison, 2016


   title={Communication and Coordination Paradigms for Highly-Parallel Accelerators},

   author={Orr, Marc S},


   school={Advanced Micro Devices, Inc}


Download Download (PDF)   View View   Source Source   



As CPU performance plateaus, many communities are turning to highly-parallel accelerators such as graphics processing units (GPUs) to obtain their desired level of processing power. Unfortunately, the GPU’s massive parallelism and data-parallel execution model make it difficult to synchronize GPU threads. To resolve this, we introduce aggregation buffers, which are producer/consumer queues that act as an interface from the GPU to a system-level resource. To amortize the high cost of producer/consumer synchronization, we introduce leader-level synchronization, where a GPU thread is elected to synchronize on behalf of its data-parallel cohort. One challenge is to coordinate threads in the same data-parallel cohort accessing different aggregation buffers. We explore two schemes to resolve this. In the first, called SIMT-direct aggregation, a data-parallel cohort invokes leader-level synchronization once for each aggregation buffer being accessed. In the second, called indirect aggregation, a data-parallel cohort uses leader-level synchronization to export its operations to a hardware aggregator, which repacks the operations into their respective aggregation buffers. We investigate two use cases for aggregation buffers. The first is the channel abstraction, which was proposed by Gaster and Howes to dynamically aggregate asynchronously produced fine-grain work into coarser-grain tasks. However, no practical implementation has been proposed. We investigate implementing channels as aggregation buffers managed by SIMD-direct aggregation. We then present a case study that maps the fine-grain, recursive task spawning in the Cilk programming language to channels by representing it as a flow graph. We implement four Cilk benchmarks and show that Cilk can scale with the GPU architecture, achieving speedups of as much as 4.3x on eight GPU cores. The second use case for aggregation buffers is to enable PGAS-style communication between threads executing on different GPUs. To explore this, we wrote a software runtime called Gravel, which incorporates aggregation buffers managed by indirect aggregation. Using Gravel, we distribute six applications, each with frequent small messages, across a cluster of eight AMD accelerated processing units (APUs) connected by InfiniBand. Compared to one node, these applications run 5.3x faster, on average. Furthermore, we show that Gravel is more programmable and usually more performant than prior GPU networking models.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2021 hgpu.org

All rights belong to the respective authors

Contact us: