16891

Communication and Coordination Paradigms for Highly-Parallel Accelerators

Marc S. Orr
University of Wisconsin-Madison
University of Wisconsin-Madison, 2016

@phdthesis{orr2016communication,

   title={Communication and Coordination Paradigms for Highly-Parallel Accelerators},

   author={Orr, Marc S},

   year={2016},

   school={Advanced Micro Devices, Inc}

}

Download Download (PDF)   View View   Source Source   

534

views

As CPU performance plateaus, many communities are turning to highly-parallel accelerators such as graphics processing units (GPUs) to obtain their desired level of processing power. Unfortunately, the GPU’s massive parallelism and data-parallel execution model make it difficult to synchronize GPU threads. To resolve this, we introduce aggregation buffers, which are producer/consumer queues that act as an interface from the GPU to a system-level resource. To amortize the high cost of producer/consumer synchronization, we introduce leader-level synchronization, where a GPU thread is elected to synchronize on behalf of its data-parallel cohort. One challenge is to coordinate threads in the same data-parallel cohort accessing different aggregation buffers. We explore two schemes to resolve this. In the first, called SIMT-direct aggregation, a data-parallel cohort invokes leader-level synchronization once for each aggregation buffer being accessed. In the second, called indirect aggregation, a data-parallel cohort uses leader-level synchronization to export its operations to a hardware aggregator, which repacks the operations into their respective aggregation buffers. We investigate two use cases for aggregation buffers. The first is the channel abstraction, which was proposed by Gaster and Howes to dynamically aggregate asynchronously produced fine-grain work into coarser-grain tasks. However, no practical implementation has been proposed. We investigate implementing channels as aggregation buffers managed by SIMD-direct aggregation. We then present a case study that maps the fine-grain, recursive task spawning in the Cilk programming language to channels by representing it as a flow graph. We implement four Cilk benchmarks and show that Cilk can scale with the GPU architecture, achieving speedups of as much as 4.3x on eight GPU cores. The second use case for aggregation buffers is to enable PGAS-style communication between threads executing on different GPUs. To explore this, we wrote a software runtime called Gravel, which incorporates aggregation buffers managed by indirect aggregation. Using Gravel, we distribute six applications, each with frequent small messages, across a cluster of eight AMD accelerated processing units (APUs) connected by InfiniBand. Compared to one node, these applications run 5.3x faster, on average. Furthermore, we show that Gravel is more programmable and usually more performant than prior GPU networking models.
No votes yet.
Please wait...

* * *

* * *

Featured events

2018
November
27-30
Hida Takayama, Japan

The Third International Workshop on GPU Computing and AI (GCA), 2018

2018
September
19-21
Nagoya University, Japan

The 5th International Conference on Power and Energy Systems Engineering (CPESE), 2018

2018
September
22-24
MediaCityUK, Salford Quays, Greater Manchester, England

The 10th International Conference on Information Management and Engineering (ICIME), 2018

2018
August
21-23
No. 1037, Luoyu Road, Hongshan District, Wuhan, China

The 4th International Conference on Control Science and Systems Engineering (ICCSSE), 2018

2018
October
29-31
Nanyang Executive Centre in Nanyang Technological University, Singapore

The 2018 International Conference on Cloud Computing and Internet of Things (CCIOT’18), 2018

HGPU group © 2010-2018 hgpu.org

All rights belong to the respective authors

Contact us: