16892

Synchronization and Coordination in Heterogeneous Processors

Joel Hestness
University of Wisconsin-Madison
University of Wisconsin-Madison, 2016

@phdthesis{hestness2016synchronization,

   title={Synchronization and Coordination in Heterogeneous Processors},

   author={Hestness, Joel},

   year={2016},

   school={UNIVERSITY OF WISCONSIN–MADISON}

}

Download Download (PDF)   View View   Source Source   

1921

views

Recent developments in internet connectivity and mobile devices have spurred massive data growth. Users demand rapid data processing from both large-scale systems and energy-constrained personal devices. Concurrently with this data growth, transistor scaling trends have slowed, diminishing processor performance and energy improvements compared to prior generations. To sustain performance trends while staying within energy budgets, emerging systems are integrating many processing cores and adding accelerators for specialized computation. The graphics processor (GPU) has become a prominent accelerator by offering fast, energy-efficient processing for data-parallel applications. For systems like desktops and mobile devices, heterogeneous processors have integrated GPUs onto the same chips as general-purpose cores (CPUs). These integrated processors have introduced new programmability and performance challenges. Similar to systems containing discrete GPU cards, early heterogeneous processors divide the CPU and GPU memory spaces, requiring programmers to explicitly manage data movement between the core types. To simplify this programming challenge, emerging heterogeneous processors provide shared memory and cache coherence between CPU and GPU cores. Still, few applications have been developed to use these new capabilities, and it is challenging to predict how programmers might use them. We aim to enable programmers to write applications that deftly and efficiently coordinate computation across CPU and GPU cores in shared memory, cache coherent heterogeneous processors. First, we analyze existing GPU computing applications to identify characteristics that can benefit from the new compute and communication capabilities. Generally, application phases with high data-level parallelism (DLP) are a good fit for GPUs, while phases with low DLP are a good fit for CPU cores, especially if they contain high instruction-level parallelism. Further, many GPU computing applications involve multiple software pipeline stages with these varied compute and memory demands. This thesis proposes and evaluates techniques that improve the performance, programmability, and energy efficiency of synchronization and coordinated computation in GPUs and heterogeneous processors. Guided by our broad workload analysis, we develop applications that execute concurrently on both CPU and GPU cores. We identify two processor design challenges that limit the performance and energy efficiency of these coordinated work applications. First, GPU barrier synchronization provides a useful mechanism for programmers to reduce synchronization granularity from numerous GPU threads. However, GPU barriers can still cause high overhead; threads must wait at barriers for lagging threads. Further, to communicate from GPU to CPU cores requires longer latency memory ordering guarantees than commonly used in discrete GPU applications. To reduce barrier overhead, we propose a hardware technique, the Transparent Fuzzy Barrier, which dynamically finds and executes instructions while threads wait at barriers to reduce barrier overhead by more than 50%. Second, GPU applications often use very coarse-grained producer-consumer communication, which reduces performance due to excessive cache spills and energy-expensive off-chip memory accesses. Existing software transformations can reduce cache spilling, but they tend to be complicated, requiring the programmer to reason about producer-toconsumer mappings and cached data footprint. To ease the cache management burden, we propose a novel hardware technique, called Q-cache, to support concurrent producerconsumer activity. Q-cache measures the cached data footprint and throttles producer or consumer tasks when buffered data might start spilling from cache. Q-cache eliminates cache spills and reduces contention to significantly improve application performance and energy.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: