high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Synchronization and Coordination in Heterogeneous Processors

Synchronization and Coordination in Heterogeneous Processors

Joel Hestness

University of Wisconsin-Madison

University of Wisconsin-Madison, 2016

@phdthesis{hestness2016synchronization,

title={Synchronization and Coordination in Heterogeneous Processors},

author={Hestness, Joel},

year={2016},

school={UNIVERSITY OF WISCONSIN–MADISON}

}

Download (PDF)

View

Source

1922

views

Recent developments in internet connectivity and mobile devices have spurred massive data growth. Users demand rapid data processing from both large-scale systems and energy-constrained personal devices. Concurrently with this data growth, transistor scaling trends have slowed, diminishing processor performance and energy improvements compared to prior generations. To sustain performance trends while staying within energy budgets, emerging systems are integrating many processing cores and adding accelerators for specialized computation. The graphics processor (GPU) has become a prominent accelerator by offering fast, energy-efficient processing for data-parallel applications. For systems like desktops and mobile devices, heterogeneous processors have integrated GPUs onto the same chips as general-purpose cores (CPUs). These integrated processors have introduced new programmability and performance challenges. Similar to systems containing discrete GPU cards, early heterogeneous processors divide the CPU and GPU memory spaces, requiring programmers to explicitly manage data movement between the core types. To simplify this programming challenge, emerging heterogeneous processors provide shared memory and cache coherence between CPU and GPU cores. Still, few applications have been developed to use these new capabilities, and it is challenging to predict how programmers might use them. We aim to enable programmers to write applications that deftly and efficiently coordinate computation across CPU and GPU cores in shared memory, cache coherent heterogeneous processors. First, we analyze existing GPU computing applications to identify characteristics that can benefit from the new compute and communication capabilities. Generally, application phases with high data-level parallelism (DLP) are a good fit for GPUs, while phases with low DLP are a good fit for CPU cores, especially if they contain high instruction-level parallelism. Further, many GPU computing applications involve multiple software pipeline stages with these varied compute and memory demands. This thesis proposes and evaluates techniques that improve the performance, programmability, and energy efficiency of synchronization and coordinated computation in GPUs and heterogeneous processors. Guided by our broad workload analysis, we develop applications that execute concurrently on both CPU and GPU cores. We identify two processor design challenges that limit the performance and energy efficiency of these coordinated work applications. First, GPU barrier synchronization provides a useful mechanism for programmers to reduce synchronization granularity from numerous GPU threads. However, GPU barriers can still cause high overhead; threads must wait at barriers for lagging threads. Further, to communicate from GPU to CPU cores requires longer latency memory ordering guarantees than commonly used in discrete GPU applications. To reduce barrier overhead, we propose a hardware technique, the Transparent Fuzzy Barrier, which dynamically finds and executes instructions while threads wait at barriers to reduce barrier overhead by more than 50%. Second, GPU applications often use very coarse-grained producer-consumer communication, which reduces performance due to excessive cache spills and energy-expensive off-chip memory accesses. Existing software transformations can reduce cache spilling, but they tend to be complicated, requiring the programmer to reason about producer-toconsumer mappings and cached data footprint. To ease the cache management burden, we propose a novel hardware technique, called Q-cache, to support concurrent producerconsumer activity. Q-cache measures the cached data footprint and throttles producer or consumer tasks when buffered data might start spilling from cache. Q-cache eliminates cache spills and reduces contention to significantly improve application performance and energy.

Tags: Computer science, CUDA, Heterogeneous systems, nVidia, nVidia GeForce GTX 860 M, OpenCL, Thesis

January 8, 2017 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Synchronization and Coordination in Heterogeneous Processors

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)

Synchronization and Coordination in Heterogeneous Processors

Share this:

Recent source codes

Most viewed papers (last 30 days)