high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Synchronization and Coordination in Heterogeneous Processors

Synchronization and Coordination in Heterogeneous Processors

Joel Hestness

University of Wisconsin-Madison

University of Wisconsin-Madison, 2016

BibTeX

Download (PDF)

View

Source

2313

views

Recent developments in internet connectivity and mobile devices have spurred massive data growth. Users demand rapid data processing from both large-scale systems and energy-constrained personal devices. Concurrently with this data growth, transistor scaling trends have slowed, diminishing processor performance and energy improvements compared to prior generations. To sustain performance trends while staying within energy budgets, emerging systems are integrating many processing cores and adding accelerators for specialized computation. The graphics processor (GPU) has become a prominent accelerator by offering fast, energy-efficient processing for data-parallel applications. For systems like desktops and mobile devices, heterogeneous processors have integrated GPUs onto the same chips as general-purpose cores (CPUs). These integrated processors have introduced new programmability and performance challenges. Similar to systems containing discrete GPU cards, early heterogeneous processors divide the CPU and GPU memory spaces, requiring programmers to explicitly manage data movement between the core types. To simplify this programming challenge, emerging heterogeneous processors provide shared memory and cache coherence between CPU and GPU cores. Still, few applications have been developed to use these new capabilities, and it is challenging to predict how programmers might use them. We aim to enable programmers to write applications that deftly and efficiently coordinate computation across CPU and GPU cores in shared memory, cache coherent heterogeneous processors. First, we analyze existing GPU computing applications to identify characteristics that can benefit from the new compute and communication capabilities. Generally, application phases with high data-level parallelism (DLP) are a good fit for GPUs, while phases with low DLP are a good fit for CPU cores, especially if they contain high instruction-level parallelism. Further, many GPU computing applications involve multiple software pipeline stages with these varied compute and memory demands. This thesis proposes and evaluates techniques that improve the performance, programmability, and energy efficiency of synchronization and coordinated computation in GPUs and heterogeneous processors. Guided by our broad workload analysis, we develop applications that execute concurrently on both CPU and GPU cores. We identify two processor design challenges that limit the performance and energy efficiency of these coordinated work applications. First, GPU barrier synchronization provides a useful mechanism for programmers to reduce synchronization granularity from numerous GPU threads. However, GPU barriers can still cause high overhead; threads must wait at barriers for lagging threads. Further, to communicate from GPU to CPU cores requires longer latency memory ordering guarantees than commonly used in discrete GPU applications. To reduce barrier overhead, we propose a hardware technique, the Transparent Fuzzy Barrier, which dynamically finds and executes instructions while threads wait at barriers to reduce barrier overhead by more than 50%. Second, GPU applications often use very coarse-grained producer-consumer communication, which reduces performance due to excessive cache spills and energy-expensive off-chip memory accesses. Existing software transformations can reduce cache spilling, but they tend to be complicated, requiring the programmer to reason about producer-toconsumer mappings and cached data footprint. To ease the cache management burden, we propose a novel hardware technique, called Q-cache, to support concurrent producerconsumer activity. Q-cache measures the cached data footprint and throttles producer or consumer tasks when buffered data might start spilling from cache. Q-cache eliminates cache spills and reduces contention to significantly improve application performance and energy.

Tags: Computer science, CUDA, Heterogeneous systems, nVidia, nVidia GeForce GTX 860 M, OpenCL, Thesis

January 8, 2017 by hgpu

No votes yet.

Please wait...

high performance computing on graphics processing units: hgpu.org

Synchronization and Coordination in Heterogeneous Processors

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Synchronization and Coordination in Heterogeneous Processors

Share this:

Recent source codes

Most viewed papers (last 30 days)