Reducing GPU Offload Latency via Fine-Grained CPU-GPU Synchronization

hgpu.org » Applications » Computer science » Reducing GPU Offload Latency via Fine-Grained CPU-GPU Synchronization

Reducing GPU Offload Latency via Fine-Grained CPU-GPU Synchronization

Daniel Lustig, Margaret Martonosi

Princeton University

19th IEEE International Symposium on High Performance Computer Architecture (HPCA), 2013

BibTeX

Download (PDF)

View

Source

2870

views

GPUs are seeing increasingly widespread use for general purpose computation due to their excellent performance for highly-parallel, throughput-oriented applications. For many workloads, however, the performance benefits of offloading are hindered by the large and unpredictable overheads of launching GPU kernels and of transferring data between CPU and GPU. This paper proposes and evaluates hardware and software support for reducing overheads and improving data latency predictability when offloading computation to GPUs. We first characterize program execution using real-system measurements to highlight the degree to which kernel launch and data transfer are major sources of overhead. We then propose a scheme of full-empty bits to track when regions of data have been transferred. This dependency tracking is fast, efficient, and fine-grained, mitigating much of the latency uncertainty and cost of offloading in current systems. On top of these fullempty bits, we build APIs that allow for early kernel launch and proactive data returns. These techniques enable faster kernel completion, while correctness remains guaranteed by the full/empty bits. Taken together, these techniques can both greatly improve GPU application performance and broaden the space of applications for which GPUs are beneficial. In particular, across a set of seven diverse benchmarks that make use of our support, the mean improvement in runtime is 26%.

Tags: Computer science, CUDA, nVidia, nVidia GeForce GTX 580, Performance

January 23, 2013 by hgpu

No votes yet.

Please wait...

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org