high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Understanding Latency Hiding on GPUs

Understanding Latency Hiding on GPUs

Vasily Volkov

Electrical Engineering and Computer Sciences, University of California at Berkeley

University of California at Berkeley, Technical Report No. UCB/EECS-2016-143, 2016

@phdthesis{Volkov:EECS-2016-143,

author={Volkov, Vasily},

title={Understanding Latency Hiding on GPUs},

school={EECS Department, University of California, Berkeley},

year={2016},

month={Aug},

uRL={http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-143.html},

number={UCB/EECS-2016-143}

}

Download (PDF)

View

Source

3194

views

Modern commodity processors such as GPUs may execute up to about a thousand of physical threads per chip to better utilize their numerous execution units and hide execution latencies. Understanding this novel capability, however, is hindered by the overall complexity of the hardware and complexity of typical workloads. In this dissertation, we suggest a better way to understand modern multithreaded performance by considering a family of synthetic workloads, which use the same key hardware capabilities – memory access, arithmetic operations, and multithreading – but are otherwise as simple as possible. One of our surprising findings is that prior performance models for GPUs fail on these workloads: they mispredict observed throughputs by factors of up to 1.7. We analyze these prior approaches, identify a number of common pitfalls, and discuss the related subtleties in understanding concurrency and Little’s Law. Also, we help to further our understanding by considering a few basic questions, such as on how different latencies compare with each other in terms of latency hiding, and how the number of threads needed to hide latency depends on basic parameters of executed code such as arithmetic intensity. Finally, we outline a performance modeling framework that is free from the found limitations. As a tangential development, we present a number of novel experimental studies, such as on how mean memory latency depends on memory throughput, how latencies of individual memory accesses are distributed around the mean, and how occupancy varies during execution.

Tags: Computer science, CUDA, nVidia, nVidia GeForce GTX 480, Performance, Thesis

October 12, 2016 by hgpu

Rating: 2.6/5. From 18 votes.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Understanding Latency Hiding on GPUs

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)

Understanding Latency Hiding on GPUs

Share this:

Recent source codes

Most viewed papers (last 30 days)