GPU Performance Modeling and Optimization
Technische Universiteit Eindhoven
Technische Universiteit Eindhoven, 2016
@phdthesis{li2016gpu,
title={GPU performance modeling and optimization},
author={Li, Ang},
year={2016},
school={Technische Universiteit Eindhoven}
}
The last decade has witnessed the blooming emergence of general-purpose Graphic-Processing-Unit computing (GPGPU). With the exponential growth of cores and threads in a modern GPU processor, how to analyze and optimize its performance becomes a grand challenge. In this thesis, as the modeling part, we propose an analytic model for throughput-oriented parallel processors. The model is visualizable, traceable and portable, while providing a good abstraction for both application designers and hardware architects to understand the performance and motivate potential optimization approaches. As the optimization part, we focus on each crucial component of a GPU streaming-multiprocessor, in particular registers-files, compute-units (SPU, DPU, SFU), caches (L1, L2, read-only, texture, constant) and scratchpad memory alternatively, clarify its underlying performance tradeoffs, and propose effective solutions to handle the tradeoffs in the design space. All the proposed optimization approaches are purely softwarebased. They are adaptive, transparent, traceable and portable, which leads to achievable and immediate performance gains for various existing GPU devices, especially for GPU integrated high-performance-computers (HPC). Particularly, the first contribution in Chapter 3 is a novel visualizable analytic model called "X" that is specially for today’s highly parallel machines. It comprehensively analyzes the interaction between the four types of parallelism (TLP, ILP, DLP and MLP) and two types of memory effects (local on-chip cache effect and remote off-chip memory effect), in terms of system throughput. The X-model acts as the theoretical basis of this thesis. The second contribution in Chapter 4 is an effective auto-tuning framework to resolve the conflict between overall thread concurrency and per-thread register usage for GPUs. We discover that the performance impact from register usage is continuous, but from concurrency is discrete. Their joint-effects form a special relationship such that a series of critical-points can be pre-computed. These critical-points denote the best performance for each concurrency level. Therefore, the global optimum, which refers to the optimal number of registers per-thread, can be quickly and efficiently selected to deliver the best GPU performance. The third contribution in Chapter 5 is an adaptive cache bypassing framework for GPUs. It uses a simple but effective approach to throttle the number of threads that could access the three types of GPU caches – L1, L2 and read-only caches, thereby avoiding the fierce cache thrashing of GPUs, and significantly improving the performance for cache-sensitive applications. In Chapter 6, we focus on a crucial GPU component that has long been ignored – the Special Function Units (SFUs) and show its outstanding role in performance acceleration and approximate computing for GPU applications. We exhaustively evaluate the numeric transcendental functions that are accelerated by SFUs and propose a transparent, tractable and portable design framework for SFU-driven approximate acceleration on GPUs. It partitions the active threads into a PEbased slower but accurate path, and a SFU-based faster but approximated path, and tunes the relative partition ratio among two paths to control the tradeoffs between the performance and accuracy of the GPU kernels. In this way, a finegrained and almost linear tuning space for the tradeoff between performance and accuracy can be created. Finally, the last contribution in Chapter 7 is a novel approach for fine-grained inter-thread synchronizations on the shared memory of modern GPUs. By reassembling the low-level assembly-based micro-operations that comprise an atomic instruction, we develop a highly efficient, low cost lock approach that can be leveraged to set up a fine-grained producer-consumer synchronization channel between cooperative threads in a thread block. Additionally, we show how to implement a dataflow algorithm on GPUs using a real 2D-wavefront application.
October 29, 2016 by hgpu