Analysis and Optimization Techniques for Massively Parallel Processors
Princeton University
Princeton University, 2014
@article{jia2014analysis,
title={Analysis and Optimization Techniques for Massively Parallel Processors},
author={Jia, Wenhao},
year={2014}
}
In response to the ever growing demand for computing power, heterogeneous parallelism has emerged as a widespread computing paradigm in the past decade or so. In particular, massively parallel processors such as graphics processing units (GPUs) have become the prevalent throughput computing elements in heterogeneous systems, offering high performance and power efficiency for general-purpose workloads. However, GPUs are difficult to program and design for several reasons. First, GPUs are relatively new and still receive frequent design changes, making it challenging for GPU programmers and designers to determine which architectural resources have the highest performance or power impact. Second, a lack of virtualization in GPUs often causes strong and unexpected resource interactions. It also forces software developers to program for specific hardware details such as thread counts and scratchpad sizes, imposing programmability and portability hurdles. Third, though some GPU components such as general-purpose caches have been introduced to improve performance and programmability, they are not well tailored to GPU characteristics such as favoring throughput over latency. Therefore, these conventionally designed components suffer from resource contention caused by high thread parallelism and do not achieve their full performance and programmability potential. To overcome these challenges, this thesis proposes statistical analysis techniques and software and hardware optimizations that improve the performance, power efficiency, and programmability of GPUs. These proposals make it easier for programmers and designers to produce optimized GPU software and hardware designs. The first part of the thesis describes how statistical analysis can help users explore a GPU software or hardware design space with performance or power as the metric of interest. In particular, two fully automated tools – Stargazer and Starchart – are developed and presented. Stargazer is based on linear regression. It identifies globally important GPU design parameters and their interactions, revealing which factors have the highest performance or power impact. Starchart improves on Stargazer by using recursive partitioning to identify not only globally but also locally influential design parameters. More importantly, Starchart can be used to solve design problems formulated as a series of design decisions. These tools ease design tuning while saving design exploration time by 300-3000 times compared to exhaustive approaches. Then, inspired by two Starchart case studies, the second part of the thesis focuses on two key GPU software design decisions: cache configuration and thread block size selection. Compile-time algorithms are proposed to make these decisions automatically, improve program performance, and ease GPU programming. The first algorithm analyzes a program’s memory access patterns and turns caching on or off accordingly for each instruction. This improves the performance benefit of caching from 5.8% to 18%. The second algorithm estimates the sufficient number of threads to trigger either memory bandwidth or compute throughput saturation. Running programs with the estimated thread counts, instead of the hardware maximum, reduces GPU core resource usage by 27-62% while improving performance by 5-10%. Finally, to show how well-designed hardware can transparently improve GPU performance and programmability, the third part of the thesis proposes and evaluates the memory request prioritization buffer (MRPB). MRPB automates GPU cache management, reduces cache contention, and increases cache throughput. It does so by using request reordering to reduce cache thrashing and by using cache bypassing to reduce resource stalls. In addition to improving performance by 1.3-2.7 times and easing GPU programming, MRPB highlights the value of tailoring conventionally designed GPU hardware components to the massively parallel nature of GPU workloads. In summary, using GPUs as an example, the high-level statistical tools and the more focused software and hardware studies presented in this thesis demonstrate how to use automation techniques to effectively improve the performance, power efficiency, and programmability of emerging heterogeneous computing platforms.
December 15, 2014 by hgpu