high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Analysis and Optimization Techniques for Massively Parallel Processors

Analysis and Optimization Techniques for Massively Parallel Processors

Wenhao Jia

Princeton University

Princeton University, 2014

@article{jia2014analysis,

title={Analysis and Optimization Techniques for Massively Parallel Processors},

author={Jia, Wenhao},

year={2014}

}

Download (PDF)

View

Source

2009

views

In response to the ever growing demand for computing power, heterogeneous parallelism has emerged as a widespread computing paradigm in the past decade or so. In particular, massively parallel processors such as graphics processing units (GPUs) have become the prevalent throughput computing elements in heterogeneous systems, offering high performance and power efficiency for general-purpose workloads. However, GPUs are difficult to program and design for several reasons. First, GPUs are relatively new and still receive frequent design changes, making it challenging for GPU programmers and designers to determine which architectural resources have the highest performance or power impact. Second, a lack of virtualization in GPUs often causes strong and unexpected resource interactions. It also forces software developers to program for specific hardware details such as thread counts and scratchpad sizes, imposing programmability and portability hurdles. Third, though some GPU components such as general-purpose caches have been introduced to improve performance and programmability, they are not well tailored to GPU characteristics such as favoring throughput over latency. Therefore, these conventionally designed components suffer from resource contention caused by high thread parallelism and do not achieve their full performance and programmability potential. To overcome these challenges, this thesis proposes statistical analysis techniques and software and hardware optimizations that improve the performance, power efficiency, and programmability of GPUs. These proposals make it easier for programmers and designers to produce optimized GPU software and hardware designs. The first part of the thesis describes how statistical analysis can help users explore a GPU software or hardware design space with performance or power as the metric of interest. In particular, two fully automated tools – Stargazer and Starchart – are developed and presented. Stargazer is based on linear regression. It identifies globally important GPU design parameters and their interactions, revealing which factors have the highest performance or power impact. Starchart improves on Stargazer by using recursive partitioning to identify not only globally but also locally influential design parameters. More importantly, Starchart can be used to solve design problems formulated as a series of design decisions. These tools ease design tuning while saving design exploration time by 300-3000 times compared to exhaustive approaches. Then, inspired by two Starchart case studies, the second part of the thesis focuses on two key GPU software design decisions: cache configuration and thread block size selection. Compile-time algorithms are proposed to make these decisions automatically, improve program performance, and ease GPU programming. The first algorithm analyzes a program’s memory access patterns and turns caching on or off accordingly for each instruction. This improves the performance benefit of caching from 5.8% to 18%. The second algorithm estimates the sufficient number of threads to trigger either memory bandwidth or compute throughput saturation. Running programs with the estimated thread counts, instead of the hardware maximum, reduces GPU core resource usage by 27-62% while improving performance by 5-10%. Finally, to show how well-designed hardware can transparently improve GPU performance and programmability, the third part of the thesis proposes and evaluates the memory request prioritization buffer (MRPB). MRPB automates GPU cache management, reduces cache contention, and increases cache throughput. It does so by using request reordering to reduce cache thrashing and by using cache bypassing to reduce resource stalls. In addition to improving performance by 1.3-2.7 times and easing GPU programming, MRPB highlights the value of tailoring conventionally designed GPU hardware components to the massively parallel nature of GPU workloads. In summary, using GPUs as an example, the high-level statistical tools and the more focused software and hardware studies presented in this thesis demonstrate how to use automation techniques to effectively improve the performance, power efficiency, and programmability of emerging heterogeneous computing platforms.

Tags: Algorithms, ATI, ATI Radeon HD 7970, Computer science, CUDA, Heterogeneous systems, nVidia, OpenCL, PTX, Tesla C2070, Thesis, Virtualization

December 15, 2014 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org

Analysis and Optimization Techniques for Massively Parallel Processors

Recent source codes

QArray

Celerity: High-level C++ for Accelerator Clusters

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

OpenMP5-Offload-OpenMC-Intel-PVC

Most viewed papers (last 30 days)

Analysis and Optimization Techniques for Massively Parallel Processors

Share this:

Recent source codes

Most viewed papers (last 30 days)