Program Optimization Strategies for Data-Parallel Many-Core Processors
University of Illinois at Urbana-Champaign
University of Illinois at Urbana-Champaign, 2008
@phdthesis{ryoo2008program,
title={Program optimization strategies for data-parallel many-core processors},
author={Ryoo, S.},
year={2008},
publisher={Citeseer}
}
Program optimization for highly parallel systems has historically been considered an art, with experts doing much of the performance tuning by hand. With the introduction of inexpensive, single-chip, massively parallel platforms, more developers will be creating highly data-parallel applications for these platforms while lacking the substantial experience and knowledge needed to maximize application performance. In addition, hand-optimization even by motivated and informed developers takes a significant amount of time and generally still underutilizes the performance of the hardware by double-digit percentages. This creates a need for structured and automatable optimization techniques that are capable of finding a near-optimal program configuration for this new class of architecture. My work discusses various strategies for optimizing programs on a highly data-parallel architecture with fine-grained sharing of resources. I first investigate useful strategies in optimizing a suite of applications. I then introduce program optimization carving, an approach that discovers high-performance application configurations for data-parallel, many-core architectures. Instead of applying a particular phase ordering of optimizations, it starts with an optimization space of major transformations and then reduces the space by examining the static code and pruning configurations that do not maximize desirable qualities in isolation or combination. Careful selection of pruning criteria for applications running on the NVIDIA GeForce 8800 GTX reduces the optimization space by as much as 98% while finding configurations within 1% of the best performance. Random sampling, in contrast, can require nearly five times as many configurations to find performance within 10% of the best. I also examine the technique’s effectiveness when varying pruning criteria.
January 24, 2011 by hgpu