Dynamic Orchestration of Massively Data Parallel Execution
University of Michigan
University of Michigan, 2014
@phdthesis{samadi2014dynamic,
title={Dynamic Orchestration of Massively Data Parallel Execution},
author={Samadi, Mehrzad},
year={2014},
school={University of Michigan}
}
Graphics processing units (GPUs) are specialized hardware accelerators capable of rendering graphics much faster than conventional general-purpose processors. They are widely used in personal computers, tablets, mobile phones, and game consoles. Modern GPUs are not only efficient at manipulating computer graphics, but also are more effective than CPUs for algorithms where processing of large data blocks can be done in parallel. This is mainly due to their highly parallel architecture. While GPUs provide low-cost and efficient platforms for accelerating massively parallel applications, tedious performance tuning is required to maximize application execution efficiency. Achieving high performance requires the programmers to manually manage the amount of on-chip memory used per thread, the total number of threads per multiprocessor, the pattern of off-chip memory accesses, etc. In addition to a complex programming model, there is a lack of performance portability across various systems with different runtime properties. Programmers usually make assumptions about runtime properties when they write code and optimize that code based on those assumptions. However, if any of these properties changes during execution, the optimized code performs poorly. To alleviate these limitations, several implementations of the application are needed to maximize performance for different runtime properties. However, it is not practical for the programmer to write several different versions of the same code which are optimized for each individual runtime condition. In this thesis, we propose a static and dynamic compiler framework to take the burden of fine tuning different implementations of the same code off the programmer. This framework enables the programmer to write the program once and allow a static compiler to generate different versions of a data parallel application with several tuning parameters. The runtime system selects the best version and fine tunes its parameters based on runtime properties such as device configuration, input size, dependency, and data values.
May 10, 2014 by hgpu