Source-to-source transformations for irregular and multithreaded code optimization
University of Versailles, Saint-Quentin-en-Yvelines
University of Versailles, 2012
@article{jaeger2012source,
title={Source-to-source transformations for irregular and multithreaded code optimization},
author={JAEGER, J.},
year={2012}
}
Source-to-Source optimization is an efficient method to generate, from a basic implementation, a high performance program for the two main challenges that are irregular codes and heterogeneous implementation. In the last decade, general purpose CPUs moved towards multi-core architectures, and the end of the increase in processors frequency marked a turning point obtaining the best performance of a single chip, achieved only when efficiently considering the parallelism inside the chip. The optimization process is now a paramount key to have continuously increasing speed-up on newest architectures. Parallelization on a single chip brings new problems to consider, with the integration of different cache level on the chip, and having several threads running simultaneously and accessing to shared resources. Such coexistence implies that the different levels of parallelism (vector, Instruction Level Parallelism, threads, memory access) interacts more than ever, and optimization for high performance should consider all levels. A second paradigm shift occurs with the generalization of hardware accelerators and heterogeneous machines, requiring expertise in all architectures composing the heterogeneous system when generating an efficient code for the target. The complication of hardware architectures provides many challenges in the HPC area, especially for irregular codes, whether irregular in data access or control flow, since generating efficient version for such code on an only core remains difficult. In this dissertation, we will provide methods to generate efficient codes from an initial implementation for irregular programs and heterogeneous parallelizations. The remaining of Chapter 1 presents the evolution of machine architecture from the first scalar computer to nowadays multi-core and heterogeneous systems, the most used source-to-source optimizations and loop transformations, and an insight in hardware behaviour of vectorized computations. Chapter 2 describes our CPC framework, extracting codelets from an irregular codes, optimizing these codelets regardless the overall program, then predicting the overall speed-up of the all system. In Chapter 3, we develop methods, with more or less complexity and memory impact, to address alignment issues, due to vectorization or bank conflicts. We apply our methods on symptomatic stencil cases, and provide along with these methods an algorithm using them to generate heterogeneous codes for CPUs and GPUs. Parallelization techniques are discussed in Chapter 4 with the presentation of two works, one addressing the generation of parallelized codelets, the second scheduling sequential tasks on an heterogeneous system. To conclude, Chapter 5 will remind the contribution of the dissertation, and discuss the improvement and future development possible concerning the presented works.
July 24, 2012 by hgpu