Efficient code generation for hardware accelerators by refining partially specified implementation
DI-ENS – Département d’informatique de l’École normale supérieure
tel-02385303, (26 November 2020)
@phdthesis{beaugnon:tel-02385303,
title={Efficient code generation for hardware accelerators by refining partially specified implementation},
author={Beaugnon, Ulysse},
url={https://tel.archives-ouvertes.fr/tel-02385303},
number={2019PSLEE050},
school={Universit{‘e} Paris sciences et lettres},
year={2019},
month={Jun},
keywords={Constraint programing ; GPU ; Compilation ; Code optimization ; Performance model ; Programation par contraintes ; GPU ; Mod{`e}le de performance ; Optimisation de code ; Compilation},
type={Theses},
pdf={https://tel.archives-ouvertes.fr/tel-02385303v2/file/Beaugnon-2019-these.pdf},
hal_id={tel-02385303},
hal_version={v2}
}
Software programmable hardware accelerators, such as Graphical Processing Units (GPUs), are specialized processors designed to perform specific tasks more efficiently than general purpose processors. They trade off generality against specialized data paths and massive parallelism, providing a raw processing power that is orders of magnitude higher than for contemporary multicore CPUs. Unfortunately, finding an efficient implementation for a function on an hardware accelerator is a complex problem. It requires making careful decisions for mapping computations to the appropriate levels of parallelism and for expliciting data movements across the different memory spaces, in addition to choosing amongst to the many possible thread-local optimizations. While the set of possible optimizations is usually well known, complex interactions between them make it hard to find a global optimum. Indeed, anticipating downstream transformations and deriving profitability information from intermediate compilation steps is a challenge. Transformations may not commute and some optimization opportunities may only become available after applying so-called “enabling” transformations. Conversely, some transformations may hinder further optimizations. As a result, the production of highly tuned implementations remains a critical challenge to achieve competitive performance. This dissertation introduces the concept of candidate to formally define, represent and explore spaces of possible implementations. A candidate is a partially specified implementation with some decisions fixed while others are left open. It represents a whole set of possible implementations of the same function. Candidates expose all potential decisions upfront and ensure they are commutative. Taking a decision always restricts the set of possible implementations. This defines a well-behaved optimization space; in particular, it allows search algorithms looking for the best implementations to make the most performance-impacting decisions first and to have a global knowledge of wich optimizations may feature in implementations. We provide a framework that automatically generate code to represents and manipulates candidates, from a declarative description of available choices and their interaction. This description is independent of the function to implement. We instantiate our concept of candidate to generate efficient code for linear algebra functions on GPUs. This shows our approach is expressive enough to model interacting decisions with a fundamental impact on the structure of the generated code, including compositions of strip-mining, loop fusion, loop interchange, unrolling, vectorization, parallelization, and orchestration of data movements across the memory hierarchy. We develop a model capable of computing a lower bound for the execution time of any implementation that derives from a candidate. We show that this model provides actionable information, even after taking only a few decisions, and that it enables pruning the implementation space, reducing its size by several orders of magnitude. We propose a simple search algorithm to illustrate our approach. It combines the lower bound performance model and actual evaluation on the hardware with statistical exploration to drive the search towards the most efficient implementations. Our experiments show that it generates code that is competitive with hand-tuned libraries for linear algebra functions on GPUs. They also demonstrate that taking the most important decisions first helps finding better implementations faster, thus showing how the concept of candidate empowers search algorithms.
December 13, 2020 by hgpu