Static and Dynamic Analyses for Efficient GPU Execution
University of Copenhagen
University of Copenhagen, 2023
@phdthesis{munksgaard2023static,
title={Static and Dynamic Analyses for Efficient GPU Execution},
author={Munksgaard, Philip},
year={2023},
school={School of The Faculty of Science, University of Copenhagen}
}
In this thesis we describe a host of static and dynamic techniques for efficient execution of GPU programs. Most significant is the array short-circuiting technique, which automatically rewrites array updates and concatenations to happen in-place when deemed safe. The optimization is based on FunMem, an intermediate representation with non-semantic memory information that we also introduce. FunMem allows the compiler to analyze, reason about and optimize memory usage patterns while keeping high-level language information. To map array elements to specific locations in memory, FunMem uses LMADs. LMADs have traditionally seen most use in the context of automatic parallelization of sequential loops, where they are used in a set-interpretation to aggregate memory accesses across loops, but we introduce two new interpretations: as index functions in FunMem and as a slicing mechanism in a user-facing language. The first enable cheap change-of-layout transformations in FunMem, while the other allows users to express complex slices of arrays not otherwise possible. Finally, to enable the aforementioned short-circuiting optimization on complex benchmarks, we present a heuristic for proving non-overlap of multi-dimensional LMADs in the set-interpretation. We also introduce a technique for automatically finding near-optimal threshold values for multi-versioned code. Multi-versioned code attempts to solve the problem of not having a single best compilation strategy for a certain program, by generating multiple semantically-equivalent but differently-optimized versions of code and guarding each version with a conditional based on runtime parameters, leading to the creation of a tuning tree. By relying on a monotonicity assumption, which relates the relative performance of each code version to the runtime parameter conditionals, we show how to automatically tune the threshold values such that each tuning dataset is executed using the fastest code version. As a result, our one-time offline tuning process produces tuning values that optimally discriminate between the different code versions for the training datasets used, and indeed for all datasets with similar size-characteristics. Both the static and dynamic techniques have been developed in the context of and implemented for the Futhark parallel programming language, and we show significant performance improvements from both array short-circuiting and autotuning.
August 13, 2023 by hgpu