Integrating Accelerators in Heterogeneous Systems
School of Graduate Studies Rutgers, The State University of New Jersey
The State University of New Jersey, 2021
@phdthesis{vesely2021integrating,
title={Integrating accelerators in heterogeneous systems},
author={Vesel{`y}, J{‘a}n},
year={2021},
school={Rutgers University-School of Graduate Studies}
}
This work studies programmability enhancing abstractions in the context of accelerators and heterogeneous systems. Specifically, the focus is on adapting abstractions that have been successfully established to improve the programmability of CPUs. Specialized accelerators including GPUs, TPUs, and FPGAs promise to deliver orders of magnitude improvements in performance and energy efficiency. However, to exploit these benefits programmers must port existing applications, or develop new ones, that target accelerator-specific programming environments. The availability of established programmability abstractions aids this process and extends the performance benefits to a wider range of applications. This work presents three cases of known CPU abstractions and studies their suitability for accelerator programming; virtual memory, operating system services, and mapping of high-level languages. I study both suitability in terms of existing operational semantics, as well as design considerations necessary for efficient implementation. First, I study the mapping of high-level dynamic languages to accelerators. Highlevel languages, like Python, are increasingly popular with designers of scientific applications with a large selection of support libraries. High-level languages are often used to bind together otherwise highly optimized components to form a complete program. I use this observation to examine a specific case of cognitive modeling workloads written in Python and propose a path to efficient execution on accelerators. I demonstrate that it is often possible to extract and optimize core computational kernels using standard compiler techniques. Extracting such kernels offers multiple benefits; it improves performance, it eliminates dynamic language features for more efficient mapping to accelerators, and it offers opportunities for exploiting compiler-based analyses to provide direct user feedback. The second major area of study is the access to system services from accelerator programs. While accelerators often work as memory-to-memory devices, there is an increasing amount of evidence in favour of providing them with direct access to network or permanent storage. This work discusses the suitability of existing operating system interfaces (POSIX) and their semantics for inclusion in GPU programs. This work considers the differences between CPU and GPU execution model and the suitability of CPU system calls from both semantics and performance point of view. Finally, I examine challenges in implementing virtual memory for accelerators. To avoid expensive data marshalling overhead, accelerators often support unified virtual address space (also called unified virtual memory). This feature allows the operating system to synchronize CPU and accelerator address spaces. However, designing such a system needs to make several trade-offs to accommodate the complexities of maintaining the mirror layout and at the same time matching accelerator specific data access patterns. This work investigates integrated GPUs as a case study of accelerators and identifies several opportunities for improvement in designing device-side address translation hardware to provide unified virtual address space. Overall, this thesis studies programmability enhancements known from the CPU world and their applications to accelerators. It demonstrates that these techniques adapt well and provide programmability and familiarity to application programmers. Such combination not only opens door to new applications but allows for straightforward acceleration of existing ones, delivering performance benefits of accelerators to a wide range of applications. Proposed extensions to accelerators were implemented and data collected on real systems without any use of system simulators or hardware emulation.
March 7, 2021 by hgpu