A Toolkit for Building Dynamic Compilers for Array-Based Languages Targeting CPUs and GPUs
School of Computer Science, McGill University, Montreal
McGill University, 2015
@phdthesis{garg2015toolkit,
title={A TOOLKIT FOR BUILDING DYNAMIC COMPILERS FOR ARRAY-BASED LANGUAGES TARGETING CPUS AND GPUS},
author={Garg, Rahul},
year={2015},
school={McGill University, Montr{‘e}al}
}
Array-based languages such as MATLAB and Python (with NumPy) have become very popular for scientific computing. However, the performance of the implementations of these languages is often lacking. For example, some of the implementations are interpreted. Further, these languages were not designed with multi-core CPUs and GPUs in mind and thus don’t take full advantage of modern hardware. Thus, developing just-in-time (JIT) compilers for these languages that allow scientific programmers to efficiently target both CPUs and GPUs is of increasing interest. However building such compilers requires considerable effort. Prior to this thesis, there were no reusable compiler toolkits for array-based languages even though many of the compilation challenges are similar across languages. This thesis is about a set of two novel and reusable tools, Velociraptor and RaijinCL, that simplify the work of building JIT compilers for array-based languages targeting both CPUs and GPUs. Velociraptor is a reusable, embeddable dynamic compiler toolkit while RaijinCL is an auto-tuning highperformance matrix operations library. Velociraptor provides a new high-level intermediate representation (IR) called VRIR which has been specifically designed for numeric computations, with rich support for arrays, plus support for high-level parallel and GPU constructs. A compiler developer uses Velociraptor by generating VRIR for key parts of an input program. Velociraptor provides an optimizing compiler toolkit for generating CPU and GPU code and also provides a smart runtime system to manage the GPU. An important contribution of the thesis is a new dynamic compilation technique called region specialization that is particularly designed for numerical programs. Region specialization first performs region detection analysis, a novel compiler analysis which identifies regions of code that may be interesting for the compiler to analyze, such as loops and library calls involving arrays. Region detection analysis also identifies parameters such as shapes and values of variables which are critical for inferring the properties of the region. Region specialization dynamically generates specialized code for the region based upon the runtime value of parameters identified by region detection analysis. To demonstrate that Velociraptor is not tied to a single language or compiler toolchain, we present two case studies of using Velociraptor: a proof-of-concept Python compiler targeting CPUs and GPUs, and a GPU extension for a MATLAB JIT. We evaluated both these compilers built using Velociraptor against production-level static and dynamic compilers on a variety of benchmarks. We demonstrate that the CPU code generated by our toolkit is either competitive with, or outperforms, the available tools and that our parallel CPU and GPU code generation can provide further speedups. RaijinCL was implemented to support matrix library operations found in array-based languages, such as matrix multiplication, on the GPU. RaijinCL an auto-tuning highperformance matrix operations library that runs across GPUs from many vendors. We present detailed experimental results from many different GPU architectures to show that it is competitive with vendor tuned libraries. Finally, while much of the literature about matrix libraries for GPUs is for discrete GPUs, we also demonstrate a prototype extension to RaijinCL to maximize system performance on single-chip CPU/GPU systems by using both the CPU and GPU.
October 6, 2015 by hgpu