Abstraction and Implementation of Unstructured Grid Algorithms on Massively Parallel Heterogeneous Architectures
Pazmany Peter Catholic University, Faculty of Information Technology
Pazmany Peter Catholic University, 2014
@article{reguly2014abstraction,
title={Abstraction and Implementation of Unstructured Grid Algorithms on Massively Parallel Heterogeneous Architectures},
author={Reguly, Istv{‘a}n Zolt{‘a}n and Ol{‘a}h, Andr{‘a}s and Nagy, Zolt{‘a}n},
year={2014}
}
The last decade saw the long tradition of frequency scaling of processing units grind to a halt, and efforts were re-focused on maintaining computational growth by other means; such as increased parallelism, deep memory hierarchies and complex execution logic. After a long period of "boring productivity", a host of new architectures, accelerators, programming languages and parallel programming techniques appeared, all trying to address different aspects of performance, productivity and generality. Computations have become cheap and the cost of data movement is now the most important factor for performance, however in the field of scientific computations many still use decades old programming languages and techniques. Domain scientists cannot be expected to gain intimate knowledge of the latest hardware and programming techniques that are often necessary to achieve high performance, thus it is more important than ever for computer scientists to take on some of this work; to come up with ways of using new architectures and to provide a synthesis of these results that can be used by domain scientists to accelerate their research. Despite decades of research, there is no universal auto-parallelising compiler that would allow the performance portability of codes to modern and future hardware, therefore research often focuses on specific domains of computational sciences, such as dense or sparse linear algebra, structured or unstructured grid algorithms, N-body problems and many others. For my research, I have chosen to target the domain of unstructured grid computations; these are most commonly used for the discretisation of partial differential equations when the computational domain is geometrically complex and the solution requires different resolution in different areas. Examples would include computational fluid dynamics simulations around complex shorelines or the blades of an aircraft turbine. The aim of this dissertation is to present my research into unstructured grid algorithms, starting out at different levels of abstraction where certain assumptions are made, which in turn are used to reason about and apply transformations to these algorithms. They are then mapped to computer code, with a focus on the dynamics between the programming model, the execution model and the hardware; investigating how the parallelism and the deep memory hierarchies available on modern heterogeneous hardware can be utilised optimally. In short, my goal is to address some of the programming challenges in modern computing; parallelism, locality, load balancing and resilience, in the field of unstructured grid computations. The first part of my research studies the Finite Element Method (FEM), therefore starts at a relatively high level of abstraction, allowing a wide range of transformations to the numerical methods involved. I show that by differently formulating the FE integration, it is possible to trade off computations for communications or temporary storage; mapping these to the resource-scarce Graphical Processing Unit (GPU) architecture, I demonstrate that a redundant compute approach is scalable to high order elements. Then, data storage formats and their implications on the algorithms are investigated, I demonstrate that a FEM-specific format is highly competitive with classical approaches, deriving data movement requirements, and showing its advantages in a massively parallel setting. Finally, I parametrise the execution of the sparse-matrix product (spMV) for massively parallel architectures, and present a constant-time heuristic as well as a machine learning algorithm to determine the optimal values of these parameters for general sparse matrices. Following the first part that focused on challenges in the context of the finite element method, I broaden the scope of my research by addressing general unstructured grid algorithms that are defined through the OP2 domain specific library [16]. OP2’s abstraction for unstructured grid computations covers the finite element method, but also others such as the finite volume method. The entry point here, that is the level of abstraction, is lower than that of the FEM, thus there is no longer control over the numerical method, however it supports a much broader range of applications. The second part of my research investigates possible transformations to the execution of computations defined through the OP2 abstraction in order to address the challenges of locality, resilience and load balancing at a higher level, that is not concerned with the exact implementation. I present a fully automated checkpointing algorithm that is capable of locating the point during execution where the state space is minimal, save data, and in the case of a failure automatically fastforward execution to the point of the last backup. Furthermore, I give an algorithm for the redundant-compute tiling, or cache-blocking, execution of general unstructured grid computations that fully automatically reorganises computations across subsequent computational loops in order to improve the temporal locality of data accesses. I also address the issue of utilisation in modern heterogeneous systems where different hardware with different performance characteristics are present. I give a model for such heterogeneous cooperative execution of unstructured grid algorithms and validate it with an implementation in OP2. Finally, the third part of my research presents results on how an algorithm defined once through OP2 can be automatically mapped to a range of contrasting programming languages, execution models and hardware. I show how execution is organised on large scale heterogeneous systems, utilising layered programming abstractions, across deep memory hierarchies and many levels of parallelism. First, I discuss how execution is mapped to GPUs through CUDA and the Single Instruction Multiple Threads (SIMT) model, deploying a number of optimisations to execution patterns and data structures fully automatically through code generation. Second, I present a mapping to CPU-like architectures, utilising OpenMP or MPI for Simultaneous Multithreading as well as AVX vector instructions for Single Instruction Multiple Data (SIMD) execution. These techniques are evaluated on computational fluid dynamics simulations, including the industrial application Rolls-Royce Hydra, on a range of different hardware, including GPUs, multi-core CPUs, the Intel Xeon Phi co-processor and distributed memory supercomputers combining CPUs and GPUs. My work demonstrates the viability of the high-level abstraction approach and its benefits in terms of performance and developer productivity.
July 24, 2014 by hgpu