On Runtime Systems for Task-based Programming on Heterogeneous Platforms
LaBRI – Laboratoire Bordelais de Recherche en Informatique, Inria Bordeaux – Sud-Ouest
tel-01959127, 18 December 2018
@phdthesis{thibault:tel-01959127,
title={On Runtime Systems for Task-based Programming on Heterogeneous Platforms},
author={Thibault, Samuel},
url={https://hal.inria.fr/tel-01959127},
school={Universit{‘e} de Bordeaux},
year={2018},
month={Dec},
keywords={Runtime Systems; Task graphs; Task graph scheduling; Distributed Computing; Support Ex{‘e}cutif; Graphe de t{^a}ches; Ordonnancement de graphe de t{^a}ches; Calcul Distribu{‘e}},
type={Habilitation {`a} diriger des recherches},
pdf={https://hal.inria.fr/tel-01959127/file/hdr.pdf},
hal_id={tel-01959127},
hal_version={v1}
}
Simulation has become pervasive in science. Real experimentation remains an essential step in scientific research, but simulation replaced a wide range of costly and lengthy or even dangerous experimentation. It however requires massive computation power, and scientists will always welcome bigger and faster computation platforms, to be able to keep simulating more and more accurately and extensively. The HPC field has kept providing such platforms but with various shifts along the decades, from vector computers to clusters. It seems that the past decade has seen such a shift, as shown by the top500 list of the fastest supercomputers. To be able to stay in the race, most of the largest platforms include accelerators such as GPGPUs or Xeon Phi, making them heterogeneous systems. Programming such systems is significantly more complex than programming the homogeneous platforms we were used to, as it now requires orchestrating asynchronous accelerator operations along usual computation execution and communications over the network, to the point that it does not seem reasonable to optimize execution by hand any more. A deep trend which has emerged to cope with this new complexity is using task-based programming models. These are not new, but have really regained a lot of interest lately, showing up in a large variety of industrial platforms and research projects using this model with various programming interfaces and features. A key part here, that is however often forgotten, misunderstood, or just ignored, is the underlying runtime system which manages tasks. This is nonetheless where an extremely wide range of optimization and support can be provided, thanks to a task-based programming model. As we will see in this document, this model is indeed very appealing for runtime systems: since they get to know the set of tasks which will have to be executed, which data they will access, possibly an estimation of the duration of the tasks, etc., this opens up for extensive possibilities, which are not reachable by an Operating System with the current limited system interfaces. It is actually more and more heard in conference keynotes that runtime systems will be key for HPC’s continued success. Thanks to such rich information from applications, runtime systems can bring a lot of questions on the desk, in terms of task scheduling of course, but also transfer optimization, memory management, performance feedback, etc. When the PhD thesis of Cedric Augonnet started in 2008, we started addressing a few of these questions within a runtime system, StarPU. During the decade that followed, we have deepened the investigations, and opened new directions, which I will discuss in this document. We have chosen to keep focused on the runtime aspects, leaving a bit on the side for instance programming languages which can be introduced to make task-based programming easier. The runtime part itself indeed did keep providing various challenges which show up in task-based runtime systems in general. These challenges happen to be related to various other research topics, and collaboration with the respective research teams has then only become natural: task scheduling of course, but also network communication, statistics, performance visualization, etc. Conversely, the existence of the actual working runtime system StarPU provided them with interesting test-cases for their respective approaches. It also allowed for various research projects to be conducted around it, without direct contribution to StarPU.
December 23, 2018 by hgpu