Hierarchical Transparent Programming for Heterogeneous Computing
Universidad de Valladolid
Universidad de Valladolid, 2014
@article{escribano2014hierarchical,
title={Hierarchical Transparent Programming for Heterogeneous Computing},
author={Escribano, Arturo Gonz{‘a}lez and Ferraris, Diego R Llanos},
year={2014}
}
Parallel computing and the development of parallel programs is a way to reduce the time of the program execution. During many years, sequential optimization was designed without thinking about parallel tasks. Currently, multi-core devices have arrived, making code parallelization more important. The parallel computing is closely related with both hardware and software point of view, in both cases, many calculations are carried out simultaneously. The final objective of parallel computing is the improvement in computing capacity. The rapid increase in the performance of Graphics Processing Units (GPUs), coupled with recent improvements in its ease of programming, have made graphics hardware a compelling platform for High Performance Computing field (HPC) in a wide kind of applications. For this impressive processing potential, a single GPU has the sufficient power to compete with many super-scalar CPUs. The heterogeneous computing notion has emerged few years ago. This concept references to exploit a system composed by multiple mixed compute devices. A Heterogeneous system can be composed by commodity multi-core processors, graphics processors and reconfigurable processors among others. Although the use of heterogeneous systems to take maximum advantage of all computing capabilities may seem a natural idea, their programming complexity is also one step beyond of the intrinsically complex parallel programming. Two main problems appear. First, the need of writing specialized code for both the CPUs cores and GPUs or other accelerators that are present in the system. Second, the problems related with data distribution among devices, and the associated load-balancing problem since heterogeneous systems do not share a common address space and present different computing powers. During the last decade, different programming models have been proposed to handle the complexity of multilevel data partition and mapping. These programming models roughly falls into two categories: Those that hide the underlying communications, and those where the explicit communication is driven by the partition made by the user. These parallel programming models do not help the programmer to explicitly express the communication pattern needed by the algorithm regardless of the data partition chosen. Tiling is a well-known technique used to distribute data and task in parallel programs and to improve the locality of nested loops in sequential code. The use of data structures to support tiles allows to better exploit the memory hierarchy, since data is often reused withing a tile. Trasgo is a programming framework that is being developed by our Trasgo research group at the University of Valladolid (Spain). Trasgo is based on high-level and nested-parallel specifications allowing easily express several complex combinations of data and parallelism tasks with a common scheme. One of the most important features is that this model hides the layout and scheduling details. The Trasgo back-end is supported by Hitmap, a runtime library for hierarchical tiling and mapping of arrays. The Hitmap library implements functions to efficiently create, manipulate, map, and communicate hierarchical tiling arrays. In this Ph.D. thesis we study the possibility of developing a portable and transparent programming system that incorporates hierarchical tiling and scheduling policies in order to take advantage of heterogeneous computing capabilities. To accomplish our research proposal we take profit the of Hitmap library. Hitmap is used as a prototype framework that integrates a parallel computation model which takes profit of all available hardware resources (CPU-GPU) in heterogeneous environments. This framework allows to generate abstract codes which are transparently adapted to heterogeneous systems with mixed types of accelerator devices. We present a study of the GPU architectures to help to determine good values of configuration parameters that should be chosen by the programmer. The knowledge obtained from this study is used to create proper policies of selecting configuration parameters values of GPU devices, such as, threadblock geometry/size and the configuration of L1. These policies are included in previous framework. After examining and analyzing the experimental results, we consider the feasibility of creating a programming execution containing automatic data partitioning techniques, communication tools, and select transparently to the programmer, good values of GPU configuration parameters for heterogeneous systems.
May 17, 2014 by hgpu