high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » On algorithmic reductions in task-parallel programming models

On algorithmic reductions in task-parallel programming models

Jan Ciesko

Universitat Politecnica de Catalunya, Departament d’Arquitectura de Computadors

Universitat Politecnica de Catalunya, 2017

@article{ciesko2017algorithmic,

title={On algorithmic reductions in task-parallel programming models},

author={Ciesko, Jan and others},

year={2017},

publisher={Universitat Polit{`e}cnica de Catalunya}

}

Download (PDF)

View

Source

Source codes

Package:

SmartJumper

2543

views

Wide adoption of parallel processing hardware in mainstream computing as well as the interest for efficient parallel programming in developer communities increase the demand for programming models that offer support for common algorithmic patterns. An algorithmic pattern of particular interest are reductions. Reductions are iterative memory updates of a program variable and appear in many applications. While their definition is simple, their variety of implementations including the use of different loop constructs and calling patterns makes their support in parallel programming models difficult. Further, their characteristic update operation over arbitrary data types that requires atomicity makes their execution computationally expensive and scalable execution challenging. These challenges and their relevance makes reductions a benchmark for compilers, runtime systems and hardware architectures today. This work advances research on algorithmic reductions. It improves their programmability by adding support for task-parallel and array-type reductions. Task-parallel reductions occur in while-loops and recursive algorithms. While for each recursive algorithm an iterative formulation exists, while-loop programs represent a super class of for-loop computable programs and therefore cannot be transformed or substituted. This limitation requires an explicit support for reduction algorithms that fall within this class. Since tasks are suited for a concurrent formulation of these algorithms, the presented work focuses on language extension to the task construct in OmpSs and OpenMP. In the first section of this work we present a generic support for task-parallel reductions in OmpSs and OpenMP and introduce the ideas of reduction scope, reduction domains and static and on-demand memory allocation. With this foundation and the feedback received from the OpenMP language review board, we develop a formalized proposal to add support for task-parallel reductions in OpenMP. This engagement led to a fruitful outcome as our proposal has been accepted into OpenMP recently. As a first step towards support of array-type reduction in a task-parallel programming model, we present a landscape of support techniques and group them by their underlying strategy. Techniques follow either the strategy of direct access (atomics), redirection or iteration ordering. We call techniques that implement redirection into thread-private data containers as techniques with alternative memory layouts (AMLs) and techniques that are based on iteration ordering as techniques with alternative iteration space (AIS). A universal support of AML-based techniques in parallel programming models can be achieved by defining basic interface methods allocate, get and reduce. As examples for new techniques that implement this interface, we present CachedPrivate and PIBOR. CachedPrivate implements a software cache to reduce communication caused by irregular accesses to remote nodes on distributed memory systems. PIBOR implements Privatization with In-lined Block-ordering, a technique that improves data locality by redirecting accesses into thread-local bins. Both techniques implement a get-method that returns a private memory storage for each update operation of the reduction loop. As an example of a technique with an alternative iteration space (AIS), we present Commutative Reductions (ComRed). This technique uses an inspector-executor execution model to generate knowledge about memory access patterns and memory overlaps between participating tasks. This information is used during the execution phase to schedule tasks with overlaps commutatively. We show that this execution model requires only a small set of additional language constructs. Performance results obtained throughout different Chapters of this work demonstrate that software techniques can improve application performance by a factor of 2-4.

Tags: Computer science, Intel Xeon Phi, OpenMP, Package, Thesis

December 10, 2017 by hgpu

Rating: 1.0/5. From 1 vote.

Please wait...