Novel Data-Partitioning Algorithms for Performance and Energy Optimization of Data-Parallel Applications on Modern Heterogeneous HPC Platforms

Hamidreza Khaleghzadeh
University College Dublin
University College Dublin, 2019


   title={Novel Data-Partitioning Algorithms for Performance and Energy Optimization of Data-Parallel Applications on Modern Heterogeneous HPC Platforms},

   author={Khaleghzadeh, Hamidreza},


   school={University College Dublin}


Heterogeneity has turned into one of the most profound and challenging characteristics of today’s HPC environments. Modern HPC platforms have become highly heterogeneous owing to the tight integration of multicore CPUs and accelerators (such as Graphics Processing Units, Intel Xeon Phis, or Field-Programmable Gate Arrays) empowering them to maximize the dominant objectives of performance and energy efficiency. Designed for legacy homogeneous platforms, traditional parallel algorithms and tools will deliver a small fraction of the potential performance and energy efficiency that we should expect from highly hybrid HPC platforms in the future. Performance and energy are the two most dominant objectives for optimization on modern heterogeneous HPC platforms such as supercomputers and cloud computing infrastructures. Recent research on modern homogeneous multicore platforms demonstrates that the performance and energy profiles of data-parallel applications executing on such platforms exhibit drastic variations due to inherent complexities in these platforms such as severe contention for shared resources and Non-Uniform Memory Access (NUMA). In this thesis, we present that these inherent characteristics and complexities have posed serious challenges to modelling and optimization of data-parallel applications on modern heterogeneous platforms for performance and energy. We illustrate that the discrete functional relationships between performance and workload size and between energy and workload size have nonlinear and non-convex shapes, which deviate significantly from the shapes and assumptions that allowed state-of-the-art optimization algorithms to find optimal solutions for performance and energy consumption. Thereby we demonstrate that the workload distribution has become an important decision variable that can no longer be ignored on modern heterogeneous HPC platforms. We formulate the problem of optimization of data-parallel applications on modern heterogeneous HPC platforms for performance and dynamic energy and then propose two new model-based data partitioning algorithms, which are named HPOPTA and HEOPTA. These algorithms respectively minimize the execution time and the dynamic energy consumption of computations in the parallel execution of applications. We also present two other algorithms, HEPOPTA and HTPOPTA, for solving bi-objective optimization problems for execution time and dynamic energy, and also execution time and total energy on modern heterogeneous HPC platforms, respectively. All these algorithms consider one decision variable, workload distribution. Unlike traditional approaches looking for load-balanced solutions, solutions returned by the algorithms are, generally speaking, non-balanced. In a typical hybrid node, the tight integration of accelerators with multicore CPUs via PCI-E communication links contains inherent limitations such as limited main memory of accelerators and limited bandwidth of the PCI-E communication links. These limitations pose formidable programming challenges to the execution of large workload sizes on these accelerators. In this research, we describe an out-of-card library, which is called HCLOOC, containing interfaces that address these challenges. It employs optimal software pipelines to overlap data transfers between host CPU and the accelerator and computations on the accelerator. It is designed using the fundamental building blocks, which are OpenCL command queues for FPGAs, Intel offload streams for Intel Xeon Phis, and CUDA streams and events that allow concurrent utilization of the copy and execution engines provided in Nvidia GPUs. We experimentally analyse and demonstrate the optimality and efficiency of the proposed algorithms and library using two well-known scientific dataparallel applications, matrix multiplication and 2D fast Fourier transform, on a cluster of two highly heterogeneous nodes. Each application invokes highly optimized vendor specific kernels for CPUs and accelerators. The matrix multiplication application is implemented using HCLOOC, which allows the accelerators to run computations of any arbitrary workload size.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2019 hgpu.org

All rights belong to the respective authors

Contact us: