high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Novel Data-Partitioning Algorithms for Performance and Energy Optimization of Data-Parallel Applications on Modern Heterogeneous HPC Platforms

Novel Data-Partitioning Algorithms for Performance and Energy Optimization of Data-Parallel Applications on Modern Heterogeneous HPC Platforms

Hamidreza Khaleghzadeh

University College Dublin

University College Dublin, 2019

BibTeX

Download (PDF)

View

Source

Source codes

Package:

libhclooc: Software Library Facilitating Out-of-core Implementations of Accelerator Kernels on Hybrid Computing Platforms

2148

views

Heterogeneity has turned into one of the most profound and challenging characteristics of today’s HPC environments. Modern HPC platforms have become highly heterogeneous owing to the tight integration of multicore CPUs and accelerators (such as Graphics Processing Units, Intel Xeon Phis, or Field-Programmable Gate Arrays) empowering them to maximize the dominant objectives of performance and energy efficiency. Designed for legacy homogeneous platforms, traditional parallel algorithms and tools will deliver a small fraction of the potential performance and energy efficiency that we should expect from highly hybrid HPC platforms in the future. Performance and energy are the two most dominant objectives for optimization on modern heterogeneous HPC platforms such as supercomputers and cloud computing infrastructures. Recent research on modern homogeneous multicore platforms demonstrates that the performance and energy profiles of data-parallel applications executing on such platforms exhibit drastic variations due to inherent complexities in these platforms such as severe contention for shared resources and Non-Uniform Memory Access (NUMA). In this thesis, we present that these inherent characteristics and complexities have posed serious challenges to modelling and optimization of data-parallel applications on modern heterogeneous platforms for performance and energy. We illustrate that the discrete functional relationships between performance and workload size and between energy and workload size have nonlinear and non-convex shapes, which deviate significantly from the shapes and assumptions that allowed state-of-the-art optimization algorithms to find optimal solutions for performance and energy consumption. Thereby we demonstrate that the workload distribution has become an important decision variable that can no longer be ignored on modern heterogeneous HPC platforms. We formulate the problem of optimization of data-parallel applications on modern heterogeneous HPC platforms for performance and dynamic energy and then propose two new model-based data partitioning algorithms, which are named HPOPTA and HEOPTA. These algorithms respectively minimize the execution time and the dynamic energy consumption of computations in the parallel execution of applications. We also present two other algorithms, HEPOPTA and HTPOPTA, for solving bi-objective optimization problems for execution time and dynamic energy, and also execution time and total energy on modern heterogeneous HPC platforms, respectively. All these algorithms consider one decision variable, workload distribution. Unlike traditional approaches looking for load-balanced solutions, solutions returned by the algorithms are, generally speaking, non-balanced. In a typical hybrid node, the tight integration of accelerators with multicore CPUs via PCI-E communication links contains inherent limitations such as limited main memory of accelerators and limited bandwidth of the PCI-E communication links. These limitations pose formidable programming challenges to the execution of large workload sizes on these accelerators. In this research, we describe an out-of-card library, which is called HCLOOC, containing interfaces that address these challenges. It employs optimal software pipelines to overlap data transfers between host CPU and the accelerator and computations on the accelerator. It is designed using the fundamental building blocks, which are OpenCL command queues for FPGAs, Intel offload streams for Intel Xeon Phis, and CUDA streams and events that allow concurrent utilization of the copy and execution engines provided in Nvidia GPUs. We experimentally analyse and demonstrate the optimality and efficiency of the proposed algorithms and library using two well-known scientific dataparallel applications, matrix multiplication and 2D fast Fourier transform, on a cluster of two highly heterogeneous nodes. Each application invokes highly optimized vendor specific kernels for CPUs and accelerators. The matrix multiplication application is implemented using HCLOOC, which allows the accelerators to run computations of any arbitrary workload size.

Tags: Computer science, CUDA, FPGA, Heterogeneous systems, Matrix multiplication, nVidia, OpenCL, Package, Thesis

March 17, 2019 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Novel Data-Partitioning Algorithms for Performance and Energy Optimization of Data-Parallel Applications on Modern Heterogeneous HPC Platforms

Package:

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

Novel Data-Partitioning Algorithms for Performance and Energy Optimization of Data-Parallel Applications on Modern Heterogeneous HPC Platforms

Package:

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)