Out-of-core Implementation for Accelerator Kernels on Heterogeneous Clouds
School of Computer Science, University College Dublin, Belfield, Dublin 4, Ireland
The Journal of Supercomputing, 2017
@article{khaleghzadeh2017out,
title={Out-of-core implementation for accelerator kernels on heterogeneous clouds},
author={Khaleghzadeh, Hamidreza and Zhong, Ziming and Reddy, Ravi and Lastovetsky, Alexey},
journal={The Journal of Supercomputing},
pages={1–18},
year={2017},
publisher={Springer}
}
Cloud environments today are increasingly featuring hybrid nodes containing multicore CPU processors and a diverse mix of accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors, and Field-Programmable Gate Arrays (FPGAs) to facilitate easier migration to them of HPC workloads. While virtualization of accelerators in clouds is a leading research challenge, we address the programming challenges that assail execution of large instances of data-parallel applications using these accelerators in this paper. In a typical hybrid node in a cloud, the tight integration of accelerators with multicore CPUs via PCI-E communication links contains inherent limitations such as limited main memory of accelerators and limited bandwidth of the PCI-E communication links. These limitations poses formidable programming challenges to execution of large problem sizes on these accelerators. In this paper, we describe a library containing interfaces (HCLOOC) that addresses these challenges. It employs optimal software pipelines to overlap data transfers between host CPU and the accelerator and computations on the accelerator. It is designed using the fundamental building blocks, which are OpenCL command queues for FPGAs, Intel offload streams for Intel Xeon Phis, and CUDA streams and events that allow concurrent utilization of the copy and execution engines provided in NVidia GPUs. We elucidate the key features of our library using an out-of-core implementation of matrix multiplication of large dense matrices on a hybrid node, an Intel Haswell multicore CPU server hosting three accelerators that includes NVidia K40c GPU, Intel Xeon Phi 3120P, and a Xilinx FPGA. Based on experiments with the GPU, we show that our out-of-core implementation achieves 82% of peak double-precision floating performance of the GPU and a speedup of 2.7 times over the NVidia’s out-of-core matrix multiplication implementation (CUBLAS-XT). We also demonstrate that our implementation exhibits 0% drop in performance when the problem size exceeds the main memory of the GPU. We observe this 0% drop also for our implementation for Intel Xeon Phi and Xilinx FPGA.
September 16, 2017 by hgpu