Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory
NEC Laboratories America, Inc., Princeton, NJ, USA
In Proceedings of the 22nd ACM symposium on Parallelism in algorithms and architectures (2010), pp. 82-91.
@conference{becchi2010data,
title={Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory},
author={Becchi, M. and Byna, S. and Cadambi, S. and Chakradhar, S.},
booktitle={Proceedings of the 22nd ACM symposium on Parallelism in algorithms and architectures},
pages={82–91},
year={2010},
organization={ACM}
}
In this paper, we describe a runtime to automatically enhance the performance of applications running on heterogeneous platforms consisting of a multi-core (CPU) and a throughput-oriented many-core (GPU). The CPU and GPU are connected by a non-coherent interconnect such as PCI-E, and as such do not have shared memory. Heterogeneous platforms available today such as [9] are of this type. Our goal is to enable the programmer to seamlessly use such a system without rewriting the application and with minimal knowledge of the underlying architectural details. Assuming that applications perform function calls to computational kernels with available CPU and GPU implementations, our runtime achieves this goal by automatically scheduling the kernels and managing data placement. In particular, it intercepts function calls to well-known computational kernels and schedules them on CPU or GPU based on their argument size and location. To improve performance, it defers all data transfers between the CPU and the GPU until necessary. By managing data placement transparently to the programmer, it provides a unified memory view despite the underlying separate memory sub-systems. We experimentally evaluate our runtime on a heterogeneous platform consisting of a 2.5GHz quad-core Xeon CPU and an NVIDIA C870 GPU. Given array sorting, parallel reduction, dense and sparse matrix operations and ranking as computational kernels, we use our runtime to automatically retarget SSI [25], K-means [32] and two synthetic applications to the above platform with no code changes. We find that, in most cases, performance improves if the computation is moved to the data, and not vice-versa. For instance, even if a particular instance of a kernel is slower on the GPU than on the CPU, the overall application may be faster if the kernel is scheduled on the GPU anyway, especially if the kernel data is already located on the GPU memory due to prior decisions. Our results show that data-aware CPU/GPU scheduling improves performance by up to 25% over the best data-agnostic scheduling on the same platform.
May 10, 2011 by hgpu