high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory

Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory

Michela Becchi, Surendra Byna, Srihari Cadambi, Srimat Chakradhar

NEC Laboratories America, Inc., Princeton, NJ, USA

In Proceedings of the 22nd ACM symposium on Parallelism in algorithms and architectures (2010), pp. 82-91.

DOI:10.1145/1810479.1810498

@conference{becchi2010data,

title={Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory},

author={Becchi, M. and Byna, S. and Cadambi, S. and Chakradhar, S.},

booktitle={Proceedings of the 22nd ACM symposium on Parallelism in algorithms and architectures},

pages={82–91},

year={2010},

organization={ACM}

}

Source

1643

views

In this paper, we describe a runtime to automatically enhance the performance of applications running on heterogeneous platforms consisting of a multi-core (CPU) and a throughput-oriented many-core (GPU). The CPU and GPU are connected by a non-coherent interconnect such as PCI-E, and as such do not have shared memory. Heterogeneous platforms available today such as [9] are of this type. Our goal is to enable the programmer to seamlessly use such a system without rewriting the application and with minimal knowledge of the underlying architectural details. Assuming that applications perform function calls to computational kernels with available CPU and GPU implementations, our runtime achieves this goal by automatically scheduling the kernels and managing data placement. In particular, it intercepts function calls to well-known computational kernels and schedules them on CPU or GPU based on their argument size and location. To improve performance, it defers all data transfers between the CPU and the GPU until necessary. By managing data placement transparently to the programmer, it provides a unified memory view despite the underlying separate memory sub-systems. We experimentally evaluate our runtime on a heterogeneous platform consisting of a 2.5GHz quad-core Xeon CPU and an NVIDIA C870 GPU. Given array sorting, parallel reduction, dense and sparse matrix operations and ranking as computational kernels, we use our runtime to automatically retarget SSI [25], K-means [32] and two synthetic applications to the above platform with no code changes. We find that, in most cases, performance improves if the computation is moved to the data, and not vice-versa. For instance, even if a particular instance of a kernel is slower on the GPU than on the CPU, the overall application may be faster if the kernel is scheduled on the GPU anyway, especially if the kernel data is already located on the GPU memory due to prior decisions. Our results show that data-aware CPU/GPU scheduling improves performance by up to 25% over the best data-agnostic scheduling on the same platform.

Tags: Computer science, CUDA, Distributed data structures, Heterogeneous systems, nVidia, Performance, Programming Languages, Sparse matrix, Tesla C870

May 10, 2011 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org

Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory

Recent source codes

QArray

Celerity: High-level C++ for Accelerator Clusters

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

OpenMP5-Offload-OpenMC-Intel-PVC

Most viewed papers (last 30 days)

Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory

Share this:

Recent source codes

Most viewed papers (last 30 days)