A scheduling and runtime framework for a cluster of heterogeneous machines with multiple accelerators
Indian Institute of Technology, Delhi
Indian Institute of Technology, 2014
@article{beri2014scheduling,
title={A scheduling and runtime framework for a cluster of heterogeneous machines with multiple accelerators},
author={Beri, Tarun and Bansal, Sorav and Kumar, Subodh},
year={2014}
}
We present a system that enables simple and intuitive programming of CPU+GPU clusters. This system relieves the programmer of the burden of load balancing, detailed data communication, task mapping, scheduling, etc. Our programming model is based on bulk synchronous distributed shared memory model, which is suitable for heterogenous multi-GPU clusters, especially so for compute intensive workloads. We report prototype applications using our system. For example, sequential version of matrix multiplication or 2D FFT requires about 30 additional lines of code to parallelize on a cluster. Distributing multiplication of two square matrices, with 1 billion elements each, across a small cluster with 120 CPU cores and 20 GPUs, our runtime scheduler achieves more than 140x speedup over the single core CPU implementation; the single GPU implementation runs out of memory for this experiment. This performance is possible due to a number of challenging optimizations working in concert. These include prefetching, pipelining, maximizing overlap between computation and communication, and scheduling across devices of vastly different capacities.
February 11, 2014 by hgpu