Enabling multiple accelerator acceleration for Java/OpenMP
University of Erlangen-Nuremberg, Computer Science Department, Programming Systems Group, Erlangen, Germany
Proceedings of the 3rd USENIX conference on Hot topic in parallelism, HotPar’11, 2011
@article{veldema2011enabling,
title={Enabling Multiple Accelerator Acceleration for Java/OpenMP},
author={Veldema, R. and Blass, T. and Philippsen, M.},
booktitle={Proceedings of the 3rd USENIX conference on Hot topic in parallelism, HotPar’11},
year={2011}
}
While using a single GPU is fairly easy, using multiple CPUs and GPUs potentially distributed over multiple machines is hard because data needs to be kept consistent using message exchange and the load needs to be balanced. We propose (1) an array package that provides partitioned and replicated arrays and (2) a compute-device library to abstract from GPUs and CPUs and their location. Our system automatically distributes a parallel-for loop in data-parallel fashion over all the devices. There are three contributions in this paper. First, we provide transparent use of multiple distributed GPUs and CPUs from within Java/OpenMP. Second, we partition arrays according to the compute-devices’ relative performance that is computed from the execution time of a small micro benchmark and a series of small bandwidth tests run at program start. Third, we repartition the arrays dynamically at run-time by increasing or decreasing the number of machines used and by switching from CPUs-only to GPUs-only, to combinations of CPUs and GPUs, and back. With our dynamic device switching we minimize communication while maximizing device use. Our system automatically finds the optimal device sets and achieves a speedup of 5 – 200 on a cluster of 8 machines with 2 GPUs each.
September 9, 2011 by hgpu