Achieving a single compute device image in OpenCL for multiple GPUs
Seoul National University, Seoul, South Korea
In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming (2011), pp. 277-288
@conference{kim2011achieving,
title={Achieving a single compute device image in OpenCL for multiple GPUs},
author={Kim, J. and Kim, H. and Lee, J.H. and Lee, J.},
booktitle={Proceedings of the 16th ACM symposium on Principles and practice of parallel programming},
pages={277–288},
year={2011},
organization={ACM}
}
In this paper, we propose an OpenCL framework that combines multiple GPUs and treats them as a single compute device. Providing a single virtual compute device image to the user makes an OpenCL application written for a single GPU portable to the platform that has multiple GPU devices. It also makes the application exploit full computing power of the multiple GPU devices and the total amount of GPU memories available in the platform. Our OpenCL framework automatically distributes at run-time the OpenCL kernel written for a single GPU into multiple CUDA kernels that execute on the multiple GPU devices. It applies a run-time memory access range analysis to the kernel by performing a sampling run and identifies an optimal workload distribution for the kernel. To achieve a single compute device image, the runtime maintains virtual device memory that is allocated in the main memory. The OpenCL runtime treats the memory as if it were the memory of a single GPU device and keeps it consistent to the memories of the multiple GPU devices. Our OpenCL-C-to-C translator generates the sampling code from the OpenCL kernel code and OpenCL-C-to-CUDA-C translator generates the CUDA kernel code for the distributed OpenCL kernel. We show the effectiveness of our OpenCL framework by implementing the OpenCL runtime and two source-to-source translators. We evaluate its performance with a system that contains 8 GPUs using 11 OpenCL benchmark applications.
February 25, 2011 by hgpu