Distributed-Shared CUDA: Virtualization of Large-Scale GPU Systems for Programmability and Reliability
Department of Mechanical Engineering, Keio University, Yokohama, Japan
The Fourth International Conference on Future Computational Technologies and Applications (FUTURE COMPUTING), 2012
@inproceedings{kawai2012distributed,
title={Distributed-Shared CUDA: Virtualization of Large-Scale GPU Systems for Programmability and Reliability},
author={Kawai, A. and Yasuoka, K. and Yoshikawa, K. and Narumi, T.},
booktitle={FUTURE COMPUTING 2012, The Fourth International Conference on Future Computational Technologies and Applications},
pages={7–12},
year={2012}
}
One of the difficulties for current GPGPU (General-Purpose computing on Graphics Processing Units) users is writing code to use multiple GPUs. One limiting factor is that only a few GPUs can be attached to a PC, which means that MPI (Message Passing Interface) would be a common tool to use tens or more GPUs. However, an MPI-based parallel code is sometimes complicated compared with a serial one. In this paper, we propose DS-CUDA (Distributed-Shared Compute Unified Device Architecture), a middleware to simplify the development of code that uses multiple GPUs distributed on a network. DS-CUDA provides a global view of GPUs at the source-code level. It virtualizes a cluster of GPU equipped PCs to seem like a single PC with many GPUs. Also, it provides automated redundant calculation mechanism to enhance the reliability of GPUs. The performance of Monte Carlo and many-body simulations are measured on 22-node (64-GPU) fraction of the TSUBAME 2.0 supercomputer. The results indicate that DS-CUDA is a practical solution to use tens or more GPUs.
July 29, 2012 by hgpu