RDMA-Based Job Migration Framework for MPI over InfiniBand
Department of Computer Science and Engineering, The Ohio State University
IEEE International Conference on Cluster Computing, 2010, pp.116-125
@conference{ouyang2010rdma,
title={RDMA-Based Job Migration Framework for MPI over InfiniBand},
author={Ouyang, X. and Marcarelli, S. and Rajachandrasekar, R. and Panda, D.K.},
booktitle={2010 IEEE International Conference on Cluster Computing},
pages={116–125},
year={2010},
organization={IEEE}
}
Coordinated checkpoint and recovery is a common approach to achieve fault tolerance on large-scale systems. The traditional mechanism dumps the process image to a local disk or a central storage area of all the processes involved in the parallel job. When a failure occurs, the processes are restarted and restored to the latest checkpoint image. However, this kind of approach is unable to provide the scalability required by increasingly largesized jobs, since it puts heavy I/O burden on the storage subsystem, and resubmitting a job during restart phase incurs lengthy queuing delay. In this paper, we enhance the fault tolerance of MVA-PICH2 [1], an open-source high performance MPI-2 implementation, by using a proactive job migration scheme. Instead of checkpointing all the processes of the job and saving their process images to a stable storage, we transfer the processes running on a health-deteriorating node to a healthy spare node, and resume these processes from the spare node. RDMA-based process image transmission is designed to take advantage of high performance communication in InfiniBand. Experimental results show that the Job Migration scheme can achieve a speedup of 4.49 times over the Checkpoint/Restart scheme to handle a node failure for a 64-process application running on 8 compute nodes. To the best of our knowledge, this is the first such job migration design for InfiniBand-based clusters.
March 17, 2011 by hgpu