Parallel Compression Checkpointing for Socket-Level Heterogeneous Systems
School of Computer Science, National University of Defense Technology, Changsha 410073, China
13th IEEE International Conference on High Performance Computing and Communications (HPCC-2011), pp.468-476, 2011
@article{liu2011parallel,
title={Parallel Compression Checkpointing for Socket-Level Heterogeneous Systems},
author={LIU, Y. and ZHU, H. and LIU, Y. and WANG, F. and FAN, B.},
year={2011}
}
Checkpointing is an effective fault tolerant technique to improve the reliability of large scale parallel computing systems. However, checkpointing causes a large number of computation nodes to store a huge amount of data into file system simultaneously. It does not only require a huge storage space to store system state, but also brings a tremendous pressure on the communication network and I/O subsystem because a massive demand of accesses are concentrated in a short period of time. Data compression can reduce the size of checkpoint data to be saved in the file system and to go through the communication network. However, compression induces a huge time overhead especially in large scale parallel systems, which is the main technical barrier of its practical usability. In this paper, we propose a parallel compression checkpointing technique to reduce the time overhead in socket-level heterogeneous architectures. It integrates a number of parallel processing techniques, including transmitting checkpoint data between CPU, GPU and file system in double buffered pipelines, aggregating file write operations, SIMD parallel compression algorithm running on GPU, etc. The paper also reports an implementation of the technique on the Tianhe-1 supercomputer system and the evaluation experiments with the system. The experiment data show that the technique is efficient and practically usable.
October 22, 2011 by hgpu