Optimizing a Near-duplicate Document Detection System with SIMD Technologies
School of Information Science and Engineering, Central South University, Changsha 410083, China
Journal of Computational Information Systems 7: 11 3846-3853, 2011
@article{yuan2011optimizing,
title={Optimizing a Near-duplicate Document Detection System with SIMD Technologies},
author={YUAN, X. and LONG, J. and ZHANG, H. and ZHANG, Z. and GUI, W.},
journal={Journal of Computational Information Systems},
volume={7},
number={11},
pages={3846–3853},
year={2011}
}
Although considerable effort has been devoted to duplicate document detection (DDD) and its applications, there is very limited study on the optimization of its time-consuming functions. An experimental analysis which is conducted on a million Grant Proposal documents from the nsfc.gov.cn shows that even by using the clustering and the sampling methods, the speed of DDD is still quite slow. By analyzing the performance of our system with Intel VTune Performance Analyzer, we find out that the shingle comparison is the most time-consuming part in our system, occupying 58% CPU usage. Based on the analysis of the whole non-parallel algorithm and the data statistics, we propose and implement an optimized shingle comparison algorithm using Intel SIMD technology and GPUs. Experiments done with Intel CPUs demonstrate 11.6% ~38.5% performance gains with different SIMD instruction sets (SSE/SSE2/SSE4.2) and parameters settings. Furthermore, our GPU implementation achieves a 170% performance gain. Higher performance could be obtained by combining these two SIMD technologies.
October 24, 2011 by hgpu