Optimizing a Near-duplicate Document Detection System with SIMD Technologies

Xinpan Yuan, Jun Long, Hao Zhang, Zuping Zhang, Weihua Gui
School of Information Science and Engineering, Central South University, Changsha 410083, China
Journal of Computational Information Systems 7: 11 3846-3853, 2011


   title={Optimizing a Near-duplicate Document Detection System with SIMD Technologies},

   author={YUAN, X. and LONG, J. and ZHANG, H. and ZHANG, Z. and GUI, W.},

   journal={Journal of Computational Information Systems},






Download Download (PDF)   View View   Source Source   



Although considerable effort has been devoted to duplicate document detection (DDD) and its applications, there is very limited study on the optimization of its time-consuming functions. An experimental analysis which is conducted on a million Grant Proposal documents from the nsfc.gov.cn shows that even by using the clustering and the sampling methods, the speed of DDD is still quite slow. By analyzing the performance of our system with Intel VTune Performance Analyzer, we find out that the shingle comparison is the most time-consuming part in our system, occupying 58% CPU usage. Based on the analysis of the whole non-parallel algorithm and the data statistics, we propose and implement an optimized shingle comparison algorithm using Intel SIMD technology and GPUs. Experiments done with Intel CPUs demonstrate 11.6% ~38.5% performance gains with different SIMD instruction sets (SSE/SSE2/SSE4.2) and parameters settings. Furthermore, our GPU implementation achieves a 170% performance gain. Higher performance could be obtained by combining these two SIMD technologies.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2021 hgpu.org

All rights belong to the respective authors

Contact us: