8334

Learning hash codes for efficient content reuse detection

Qi Zhang, Yan Wu, Zhuoye Ding, Xuanjing Huang
School of Computer Science, Fudan University, 825 Zhangheng Road, Shanghai, P.R.China
35th international ACM SIGIR conference on Research and development in information retrieval (SIGIR ’12), 2012

@inproceedings{zhang2012learning,

   title={Learning hash codes for efficient content reuse detection},

   author={Zhang, Q. and Wu, Y. and Ding, Z. and Huang, X.},

   booktitle={Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval},

   pages={405–414},

   year={2012},

   organization={ACM}

}

Download Download (PDF)   View View   Source Source   

2224

views

Content reuse is extremely common in user generated mediums. Reuse detection serves as be the basis for many applications. However, along with the explosion of Internet and continuously growing uses of user generated mediums, the task becomes more critical and difficult. In this paper, we present a novel efficient and scalable approach to detect content reuse. We propose a new signature generation algorithm, which is based on learned hash functions for words. In order to deal with tens of billions of documents, we implement the detection approach on graphical processing units (GPUs). The experimental comparison in this paper involves studies of efficiency and effectiveness of the proposed approach in different types of document collections, including ClueWeb09, Tweets2011, and so on. Experimental results show that the proposed approach can achieve the same detection rates with state-of-the-art systems while uses significantly less execution time than them (from 400X to 1500X speedup).
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: