high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Learning hash codes for efficient content reuse detection

Learning hash codes for efficient content reuse detection

Qi Zhang, Yan Wu, Zhuoye Ding, Xuanjing Huang

School of Computer Science, Fudan University, 825 Zhangheng Road, Shanghai, P.R.China

35th international ACM SIGIR conference on Research and development in information retrieval (SIGIR ’12), 2012

DOI:10.1145/2348283.2348339

@inproceedings{zhang2012learning,

title={Learning hash codes for efficient content reuse detection},

author={Zhang, Q. and Wu, Y. and Ding, Z. and Huang, X.},

booktitle={Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval},

pages={405–414},

year={2012},

organization={ACM}

}

Download (PDF)

View

Source

2074

views

Content reuse is extremely common in user generated mediums. Reuse detection serves as be the basis for many applications. However, along with the explosion of Internet and continuously growing uses of user generated mediums, the task becomes more critical and difficult. In this paper, we present a novel efficient and scalable approach to detect content reuse. We propose a new signature generation algorithm, which is based on learned hash functions for words. In order to deal with tens of billions of documents, we implement the detection approach on graphical processing units (GPUs). The experimental comparison in this paper involves studies of efficiency and effectiveness of the proposed approach in different types of document collections, including ClueWeb09, Tweets2011, and so on. Experimental results show that the proposed approach can achieve the same detection rates with state-of-the-art systems while uses significantly less execution time than them (from 400X to 1500X speedup).

Tags: Algorithms, Computer science, CUDA, Hashing, Information Retrieval, nVidia, nVidia Quadro FX 4000, Security

October 8, 2012 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Learning hash codes for efficient content reuse detection

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)

Learning hash codes for efficient content reuse detection

Share this:

Recent source codes

Most viewed papers (last 30 days)