high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Learning hash codes for efficient content reuse detection

Learning hash codes for efficient content reuse detection

Qi Zhang, Yan Wu, Zhuoye Ding, Xuanjing Huang

School of Computer Science, Fudan University, 825 Zhangheng Road, Shanghai, P.R.China

35th international ACM SIGIR conference on Research and development in information retrieval (SIGIR ’12), 2012

DOI:10.1145/2348283.2348339

@inproceedings{zhang2012learning,

title={Learning hash codes for efficient content reuse detection},

author={Zhang, Q. and Wu, Y. and Ding, Z. and Huang, X.},

booktitle={Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval},

pages={405–414},

year={2012},

organization={ACM}

}

Download (PDF)

View

Source

2918

views

Content reuse is extremely common in user generated mediums. Reuse detection serves as be the basis for many applications. However, along with the explosion of Internet and continuously growing uses of user generated mediums, the task becomes more critical and difficult. In this paper, we present a novel efficient and scalable approach to detect content reuse. We propose a new signature generation algorithm, which is based on learned hash functions for words. In order to deal with tens of billions of documents, we implement the detection approach on graphical processing units (GPUs). The experimental comparison in this paper involves studies of efficiency and effectiveness of the proposed approach in different types of document collections, including ClueWeb09, Tweets2011, and so on. Experimental results show that the proposed approach can achieve the same detection rates with state-of-the-art systems while uses significantly less execution time than them (from 400X to 1500X speedup).

Tags: Algorithms, Computer science, CUDA, Hashing, Information Retrieval, nVidia, nVidia Quadro FX 4000, Security

October 8, 2012 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Learning hash codes for efficient content reuse detection

Your response

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)

Learning hash codes for efficient content reuse detection

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)