high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Optimizing a Near-duplicate Document Detection System with SIMD Technologies

Optimizing a Near-duplicate Document Detection System with SIMD Technologies

Xinpan Yuan, Jun Long, Hao Zhang, Zuping Zhang, Weihua Gui

School of Information Science and Engineering, Central South University, Changsha 410083, China

Journal of Computational Information Systems 7: 11 3846-3853, 2011

@article{yuan2011optimizing,

title={Optimizing a Near-duplicate Document Detection System with SIMD Technologies},

author={YUAN, X. and LONG, J. and ZHANG, H. and ZHANG, Z. and GUI, W.},

journal={Journal of Computational Information Systems},

volume={7},

number={11},

pages={3846–3853},

year={2011}

}

Download (PDF)

View

Source

2027

views

Although considerable effort has been devoted to duplicate document detection (DDD) and its applications, there is very limited study on the optimization of its time-consuming functions. An experimental analysis which is conducted on a million Grant Proposal documents from the nsfc.gov.cn shows that even by using the clustering and the sampling methods, the speed of DDD is still quite slow. By analyzing the performance of our system with Intel VTune Performance Analyzer, we find out that the shingle comparison is the most time-consuming part in our system, occupying 58% CPU usage. Based on the analysis of the whole non-parallel algorithm and the data statistics, we propose and implement an optimized shingle comparison algorithm using Intel SIMD technology and GPUs. Experiments done with Intel CPUs demonstrate 11.6% ~38.5% performance gains with different SIMD instruction sets (SSE/SSE2/SSE4.2) and parameters settings. Furthermore, our GPU implementation achieves a 170% performance gain. Higher performance could be obtained by combining these two SIMD technologies.

Tags: Algorithms, Clustering, Computer science, nVidia, nVidia GeForce 8600 GT, OpenCL, Optimization

October 24, 2011 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Optimizing a Near-duplicate Document Detection System with SIMD Technologies

Your response

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)

Optimizing a Near-duplicate Document Detection System with SIMD Technologies

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)