Document Stream Clustering using GPUs

Michael J. Szaszy, Hanan Samet
Department of Computer Science, University of Maryland, College Park, MD 20742
University of Maryland, 2013


   title={Document Stream Clustering using GPUs},

   author={Samet, M.J.S.H.},



Download Download (PDF)   View View   Source Source   



The Web is constantly generating streams of textual information in the form of News articles and Tweets. In order for Information Retrieval systems to make sense of all this data partitional clustering algorithms are used to create groups of similar documents. Traditional clustering algorithms, like K-means, are not well suited for stream processing where the dataset is constantly changing as new documents are published. In this paper we present a clustering algorithm designed to work with streaming documents. These documents, described by their TF-IDF (term frequency – inverse document frequency) [15] term vectors, are incrementally generated appropriate clusters based on the cosine similarity metric. We provide an efficient implementation of this algorithm on a GPU using CUDA, that achieves speedups of over 43X compared to its serial CPU implementation and has the ability to cluster a document within just .01 seconds after its term vector is received, even when there are 1.6 million clusters. Our implementation is capable to scale to clustering 5.5 million documents using a single GTX 480 GPU in 16.1 hours and can easily be extended to run on a system containing large numbers of GPUs.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: