Mr. Scan: Extreme Scale Density-Based Clustering using a Tree-Based Network of GPGPU Nodes
Computer Sciences Department, University of Wisconsin, Madison, WI 53706
University of Wisconsin, 2013
@article{welton2013mr,
title={Mr. Scan: Extreme Scale Density-Based Clustering using a Tree-Based Network of GPGPU Nodes},
author={Welton, Benjamin and Samanas, Evan and Miller, Barton P},
year={2013}
}
Density-based clustering algorithms are a widely-used class of data mining techniques that can find irregularly shaped clusters and cluster data without prior knowledge of the number of clusters it contains. DBSCAN is the most well-known density-based clustering algorithm. We introduce our version of DBSCAN, called Mr. Scan, which uses a hybrid parallel implementation that combines the MRNet tree-based distribution network with GPGPU-equipped nodes. This design allows Mr. Scan to efficiently and accurately cluster multi-billion point datasets. Mr. Scan avoids the problems of existing implementations by effectively partitioning the point space and by optimizing DBSCAN’s computation over dense data regions. We tested Mr. Scan on a geolocated Twitter dataset. At its largest scale, Mr. Scan clustered 6.5 billion points from the Twitter dataset on 8,192 GPU nodes on Cray Titan in 17.3 minutes. All other parallel DBSCAN implementations have only demonstrated the ability to cluster up to 100 million points.
April 30, 2013 by hgpu