Parallel data mining algorithms for multi-dimensional points on GPUs

Takazumi Matsumoto
The Hong Kong Polytechnic University
The Hong Kong Polytechnic University, 2015


   title={Parallel data mining algorithms for multi-dimensional points on GPUs},

   author={Matsumoto, Takazumi},


   school={The Hong Kong Polytechnic University}


Download Download (PDF)   View View   Source Source   



Data mining tasks such as clustering, outlier detection and similarity search typically employ a series of algorithms to operate on a large set of data, making them amenable to parallelization. Thus parallelization of data mining operations such as distance computation has been extensively studied in the literature. In recent years, the use of Graphics Processing Units (GPUs) for data mining has been steadily increasing. As state-of-the-art processors now include both CPU and GPU, it is important to consider which approaches benefit from GPU processing and which do not, and apply a heterogeneous processing approach to improve the efficiency when applicable. Similarity search, also referred to as k-nearest neighbor search, is a particular application of distance computation where only to top k results (e.g., top 10) are required. It is used extensively in multimedia search, where only a small subset of possible results are used, and numerous approaches using both exact and approximate k-nearest neighbor methods have been proposed over the years. Our contribution is a new exact distance algorithm that not only outperforms competing exact GPU methods, but can compete with leading approximate GPU methods. It can also leverage heterogeneous (CPU-GPU) processing for improved efficiency. Outlier detection, also known as anomaly detection, involves detecting data points that deviate from expected patterns. It is an important part of applications such as fraud detection and system fault detection. In many real-world applications such as wireless sensor readings and weather forecasts, data are uncertain in nature. Rather than discarding uncertainty information or storing many sample readings, it is possible to instead approximate the probability distribution function (pdf) to compactly store and efficiently calculate uncertain distance when required. Existing GPU approaches are limited to working with certain data, while our contribution is a parallel method for outlier detection on uncertain data. Clustering is another common operation in data mining. Our work focuses on a particular application, label generation for web search results that automatically generates labels for related topics returned by a user’s web search. This is accomplished using a fuzzy transduction-based relational model. We also contribute a GPU implementation of our proposed solution.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: