Parallel Hierarchical Clustering on the GPU
Leopold-Franzens-University Innsbruck
Leopold-Franzens-University Innsbruck, 2013
@article{reiterer2014hierarchical,
title={Hierarchical Clustering on the GPU},
author={Reiterer, Ursula},
year={2014}
}
Clustering is a basic task in exploratory data analysis. It is used to partition elements of a set into disjoint groups, so-called clusters, such that elements within a group are similar to each other, but dissimilar to elements of other groups. Several clustering algorithms exist, which can be applied depending on the type of dataset and the particular purpose. However, clustering large datasets can be computationally expensive and requires techniques to increase the performance of these algorithms. To achieve this, recent research often focuses on the massively parallelism provided by Graphics Processing Units (GPUs). In this thesis, we present three clustering algorithms: k-means, k-medoids and hierarchical clustering. Using OpenCL as programming model, we implemented a completely parallel k-means, which reduces the data movement between CPU and GPU and optimizes the parallel assignment step using the GPU’s local memory. Similarly, we designed a parallel k-medoids, which executes all basic clustering steps on the GPU and uses a tiling approach for the computation of the new medoids. Finally, we focused on divisive hierarchical clustering. We first implemented a standard approach, which processes each cluster separately in parallel and continues the clustering sequentially on the CPU as soon as the clusters become too small to make efficient use of the GPU. Then, we introduced a new approach, which increases the parallelism by applying k-means concurrently on all data segments of the same hierarchy depth, using a segmented reduction for the computation of the new means. We further optimized this segmented reduction by injecting PTX assembly instructions in the kernels. Our results show that our segmented approach is 2-3 times faster than the classical parallel approach and 4-5 times faster than the sequential version.
September 20, 2014 by hgpu