Design and evaluation of a parallel k-nearest neighbor algorithm on CUDA-enabled GPU
Chinese Acad. of Sci., Grad. Univ., Beijing, China
IEEE 2nd Symposium on Web Society (SWS), 2010
@conference{liang2010design,
title={Design and evaluation of a parallel k-nearest neighbor algorithm on CUDA-enabled GPU},
author={Liang, S. and Liu, Y. and Wang, C. and Jian, L.},
booktitle={Web Society (SWS), 2010 IEEE 2nd Symposium on},
pages={53–60},
organization={IEEE}
}
Recent developments in Graphics Processing Units (GPUs) have enabled inexpensive high performance computing for general-purpose applications. Due to GPU’s tremendous computing capability, it has emerged as the co-processor of CPU to achieve a high overall throughput. CUDA programming model provides the programmers adequate C language like APIs to better exploit the parallel power of the GPU. K-nearest neighbor (KNN) is a widely used classification technique and has significant applications in various domains, especially in text classification. The computational-intensive nature of KNN requires a high performance implementation. In this paper, we propose CUKNN, a CUDA-based parallel implementation of KNN. It launches two CUDA kernels, distance calculation kernel and selecting kernel. In the distance calculation kernel, a great number of concurrent CUDA threads are issued, where each thread performs the calculation between the query object and a reference object; in the selecting kernel, threads in a block find the local-k nearest neighbors of the query object concurrently, and then a thread is invoked to find the global-k nearest neighbors out of the queues of local-k neighbors. Various CUDA optimization techniques are applied to maximize the utilization of GPU. We evaluate our implementation by using synthetic dataseis and a real physical simulation dataset. The experimental results demonstrate that CUKNN outperforms the serial KNN on an HP xw8600 workstation significantly, achieving up to 46.7IX speedup on the synthetic dataseis and 42.49X on the physical simulation dataset including I/O cost. It also shows good scalability when varying the number of dimensions of the reference dataset, the number of objects in the reference dataset, and the number of objects in the query dataset.
May 4, 2011 by hgpu