Parallel Matching and Clustering Algorithms on GPUs
University of Bergen
University of Bergen, 2017
@article{naim2017parallel,
title={Parallel Matching and Clustering Algorithms on GPUs},
author={Naim, Md},
year={2017},
publisher={The University of Bergen}
}
The main focus of this thesis is on developing efficient algorithms on GPUs for certain matching and clustering problems. Through extensive experiments we show that sparse and unstructured problems can benefit greatly from using GPUs as long as the algorithms are carefully designed. Even though none of the presented algorithms are fundamentally new, they still require significant redesign to make them efficient on GPUs. Common to all the developed algorithms is that they emphasis achieving an even load balance and high degree of parallelism, while at the same time avoiding the use of time consuming synchronization operations. Through extensive experiments we verify the performance of the suggested algorithms. In some cases, even a single GPU can outperform tens of multiprocessors on a state-of-the-art supercomputer. The area of computing using GPU enhanced systems is changing rapidly. This is especially true for memory management. For instance, simultaneous access to the same memory by a host and a device was not possible until it was recently introduced in the Pascal GPU [15]. However, starting from the early unified shader architecture, the fundamental architectural features of GPUs have not undergone radical changes. Instead what has happened is that features have been added in an incremental fashion while the speed and size of the device has increased gradually. Based on this observation we believe that algorithms that are developed for GPUs today will also be relevant in the near future. Although the presented algorithms in this thesis can be viewed as proof of concept that carefully designed GPU algorithms for certain graph problems can compete with implementations on traditional parallel supercomputers, it is not to be underestimated that doing so requires substantial effort. For this to become possible on a more regular basis there is a need to continue and further develop higher level abstractions for graph analytics on GPUs. As an example, we believe that the strategy used in Paper III where vertices were grouped according to their degree before being allocated to different thread blocks, is one such technique that could benefit other applications. Also, the use of various reduction operations could be candidates for generic implementation. It seems clear that GPUs will play a major role in the future of HPC, with large systems containing several interconnected ones. Designing efficient graph algorithms for such systems is still only starting and likely to generate much interesting work in the future. Just like the distinction between traditional shared and distributed algorithms, we believe that one will see a similar division between different GPU algorithms depending on how the underlying devices are interconnected.
October 21, 2017 by hgpu