## Hybrid Multicore Algorithms for Some Semi-Numerical Applications and Graphs

Center for Security, Theory, and Algorithmic Research, International Institute of Information Technology, Hyderabad – 500 032, India

International Institute of Information Technology, 2014

@phdthesis{banerjee2014hybrid,

title={Hybrid Multicore Algorithms for Some Semi-Numerical Applications and Graphs},

author={Banerjee, Dip Sankar},

year={2014},

school={International Institute of Information Technology Hyderabad}

}

The computing industry has undergone several paradigm shifts in the last few decades. Fueled by the need of faster computing, larger data and real time processing needs parallel computing has emerged as one of the dominant paradigms. Motivated by the success achieved in distributed computing models and the limitations faced by single core processors, parallel computing is the only alternative for building faster computers. Parallel computing is one of the most challenging areas computer science in the present and developing algorithms and optimization techniques for utilizing the processing power present in a current generation parallel computer is still a very exciting area for research. The parallel computing industry underwent a massive shift with the conventional sequential computers hitting the power wall. It led to the development of multicore and many-core computing chips that had multiple sequential computing cores packed into a single chip. The immediate impact was the need for (re)designing sequential algorithms in order to utilize the computing power of such chips. Combined with the intricate memory and cache structures, parallel algorithms require a higher degree of engineering for the most optimal performance. The many-core revolution started with the release of Graphics Processing Units (GPU) which had a large number of compute cores and offered massive parallelism. With the evolution of the many-core chips, the GPUs found application in graphics, gaming as well as general purpose computation. In the same time frame, the Central Processing Units (CPU) too under went a sea of innovation and emerged as more powerful and mature computing machines. However, the multicore CPUs were mostly ignored in its initial days. With the advancement of accelerator platforms, the CPUs and GPUs are now able to communicate in a more efficient manner. In the recent times there has been quite a few works such as the ones in [79, 91, 43] that shows that hybrid algorithms actually provide better performance and efficiency over conventional accelerator based computing. In this thesis we work towards the development of hybrid multicore computing. Hybrid multicore computing is developing algorithms and optimization strategies for popular computing primitives on an hybrid platform. A hybrid platform is one which contains two or more multicore or many-core devices. There are several challenges towards the efficient algorithm design on hybrid platforms such as that of communication bottlenecks, load balance and synchronization. In this thesis we work towards developing algorithms for some computing primitives such as that of list ranking, sorting and pseudorandomness and some graph algorithms. We experiment on a hybrid platform consisting of a 6-core Intel CPU and Nvidia GPUs of broadly two generations. In our first work we work towards the development of hybrid algorithms for List Ranking. To this end, we explore the algorithmic and analytical issues in hybrid multicore computing. Our case studies involve two different ways of designing hybrid multicore algorithms. The main contribution of this work is to address the issues related to the design of hybrid solutions. We show our hybrid algorithm for list ranking is faster by 50% compared to the best known implementation [139]. This work is followed by a hybrid implementation of comparison sorting. Sorting has been a topic of immense research value since the inception of Computer Science. In this work, we present a hybrid comparison based sorting algorithm which utilizes a many-core GPU and a multi- core CPU to perform sorting. The algorithm is broadly based on splitting the input list according to a large number of splitters followed by creating independent sublists. Sorting the independent sublists results in sorting the entire original list. On a hybrid platform, our algorithm achieves a 20% gain over the current best known comparison sort result that was published by Davidson et. al. [32]. On the above experimental platform, our results are better by 40% on average over a similar GPU-alone algorithm proposed by Leischner et. al. [81]. Our results also show that our algorithm and its implementation scale with the size of the input. We also show that such performance gains can be obtained on other hybrid platforms. The use of many-core architectures and accelerators, such as GPUs, with good programmability has allowed them to be deployed for vital computational work. The ability to use randomness in computation is known to help in several situations. For such computations to be made possible on a general purpose computer, a source of randomness, or in general a pseudo random generator (PRNG), is essential. However, most of the PRNGs currently available on GPUs suffer from some basic drawbacks. It is of high interest to develop a parallel, quality PRNG that also works in an on demand model. In this work, we investigate a hybrid technique to create an efficient PRNG. The basic technique we apply is that of random walks on expander graphs. Unlike existing generators available in the GPU programming environment, our generator can produce random numbers on demand as opposed to a one- time generation. Our approach produces 0.07 GNumbers per second. The quality of our generator is tested with industry standard tests. In the second part of the thesis, we work towards hybrid graph connected components and breadth first search. For computing graph connected components we use a static load balancing technique for partitioning of work between two devices and is followed by a low overhead consolidation phase for merging the results. We achieve almost 25% improvement over the best known implementation in [123]. For performing breadth first search (BFS), we employ a graph pruning strategy to reduce the size of large graphs which is often composed of a large percentage of pendant nodes. We apply a hybrid BFS on the remainder graph and is followed by a re-insertion phase. We achieve a 35% improvement over the current best know solution which was published by Munguia et al. in [91].

January 19, 2015 by hgpu