Adaptation of the MapReduce programming framework to compute-intensive data-analytics kernels
University of Illinois at Urbana-Champaign
University of Illinois at Urbana-Champaign, 2013
@article{campbell2013adaptation,
title={Adaptation of the MapReduce programming framework to compute-intensive data-analytics kernels},
author={Campbell, R.H. and Sanders, W. and Hwu, W.M. and Raghunathan, A.},
year={2013}
}
Compute-intensive data-analytic (CIDA) applications have become a major component of many different business domains, as well as scientific computing applications. These algorithms stem from domains as diverse as web analysis and social networks, machine learning and data mining, text analysis, bio-informatics, astronomy image analysis, business analytics, large scale graph algorithms, image/video processing and recognition, some high performance computing problems, quantitative finance and simulation among others. These computational problems deal with massive data sets, and require performing lots of computation per data element. This thesis presents a vision of CIDA applications programmed in a MapReduce style framework and running on clusters of accelerators. Regardless of the type of accelerator, whether GPUs (NVIDIA or AMD), other manycore architectures (like Intel Larrabee or MIC) or even heterogeneous chips (AMD Fusion or IBM Cell processor), there is a fundamental condition imposed on the software, namely the increased sensitivity to locality. As a result, the common theme in this thesis is to increase the locality in CIDA applications. We report on four research efforts to achieve this goal. The Multiple independent threads on a heterogeneous resource architecture (MITHRA) project integrates Hadoop MapReduce and GPUs together, where the map() functions execute. As a result, by moving the map() functions to GPUs we increase the locality of reference and gain better performance. We have shown that when the MITHRA model is applicable (for instance for Monte Carlo algorithms), each computing node can perform orders of magnitude more work in the same run-time. Then we introduce partitioned iterative convergence (PIC) as an approach to realize iterative algorithms on clusters. We observed that conventional implementations of iterative algorithms using MapReduce are quite inefficient as a result of several factors. Complementary to prior work, we focused on addressing the challenges of high network traffic due to frequent model updates and lack of parallelism across iterations. PIC has two phases. In the first phase, called the best-effort phase, it partitions the problem and runs the sub-problems in individual cluster nodes, where the locality can be exploited better. The results of this phase can be numerically inaccurate (about 3% based on experimental results), but can be computed much faster. The second phase of PIC, called the top-off phase, runs the original iterative algorithm a few more iterations (starting with the results of the best-effort phase) to compute an accurate answer. Finally we introduce two GPU-based projects that try to increase the performance of MapReduce style functions in GPUs. The first is loop maximizing, a code transformation for GPUs that can eliminate code flow divergence (and hence serialization in GPUs) and result in better usage of GPU processing elements. Using this technique, we have achieved the highest reported speedups for gene alignment algorithms. The second GPU-based project is a library for dynamic shared memory allocation and access in GPUs assuming independent execution of the GPU threads, which happens in a MapReduce style environment among both map() and reduce() functions. The two MapReduce adaptations (MITHRA and PIC), the GPU-based loop-maximizing optimization and the Plasma library together lay the plan for the goal of achieving good performance on locality-sensitive clusters. This thesis shows the feasibility of this approach, and describes how each of these projects contributes to the collective target.
February 9, 2013 by hgpu