Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU
AMD Research, Advanced Micro Devices, Inc., USA
2nd Workshop on Irregular Applications: Architectures & Algorithms (IA3), held in conjunction with AMC/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12), 2012
@article{daga2012exploiting,
title={Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU},
author={Daga, M. and Nutter, M.},
year={2012}
}
B+ tree structured index searches are one of the fundamental database operations and hence, accelerating them is essential. GPUs provide a compelling mix of performance per watt and performance per dollar, and thus are an attractive platform for accelerating B+ tree searches. However, tree search on discrete GPUs presents significant challenges for acceleration due to (i) the irregular representation in memory and (ii) the requirement to copy the tree to the GPU memory over the PCIe bus. In this paper, we present the acceleration of B+ tree searches on a fused CPU+GPU processor (an accelerated processing unit or APU). We counter the aforementioned issues by reorganizing the B+ tree in memory and utilizing the novel heterogeneous system architecture, which eliminates (i) the need to copy the tree to the GPU and (ii) the limitation on the size of the tree that can be accelerated. Our approach exploits the coarsegrained parallelism in tree search, wherein we execute multiple searches in parallel to optimize for the SIMD width without modifying the inherent B+ tree data structure. Our results illustrate that the APU implementation can perform up to 70M1 queries per second and is 4.9x faster in the best case and 2.5x faster on average than a hand-tuned, SSE-optimized, six-core CPU implementation, for varying orders of the B+ tree with 4M keys. We also present an analysis of the effect of caches on performance, and of the efficacy of the APU to eliminate data-copies.
November 18, 2012 by hgpu