Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU

hgpu.org » Applications » Computer science » Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU

Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU

Mayank Daga, Mark Nutter

AMD Research, Advanced Micro Devices, Inc., USA

2nd Workshop on Irregular Applications: Architectures & Algorithms (IA3), held in conjunction with AMC/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12), 2012

BibTeX

Download (PDF)

View

Source

2430

views

B+ tree structured index searches are one of the fundamental database operations and hence, accelerating them is essential. GPUs provide a compelling mix of performance per watt and performance per dollar, and thus are an attractive platform for accelerating B+ tree searches. However, tree search on discrete GPUs presents significant challenges for acceleration due to (i) the irregular representation in memory and (ii) the requirement to copy the tree to the GPU memory over the PCIe bus. In this paper, we present the acceleration of B+ tree searches on a fused CPU+GPU processor (an accelerated processing unit or APU). We counter the aforementioned issues by reorganizing the B+ tree in memory and utilizing the novel heterogeneous system architecture, which eliminates (i) the need to copy the tree to the GPU and (ii) the limitation on the size of the tree that can be accelerated. Our approach exploits the coarsegrained parallelism in tree search, wherein we execute multiple searches in parallel to optimize for the SIMD width without modifying the inherent B+ tree data structure. Our results illustrate that the APU implementation can perform up to 70M1 queries per second and is 4.9x faster in the best case and 2.5x faster on average than a hand-tuned, SSE-optimized, six-core CPU implementation, for varying orders of the B+ tree with 4M keys. We also present an analysis of the effect of caches on performance, and of the efficacy of the APU to eliminate data-copies.

Tags: APU, ATI, ATI Radeon HD 7660, ATI Radeon HD 7970, Computer science, Databases, Heterogeneous systems, OpenCL

November 18, 2012 by hgpu

No votes yet.

Please wait...

high performance computing on graphics processing units: hgpu.org

Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Exploiting Coarse-grained Parallelism in B+ Tree Searches on an APU

Share this:

Recent source codes

Most viewed papers (last 30 days)