high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Relational Algorithms for Multi-Bulk-Synchronous Processors

Relational Algorithms for Multi-Bulk-Synchronous Processors

Gregory Frederick Diamos, Haicheng Wu, Ashwin Lele, Jin Wang, Sudhakar Yalamanchili

NVIDIA Research

NVIDIA Research, Technical report, 2012

@article{diamos2012relational,

title={Relational Algorithms for Multi-Bulk-Synchronous Processors},

author={Diamos, G.F. and Wu, H. and Lele, A. and Wang, J. and Yalamanchili, S.},

year={2012}

}

Download (PDF)

View

Source

1386

views

Relational databases remain an important application domain for organizing and analyzing the massive volume of data generated as sensor technology, retail and inventory transactions, social media, computer vision, and new fields continue to evolve. At the same time, processor architectures are beginning to shift towards hierarchical and parallel architectures employing throughput-optimized memory systems, lightweight multi-threading, and Single-Instruction Multiple-Data (SIMD) core organizations. Examples include general purpose graphics processing units (GPUs) such as NVIDIA’s Fermi, Intels Sandy Bridge, and AMD’s Fusion processors. This paper explores the mapping of primitive relational algebra operations onto GPUs. In particular, we focus on algorithms and data structure design identifying a fundamental conflict between the structure of algorithms with good computational complexity and that of algorithms with memory access patterns and instruction schedules that achieve peak machine utilization. To reconcile this conflict, our design space exploration converges on a hybrid multi-stage algorithm that devotes a small amount of the total runtime to prune input data sets using an irregular algorithm with good computational complexity. The partial results are then fed into a regular algorithm that achieves near peak machine utilization. The design process leading to the most efficient algorithm for each stage is described, detailing alternative implementations, their performance characteristics, and an explanation of why they were ultimately abandoned. The least efficient algorithm (JOIN) achieves 57%72% of peak machine performance depending on the density of the input. The most efficient algorithms (PRODUCT, PROJECT, and SELECT) achieve 86% 92% of peak machine performance across all input data sets. To the best of our knowledge, these represent the best known published results to date for any implementations. This work lays the foundation for the development of a relational database system that achieves good scalability on a Multi-Bulk-Synchronous-Parallel (M-BSP) processor architecture. Additionally, the algorithm design may offer insights into the design of parallel and distributed relational database systems. It leaves the problems of query planning, operator!query synthesis, corner case optimization, and system/OS interaction as future work that would be necessary for commercial deployment.

Tags: Algorithms, Computational Complexity, Computer science, CUDA, Databases, Design space exploration, nVidia, Optimization, Tesla C2050

March 9, 2012 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org

Relational Algorithms for Multi-Bulk-Synchronous Processors

Recent source codes

QArray

Celerity: High-level C++ for Accelerator Clusters

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

OpenMP5-Offload-OpenMC-Intel-PVC

Most viewed papers (last 30 days)

Relational Algorithms for Multi-Bulk-Synchronous Processors

Share this:

Recent source codes

Most viewed papers (last 30 days)