high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Relational Algorithms for Multi-Bulk-Synchronous Processors

Relational Algorithms for Multi-Bulk-Synchronous Processors

Gregory Frederick Diamos, Haicheng Wu, Ashwin Lele, Jin Wang, Sudhakar Yalamanchili

NVIDIA Research

NVIDIA Research, Technical report, 2012

@article{diamos2012relational,

title={Relational Algorithms for Multi-Bulk-Synchronous Processors},

author={Diamos, G.F. and Wu, H. and Lele, A. and Wang, J. and Yalamanchili, S.},

year={2012}

}

Download (PDF)

View

Source

2142

views

Relational databases remain an important application domain for organizing and analyzing the massive volume of data generated as sensor technology, retail and inventory transactions, social media, computer vision, and new fields continue to evolve. At the same time, processor architectures are beginning to shift towards hierarchical and parallel architectures employing throughput-optimized memory systems, lightweight multi-threading, and Single-Instruction Multiple-Data (SIMD) core organizations. Examples include general purpose graphics processing units (GPUs) such as NVIDIA’s Fermi, Intels Sandy Bridge, and AMD’s Fusion processors. This paper explores the mapping of primitive relational algebra operations onto GPUs. In particular, we focus on algorithms and data structure design identifying a fundamental conflict between the structure of algorithms with good computational complexity and that of algorithms with memory access patterns and instruction schedules that achieve peak machine utilization. To reconcile this conflict, our design space exploration converges on a hybrid multi-stage algorithm that devotes a small amount of the total runtime to prune input data sets using an irregular algorithm with good computational complexity. The partial results are then fed into a regular algorithm that achieves near peak machine utilization. The design process leading to the most efficient algorithm for each stage is described, detailing alternative implementations, their performance characteristics, and an explanation of why they were ultimately abandoned. The least efficient algorithm (JOIN) achieves 57%72% of peak machine performance depending on the density of the input. The most efficient algorithms (PRODUCT, PROJECT, and SELECT) achieve 86% 92% of peak machine performance across all input data sets. To the best of our knowledge, these represent the best known published results to date for any implementations. This work lays the foundation for the development of a relational database system that achieves good scalability on a Multi-Bulk-Synchronous-Parallel (M-BSP) processor architecture. Additionally, the algorithm design may offer insights into the design of parallel and distributed relational database systems. It leaves the problems of query planning, operator!query synthesis, corner case optimization, and system/OS interaction as future work that would be necessary for commercial deployment.

Tags: Algorithms, Computational Complexity, Computer science, CUDA, Databases, Design space exploration, nVidia, Optimization, Tesla C2050

March 9, 2012 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Relational Algorithms for Multi-Bulk-Synchronous Processors

Your response

Recent source codes

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

Agentic Code Optimization via Compiler-LLM Cooperation

Most viewed papers (last 30 days)

Relational Algorithms for Multi-Bulk-Synchronous Processors

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)