Scalable Query Evaluation in Relational Databases

hgpu.org » Programming » Algorithms » Scalable Query Evaluation in Relational Databases

Scalable Query Evaluation in Relational Databases

Rasmus Resen Amossen

The Efficient Computation research group, IT University of Copenhagen, Denmark

IT University of Copenhagen, 2011

BibTeX

Download (PDF)

View

Source

2030

views

The scalability of a query depends on the amount of data that needs to be accessed when computing the answer. This implies three immediate general strategies for improving query performance: decrease the amount of data (including intermediate results) to be accessed by accessing it smarter; decrease the amount by simply reducing the data quantity in the first place; and increase the amount of data accessed per time unit. This PhD dissertation presents four research results, covering each of these three approaches. The first three results focus on variations of the highly applicable query class join-project, which is a join of two database tables followed by a duplicate eliminating projection. Join-projects are equivalent to sparse Boolean matrix multiplication and frequent pair mining (the special case of frequent itemset with itemset cardinality limited to 2). We describe a new output sensitive algorithm for join-projects which has small intermediate results on worst-case inputs, and in particular, is efficient in both the RAM and I/O model. The algorithm uses the output size to deduce its computation strategy, and this introduces a chicken-and-egg problem: how do we obtain the output size without actually computing the output? This question is answered in another result in which we obtain a (1+epsilon) approximation of the output size in expected linear time and I/O for epsilon>n^(-1/4). In another result we address the throughput itself by using the massive parallel capabilities of graphics processing units (GPUs) to handle the pair mining problem. For that we present a new data structure, BatMap, which is a novel vertical data layout that is particularly well suited for parallel processing. The last result deals with the general problem of reducing the quantity of data that must be accessed for answering any given query on a row store RDBMS. We present a quadratic integer program formulation of the vertical partitioning problem for OLTP workloads in a distributed environment. This quadratic optimization problem is NP-hard so we also describe a randomized heuristic that empirically has shown to be reliable in sense of both speed and cost reduction.

Tags: Algorithms, Computer science, Databases, Matrix multiplication, nVidia, nVidia GeForce GTX 285, OpenCL, Optimization, Python, Thesis

December 5, 2011 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org