## Scalable Query Evaluation in Relational Databases

The Efficient Computation research group, IT University of Copenhagen, Denmark

IT University of Copenhagen, 2011

@article{amossen2011scalable,

title={Scalable Query Evaluation in Relational Databases},

author={Amossen, R.R.},

year={2011}

}

The scalability of a query depends on the amount of data that needs to be accessed when computing the answer. This implies three immediate general strategies for improving query performance: decrease the amount of data (including intermediate results) to be accessed by accessing it smarter; decrease the amount by simply reducing the data quantity in the first place; and increase the amount of data accessed per time unit. This PhD dissertation presents four research results, covering each of these three approaches. The first three results focus on variations of the highly applicable query class join-project, which is a join of two database tables followed by a duplicate eliminating projection. Join-projects are equivalent to sparse Boolean matrix multiplication and frequent pair mining (the special case of frequent itemset with itemset cardinality limited to 2). We describe a new output sensitive algorithm for join-projects which has small intermediate results on worst-case inputs, and in particular, is efficient in both the RAM and I/O model. The algorithm uses the output size to deduce its computation strategy, and this introduces a chicken-and-egg problem: how do we obtain the output size without actually computing the output? This question is answered in another result in which we obtain a (1+epsilon) approximation of the output size in expected linear time and I/O for epsilon>n^(-1/4). In another result we address the throughput itself by using the massive parallel capabilities of graphics processing units (GPUs) to handle the pair mining problem. For that we present a new data structure, BatMap, which is a novel vertical data layout that is particularly well suited for parallel processing. The last result deals with the general problem of reducing the quantity of data that must be accessed for answering any given query on a row store RDBMS. We present a quadratic integer program formulation of the vertical partitioning problem for OLTP workloads in a distributed environment. This quadratic optimization problem is NP-hard so we also describe a randomized heuristic that empirically has shown to be reliable in sense of both speed and cost reduction.

December 5, 2011 by hgpu