14250

A Study of Data Partitioning on OpenCL-based FPGAs

Zeke Wang, Bingsheng He, Wei Zhang
Nanyang Technological University, Singapore
International Conference on Field-programmable Logic and Applications (FPL), 2015

@article{wang2015study,

   title={A Study of Data Partitioning on OpenCL-based FPGAs},

   author={Wang, Zeke and He, Bingsheng and Zhang, Wei},

   journal={Work},

   volume={6},

   number={7},

   pages={1},

   year={2015}

}

Download Download (PDF)   View View   Source Source   

1787

views

A lot of research efforts have been devoted to accelerating relational database applications on FPGAs, due to their high energy efficiency and high throughput. Most of the existing studies are based on hardware description languages (HDLs). Recently, FPGA vendors have started to develop OpenCL SDKs for much better programmability. In this paper, we investigate the performance of relational database applications on OpenCL-based FPGAs. As a start, we study the performance of data partitioning, a core operation widely used in relational databases. Due to random memory accesses, data partitioning is time-consuming and can become a major bottleneck for database operators such as hash join. We start with the state-of-the-art OpenCL implementation which was originally designed for CPUs/GPUs, and find that it suffers from lock overheads and memory bandwidth overheads. To reduce lock overheads, we develop a simple yet efficient multi-kernel approach to leverage two emerging features of Altera OpenCL SDK, namely task kernel and channel. Moreover, on-chip buckets are employed to reduce the number of memory transactions. We further develop a cost model to guide the parameter configuration. We evaluate the proposed design on a recent Altera Stratix V FPGA. Our results demonstrate 1) our cost model can accurately predict the performance of data partitioning under different parameter settings; 2) our proposed multi-kernel approach can achieve 10.7X speedup over the existing OpenCL implementation. Also, the experiments with three case studies show that the optimized implementations can achieve 4-12X performance improvement over the original implementations.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: