Sorting on FPGAs using Merge Trees
UCLA
UCLA, 2019
@article{samardzic2019sorting,
title={Sorting on FPGAs using Merge Trees},
author={Samardzic, Nikola},
year={2019}
}
Hardware Mergers can be used to implement sorting algorithms on Field-Programmable Gate Arrays (FPGAs) by inductively merging elements as in the Merge Sort algorithm.[1][2] These Hardware Mergers have also been laid out onto the FPGA in a complete binary tree pattern (called a Hardware Merge Tree) which further enhances performance of the sorting procedure by merging many input arrays at once.[3] In [4] the authors recognize the need for using different Merger Trees when sorting on different levels of the memory hierarchy (On-Chip, DRAM, or Flash). Nonetheless, their choice of Merge Trees at each level of the hierarchy is rather ad hoc and there is no rigorous analysis to justify their design choices. In this work, we develop a comprehensive analytical model that takes into consideration the off-chip memory bandwidth and the amount of on-chip resources to optimize the overall sorting latency. Further, we develop methods for implementing a wider range of Merge Trees as well as a general DRAM access pattern and data loading model that operates at peak DRAM bandwidth without affecting the throughput of the Merge Tree. As all signals in our Merge Tree are local to the tree nodes (Hardware Mergers), the design can be scaled to future improvements in FPGA resource availability as well as to a network of many interconnected FPGAs, without impact to the Merge Tree throughput. We present the only end-to-end sorting design on FPGAs aside from the design in [4]. Cycle-accurate simulations of our design report that it outperforms in total sorting time the implementation in [4] on the same hardware setup by 3.1x. Our cycle-accurate software simulations which account for DRAM random access and sequential access trade-offs demonstrates that our end-to-end design has total sorting time within 90% of the predictions of our model.
July 21, 2019 by hgpu