16801

Applications

Georgios Smaragdos, Georgios Chatzikonstantis, Rahul Kukreja, Harrys Sidiropoulos, Dimitrios Rodopoulos, Ioannis Sourdis, Zaid Al-Ars, Christoforos Kachris, Dimitrios Soudris, Chris I. De Zeeuw, Christos Strydis
View View   Download Download (PDF)   
Mark Govett, Jim Rosinski, Jacques Middlecoff, Tom Henderson, Jin Lee, Alexander MacDonald, Paul Madden, Julie Schramm, Antonio Duarte
View View   Download Download (PDF)   
Florencio Balboa Usabiaga, Blaise Delmotte, Aleksandar Donev
Marcos Amaris, Raphael Y. de Camargo, Mohamed Dyab, Alfredo Goldman, Denis Trystram
Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, Huazhong Yang, William J. Dally
View View   Download Download (PDF)   
Enrico Piccinini, Claudia Benedetti, Ilaria Siloi, Matteo G. A. Paris, Paolo Bordone
View View   Download Download (PDF)   
Ahmad Lashgar, Amirali Baniasadi
G. Amadio, A. Ananya, J. Apostolakis, A. Arora, M. Bandieramonte, A. Bhattacharyya, C. Bianchini, R. Brun, P. Canal, F. Carminati, L. Duhem, D. Elvira, A. Gheata, M. Gheata, I. Goulas, R. Iope, S. Jun, G. Lima, A. Mohanty, T. Nikitina, M. Novak, W. Pokorski, A. Ribon, R. Sehgal, O. Shadura, S. Vallecorsa, S. Wenzel, Y. Zhang
View View   Download Download (PDF)   
Kazuhiro Yamato
View View   Download Download (PDF)   
Martin Schrimpf
View View   Download Download (PDF)   
Karel Adamek, Wesley Armour

Availability of OpenCL for FPGAs has raised new questions about the efficiency of massive thread-level parallelism on FPGAs. The general trend is toward creating deep pipelining and in-order execution of many OpenCL threads across a shared data-path. While this can be a very effective approach for regular kernels, its efficiency significantly diminishes for irregular kernels with runtime-dependent control flow. We need to look for new approaches to improve execution efficiency of FPGAs when targeting irregular OpenCL kernels. This paper proposes a novel solution, called Hardware Thread Reordering (HTR), to boost the throughput of the FPGAs when executing irregular kernels possessing non-deterministic runtime control flow. The key insight of HRT is out-of-order OpenCL thread execution over a shared data-path to achieve significantly higher throughput. The thread reordering is performed at a basic-block level granularity. The synthesized basic-blocks are extended with independent pipeline control signals and context registers to bypass the live values of reordered threads. We demonstrate the efficiency of our proposed solution on three parallel irregular kernels. For the experiments, we utilize the LegUp tool to compare the baseline (in-order) data-path with HTR-enhanced data-path. Our RTL simulation results demonstrate that HTR-enhanced data-path achieves up to 11X increase in kernels throughput at a very low overhead (less than 2X increase in FPGA resources).

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: