27194

Enabling The Feed-Forward Design Model in OpenCL Using Pipes

Mostafa Eghbali Zarch, Michela Becchi
North Carolina State University, Raleigh, NC, USA
arXiv:2208.13364 [cs.DC], (29 Aug 2022)

@misc{https://doi.org/10.48550/arxiv.2208.13364,

   doi={10.48550/ARXIV.2208.13364},

   url={https://arxiv.org/abs/2208.13364},

   author={Zarch, Mostafa Eghbali and Becchi, Michela},

   keywords={Distributed, Parallel, and Cluster Computing (cs.DC), FOS: Computer and information sciences, FOS: Computer and information sciences},

   title={Enabling The Feed-Forward Design Model in OpenCL Using Pipes},

   publisher={arXiv},

   year={2022},

   copyright={arXiv.org perpetual, non-exclusive license}

}

Download Download (PDF)   View View   Source Source   

152

views

Over the past few years, there has been an increased interest in using FPGAs alongside CPUs and GPUs in high-performance computing systems and data centers. This trend has led to a push toward the use of high-level programming models and libraries, such as OpenCL, both to lower the barriers to the adoption of FPGAs by programmers unfamiliar with hardware description languages (HDLs), and to allow to seamlessly deploy a single code on different devices. Today, both Intel and Xilinx (now part of AMD) offer toolchains to compile OpenCL code onto FPGA. However, using OpenCL on FPGAs is complicated by performance portability issues, since different devices have fundamental differences in architecture and nature of hardware parallelism they offer. Hence, platform-specific optimizations are crucial to achieving good performance across devices. In this paper, we propose using the feed-forward design model based on pipes in order to improve the performance of OpenCL codes running on FPGA. We show the code transformations required to apply this method to existing OpenCL kernels, and we discuss the restrictions to its applicability. Using popular benchmark suites and microbenchmarks, we show that the feed-forward design model can result in higher utilization of the global memory bandwidth available and increased instruction concurrency, thus improving the overall throughput of the OpenCL implementations at a modest resource utilization cost. Further concurrency can be achieved by using multiple producers and multiple consumers.
No votes yet.
Please wait...

* * *

* * *

* * *

HGPU group © 2010-2022 hgpu.org

All rights belong to the respective authors

Contact us: