Solving convex optimization problems on FPGA using OpenCL
Delft University of Technology
Delft University of Technology, 2020
@article{berkers2020solving,
title={Solving convex optimization problems on FPGA using OpenCL},
author={Berkers, Martijn},
year={2020}
}
The application of accelerators in HPC applications has seen enormous growth in the last decade. In the field of HPC demands on throughput are steadily growing. Not all of the algorithms used have a clear HW architecture which performs the best. Our work explores the performance of different HW architectures in solving a convex optimization problem. These algorithms are a sequence of dependent operations making it an interesting use-case because parallelism is not easily found. Our work focuses on a use-case of an on machine computational model present in ASML, we explore the acceleration of a quadratic programming Active-Set algorithm on dedicated hardware. There are libraries available to do this on both the CPU and GPU, while nothing is available for the FPGA. Our work focuses on filling this gap by implementing the algorithm using a high-level abstraction parallel programming language in order to ease development for FPGA accelerators. We use the Intel FPGA SDK for OpenCL framework to evaluate the performance trade-offs involved with FPGA acceleration and compare the performance to both the CPU and GPU using library functions. To fit FPGA architecture the algorithm is converted to a dataflow algorithm to enable streaming of data between kernels. The implementation leverages the features introduced in the Intel FPGA SDK for OpenCL framework to stream data using on-chip low-latency communication between kernels. We demonstrate that such a complicated algorithm can efficiently be implemented using the OpenCL framework. Our implementation achieves competitive performance compared to optimized library function on both the CPU and GPU. The OpenCL framework allows for easy design space exploration. We have explored different optimization strategies. The execution time of the final FPGA implementation is 3.5x and 1.2x longer than the CPU and GPU respectively in double precision floating-point. If the accuracy of the FPGA implementation is reduced to single precision there is a speedup of 2.2x in execution time compared to the double precision variant. Higher throughput can be achieved by duplicating the implementation. With the current size of the algorithm, two additional copies are possible. A handcrafted implementation could further improve the FPGA performance by manually managing local memory structures and reusing processing elements. However, significantly fewer lines of code are required, and a significant reduction in development time is achieved by using the OpenCL framework compared to traditional hardware description languages.
March 8, 2020 by hgpu