Fast Arbitrary Precision Floating Point on FPGA
Department of Computer Science, ETH Zurich, Switzerland
arXiv:2204.06256 [cs.DC], (13 Apr 2022)
@misc{https://doi.org/10.48550/arxiv.2204.06256,
doi={10.48550/ARXIV.2204.06256},
url={https://arxiv.org/abs/2204.06256},
author={Licht, Johannes de Fine and Pattison, Christopher A. and Ziogas, Alexandros Nikolaos and Simmons-Duffin, David and Hoefler, Torsten},
keywords={Distributed, Parallel, and Cluster Computing (cs.DC), FOS: Computer and information sciences, FOS: Computer and information sciences},
title={Fast Arbitrary Precision Floating Point on FPGA},
publisher={arXiv},
year={2022},
copyright={arXiv.org perpetual, non-exclusive license}
}
Numerical codes that require arbitrary precision floating point (APFP) numbers for their core computation are dominated by elementary arithmetic operations due to the super-linear complexity of multiplication in the number of mantissa bits. APFP computations on conventional software-based architectures are made exceedingly expensive by the lack of native hardware support, requiring elementary operations to be emulated using instructions operating on machine-word-sized blocks. In this work, we show how APFP multiplication on compile-time fixed-precision operands can be implemented as deep FPGA pipelines with a recursively defined Karatsuba decomposition on top of native DSP multiplication. When comparing our design implemented on an Alveo U250 accelerator to a dual-socket 36-core Xeon node running the GNU Multiple Precision Floating-Point Reliable (MPFR) library, we achieve a 9.8x speedup at 4.8 GOp/s for 512-bit multiplication, and a 5.3x speedup at 1.2 GOp/s for 1024-bit multiplication, corresponding to the throughput of more than 351x and 191x CPU cores, respectively. We apply this architecture to general matrix-matrix multiplication, yielding a 10x speedup at 2.0 GOp/s over the Xeon node, equivalent to more than 375x CPU cores, effectively allowing a single FPGA to replace a small CPU cluster. Due to the significant dependence of some numerical codes on APFP, such as semidefinite program solvers, we expect these gains to translate into real-world speedups. Our configurable and flexible HLS-based code provides as high-level software interface for plug-and-play acceleration, published as an open source project.
April 17, 2022 by hgpu