26624

Fast Arbitrary Precision Floating Point on FPGA

Johannes de Fine Licht, Christopher A. Pattison, Alexandros Nikolaos Ziogas, David Simmons-Duffin, Torsten Hoefler
Department of Computer Science, ETH Zurich, Switzerland
arXiv:2204.06256 [cs.DC], (13 Apr 2022)

@misc{https://doi.org/10.48550/arxiv.2204.06256,

   doi={10.48550/ARXIV.2204.06256},

   url={https://arxiv.org/abs/2204.06256},

   author={Licht, Johannes de Fine and Pattison, Christopher A. and Ziogas, Alexandros Nikolaos and Simmons-Duffin, David and Hoefler, Torsten},

   keywords={Distributed, Parallel, and Cluster Computing (cs.DC), FOS: Computer and information sciences, FOS: Computer and information sciences},

   title={Fast Arbitrary Precision Floating Point on FPGA},

   publisher={arXiv},

   year={2022},

   copyright={arXiv.org perpetual, non-exclusive license}

}

Numerical codes that require arbitrary precision floating point (APFP) numbers for their core computation are dominated by elementary arithmetic operations due to the super-linear complexity of multiplication in the number of mantissa bits. APFP computations on conventional software-based architectures are made exceedingly expensive by the lack of native hardware support, requiring elementary operations to be emulated using instructions operating on machine-word-sized blocks. In this work, we show how APFP multiplication on compile-time fixed-precision operands can be implemented as deep FPGA pipelines with a recursively defined Karatsuba decomposition on top of native DSP multiplication. When comparing our design implemented on an Alveo U250 accelerator to a dual-socket 36-core Xeon node running the GNU Multiple Precision Floating-Point Reliable (MPFR) library, we achieve a 9.8x speedup at 4.8 GOp/s for 512-bit multiplication, and a 5.3x speedup at 1.2 GOp/s for 1024-bit multiplication, corresponding to the throughput of more than 351x and 191x CPU cores, respectively. We apply this architecture to general matrix-matrix multiplication, yielding a 10x speedup at 2.0 GOp/s over the Xeon node, equivalent to more than 375x CPU cores, effectively allowing a single FPGA to replace a small CPU cluster. Due to the significant dependence of some numerical codes on APFP, such as semidefinite program solvers, we expect these gains to translate into real-world speedups. Our configurable and flexible HLS-based code provides as high-level software interface for plug-and-play acceleration, published as an open source project.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: