High Precision Integer Multiplication with a GPU Using Strassen’s Algorithm with Multiple FFT Sizes
Computer Science Department, University of Massachusetts, Amherst, MA 01003-4610, USA
Parallel Processing Letters (PPL), Volume: 21, Issue: 3, pp. 359-375, 2011
@article{emmart2011high,
title={HIGH PRECISION INTEGER MULTIPLICATION WITH A GPU USING STRASSEN’S ALGORITHM WITH MULTIPLE FFT SIZES},
author={EMMART, N.},
journal={Parallel Processing Letters},
volume={21},
number={3},
pages={359–375},
year={2011}
}
We have improved our prior implementation of Strassens algorithm for high performance multiplication of very large integers on a general purpose graphics processor (GPU). A combination of algorithmic and implementation optimizations result in a factor of up to 13.9 speed improvement over our previous work, running on an NVIDIA 295. We have also reoptimized the implementation for an NVIDIA 480, from which we obtain a factor of up to 19 speedup in comparison with a Core i7 processor core of the same technology generation. To provide a fairer chip to chip comparison, we also determined total GPU throughput on a set of multiplications relative to all of the cores on a multicore chip running in parallel. We find that the GTX 480 provides a factor of six higher throughput than all four cores/eight threads of the Core i7. This paper discusses how we adapted the algorithm to operate within the limitations of the GPU and how we dealt with other issues encountered in the implementation process, including details of the memory layout of our FFTs. Compared with our earlier work, which used Karatsuba’s algorithm to guide multiplication of different operand sizes built on top of Strassen’s algorithm being applied to fixed-size segments of the operands, we are now able to apply Strassen’s algorithm directly to operands ranging in size from 255K bits to 16,320K bits.
January 11, 2012 by hgpu