Strassen’s Matrix Multiplication on GPUs
Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA
IEEE 17th International Conference on Parallel and Distributed Systems (ICPADS), 2011
We provide efficient single-precision and integer GPU implementations of Strassen’s algorithm as well as of Winograd’s variant. On an NVIDIA C1060 GPU, a speedup of 32% (35%) is obtained for Strassen’s 4-level implementation and 33% (36%) for Winograd’s variant relative to the sgemm (integer version of sgemm) code in CUBLAS 3.0 when multiplying 16384×16384 matrices. The maximum numerical error for the single-precision implementations is about 2 orders of magnitude higher than those for sgemm when n = 16384 and is zero for the integer implementations.
January 23, 2012 by hgpu