Improving Performance of Matrix Multiplication and FFT on GPU
Key Laboratory of High Confidence Software Technologies, Ministry of Education, School of Electronics Engineering and Computer Science, Peking University, Beijing, China
In ICPADS ’09: Proceedings of the 2009 15th International Conference on Parallel and Distributed Systems (2009), pp. 42-48.
@conference{cui2010improving,
title={Improving performance of matrix multiplication and FFT on GPU},
author={Cui, X. and Chen, Y. and Mei, H.},
booktitle={Parallel and Distributed Systems (ICPADS), 2009 15th International Conference on},
pages={42–48},
issn={1521-9097},
year={2010},
organization={IEEE}
}
In this paper we discuss about our experiences in improving the performance of two key algorithms: the single-precision matrix-matrix multiplication subprogram (SGEMM of BLAS) and single-precision FFT using CUDA. The former is computation-intensive, while the latter is memory bandwidth or communication-intensive. A peak performance of 393 Gflops is achieved on NVIDIA GeForce GTX280 for the former, about 5% faster than the CUBLAS 2.0 library. Better FFT performance results are obtained for a range of dimensions. Some common principles are discussed for the design and implementation of many-core algorithms.
November 2, 2010 by hgpu