Improving Performance of Matrix Multiplication and FFT on GPU

Xiang Cui, Yifeng Chen, Hong Mei
Key Laboratory of High Confidence Software Technologies, Ministry of Education, School of Electronics Engineering and Computer Science, Peking University, Beijing, China
In ICPADS ’09: Proceedings of the 2009 15th International Conference on Parallel and Distributed Systems (2009), pp. 42-48.


   title={Improving performance of matrix multiplication and FFT on GPU},

   author={Cui, X. and Chen, Y. and Mei, H.},

   booktitle={Parallel and Distributed Systems (ICPADS), 2009 15th International Conference on},






Download Download (PDF)   View View   Source Source   



In this paper we discuss about our experiences in improving the performance of two key algorithms: the single-precision matrix-matrix multiplication subprogram (SGEMM of BLAS) and single-precision FFT using CUDA. The former is computation-intensive, while the latter is memory bandwidth or communication-intensive. A peak performance of 393 Gflops is achieved on NVIDIA GeForce GTX280 for the former, about 5% faster than the CUBLAS 2.0 library. Better FFT performance results are obtained for a range of dimensions. Some common principles are discussed for the design and implementation of many-core algorithms.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: