Fast Implementation of DGEMM on Fermi GPU

Guangming Tan, Linchuan Li, Sean Triechle, Everett Phillips, Yungang Bao, Ninghui Sun
Key Laboratory of Computer Architecture, Institute of Computing Technology,Chinese Academy of Science
ACM/IEEE Supercomputing (SC’11), 2011


   title={Fast Implementation of DGEMM on Fermi GPU},

   author={Tan, G. and Li, L. and Triechle, S. and Phillips, E. and Bao, Y. and Sun, N.},

   booktitle={International Conference for High Performance Computing, Networking, Storage and Analysis, ICS {~O}11},



In this paper we present a thorough experience on tuning double-precision matrix-matrix multiplication (DGEMM) on the Fermi GPU architecture. We choose an optimal algorithm with blocking in both shared memory and registers to satisfy the constraints of the Fermi memory hierarchy. Our optimization strategy is further guided by a performance modeling based on micro-architecture benchmarks. Our optimizations include software pipelining, use of vector memory operations, and instruction scheduling. Our best CUDA algorithm achieves comparable performance with the latest CUBLAS library. We further improve upon this with an implementation in the native machine language, leading to 20% increase in performance. That is, the achieved peak performance (efficiency) is improved from 302Gflop/s (58%) to 362Gflop/s (70%).
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2021 hgpu.org

All rights belong to the respective authors

Contact us: