8965

Performance Upper Bound Analysis and Optimization of SGEMM on Fermi and Kepler GPUs

Junjie Lai, Andre Seznec
INRIA, France
International Symposium on Code Generation and Optimization (CGO ’13), 2013
@inproceedings{lai:hal-00789958,

   hal_id={hal-00789958},

   url={http://hal.inria.fr/hal-00789958},

   title={Performance Upper Bound Analysis and Optimization of SGEMM on Fermi and Kepler GPUs},

   author={Lai, Junjie and Seznec, Andr{‘e}},

   keywords={Kepler GPU; Fermi GPU; SGEMM; CUDA; Performance Upper Bound Analysis},

   language={Anglais},

   affiliation={ALF – INRIA – IRISA},

   booktitle={CGO ’13 – 2013 International Symposium on Code Generation and Optimization},

   address={Shenzhen, Chine},

   audience={internationale },

   year={2013},

   month={Feb},

   pdf={http://hal.inria.fr/hal-00789958/PDF/112_Lai.pdf}

}

Download Download (PDF)   View View   Source Source   

611

views

In this paper, we present an approach to estimate GPU applications’ performance upper bound based on algorithm analysis and assembly code level benchmarking. As an example, we analyze the potential peak performance of SGEMM (Single-precision General Matrix Multiply) on Fermi (GF110) and Kepler (GK104) GPUs. We try to answer the question of how much optimization space is left for SGEMM and why. According to our analysis, the nature of Fermi (Kepler) instruction set and the limited issue throughput of the schedulers are the main limitation factors for SGEMM to approach the theoretical peak performance. The estimated upper-bound peak performance of SGEMM is around 82.5% of the theoretical peak performance on GTX580 Fermi GPU and 57.6% on GTX680 Kepler GPU. Guided by this analysis and using the native assembly language, on average, our SGEMM implementations achieve about 5% better performance than CUBLAS in CUDA 4.1 SDK for large matrices on GTX580. The achieved performance is around 90% of the estimated upper-bound per- formance of SGEMM on GTX580. On GTX680, the best performance we achieve is around 77.3% of the estimated performance upper bound. We also describe how to use native assembly language directly in the CUDA runtime source
VN:F [1.9.22_1171]
Rating: 0.0/5 (0 votes cast)

* * *

* * *

Like us on Facebook

HGPU group

169 people like HGPU on Facebook

Follow us on Twitter

HGPU group

1280 peoples are following HGPU @twitter

* * *

Free GPU computing nodes at hgpu.org

Registered users can now run their OpenCL application at hgpu.org. We provide 1 minute of computer time per each run on two nodes with two AMD and one nVidia graphics processing units, correspondingly. There are no restrictions on the number of starts.

The platforms are

Node 1
  • GPU device 0: AMD/ATI Radeon HD 5870 2GB, 850MHz
  • GPU device 1: AMD/ATI Radeon HD 6970 2GB, 880MHz
  • CPU: AMD Phenom II X6 @ 2.8GHz 1055T
  • RAM: 12GB
  • OS: OpenSUSE 13.1
  • SDK: AMD APP SDK 2.9
Node 2
  • GPU device 0: AMD/ATI Radeon HD 7970 3GB, 1000MHz
  • GPU device 1: nVidia GeForce GTX 560 Ti 2GB, 822MHz
  • CPU: Intel Core i7-2600 @ 3.4GHz
  • RAM: 16GB
  • OS: OpenSUSE 12.2
  • SDK: nVidia CUDA Toolkit 6.0.1, AMD APP SDK 2.9

Completed OpenCL project should be uploaded via User dashboard (see instructions and example there), compilation and execution terminal output logs will be provided to the user.

The information send to hgpu.org will be treated according to our Privacy Policy

HGPU group © 2010-2014 hgpu.org

All rights belong to the respective authors

Contact us: