high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Performance Upper Bound Analysis and Optimization of SGEMM on Fermi and Kepler GPUs

Performance Upper Bound Analysis and Optimization of SGEMM on Fermi and Kepler GPUs

Junjie Lai, Andre Seznec

INRIA, France

International Symposium on Code Generation and Optimization (CGO ’13), 2013

@inproceedings{lai:hal-00789958,

hal_id={hal-00789958},

url={http://hal.inria.fr/hal-00789958},

title={Performance Upper Bound Analysis and Optimization of SGEMM on Fermi and Kepler GPUs},

author={Lai, Junjie and Seznec, Andr{‘e}},

keywords={Kepler GPU; Fermi GPU; SGEMM; CUDA; Performance Upper Bound Analysis},

language={Anglais},

affiliation={ALF – INRIA – IRISA},

booktitle={CGO ’13 – 2013 International Symposium on Code Generation and Optimization},

address={Shenzhen, Chine},

audience={internationale },

year={2013},

month={Feb},

pdf={http://hal.inria.fr/hal-00789958/PDF/112_Lai.pdf}

}

Download (PDF)

View

Source

2729

views

In this paper, we present an approach to estimate GPU applications’ performance upper bound based on algorithm analysis and assembly code level benchmarking. As an example, we analyze the potential peak performance of SGEMM (Single-precision General Matrix Multiply) on Fermi (GF110) and Kepler (GK104) GPUs. We try to answer the question of how much optimization space is left for SGEMM and why. According to our analysis, the nature of Fermi (Kepler) instruction set and the limited issue throughput of the schedulers are the main limitation factors for SGEMM to approach the theoretical peak performance. The estimated upper-bound peak performance of SGEMM is around 82.5% of the theoretical peak performance on GTX580 Fermi GPU and 57.6% on GTX680 Kepler GPU. Guided by this analysis and using the native assembly language, on average, our SGEMM implementations achieve about 5% better performance than CUBLAS in CUDA 4.1 SDK for large matrices on GTX580. The achieved performance is around 90% of the estimated upper-bound per- formance of SGEMM on GTX580. On GTX680, the best performance we achieve is around 77.3% of the estimated performance upper bound. We also describe how to use native assembly language directly in the CUDA runtime source

Tags: Algorithms, Benchmarking, Computer science, CUBLAS, CUDA, nVidia, nVidia GeForce GTX 680, Optimization, Performance

February 22, 2013 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Performance Upper Bound Analysis and Optimization of SGEMM on Fermi and Kepler GPUs

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)

Performance Upper Bound Analysis and Optimization of SGEMM on Fermi and Kepler GPUs

Share this:

Recent source codes

Most viewed papers (last 30 days)