A Practical Performance Model for Compute and Memory Bound GPU Kernels

Elias Konstantinidis, Yiannis Cotronis
University of Athens, Department of Informatics and telecommunications
23rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), 2015


   author={E. Konstantinidis and Y. Cotronis},

   booktitle={2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing},

   title={A Practical Performance Model for Compute and Memory Bound GPU Kernels},



   keywords={graphics processing units;parallel architectures;performance evaluation;CUDA kernels;GPU hardware;GPU kernels;architecture specifications;compute memory bandwidth ratios;compute memory bound characteristic;memory bound GPU Kernels;memory transfer performance;microbenchmark program;microbenchmarking specifications;peak compute performance;performance prediction;practical performance model;quadrant split model;roofline visual performance model;variable operation intensity;Bandwidth;Computational modeling;Graphics processing units;Kernel;Performance evaluation;Throughput;Visualization;GPU kernels;micro-benchmarks;performance model;performance prediction},





Performance prediction of GPU kernels is generally a tedious procedure with unpredictable results. In this paper, we provide a practical model for estimating performance of CUDA kernels on GPU hardware in an automated manner. First, we propose the quadrant-split model, an alternative of the roofline visual performance model, which provides insight on the performance limiting factors of multiple devices with different compute-memory bandwidth ratios with respect to a particular kernel. We elaborate on the compute-memory bound characteristic of kernels. In addition, a micro-benchmark program was developed exposing the peak compute and memory transfer performance using variable operation intensity. Experimental results of executions on different GPUs are presented. In the proposed performance prediction procedure, a set of kernel features is extracted through an automated profiling execution which records a set of significant kernel metrics. Additionally, a small set of device features for the target GPU is generated using micro-benchmarking and architecture specifications. In conjunction of kernel and device features we determine the performance limiting factor and we generate an estimation of the kernel’s execution time. We performed experiments on DAXPY, DGEMM, FFT and stencil computation kernels using 4 GPUs and we showed an absolute error in predictions of 10.1% in the average case and 25.8% in the worst case.
No votes yet.
Please wait...

* * *

* * *

* * *

HGPU group © 2010-2022 hgpu.org

All rights belong to the respective authors

Contact us: