high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » CUDA » A Practical Performance Model for Compute and Memory Bound GPU Kernels

A Practical Performance Model for Compute and Memory Bound GPU Kernels

Elias Konstantinidis, Yiannis Cotronis

University of Athens, Department of Informatics and telecommunications

23rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), 2015

DOI:10.1109/PDP.2015.51

BibTeX

Download (PDF)

View

Source

Source codes

Package:

mixbench: A GPU benchmark tool for evaluating GPUs on mixed operational intensity kernels

2663

views

Performance prediction of GPU kernels is generally a tedious procedure with unpredictable results. In this paper, we provide a practical model for estimating performance of CUDA kernels on GPU hardware in an automated manner. First, we propose the quadrant-split model, an alternative of the roofline visual performance model, which provides insight on the performance limiting factors of multiple devices with different compute-memory bandwidth ratios with respect to a particular kernel. We elaborate on the compute-memory bound characteristic of kernels. In addition, a micro-benchmark program was developed exposing the peak compute and memory transfer performance using variable operation intensity. Experimental results of executions on different GPUs are presented. In the proposed performance prediction procedure, a set of kernel features is extracted through an automated profiling execution which records a set of significant kernel metrics. Additionally, a small set of device features for the target GPU is generated using micro-benchmarking and architecture specifications. In conjunction of kernel and device features we determine the performance limiting factor and we generate an estimation of the kernel’s execution time. We performed experiments on DAXPY, DGEMM, FFT and stencil computation kernels using 4 GPUs and we showed an absolute error in predictions of 10.1% in the average case and 25.8% in the worst case.

Tags: Benchmarking, CUDA, OpenCL, Performance prediction

May 23, 2016 by ekondis

No votes yet.

Please wait...

high performance computing on graphics processing units: hgpu.org

A Practical Performance Model for Compute and Memory Bound GPU Kernels

Package:

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

A Practical Performance Model for Compute and Memory Bound GPU Kernels

Package:

Share this:

Recent source codes

Most viewed papers (last 30 days)