Analytical Performance Estimation during Code Generation on Modern GPUs

hgpu.org » Applications » Computer science » Analytical Performance Estimation during Code Generation on Modern GPUs

Analytical Performance Estimation during Code Generation on Modern GPUs

Dominik Ernst, Markus Holzer, Georg Hager, Matthias Knorr, Gerhard Wellein

Erlangen National High Performance Computing Center (NHR@FAU), Friedrich-Alexander-Universität Erlangen-Nürnberg, Martensstraße 1, Erlangen, 91058, Germany

arXiv:2204.14242 [cs.DC], (29 Apr 2022)

DOI:10.48550/arXiv.2204.14242

@misc{https://doi.org/10.48550/arxiv.2204.14242,

doi={10.48550/ARXIV.2204.14242},

url={https://arxiv.org/abs/2204.14242},

author={Ernst, Dominik and Holzer, Markus and Hager, Georg and Knorr, Matthias and Wellein, Gerhard},

keywords={Distributed, Parallel, and Cluster Computing (cs.DC), FOS: Computer and information sciences, FOS: Computer and information sciences},

title={Analytical Performance Estimation during Code Generation on Modern GPUs},

publisher={arXiv},

year={2022},

}

Download (PDF)

View

Source

Source codes

Package:

WARPSPEED: An analytic Data Volume and Performance Estimator for GPU Kernels

1343

views

Automatic code generation is frequently used to create implementations of algorithms specifically tuned to particular hardware and application parameters. The code generation process involves the selection of adequate code transformations, tuning parameters, and parallelization strategies. We propose an alternative to time-intensive autotuning, scenario-specific performance models, or black-box machine learning to select the best-performing configuration. This paper identifies the relevant performance-defining mechanisms for memory-intensive GPU applications through a performance model coupled with an analytic hardware metric estimator. This enables a quick exploration of large configuration spaces to identify highly efficient code candidates with high accuracy. We examine the changes of the A100 GPU architecture compared to the predecessor V100 and address the challenges of how to model the data transfer volumes through the new memory hierarchy. We show how our method can be coupled to the pystencils stencil code generator, which is used to generate kernels for a range-four 3D-25pt stencil and a complex two-phase fluid solver based on the Lattice Boltzmann Method. For both, it delivers a ranking that can be used to select the best-performing candidate. The method is not limited to stencil kernels but can be integrated into any code generator that can generate the required address expressions.

Tags: Code generation, Computer science, CUDA, Lattice Boltzmann model, Machine learning, nVidia, OpenCL, Package, Tesla A100, Tesla V100

May 8, 2022 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org