Improving Performance of Iterative Applications through Interleaved Execution of Approximated CUDA Kernels

hgpu.org » Applications » Computer science » Improving Performance of Iterative Applications through Interleaved Execution of Approximated CUDA Kernels

Improving Performance of Iterative Applications through Interleaved Execution of Approximated CUDA Kernels

Gabriel Freytag

Universidade Federal do Rio Grande do Sul, Instituto de Informática

Universidade Federal do Rio Grande do Sul, 2023

BibTeX

Download (PDF)

View

Source

874

views

Approximate computing techniques, particularly those involving reduced and mixed precision, are widely studied in literature to accelerate applications and reduce energy consumption. Although many researchers analyze the performance, accuracy loss, and energy consumption of a wide range of application domains, few evaluate approximate computing techniques in iterative applications. These applications rely on the result of the computations of previous iterations to perform subsequent iterations, making them sensitive to precision errors that can propagate and magnify throughout the execution. Additionally, monitoring the accuracy loss of the execution in large datasets is challenging. Calculating accuracy loss at runtime is computationally expensive and becomes infeasible in applications with a considerable volume of data. This thesis presents a methodology for generating interleaved execution configurations of multiple kernel versions for iterative applications on GPUs. The methodology involves sampling the accuracy loss profile, extracting performance and accuracy loss statistics, and offline generating interleaved execution configurations of kernel versions for different thresholds of accuracy loss. The experiments conducted on three iterative applications of physical simulation in three-dimensional data domains demonstrated the capability of the methodology to extract performance and accuracy loss statistics and generate interleaved execution configurations of kernel versions with speedups up to 2 and reduction of energy consumption up to 60%. For future work, we suggest studying different optimization strategies for generating interleaved execution configurations of kernel versions, such as using neural networks and machine learning.

Tags: Computer science, CUDA, Machine learning, Mixed precision, Neural networks, nVidia, nVidia A100, nVidia P100, Thesis

June 18, 2023 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org