high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Data-Driven Dynamic Autotuning: Optimizing Autotuning Overhead with Prior Tuning Data

Data-Driven Dynamic Autotuning: Optimizing Autotuning Overhead with Prior Tuning Data

Jaroslav Oľha

Masaryk University

Masaryk University, 2024

BibTeX

Download (PDF)

View

Source

595

views

Modern high performance computing applications often rely on heterogeneous hardware resources to achieve maximum performance. This approach presents obvious benefits, combining the processing power of multiple different processors and allowing them to be more specialized. However, since HPC applications typically need to be programmed in a hardware-aware manner to achieve maximum performance, this places more burden on programmers to ensure that their programs can take full advantage of a wide variety of processing units. This issue can be addressed with source code autotuning – many implementation variants are defined in advance, and the most appropriate one is found once all execution conditions, such as the available hardware, become known. The problem of efficient computing is thus transformed into a searching problem. In some cases, the conditions that determine the efficiency of implementations only become known at runtime, or they keep changing during execution and require adaptation on the fly. In such a scenario, it is possible to overlap the autotuning process with the actual execution of the tuned code, which is commonly referred to as dynamic autotuning. The objectives of dynamic autotuning are shifted towards finding approximate solutions quickly, rather than searching for the global optimum. This thesis addresses several issues that arise in this context – the main focus of this work, in addition to introducing key concepts and presenting technical solutions, is to address the problem of autotuning overhead. Since the typical dynamic autotuning use case requires the tuning process to run concurrently with the tuned application, any autotuning overhead usually comes at the expense of the actual computation. Thus, reducing this overhead becomes even more critical than it is for traditional offline autotuning approaches. The author focuses on two main aspects of minimizing autotuning overhead: finding well-performing configurations more quickly, and setting a tuning budget that ensures minimal application run time. A well-chosen tuning budget ensures that autotuning neither wastes computational resources by running too long for minor code improvements, nor ends prematurely, leaving potential performance gains untapped. In both cases, historical data from previous autotuning efforts plays a major role, highlighting the importance of collecting and reusing this data. The experimental results presented in this thesis clearly show that various properties of tuning spaces, such as tuning parameter importance, relative portions of well-performing configurations, or the relationships between tuning parameters and hardware performance counters, are transferable across different hardware models. These findings led to the development of a profile-based searcher, which has shown considerable ability to improve autotuning convergence, and a tuning budget estimation method, which can ensure a near-optimal number of tuning iterations – both enhancing the effectiveness of dynamic autotuning methods and minimizing their negative impact on the tuned application.

Tags: AMD Radeon RX Vega 56, ATI, Auto-Tuning, Computer science, Heterogeneous systems, nVidia, nVidia GeForce GTX 1070, nVidia GeForce GTX 680, nVidia GeForce GTX 750, nVidia GeForce RTX 2080 Ti, OpenCL, Performance, Tesla K20, Thesis

November 3, 2024 by hgpu

No votes yet.

Please wait...

high performance computing on graphics processing units: hgpu.org

Data-Driven Dynamic Autotuning: Optimizing Autotuning Overhead with Prior Tuning Data

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Data-Driven Dynamic Autotuning: Optimizing Autotuning Overhead with Prior Tuning Data

Share this:

Recent source codes

Most viewed papers (last 30 days)