high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Static and Dynamic Analyses for Efficient GPU Execution

Static and Dynamic Analyses for Efficient GPU Execution

Philip Munksgaard

University of Copenhagen

University of Copenhagen, 2023

BibTeX

Download (PDF)

View

Source

949

views

In this thesis we describe a host of static and dynamic techniques for efficient execution of GPU programs. Most significant is the array short-circuiting technique, which automatically rewrites array updates and concatenations to happen in-place when deemed safe. The optimization is based on FunMem, an intermediate representation with non-semantic memory information that we also introduce. FunMem allows the compiler to analyze, reason about and optimize memory usage patterns while keeping high-level language information. To map array elements to specific locations in memory, FunMem uses LMADs. LMADs have traditionally seen most use in the context of automatic parallelization of sequential loops, where they are used in a set-interpretation to aggregate memory accesses across loops, but we introduce two new interpretations: as index functions in FunMem and as a slicing mechanism in a user-facing language. The first enable cheap change-of-layout transformations in FunMem, while the other allows users to express complex slices of arrays not otherwise possible. Finally, to enable the aforementioned short-circuiting optimization on complex benchmarks, we present a heuristic for proving non-overlap of multi-dimensional LMADs in the set-interpretation. We also introduce a technique for automatically finding near-optimal threshold values for multi-versioned code. Multi-versioned code attempts to solve the problem of not having a single best compilation strategy for a certain program, by generating multiple semantically-equivalent but differently-optimized versions of code and guarding each version with a conditional based on runtime parameters, leading to the creation of a tuning tree. By relying on a monotonicity assumption, which relates the relative performance of each code version to the runtime parameter conditionals, we show how to automatically tune the threshold values such that each tuning dataset is executed using the fastest code version. As a result, our one-time offline tuning process produces tuning values that optimally discriminate between the different code versions for the training datasets used, and indeed for all datasets with similar size-characteristics. Both the static and dynamic techniques have been developed in the context of and implemented for the Futhark parallel programming language, and we show significant performance improvements from both array short-circuiting and autotuning.

Tags: AMD Radeon Instinct MI100, ATI, Benchmarking, Computer science, nVidia, nVidia A100, OpenCL, Performance, Thesis

August 13, 2023 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

Static and Dynamic Analyses for Efficient GPU Execution

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Static and Dynamic Analyses for Efficient GPU Execution

Share this:

Recent source codes

Most viewed papers (last 30 days)