high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » GPU Array Access Auto-Tuning

GPU Array Access Auto-Tuning

Nicolas Weber

Technische Universitat Darmstadt

Technische Universitat, Darmstadt, 2017

@phdthesis{weber2017gpu,

title={GPU Array Access Auto-Tuning},

author={Weber, Nicolas},

year={2017},

school={Technische Universit{"a}t}

}

Download (PDF)

View

Source

3506

views

GPUs have been used for years in compute intensive applications. Their massive parallel processing capabilities can speedup calculations significantly. However, to leverage this speedup it is necessary to rethink and develop new algorithms that allow parallel processing. These algorithms are only one piece to achieve high performance. Nearly as important as suitable algorithms is the actual implementation and the usage of special hardware features such as intra-warp communication, shared memory, caches, and memory access patterns. Optimizing these factors is usually a time consuming task that requires deep understanding of the algorithms and the underlying hardware. Unlike CPUs, the internal structure of GPUs has changed significantly and will likely change even more over the years. Therefore it does not suffice to optimize the code once during the development, but it has to be optimized for each new GPU generation that is released. To efficiently (re-)optimize code towards the underlying hardware, auto-tuning tools have been developed that perform these optimizations automatically, taking this burden from the programmer. In particular, NVIDIA — the leading manufacturer for GPUs today — applied significant changes to the memory hierarchy over the last four hardware generations. This makes the memory hierarchy an attractive objective for an auto-tuner. In this thesis we introduce the MATOG auto-tuner that automatically optimizes array access for NVIDIA CUDA applications. In order to achieve these optimizations, MATOG has to analyze the application to determine optimal parameter values. The analysis relies on empirical profiling combined with a prediction method and a data post-processing step. This allows to find nearly optimal parameter values in a minimal amount of time. Further, MATOG is able to automatically detect varying application workloads and can apply different optimization parameter settings at runtime. To show MATOG’s capabilities, we evaluated it on a variety of different applications, ranging from simple algorithms up to complex applications on the last four hardware generations, with a total of 14 GPUs. MATOG is able to achieve equal or even better performance than hand-optimized code. Further, it is able to provide performance portability across different GPU types (low-, mid-, high-end and HPC) and generations. In some cases it is able to exceed the performance of hand-crafted code that has been specifically optimized for the tested GPU by dynamically changing data layouts throughout the execution.

Tags: Computer science, CUDA, nVidia, nVidia GeForce GT 440, nVidia GeForce GT 620, nVidia GeForce GT 730, nVidia GeForce GTX 1080, nVidia GeForce GTX 480, nVidia GeForce GTX 560 Ti, nVidia GeForce GTX 570, nVidia GeForce GTX 590, nVidia GeForce GTX 680, nVidia GeForce GTX 780, nVidia GeForce GTX 980, nVidia GeForce GTX Titan X, Performance, performance portability, Tesla C2070, Tesla K20, Thesis

August 8, 2017 by hgpu

Rating: 4.2/5. From 5 votes.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

GPU Array Access Auto-Tuning

Your response

Recent source codes

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

Agentic Code Optimization via Compiler-LLM Cooperation

Most viewed papers (last 30 days)

GPU Array Access Auto-Tuning

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)