high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Analyzing CUDA’s Compiler through the Visualization of Decoded GPU Binaries

Analyzing CUDA’s Compiler through the Visualization of Decoded GPU Binaries

Cedric Nugteren, Bart Mesman, Henk Corporaal

Eindhoven University of Technology, The Netherlands

ODES-8: Proceedings of the 8th Workshop on Optimizations for DSP and Embedded Systems at CGO ’10, 2010

@inproceedings{nugteren2010analyzing,

title={Analyzing CUDA’s Compiler through the Visualization of Decoded GPU Binaries},

author={Nugteren, C. and Mesman, B. and Corporaal, H.},

booktitle={ODES-8: Proceedings of the 8th Workshop on Optimizations for DSP and Embedded Systems at CGO’10},

year={2010}

}

Download (PDF)

View

Source

2779

views

With GPU architectures becoming increasingly important due to their large number of parallel processors, NVIDIA’s CUDA environment is becoming widely used to support general purpose applications. To efficiently use the parallel processing power, programmers need to efficiently parallelize and map their algorithms. The difficulty of this task leads to the idea to investigate CUDA’s compiler. Part of the compiler in the CUDA tool-chain is entirely undocumented, as is its output. To draw conclusions on the behaviour of this compiler, the resulting object code is reverse engineered. A visualization tool is introduced, analyzing the previously unknown compiler behaviour and proving helpful to improve the mapping process for the programmer. These improvements focus on the area of register allocation and instruction reordering. This paper describes an extension to the CUDA tool-chain, providing programmers with a visualization of register life ranges. Also, the paper presents guidelines describing how to apply optimizations in order to obtain a lower register pressure. In a case-study example, performance increases by 33% compared to already optimized CUDA code. This is achieved by optimizing the code with the help of the introduced visualization tool. Also, in 11 other case-study examples, register pressure is reduced by an average of 18%. The presented guidelines could be added to the compiler to enable a similar register pressure reduction to be achieved automatically at compile-time for new and existing CUDA programs.

Tags: Algorithms, Computer science, CUDA, nVidia, Optimization, Performance, Visualization

March 13, 2012 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Analyzing CUDA’s Compiler through the Visualization of Decoded GPU Binaries

Your response

Recent source codes

Agentic Code Optimization via Compiler-LLM Cooperation

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

True 4-Bit Quantized CNN Training on CPU

cuFuzz: A GPU-oriented coverage-guided fuzzer for userland CUDA application

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Most viewed papers (last 30 days)

Analyzing CUDA’s Compiler through the Visualization of Decoded GPU Binaries

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)