high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » GPU Performance Modeling and Optimization

GPU Performance Modeling and Optimization

Ang Li

Technische Universiteit Eindhoven

Technische Universiteit Eindhoven, 2016

@phdthesis{li2016gpu,

title={GPU performance modeling and optimization},

author={Li, Ang},

year={2016},

school={Technische Universiteit Eindhoven}

}

Download (PDF)

View

Source

3220

views

The last decade has witnessed the blooming emergence of general-purpose Graphic-Processing-Unit computing (GPGPU). With the exponential growth of cores and threads in a modern GPU processor, how to analyze and optimize its performance becomes a grand challenge. In this thesis, as the modeling part, we propose an analytic model for throughput-oriented parallel processors. The model is visualizable, traceable and portable, while providing a good abstraction for both application designers and hardware architects to understand the performance and motivate potential optimization approaches. As the optimization part, we focus on each crucial component of a GPU streaming-multiprocessor, in particular registers-files, compute-units (SPU, DPU, SFU), caches (L1, L2, read-only, texture, constant) and scratchpad memory alternatively, clarify its underlying performance tradeoffs, and propose effective solutions to handle the tradeoffs in the design space. All the proposed optimization approaches are purely softwarebased. They are adaptive, transparent, traceable and portable, which leads to achievable and immediate performance gains for various existing GPU devices, especially for GPU integrated high-performance-computers (HPC). Particularly, the first contribution in Chapter 3 is a novel visualizable analytic model called "X" that is specially for today’s highly parallel machines. It comprehensively analyzes the interaction between the four types of parallelism (TLP, ILP, DLP and MLP) and two types of memory effects (local on-chip cache effect and remote off-chip memory effect), in terms of system throughput. The X-model acts as the theoretical basis of this thesis. The second contribution in Chapter 4 is an effective auto-tuning framework to resolve the conflict between overall thread concurrency and per-thread register usage for GPUs. We discover that the performance impact from register usage is continuous, but from concurrency is discrete. Their joint-effects form a special relationship such that a series of critical-points can be pre-computed. These critical-points denote the best performance for each concurrency level. Therefore, the global optimum, which refers to the optimal number of registers per-thread, can be quickly and efficiently selected to deliver the best GPU performance. The third contribution in Chapter 5 is an adaptive cache bypassing framework for GPUs. It uses a simple but effective approach to throttle the number of threads that could access the three types of GPU caches – L1, L2 and read-only caches, thereby avoiding the fierce cache thrashing of GPUs, and significantly improving the performance for cache-sensitive applications. In Chapter 6, we focus on a crucial GPU component that has long been ignored – the Special Function Units (SFUs) and show its outstanding role in performance acceleration and approximate computing for GPU applications. We exhaustively evaluate the numeric transcendental functions that are accelerated by SFUs and propose a transparent, tractable and portable design framework for SFU-driven approximate acceleration on GPUs. It partitions the active threads into a PEbased slower but accurate path, and a SFU-based faster but approximated path, and tunes the relative partition ratio among two paths to control the tradeoffs between the performance and accuracy of the GPU kernels. In this way, a finegrained and almost linear tuning space for the tradeoff between performance and accuracy can be created. Finally, the last contribution in Chapter 7 is a novel approach for fine-grained inter-thread synchronizations on the shared memory of modern GPUs. By reassembling the low-level assembly-based micro-operations that comprise an atomic instruction, we develop a highly efficient, low cost lock approach that can be leveraged to set up a fine-grained producer-consumer synchronization channel between cooperative threads in a thread block. Additionally, we show how to implement a dataflow algorithm on GPUs using a real 2D-wavefront application.

Tags: Computer science, CUDA, Hardware Architecture, nVidia, Performance, PTX, Thesis

October 29, 2016 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

GPU Performance Modeling and Optimization

Your response

Recent source codes

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

True 4-Bit Quantized CNN Training on CPU

cuFuzz: A GPU-oriented coverage-guided fuzzer for userland CUDA application

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Most viewed papers (last 30 days)

GPU Performance Modeling and Optimization

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)