high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » An in-depth performance analysis of irregular workloads on VLIW APU

An in-depth performance analysis of irregular workloads on VLIW APU

Matthew Doerksen

The University of Manitoba, Winnipeg, Manitoba, Canada

The University of Manitoba, 2014

@phdthesis{doerksen2014depth,

title={An in-depth performance analysis of irregular workloads on VLIW APU},

author={Doerksen, Matthew James},

year={2014},

school={The University of Manitoba}

}

Download (PDF)

View

Source

2392

views

Heterogeneous multi-core architectures have a higher performance/power ratio than traditional homogeneous architectures. Due to their heterogeneity, these architectures support diverse applications but developing parallel algorithms on these architectures can be difficult. In implementing algorithms for heterogeneous systems, proprietary languages are often required, limiting portability. Although general purpose graphics processing units (GPUs) have shown great promise in accelerating the performance of throughput computing applications, it is still limited by the memory wall. The memory wall can greatly affect application performance for problems that incorporate amorphous parallelism or irregular workload. Now, AMD’s Fusion series of Accelerated Processing Units (APUs) attempts to solve this problem. The APU is a radical change from the traditional systems of a few years ago. This design change enables consumers to have a capable CPU connected to a powerful, compute-capable GPU using a Very Long Instruction Word (VLIW) architecture. In this thesis, I address the suitability of irregular workload problems on APU architectures. I consider four scientific computing problems of varying characteristics and map them onto the architectural features of the APU. I develop several software optimizations for each problem by making effective use of VLIW static scheduling through techniques such as loop unrolling and vectorization. Using AMD’s OpenCL profiler, I analyze the execution of the various optimizations and provide an in-depth performance analysis using metrics such as kernel occupancy, ALUFetchRatio, ALUBusy Percentage and ALUPacking. Finally, I show the effect of register pressure due to vectorization and the limitations associated with the APU architecture for irregular workloads.

Tags: APU, ATI, ATI Radeon HD 7970, Computer science, Heterogeneous systems, OpenCL, Performance, Thesis

June 16, 2014 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

An in-depth performance analysis of irregular workloads on VLIW APU

Your response

Recent source codes

Agentic Code Optimization via Compiler-LLM Cooperation

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

True 4-Bit Quantized CNN Training on CPU

cuFuzz: A GPU-oriented coverage-guided fuzzer for userland CUDA application

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Most viewed papers (last 30 days)

An in-depth performance analysis of irregular workloads on VLIW APU

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)