high performance computing on graphics processing units: hgpu.org

hgpu.org » paper » D5.5.2 – Architectural Techniques to exploit SLACK & ACCURACY trade-offs

D5.5.2 – Architectural Techniques to exploit SLACK & ACCURACY trade-offs

S. Kaxiras, G. Keramidas, K. Koukos

@article{keramidasd5,

title={D5. 5.2–Architectural Techniques to exploit SLACK & ACCURACY trade-offs},

author={Keramidas, Georgios and TUB, Ben Juurlink and Stamoulis, Reviewers Iakovos}

}

Download (PDF)

View

Source

1943

views

In this work we are (a) exploring memory slack for the state-of-the-art many-core CPUs and GPUs, (b) present techniques to eliminate slack, and (c) explore the architectural parameters to improve power eciency. Dynamic Voltage-Frequency Scaling (DVFS) is one of the most benecial techniques for CPU’s to improve power eciency. The end of Dennard scaling however, in which as technology advances the available voltage range shrinks, is threatening the eectiveness of DVFS. This is very common in GPUs today and will become a severe limitation for many-cores in the near future. In this report we are analysing the impact of core DVFS for dierent memory frequencies into state of the art GPUs. Because of the limitations imposed by either the programming models or the hardware itself we could not apply DVFS on embedded low power GPUs. Therefore we swift our attention to general purpose multi-cores and demonstrate signicant energy benets from our proposed execution scheme. For the GPU evaluation part we are using the NVIDIA-CUDA toolkit and some custom micro-benchmarks. Our analysis shows that DVFS can give signicant energy benet at architectures with restricted memory bandwidth, such as embedded or mobile GPUs (although this is restricted to simulated runs only due to limitations ). Finally our work (a) proposes and evaluates a novel execution scheme for general purpose many-cores, and (b) investigates and intriguing future direction and reveal that energy ineciencies of GPUs are not related with memory slack but with the mechanisms used to hide slack which seems to compromise applications locality.

Tags: Benchmarking, Memory, Power-efficient computing

September 6, 2013 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

D5.5.2 – Architectural Techniques to exploit SLACK & ACCURACY trade-offs

Your response

Recent source codes

Agentic Code Optimization via Compiler-LLM Cooperation

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

True 4-Bit Quantized CNN Training on CPU

cuFuzz: A GPU-oriented coverage-guided fuzzer for userland CUDA application

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Most viewed papers (last 30 days)

D5.5.2 – Architectural Techniques to exploit SLACK & ACCURACY trade-offs

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)