high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs

Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs

Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari

School of Electrical and Computer Engineering, University College of Engineering, University of Tehran, Tehran, Iran

26th International Conference on Architecture of Computing Systems (ARCS 2013), 2013

@article{lashgarinter2013inter,

title={Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs},

author={Lashgar, A. and Baniasadi, A. and Khonsari, A.},

year={2013}

}

Download (PDF)

View

Source

2581

views

GPUs employ thousands of threads per core to achieve high throughput. These threads exhibit localities in control-flow, instruction and data addresses and values. In this study we investigate inter-warp instruction temporal locality and show that during short intervals a significant share of fetched instructions are fetched unnecessarily. This observation provides several opportunities to enhance GPUs. We discuss different possibilities and evaluate filter cache as a case study. Moreover, we investigate how variations in microarchitectural parameters impacts potential filter cache benefits in GPUs.

Tags: Computer science, Energy-efficient computing, GPGPU-sim, nVidia, Tesla

January 17, 2013 by hgpu

Rating: 0.5/5. From 1 vote.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs

Your response

Recent source codes

Agentic Code Optimization via Compiler-LLM Cooperation

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

True 4-Bit Quantized CNN Training on CPU

cuFuzz: A GPU-oriented coverage-guided fuzzer for userland CUDA application

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Most viewed papers (last 30 days)

Inter-Warp Instruction Temporal Locality in Deep-Multithreaded GPUs

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)