25738

Principal Kernel Analysis: A Tractable Methodology to Simulate Scaled GPU Workloads

Cesar A. Baddouh, Mahmoud Khairy, Roland Green, Mathias Payer, Timothy G. Rogers
Purdue University
54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’21), Pages 724–737, 2021

@inproceedings{avalos2021principal,

   title={Principal Kernel Analysis: A Tractable Methodology to Simulate Scaled GPU Workloads},

   author={Avalos Baddouh, Cesar and Khairy, Mahmoud and Green, Roland N and Payer, Mathias and Rogers, Timothy G},

   booktitle={MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture},

   pages={724–737},

   year={2021}

}

Download Download (PDF)   View View   Source Source   

171

views

Simulating all threads in a scaled GPU workload results in prohibitive simulation cost. Cycle-level simulation is orders of magnitude slower than native silicon, the only solution is to reduce the amount of work simulated while accurately representing the program. Existing solutions to simulate GPU programs either scale the input size, simulate the first several billion instructions, or simulate a portion of both the GPU and the workload. These solutions lack validation against scaled systems, produce unrealistic contention conditions and frequently miss critical code sections. Existing CPU sampling mechanisms, like SimPoint, reduce per-thread workload, and are ill-suited to GPU programs where reducing the number of threads is critical. Sampling solutions on GPUs space lack silicon validation, require per-workload parameter tuning, and do not scale. A tractable solution, validated on contemporary scaled workloads, is needed to provide credible simulation results. By studying scaled workloads with centuries-long simulation times, we uncover practical and algorithmic limitations of existing solutions and propose Principal Kernel Analysis: a hierarchical program sampling methodology that concisely represents GPU programs by selecting representative kernel portions using a scalable profiling methodology, tractable clustering algorithm and detection of intra-kernel IPC stability. We validate Principal Kernel Analysis across 147 workloads and three GPU generations using the Accel-Sim simulator, demonstrating a better performance/error tradeoff than prior work and that century-long MLPerf simulations are reduced to hours with an average cycle error of 27% versus silicon.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2021 hgpu.org

All rights belong to the respective authors

Contact us: