GCStack+GCScaler: Fast and Accurate GPU Performance Analyses Using Fine-Grained Stall Cycle Accounting and Interval Analysis
Yonsei University, Seoul, Republic of Korea
Proceedings of the 52nd Annual International Symposium on Computer Architecture (ISCA ’25), 2025
To design next-generation Graphics Processing Units (GPUs), GPU architects rely on GPU performance analyses to identify key GPU performance bottlenecks and explore GPU design spaces. Unfortunately, the existing GPU performance analysis mechanisms make it difficult for GPU architects to conduct fast and accurate GPU performance analyses. The existing mechanisms can provide misleading insights into GPU performance bottlenecks. They characterize the performance-degrading stall events of a GPU using coarse-grained, issue-stage-centric, and priority-based cycle stacks which tend to exaggerate memory-side stall events and hide concurrently occurring stall events. The existing mechanisms also incur high GPU design space exploration overhead, as they involve repetitive cycle-level timing simulations for evaluating alternative GPU designs. In this paper, we propose two GPU performance analysis mechanisms, namely GCStack and GCScaler. The two mechanisms enable fast and accurate GPU performance analyses by (1) accurately characterizing the performance bottlenecks of a baseline GPU using fine-grained stall cycle accounting, and (2) accurately scaling the stall cycles of the baseline GPU using interval analysis and analytical scaling models. GCStack captures all the concurrently occurring stall events within each stall cycle, and characterizes the performance as a fine-grained cycle stack. Using the fine-grained cycle stack, GCScaler leverages the existing GPU interval analysis techniques’ accurate stall cycle scaling capability to estimate the cycle stack for an alternative GPU design. GCScaler further employs analytical scaling models designed to scale the idle and synchronization stall cycles accurately. Our evaluation using 47 benchmarks shows that GCStack and GCScaler accelerate an exploration of 1,000 GPU designs by 32.7× over repetitive timing simulations while achieving a low mean absolute performance estimation error rate of 6.37%.
June 29, 2025 by hgpu
Your response
You must be logged in to post a comment.