29702

Dissecting the NVIDIA Hopper Architecture through Microbenchmarking and Multiple Level Analysis

Weile Luo, Ruibo Fan, Zeyu Li, Dayou Du, Hongyuan Liu, Qiang Wang, Xiaowen Chu
The Hong Kong University of Science and Technology (Guangzhou)
arXiv:2501.12084 [cs.DC], (21 Jan 2025)

@misc{luo2025dissectingnvidiahopperarchitecture,

   title={Dissecting the NVIDIA Hopper Architecture through Microbenchmarking and Multiple Level Analysis},

   author={Weile Luo and Ruibo Fan and Zeyu Li and Dayou Du and Hongyuan Liu and Qiang Wang and Xiaowen Chu},

   year={2025},

   eprint={2501.12084},

   archivePrefix={arXiv},

   primaryClass={cs.DC},

   url={https://arxiv.org/abs/2501.12084}

}

Download Download (PDF)   View View   Source Source   

721

views

Modern GPUs, with their specialized hardware like tensor cores, are essential for demanding AI and deep learning applications. This study presents a comprehensive, multi-level microbenchmarking analysis of the NVIDIA Hopper GPU architecture, delving into its performance characteristics and novel features. We benchmark Hopper’s memory subsystem latency and throughput, comparing its L2 partitioned cache behavior and global memory access patterns against recent GPU generations, Ampere and Ada Lovelace. Our analysis reveals significant performance differences and architectural improvements in Hopper. A core contribution of this work is a detailed evaluation of Hopper’s fourth-generation tensor cores, including their FP8 precision support and the novel asynchronous wgmma instructions, assessing their impact on matrix multiply-accumulate operations. We further investigate the performance implications of other key Hopper innovations: DPX instructions for accelerating dynamic programming algorithms, distributed shared memory (DSM) for inter-SM communication, and the Tensor Memory Accelerator (TMA) for asynchronous data movement. This multi-level approach encompasses instruction-level microbenchmarks, library-level analysis of the Transformer Engine, and application-level benchmarks of tensor core performance within large language models. Our findings provide valuable, in-depth insights for software developers seeking to optimize performance and develop accurate performance models for the Hopper architecture, ultimately contributing to a deeper understanding of its potential for accelerating AI and other computationally intensive workloads.
No votes yet.
Please wait...

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us: