high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Microbenchmarking NVIDIA’s Blackwell Architecture: An in-depth Architectural Analysis

Microbenchmarking NVIDIA’s Blackwell Architecture: An in-depth Architectural Analysis

Aaron Jarmusch, Sunita Chandrasekaran

Dept. of Computer Information Sciences, University of DelawareNewark, US

arXiv:2512.02189 [cs.AR] (1 Dec 2025)

DOI:10.48550/arXiv.2512.02189

@{,

}

Download (PDF)

View

Source

1277

views

As GPU architectures rapidly evolve to meet the overcoming demands of exascale computing and machine learning, the performance implications of architectural innovations remain poorly understood across diverse workloads. NVIDIA’s Blackwell (B200) generation introduce significant architectural advances including the 5th generation tensor cores, tensor memory (TMEM), decompression engine (DE), and dual chips; however systematic methodologies for quantifying these improvements lag behind hardware development cycles. We contribute an open-source microbenchmark suite that offers practical insights into optimizing workloads to fully utilize the rich feature sets of the modern GPU architecture. This work aims to enable application developers make informed architectural decisions and guide future GPU design directions. Our work studies Blackwell GPUs, compares them to H200 generation with regards to the memory subsystem, tensor core pipeline and floating-point precisions (FP32, FP16, FP8, FP6, FP4). Our systematic evaluation of dense/sparse GEMM, transformer inference, and training workloads demonstrate that B200’s tensor core enhancements achieves 1.56x higher mixed-precision throughput and 42% better energy efficiency than H200. Our memory analysis reveals 58% reduction in memory access latency in cache-misses, fundamentally changing optimal algorithm design strategies.

Tags: Benchmarking, Computer science, CUDA, HPC, Machine learning, nVidia, nVidia B200, nVidia H200, PTX

December 7, 2025 by hgpu

Rating: 5.0/5. From 1 vote.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Microbenchmarking NVIDIA’s Blackwell Architecture: An in-depth Architectural Analysis

Your response

Recent source codes

Awesome LLM-Driven Kernel Generation

PhysProver: Advancing Automatic Theorem Proving for Physics

ParaCodex: A Profiling-Guided Autonomous Coding Agent for Reliable Parallel Code Generation and Translation

SeedFold: Scaling Biomolecular Structure Prediction

Tilus: A Tile-Level GPU Kernel Programming Language

Memory-Efficient Acceleration of Block Low-Rank Foundation Models on Resource Constrained GPUs

BoltzGen:Toward Universal Binder Design

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

cuPilot: A Strategy-Coordinated Multi-agent Framework for CUDA Kernel Evolution

MATLAB Tensor Core models

Most viewed papers (last 30 days)

Microbenchmarking NVIDIA’s Blackwell Architecture: An in-depth Architectural Analysis

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)