Practical FP4 Training for Large-Scale MoE Models on Hopper GPUs

hgpu.org » Applications » Computer science » Practical FP4 Training for Large-Scale MoE Models on Hopper GPUs

Practical FP4 Training for Large-Scale MoE Models on Hopper GPUs

Wuyue Zhang, Chongdong Huang, Chunbo You, Cheng Gu, Fengjuan Wang, Mou Sun

Zhejiang Lab, Hangzhou, China

arXiv:2603.02731 [cs.LG], (3 Mar 2026)

DOI:10.48550/arXiv.2603.02731

@misc{zhang2026practical,

title={Practical FP4 Training for Large-Scale MoE Models on Hopper GPUs},

author={Wuyue Zhang and Chongdong Huang and Chunbo You and Cheng Gu and Fengjuan Wang and Mou Sun},

year={2026},

eprint={2603.02731},

archivePrefix={arXiv},

primaryClass={cs.LG},

url={https://arxiv.org/abs/2603.02731}

}

Download (PDF)

View

Source

Source codes

Package:

MXFP4 Training Support Codebase

1153

views

Training large-scale Mixture-of-Experts (MoE) models is bottlenecked by activation memory and expert-parallel communication, yet FP4 training remains impractical on Hopper-class GPUs without native MXFP4 or NVFP4 support. In this work, we present a training recipe that enables MXFP4 efficiency for MoE models on Hopper architectures without native 4-bit computation support. A central challenge is to integrate FP4 into an existing BF16/FP8 hybrid training pipeline without incurring costly precision round-trips (e.g., FP4 – BF16 – FP8). We address this challenge by introducing direct FP8-to-FP4 quantization and de-quantization, together with scaling-aware FP4 row-wise to column-wise conversion, enabling FP4 activations and expert-parallel communication with minimal overhead. Core MoE computations are executed in FP8, while activations and expert-parallel communication are compressed using MXFP4, achieving substantial memory and bandwidth savings without degrading convergence. At the 671B parameter scale, our method achieves end-to-end training performance comparable to strong FP8 baselines, while reducing peak activation memory by 14.8% (11.8 GB) and improving training throughput by 12.5%, from 1157 to 1302 tokens per GPU per second. These results show that FP4 efficiency can be practically realized for large-scale MoE training through careful software-hardware co-design, even without native FP4 Tensor Core support.

Tags: Computer science, CUDA, LLM, nVidia, nVidia GB200, Package, Precision

March 8, 2026 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org