high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error

FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error

Fengjuan Wang, Zhiyi Su, Xingzhu Hu, Cheng Wang, Mou Sun

Zhejiang Lab

arXiv:2511.02302 [cs.LG], (4 Nov 2025)

DOI:10.48550/arXiv.2511.02302

@misc{wang2025fp8flowmoecastingfreefp8recipe,

title={FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error},

author={Fengjuan Wang and Zhiyi Su and Xingzhu Hu and Cheng Wang and Mou Sun},

year={2025},

eprint={2511.02302},

archivePrefix={arXiv},

primaryClass={cs.LG},

url={https://arxiv.org/abs/2511.02302}

}

Download (PDF)

View

Source

138

views

Training large Mixture-of-Experts (MoE) models remains computationally prohibitive due to their extreme compute and memory demands. Although low-precision training promises to accelerate computation and reduce memory footprint, existing implementations still rely on BF16-dominated dataflows with frequent quantize-dequantize (Q/DQ) conversions. These redundant casts erode much of FP8’s theoretical efficiency. However, naively removing these casts by keeping dataflows entirely in FP8 introduces double quantization error: tensors quantized along different dimensions accumulate inconsistent scaling factors, degrading numerical stability. We propose FP8-Flow-MoE, an FP8 training recipe featuring a quantization-consistent FP8-centric dataflow with a scaling-aware transpose and fused FP8 operators that streamline computation and eliminate explicit cast operations from 12 to 2. Evaluations on a 671B-parameter MoE model demonstrate up to 21% higher throughput and 16.5 GB lower memory usage per GPU compared to BF16 and naïve FP8 baselines, while maintaining stable convergence. We provide a plug-and-play FP8 recipe compatible with TransformerEngine and Megatron-LM, which will be open-sourced soon.

Tags: Artificial intelligence, Computer science, Machine learning, nVidia, nVidia H100

November 9, 2025 by hgpu

No votes yet.

Please wait...