Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe

Published:06/18/2026

Analysis

~25 min read · 24,533 charsThis analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

1. Bibliographic Information

1.1. Title

The title of the paper is "Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe". The central topic revolves around identifying a fundamental flaw in the current standard 4-bit floating-point format (E2M1) used for Large Language Model (LLM) pretraining and proposing a new recipe called UFP4 based on uniform grids to resolve this issue.

1.2. Authors

The authors are Qian Zhao, Kunlong Chen, Changxin Tian, Zhonghui Jiang, Haitao Zhang, Chaofan Yu, Peijie Jiang, Mingliang Gong, Jia Liu, Ziqi Liu, Zhiqiang Zhang*, Jun Zhou. They are affiliated with the Ling Team, Ant Group. The corresponding authors are Zhiqiang Zhang and Jun Zhou. The research background of the team appears to be focused on efficient AI system architecture, specifically low-precision training and scaling laws for large models.

1.3. Journal/Conference

The paper is currently available as a preprint on arXiv (arXiv:2606.20381). As of the provided text, it has not yet been published in a specific journal or conference proceedings, though it is dated June 19, 2026. arXiv is a highly reputable open-access repository for scientific preprints in fields like computer science and mathematics, widely used to disseminate cutting-edge research before formal peer review.

1.4. Publication Year

The paper is dated June 19, 2026.

1.5. Abstract

The paper addresses the challenge of FP4 training (training with 4-bit floating-point numbers), which promises significant reductions in memory and computation costs for LLM pretraining. Current hardware and recipes (e.g., NVIDIA Blackwell, AMD MI350) primarily rely on the E2M1 data element format. The authors identify a fundamental limitation in E2M1: it inherently suffers from Shrinkage Bias, a systematic negative rounding error caused by the geometric asymmetry of its representable value bins. This bias accumulates multiplicatively across network layers and is amplified by the Random Hadamard Transform (RHT), explaining the training instability seen in existing E2M1-based methods. In contrast, uniform grids (like E1M2 or INT4) avoid this error. Based on this, the authors propose UFP4, a uniform 4-bit training recipe that applies RHT to all three training GEMMs (matrix multiplications) and restricts stochastic rounding to the upstream gradient dY. Experiments on Dense 1.5B, MoE 7.9B, and MoE 124B models show that UFP4 achieves lower loss degradation than strong E2M1 baselines. The authors suggest future accelerators should support uniform 4-bit grids as first-class primitives.

The official source link is: https://arxiv.org/abs/2606.20381 The PDF link is: https://arxiv.org/pdf/2606.20381v1 The publication status is Preprint.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the training instability and loss degradation observed when using FP4 (4-bit floating point) precision for the pretraining of Large Language Models (LLMs). While FP4 offers substantial cost savings (halving the memory footprint compared to FP8), it is notoriously difficult to stabilize during the training process.

This problem is critically important because the cost of training state-of-the-art LLMs is skyrocketing. Moving to lower precision formats like FP4 is essential for making future models economically and environmentally viable. However, current industry-standard solutions, such as NVIDIA's NVFP4 recipe, rely heavily on the E2M1 format (a 4-bit float with 2 exponent bits and 1 mantissa bit). Despite various mitigation techniques like scaling and stochastic rounding, these recipes still struggle to match the accuracy of higher precision (BF16) training.

The paper's entry point is a deep geometric analysis of the number representation grid itself. The authors challenge the default choice of E2M1, identifying that the specific arrangement of numbers in this format creates a "Shrinkage Bias" that acts as a fundamental barrier to stable training, especially when combined with outlier-handling techniques like the Random Hadamard Transform (RHT).

2.2. Main Contributions / Findings

The primary contributions of the paper are:

  1. Identification of Shrinkage Bias: The authors formally define and analyze "Shrinkage Bias," a systematic negative rounding error inherent to non-uniform formats like E2M1. They show this is caused by the geometric asymmetry of the rounding bins in the E2M1 grid.

  2. Systemic Impact Analysis: They theoretically and empirically demonstrate that this bias is not just local noise; it accumulates multiplicatively across deep network layers, causing signal decay. Furthermore, they show that the Random Hadamard Transform (RHT), typically used to fix outliers, actually exacerbates this bias under E2M1 by pushing data into the most asymmetric bins.

  3. Proposal of UFP4: They propose UFP4, a novel 4-bit training recipe. Instead of E2M1, UFP4 uses a uniform grid (E1M2/INT4 style). This uniformity eliminates the geometric source of Shrinkage Bias.

  4. Full-RHT Strategy: Unlike E2M1 recipes that restrict RHT to specific paths to avoid instability, UFP4 safely enables RHT on all three training GEMMs (forward, backward data, backward weight) because the uniform grid handles the post-RHT distribution correctly.

    The key findings are that UFP4 consistently outperforms strong E2M1 baselines on large-scale models (up to 124B parameters), achieving significantly lower BF16-relative loss degradation. This suggests that the hardware industry should prioritize support for uniform 4-bit grids (E1M2/INT4) alongside the existing E2M1 standard.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, several foundational concepts in deep learning and numerical computing are required:

  • Floating-Point Representation (FP4): Standard computers use 32-bit (FP32) or 16-bit (FP16/BF16) floating-point numbers. FP4 compresses this to just 4 bits (1 sign bit, and the rest split between exponent and mantissa).

    • E2M1: A specific FP4 format with 2 exponent bits and 1 mantissa bit. This provides a wide dynamic range (can represent very large and very small numbers) but low precision (large gaps between numbers).
    • E1M2: A specific FP4 format with 1 exponent bit and 2 mantissa bits. This has a narrower dynamic range but higher precision (numbers are closer together). The paper argues this spacing is more "uniform" than E2M1.
    • INT4: A 4-bit integer format. This represents numbers with perfectly uniform spacing (e.g., -8, -7, ..., 7).
  • Blockwise Quantization: To make 4-bit numbers usable for the wide range of values in a neural network, the tensor is divided into small blocks (e.g., 1x16). Each block has its own "scale factor" to stretch the 4-bit values to fit the data in that specific block.

  • GEMM (General Matrix Multiply): The core mathematical operation in deep learning, typically calculating Y=XWTY = XW^T. This is where the vast majority of computation happens.

  • Random Hadamard Transform (RHT): A mathematical operation that rotates the data. In deep learning tensors, some values are "outliers" (extremely large). RHT mixes these outlier values across the vector, making the distribution smoother and easier to quantize.

  • Stochastic Rounding (SR): When a number falls between two representable 4-bit values, standard rounding picks the closest one deterministically. Stochastic rounding picks one of the two neighbors probabilistically, based on how close the true value is to each. This preserves the expected value of the data, reducing bias.

  • Round-to-Nearest-Even (RTNE): The standard, deterministic rounding rule used in IEEE floating-point standards. If a number is exactly halfway between two representable values, it rounds to the one with an even least significant bit.

3.2. Previous Works

The paper places itself in the context of recent advances in low-precision training:

  • FP8 Training: Papers like (Micikevicius et al., 2022) established FP8 as a viable standard for training, which paved the way for going even lower to FP4.
  • MXFP4 & NVFP4: Proposed by (Rouhani et al., 2023) and (NVIDIA et al., 2026), these are specific hardware-software recipes for FP4 training. They rely on E2M1 with fine-grained block scaling. NVFP4 introduced the use of RHT and SR to handle outliers, but the authors of the current paper show these methods still suffer from the geometric "Shrinkage Bias."
  • HiFloat4: Work by (Taghian et al., 2026) adopts a uniform S1P2 data element (similar to E1M2) on Ascend NPUs, supporting the paper's argument that uniform grids are promising.
  • Quantization-Aware Training & Outlier Handling: Methods like (Wang et al., 2025) and (Panferov et al., 2026) modify the gradient estimator or clamp outliers to stabilize FP4. The paper argues these treat symptoms of the E2M1 format flaw rather than the root cause.

3.3. Technological Evolution

The field has evolved from using standard FP32/BF16 to FP8 (e.g., in H100 GPUs), and now towards FP4 (e.g., in Blackwell GPUs). Initially, the focus was on dynamic range (handling outliers), favoring formats like E2M1. However, as researchers applied techniques like RHT to smooth out distributions, the bottleneck shifted from "dynamic range" to "local resolution" (precision). This paper represents the moment where the research community realizes that the geometric grid itself must change to match the new, smoother data distributions created by transforms like RHT.

3.4. Differentiation Analysis

The core difference between this paper and previous works (like NVFP4) is the choice of the number format grid.

  • Previous work: Stick with E2M1 (non-uniform) and try to fix the resulting errors with complex scaling, rounding tricks, or restricting where RHT is applied.
  • This paper (UFP4): Switch to a uniform grid (E1M2/INT4). This eliminates the geometric bias at the source. Because the bias is gone, they can apply RHT everywhere (Full-RHT), which previous methods couldn't do safely.

4. Methodology

4.1. Principles

The core principle of the UFP4 method is that geometric uniformity in the number representation grid is superior to wide dynamic range when using Random Hadamard Transforms (RHT).

The intuition is as follows:

  1. The Problem with E2M1: The E2M1 format has "bins" (ranges of numbers that round to a specific value) that are asymmetric. For example, the gap between the representable number 2 and the next one might be larger on the right side than the left. This asymmetry means that, on average, numbers get rounded down (towards zero). This is "Shrinkage Bias."
  2. Accumulation: In a deep network, you multiply matrices together. If every matrix operation slightly shrinks the magnitude of the signal, this shrinking compounds exponentially. By the time the signal reaches the deep layers, it has decayed significantly, harming training.
  3. The RHT Mismatch: RHT is used to spread outliers out. This makes the distribution of values more uniform (dense). However, this pushes more data into the mid-range values of E2M1, which happen to be where the geometric asymmetry (and thus the bias) is the worst. So, RHT, intended to help, actually makes the Shrinkage Bias worse in E2M1.
  4. The UFP4 Solution: Use a Uniform Grid (E1M2 or INT4). In a uniform grid, the bins are symmetric. There is no inherent bias towards zero. Therefore, the signal does not decay. Furthermore, because the grid is uniform, it perfectly matches the "smoothed" distribution created by RHT. This allows RHT to be used on all operations without fear of exacerbating bias.

4.2. Core Methodology In-depth

4.2.1. Formalizing Shrinkage Bias

The authors first provide a rigorous mathematical definition of Shrinkage Bias.

Consider a block of data with a scale factor ss. Let t=x/st = |x|/s be the normalized magnitude of a value. The paper defines Shrinkage Bias as the negative expected error when rounding to the nearest even level (ρG\rho_G) for a distribution P\mathcal{P} over tt. The expected error is defined as bG(P)=EtP[ρG(t)t]b_G(\mathcal{P}) = \mathbb{E}_{t \sim \mathcal{P}}[\rho_G(t) - t]. If this value is negative, it means the quantized value is consistently smaller than the real value (magnitude shrinkage).

Assume we are looking at the non-negative levels of the format codebook G+={q0,q1,,qn}G_{+} = \{q_0, q_1, \dots, q_n\} where 0=q0<q1<<qn0=q_0 < q_1 < \dots < q_n. For any interior level qiq_i, the "bin" of values that round to it is the interval between the midpoints of its neighbors. The authors define the interior of this bin as:

Bi=(qi1+qi2,qi+qi+12),i=qiqi12,ri=qi+1qi2, \mathcal{B}_{i}=\left(\frac{q_{i-1}+q_{i}}{2},\frac{q_{i}+q_{i+1}}{2}\right),\qquad\ell_{i}=\frac{q_{i}-q_{i-1}}{2},\quad r_{i}=\frac{q_{i+1}-q_{i}}{2},

Here, i\ell_i is the width of the bin to the left of qiq_i, and rir_i is the width to the right. If the density of data inside this bin is roughly uniform, we can calculate the expected error. The error for a value tt inside the bin is qitq_i - t. By changing variables to u=tqiu = t - q_i, the expected error becomes the integral of -u over the bin width:

E[ρG(t)ttBi]=1i+riiri(u)du=iri2=2qiqi1qi+14. \mathbb{E}[\rho_{G}(t)-t\mid t\in\mathcal{B}_{i}]=\frac{1}{\ell_{i}+r_{i}}\int_{-\ell_{i}}^{r_{i}}(-u)\,du=\frac{\ell_{i}-r_{i}}{2}=\frac{2q_{i}-q_{i-1}-q_{i+1}}{4}.

This formula is the smoking gun. It shows that the expected error depends on the difference between the left and right bin widths (iri\ell_i - r_i).

  • If the grid is uniform (like E1M2), i=ri\ell_i = r_i, so the error is 0. No bias.

  • If the grid is non-uniform (like E2M1), iri\ell_i \neq r_i. Specifically, at the transition points in E2M1 (e.g., between 1.5 and 2, and 2 and 3), the right side is wider than the left (ri>ir_i > \ell_i), making the result negative. This confirms the systematic "Shrinkage Bias."

    The following figure (Figure 1 from the original paper) visually demonstrates this geometric asymmetry in E2M1 compared to the uniform grids.

    img-0.jpeg 该图像是两个标志,分别为 LING 和 ANT GROUP。这些标志可能与技术或金融相关的公司有关,具体内容未在可用信息中提及。

    img-1.jpeg 该图像是一个图表,展示了不同数据元素值下 E2M1、E1M2 和 INT4 的非均匀与均匀表现。通过对比三种格式的表现,图中突出显示了非均匀格式 E2M1 的系统性偏差。图表中的关键数值和标记帮助理解不同格式的量化效果。

4.2.2. Systemic Impact and Multiplicative Accumulation

Next, the authors explain how this small local error becomes a systemic problem.

When performing a matrix multiplication Z=ABZ = AB^{\top} with quantized operands A^\widehat{A} and B^\widehat{B}, the shrinkage can be measured by projecting the quantized operand onto the exact BF16 operand. They define a scaling factor αA\alpha_A:

αA=A^,AFAF2,A^=αAA+RA,RA,AF=0, \alpha_{A}=\frac{\langle\widehat{A},A\rangle_{F}}{\|A\|_{F}^{2}},\qquad\widehat{A}=\alpha_{A}A+R_{A},\qquad\langle R_{A},A\rangle_{F}=0,

Here, αA\alpha_A represents how much of the original signal AA is preserved in the quantized version A^\widehat{A}. If αA<1\alpha_A < 1, the signal is attenuated (shrunk). RAR_A is the residual noise orthogonal to AA.

Substituting this into the GEMM equation Zq=A^B^Z_q = \widehat{A}\widehat{B}^{\top} yields:

Zq=A^B^=αAαBABαAαBZ+αAARB+αBRAB+RARBresidual noise. Z_{q}=\widehat{A}\widehat{B}^{\top}=\underbrace{\alpha_{A}\alpha_{B}AB^{\top}}_{\alpha_{A}\alpha_{B}Z}+\underbrace{\alpha_{A}AR_{B}^{\top}+\alpha_{B}R_{A}B^{\top}+R_{A}R_{B}^{\top}}_{\text{residual noise}}.

This decomposition shows that the main signal ZZ is scaled by αAαB\alpha_A \alpha_B. Since both α\alpha's are less than 1 (due to shrinkage), the signal is multiplied by a factor ηαAαB<1\eta \approx \alpha_A \alpha_B < 1 at every layer.

In a deep network with KK layers, this attenuation accumulates. The cumulative scaling factor is the product of all individual factors:

k=1Kηk=k=1K(1δk)exp(k=1Kδk),δk=1ηk. \prod_{k=1}^{K}\eta_{k}=\prod_{k=1}^{K}(1-\delta_{k})\approx\exp\left(-\sum_{k=1}^{K}\delta_{k}\right),\qquad\delta_{k}=1-\eta_{k}.

This approximation (derived using log(1δ)δ\log(1-\delta) \approx -\delta) shows that the decay is exponential in the number of layers. Even a tiny per-layer shrinkage (δk\delta_k) can result in massive signal loss in a 124-layer model.

4.2.3. RHT Exacerbation

The authors then analyze why Random Hadamard Transform (RHT) makes this worse for E2M1.

RHT rotates the tensor to disperse outliers. This shifts the "regime" of the data from dynamic-range-limited (few huge outliers) to local-resolution-limited (many values of moderate size). To quantify this, they measure the Effective Bucket Ratio (BeffB_{\text{eff}}), which measures how evenly the 4-bit codebook levels are used:

Be f f(G,T)=exp(E(G,T))K,w h e r eE(G,T)=i=1Kpilog(pi+ϵ) B _ {\text {e f f}} (G, T) = \frac {\exp (\mathcal {E} (G , T))}{K}, \quad \text {w h e r e} \quad \mathcal{E} (G, T) = - \sum_ {i = 1} ^ {K} p _ {i} \log \left(p _ {i} + \epsilon\right)

Here, E\mathcal{E} is the entropy of the bucket distribution. A higher BeffB_{\text{eff}} means the buckets are used more uniformly.

They also measure the change in Signal-to-Quantization-Noise Ratio (ΔSQNR\Delta \mathrm{SQNR}) induced by RHT:

ΔSQNR=10log10NMSEINMSEH16, \Delta \mathrm {S Q N R} = 1 0 \log_ {1 0} \frac {\mathrm {N M S E} _ {I}}{\mathrm {N M S E} _ {\mathbf {H} _ {1 6}}},

where NMSEI\mathrm{NMSE}_I is the error without RHT and NMSEH16\mathrm{NMSE}_{\mathbf{H}_{16}} is the error with RHT.

The analysis shows that for E2M1, while RHT increases bucket utilization (good), it actually decreases SQNR (bad). This is because RHT pushes data into the mid-range bins of E2M1, which are precisely the asymmetric bins causing Shrinkage Bias. For the uniform E1M2 grid, however, RHT increases both utilization and SQNR.

The following figure (Figure 2 from the original paper) illustrates how RHT changes the bucket utilization and SQNR for real tensors, showing the divergence between E2M1 and E1M2.

img-3.jpeg 该图像是一个散点图,展示了 fwd_w 线性函数的 Delta SQNR (dB) 与 raw 数据的关系。图中,X 轴代表 raw 数据的值,Y 轴对应 Delta SQNR。在图的左下角,部分蓝色点显示了负的 SQNR,而右上角的橙色点则表示 SQNR 的显著提升。

img-4.jpeg 该图像是图表,展示了线性模型的输出与输入之间的关系。橙色和蓝色的点在坐标系中分布,提供了关于模型预测误差的可视化信息。

img-5.jpeg 该图像是图表,展示了 linear_fc1 / bwd_dy 的数据分布情况。上半部分显示了蓝色和浅蓝色的数据点,而下半部分则呈现了橙色和棕色的数据点,反映了不同情况下的数值变化趋势。

4.2.4. The UFP4 Recipe

Based on the analysis, the authors propose UFP4. The recipe is defined by the following structural choices:

  1. Format: Switch from E2M1 to a Uniform Grid (E1M2 or INT4 style). This eliminates the geometric bias at the source.

  2. RHT Scope: Apply RHT to all three training GEMMs:

    • FPROP (fwd_y): The forward pass matrix multiplication.
    • DGRAD (bwd_dx): The backward pass with respect to inputs.
    • WGRAD (bwd_dw): The backward pass with respect to weights.
    • Note: Previous E2M1 recipes (like NVFP4) only applied RHT to bwd_dw because applying it to fwd_y and bwd_dx caused instability due to the Shrinkage Bias. UFP4 fixes this, allowing full coverage.
  3. Stochastic Rounding (SR): Apply SR only to the upstream gradient dY. This preserves the gradient expectation for the optimizer.

    The following figure (Figure 3 from the original paper) provides an overview of the UFP4 recipe flow.

    img-9.jpeg 该图像是UFP4训练流程的示意图,展示了在不同层之间的信息流动与操作,包括BF16激活和优化器设计。通过统一网格,UFP4在所有前向和反向GEMM中实现了广泛应用的随机哈达玛变换(RHT)。

5. Experimental Setup

5.1. Datasets

The paper validates the UFP4 recipe using large-scale pretraining experiments. While the specific text corpus (e.g., Common Crawl, The Pile) is not explicitly named in the provided text, the experiments focus on model scale and architecture:

  • Dense 1.5B: A dense (non-MoE) model with 1.5 billion parameters.

  • MoE 7.9B: A Mixture-of-Experts model with 7.9 billion parameters.

  • MoE 124B: A very large Mixture-of-Experts model with 124 billion parameters.

    These models were chosen to demonstrate that the findings hold across different architectural paradigms (Dense vs. MoE) and scales, proving the industrial applicability of the method. The training involved processing hundreds of billions of tokens (e.g., long-run pretraining).

5.2. Evaluation Metrics

The paper uses several metrics to evaluate the quality of the FP4 training:

  1. BF16-relative Loss Degradation:

    • Conceptual Definition: This metric quantifies how much worse the model trained in FP4 performs compared to a reference model trained in full precision (BF16). It measures the relative error in the Language Modeling (LM) loss. A value closer to 0% indicates the FP4 training is as good as BF16.
    • Mathematical Formula: Relative Error=LrLBF16LBF16 \text{Relative Error} = \frac{|\mathcal{L}_r - \mathcal{L}_{\text{BF16}}|}{\mathcal{L}_{\text{BF16}}}
    • Symbol Explanation:
      • Lr\mathcal{L}_r: The loss of the recipe being evaluated (e.g., UFP4 or E2M1 baseline).
      • LBF16\mathcal{L}_{\text{BF16}}: The loss of the baseline BF16 model.
  2. Signal-to-Quantization-Noise Ratio (SQNR):

    • Conceptual Definition: This measures the fidelity of the quantization process. It compares the power of the original signal to the power of the error introduced by quantization. Higher SQNR means less distortion.
    • Mathematical Formula: SQNR=10log10(xi2(xix^i)2) \text{SQNR} = 10 \log_{10} \left( \frac{\sum x_i^2}{\sum (x_i - \hat{x}_i)^2} \right)
    • Symbol Explanation:
      • xix_i: The original values in the tensor.
      • x^i\hat{x}_i: The quantized (dequantized) values.
  3. Effective Bucket Ratio (BeffB_{\text{eff}}):

    • Conceptual Definition: This measures how evenly the available 4-bit representable values are being used. If all values cluster in just one bin, the ratio is low. If they are spread out evenly across all bins, the ratio is high. High utilization is generally good for quantization efficiency.
    • Mathematical Formula: (As defined in Methodology) Be f f(G,T)=exp(E(G,T))K B _ {\text {e f f}} (G, T) = \frac {\exp (\mathcal {E} (G , T))}{K}
    • Symbol Explanation:
      • E(G,T)\mathcal{E}(G, T): The entropy of the distribution of values across the KK buckets.
      • KK: The total number of buckets (16 for FP4).

5.3. Baselines

The paper compares UFP4 against a strong E2M1-based reference recipe. This baseline was not just a default setting; it was selected via a controlled configuration search (detailed in Appendix B) to ensure it was the best possible implementation of the E2M1 paradigm. The key differences are summarized in the following table (Table 1 from the original paper):

The following are the results from Table 1 of the original paper:

Configuration E2M1-based recipe E1M2-based recipe (UFP4)
Format E2M1 E1M2/INT4-style uniform grid
Quant block size 1×16 1×16
Scale hierarchy FP32 single-level FP32 single-level
RHT scope bwd_dw fwd_y, bwd_dx, bwd_dw
RHT block size 16 16
SR scope dY dY
2D weight scaling X X

The primary differences are the Format (E2M1 vs Uniform) and the RHT scope (Weight-only vs Full-RHT).

6. Results & Analysis

6.1. Core Results Analysis

The paper organizes experiments into five questions (Q1-Q5) to trace the impact of the proposed changes.

6.1.1. Q1: Does RHT change the preferred 4-bit grid?

The authors analyze single tensors and GEMM outputs. They find that for "well-behaved" tensors (few outliers), RHT has little effect. However, for "outlier-heavy" tensors (common in LLMs like linear_fc2), RHT reverses the format ranking:

  • Without RHT: E2M1 is better because its wide dynamic range handles the few outliers well.

  • With RHT: E1M2 (UFP4) becomes better. RHT smoothes the distribution, removing the need for wide dynamic range and highlighting the need for local precision, which E1M2 provides.

    The following figures (Figures 4 and 5 from the original paper) show the SQNR and bucket utilization for these tensors, illustrating the inversion of performance.

    img-10.jpeg 该图像是一个图表,展示了不同层次的信噪比(SNR)变化。左侧图表(a)显示了各层次的SNR,右侧图表(b)展示了随机哈达玛变换(RHT)引起的SNR变化,分别以不同颜色表示E1M2和E2M1格式的效果。

    img-11.jpeg 该图像是一个图表,展示了层级宽度的幅度桶利用率和随机哈达玛变换引入的利用率增益。左侧图表标示了不同层的桶利用率变化,右侧则显示了经过随机哈达玛变换后的利用率提升。整体趋势呈现出均匀格式相较于非均匀格式的相对优势。

    img-12.jpeg 该图像是图表,展示了线性全连接层的每层SQNR(信号质量噪声比)以及RHT(随机哈达玛变换)引起的SQNR变化。左侧显示在不同层级下的SQNR值,而右侧则强调了RHT的影响。

    img-13.jpeg 该图像是一个图表,左侧展示了层级的幅度桶利用率,右侧展示了基于随机哈达玛转变(RHT)的利用率增益。图表对比了E1M2和E2M1两种格式在不同层级的性能表现,说明了E2M1的效能随层级增加而降低,且RHT对E1M2格式的影响略有增加。

This confirms the hypothesis: once RHT is applied, the uniform grid is superior.

6.1.2. Q2: Does UFP4 reduce the BF16-relative training loss gap?

In end-to-end training on Dense 1.5B, MoE 7.9B, and MoE 124B, UFP4 consistently stays closer to the BF16 baseline than the E2M1 reference.

  • Dense 1.5B: Error dropped from 1.2570% to 0.9673%.

  • MoE 7.9B: Error dropped from 2.3596% to 1.8469%.

  • MoE 124B: Error dropped from 1.7308% to 1.3863%.

    The following figure (Figure 8 from the original paper) shows the loss curves over training steps, visually demonstrating that UFP4 maintains lower loss degradation throughout the long run.

    img-16.jpeg 该图像是图表,展示了Dense 1.5B模型在不同训练token数量下的BF16相对语言建模损失误差 | rac{ ext{L}_r - ext{L}_{ ext{BF16}}|}{ ext{L}_{ ext{BF16}}}。X轴为训练token数量,Y轴为相对损失误差,E1M2和E2M1两条曲线对比,表明E1M2在长时间训练中表现更优。

    img-17.jpeg 该图像是一个示意图,展示了在长达 1000 亿训练标记的情况下,MoE 7.9B 模型中 E2M1 相较于 BF16 的性能表现。曲线显示了不同量化方案的损失,相应的相对损失值为 2.339% 和 1.840%。

    img-18.jpeg 该图像是一个示意图,展示了在训练代币数量不同情况下,E2M1与BF16的损失对比以及E1M2与BF16的损失对比。图中显示了MoE 124B模型在不同训练阶段的性能变化。

6.1.3. Q3: Does the advantage persist across model scales?

Using a scaling-law analysis (training models from 10M to 324M parameters), the authors show that the UFP4 curve is consistently below the E2M1 curve. The fitted curves suggest that the advantage of UFP4 is not just a lucky coincidence for one specific size but a fundamental property of the method that scales.

The following figure (Figure 9 from the original paper) shows the scaling law fit and the loss comparison across sizes.

img-19.jpeg 该图像是一个损失缩放曲线图,展示了不同训练配置下损失值随 FLOPs 变化的趋势。可以看到,UFP4 的损失表现优于 E2M1 基线,曲线逐渐下降,表明随着计算量的增加,模型性能得到了提升。

img-20.jpeg 该图像是一个图表,展示了不同实验条件下的 IsoFLOPS 比较。图中包含了多条线代表不同配置的性能,Y轴为对数值,X轴为实验条件。可以看到,不同的实验设置在相同情况下的性能表现有所差异,表明了新方法在计算效率方面的优势。

6.2. Ablation Studies / Parameter Analysis

6.2.1. Q4: Which recipe components matter?

The authors perform ablation studies on the Dense 1.5B model to test the contribution of RHT scope and Stochastic Rounding (SR).

The following are the results from Table 2 of the original paper:

Setting SR fwd_y bwd_dx bwd_dw Mean LM loss Δ loss
RHT scope ablation
No RHT - - - 1.89202 0.00000
RHT on bwd_dw - - 1.88721 -0.00481
RHT on bwd_dx, bwd_dw - 1.88912 -0.00290
RHT on fwd_y, bwd_dw - 1.88558 -0.00644
Full RHT w/ SR (UFP4) 1.88079 -0.01123
SR ablation under full RHT
Full RHT w/o SR - 1.88535 0.00000
Full RHT w/ SR (UFP4) 1.88079 -0.00456

Analysis:

  • Full RHT is best: The configuration with RHT on all paths (fwd_y, bwd_dx, bwd_dw) achieves the lowest loss. This validates the core claim that uniform grids enable Full-RHT.

  • SR helps: Adding Stochastic Rounding to dY provides an additional loss reduction.

    The authors also tested if restricting the range of E2M1 (e.g., max_fpx=2.0) could mimic a uniform grid. The results showed that simply restricting E2M1 is not a substitute for a native uniform grid, as it sacrifices dynamic range without fully resolving the geometric issues.

The following figure (Figure 10 from the original paper) shows that range-restricted E2M1 variants fail to match the performance of the native E2M1 reference, let alone UFP4.

img-21.jpeg 该图像是一个图表,展示了在Dense 1.5B训练中,Full-RHT E2M1变体相对于E2M1参考的相对语言模型损失差异。随着训练token数量的增加,损失差异逐渐趋于稳定,最大损失差异为+1.99%。

img-22.jpeg 该图像是一个示意图,展示了在MoE 7.9B模型下,使用不同E2M1配置的相对损失差异与训练token数量的关系。图中显示了Full-RHT E2M1(max_fpxx=2, 3, 4)与基准之间的损失变化趋势,分别记录了较低的损失下降幅度。

6.2.2. Q5: Can RHT be fused into FP4 quantization efficiently?

Finally, the authors address the computational cost. They show that by fusing the RHT operation directly into the quantization kernel, the overhead is minimal.

  • Fused Latency: Only 1.06×1.06\times (SM90) and 1.07×1.07\times (SM100) of standalone quantization.

  • Unfused Latency: 1.62×1.62\times and 1.41×1.41\times.

    This proves that UFP4 is not just theoretically sound but practically efficient for deployment.

The following figure (Figure 11 from the original paper) illustrates the low overhead of the fused kernel.

img-23.jpeg 该图像是图表,展示了SM90和SM100系统下的RHT与量化的平均延迟比。左侧柱状图表示相对于独立量化的延迟比,右侧柱状图显示了每种形状的融合RHT与量化的额外开销。U/F比值分别为1.62x和1.41x,反映了融合与未融合的延迟对比。

img-24.jpeg 该图像是图表,展示了融合 RHT 和量化的开销。横轴表示不同的数据尺寸,纵轴表示在 SM90 和 SM100 处理器上相对于基准的开销比率。从图中可以看出,随着数据尺寸的增加,开销比率呈现出逐渐上升的趋势。

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper concludes that the default industry choice of E2M1 for FP4 training is fundamentally flawed when combined with modern outlier-handling techniques like the Random Hadamard Transform (RHT). The authors identify "Shrinkage Bias," a geometric property of non-uniform grids, as the root cause of training instability and signal decay. They propose UFP4, a recipe based on uniform grids (E1M2/INT4), which eliminates this bias. By doing so, UFP4 unlocks the ability to apply RHT to all training operations (Full-RHT), resulting in significantly lower BF16-relative loss degradation across dense and MoE models at various scales.

7.2. Limitations & Future Work

The authors acknowledge that current hardware accelerators (like NVIDIA Blackwell and AMD MI350) are primarily optimized for E2M1 data paths. While they show that E2M1 range restriction is not a viable software fix, the immediate limitation is the lack of native hardware support for E1M2/INT4 grids in major ecosystems. The authors suggest that future accelerator design should treat uniform 4-bit grids as "first-class citizens" alongside E2M1. They also note that while UFP4 closes the gap to BF16, a residual gap still remains, indicating room for further refinement in scaling or quantization techniques.

7.3. Personal Insights & Critique

This paper provides a compelling "back to basics" argument. Often, research focuses on adding complex layers or heuristics (like better gradient estimators) to fix problems. This paper, however, digs down to the fundamental bit-level representation of numbers. It is a strong reminder that numerical format geometry matters.

The insight that RHT changes the "regime" of the data from dynamic-range-limited to local-resolution-limited is profound. It suggests that as we get better at preprocessing tensors (making them smoother), our number formats should evolve from "wide range" types (FP) to "high precision" types (INT/Uniform).

A potential critique or area for further exploration could be the interaction with other non-linearities or activation functions. The paper focuses on linear layers and GEMMs. It would be interesting to see if this uniform grid advantage holds for other operations like LayerNorm or Softmax, which have their own numerical stability challenges. Additionally, the reliance on RHT implies a computational overhead; while the fused kernel results are promising, the absolute cost on older hardware without these fused kernels could be a barrier to adoption. Overall, the paper makes a robust case for a paradigm shift in how we think about low-precision training formats.