NVIDIA Unveils 4-Bit Pretraining With NVFP4: Scaling A 12B Hybrid Mamba-Transformer Across 10 Trillion Tokens

Training cutting-edge large language models (LLMs) using FP8 precision has become the norm, but dropping down to 4-bit floating point has been a persistent challenge. The issue is that narrower formats squeeze the dynamic range and magnify quantization errors over long sequences of tokens. A new study from NVIDIA introduces a pretraining approach built on NVFP4, a 4-bit microscaling format with native support in Blackwell Tensor Cores. The team validated this method by training a 12-billion-parameter hybrid Mamba-Transformer model on 10 trillion tokens. According to the researchers, this represents the longest publicly documented 4-bit precision training run so far. The finished model scores 62.58% on MMLU-Pro 5-shot, compared to 62.62% for the FP8 baseline, and is supported through NVIDIA’s Transformer Engine.

Understanding NVFP4

To appreciate why NVFP4 matters, it’s useful to look at how microscaling formats function. In a microscaling (MX) format, a group of consecutive low-precision values shares one scale factor, which maps the group back to a wider numerical range during matrix multiplication. MXFP4 uses blocks of 32 elements, with each element stored as E2M1 — that’s 1 sign bit, 2 exponent bits, and 1 mantissa bit — representing only the values ±0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, and ±6. Block scale factors are stored in UE8M0, limiting them to powers of two.

NVFP4 introduces three key changes. First, the block size shrinks from 32 to 16 elements, reducing the dynamic range each scale factor must handle. Second, block scale factors use E4M3 instead of UE8M0, sacrificing some exponent range for greater mantissa precision so the per-block amax (absolute maximum value) can be mapped much closer to the FP4 maximum. Third, NVFP4 adds a second level of scaling: an FP32 per-tensor scale that remaps values so the E4M3 block scales themselves remain representable. The outcome is that at least 6.25% of values in each block — specifically the per-block amax — are captured at near-FP8 accuracy, while the rest use standard FP4.

On NVIDIA Blackwell hardware, FP4 GEMMs deliver 4× the throughput of BF16 on GB200 and 6× on GB300, translating to roughly 2× and 3× speedups over FP8. Memory usage for operands is cut roughly in half compared to FP8.

What Gets Quantized — and What Stays in Higher Precision

Only the GEMMs within linear (fully-connected) layers — specifically the forward pass (Fprop), gradient with respect to inputs (Dgrad), and gradient with respect to weights (Wgrad) — actually run in NVFP4. Embeddings, the output projection head, normalization layers, activation functions, and all attention components (including softmax and the query-key and attention score-value batched GEMMs) remain in BF16 or FP32. Model weights, weight gradients used for accumulation across microbatches and data-parallel replicas, and optimizer states are all maintained in FP32. Tensor parallel reductions are performed in BF16.

The Four-Part Training Methodology

Simply quantizing every linear-layer GEMM to NVFP4 using default settings (1×16 block scaling everywhere, round-to-nearest-even on every tensor, no transforms) causes training to diverge early. NVIDIA’s approach stabilizes training through four components, and ablation studies on the 12B model confirm that each one is essential.

Selective high precision: Linear layers in the first two and the last eight of the model’s 62 blocks (roughly 16% of all linear layers) are kept in BF16. Ablation experiments showed that the final blocks are the most sensitive because they need more dynamic range than FP4 can offer; keeping only the last four blocks in BF16 was also sufficient for stable convergence.

Random Hadamard Transforms (RHT): Outlier values in weight gradients are redistributed into an approximately Gaussian distribution by multiplying input tiles with a 16×16 Hadamard matrix combined with a random ±1 sign vector. Since the orthogonal transforms cancel out within the dot product, no mathematical correction is required in the GEMM. The d=16 size was determined through experimentation: d=4 harmed convergence, while d=128 produced similar results. RHT is applied only to the inputs of the weight-gradient (Wgrad) GEMM, and a single random sign

The same vector is used for all linear layers. At the 1.2B scale, randomization had no effect, but it produced a measurable improvement in the 12B run.

Two-dimensional (2D) block scaling for weights: Conventional NVFP4 applies scaling to 1×16 blocks along the dot-product axis. Since the backward pass transposes the weight matrix, the forward and backward passes end up working with different quantized versions of the weights, which violates the chain rule. NVIDIA addresses this by scaling weights in 16×16 blocks so that both passes share the same quantized representation. Activations and gradients continue to use 1×16 scaling, as they are less affected by this mismatch.

Stochastic rounding on gradients: Round-to-nearest-even creates a systematic bias when used on gradient tensors. Stochastic rounding instead rounds probabilistically, weighting the choice by how close the value is to each of the two nearest representable numbers, which eliminates that bias. The researchers explicitly state in their paper that stochastic rounding is harmful when applied to forward-pass tensors, so its use is limited to gradients only.

Results on the 12B Hybrid Mamba-Transformer

The 12B model is built on the Nemotron-Nano-12B-v2-Base architecture — 62 blocks (6 Self-Attention, 28 FFN, 28 Mamba-2), hidden dimension 5120, FFN dimension 20480 — trained with a Warmup-Stable-Decay schedule (constant learning rate for the first 80% of training, then decay over the remaining 20%), batch size 736, sequence length 8192. The FP8 reference baseline follows the DeepSeek-V3 approach: E4M3 elements, 128×128 weight blocks, 1×128 activation and gradient blocks, with the first block and last two blocks kept in BF16.

NVFP4 validation loss remains within 1% of the FP8 baseline during the stable phase and drifts to just over 1.5% during the decay phase. Downstream accuracy is similar across most benchmarks: MMLU 76.57% vs 77.36%, GSM8K CoT 92.27% vs 89.08%, MATH 81.48% vs 83.32%, AGIEval English CoT 70.31% vs 67.01%. Coding benchmarks show the widest gap — HumanEval+ 57.43% vs 59.93%, MBPP+ 55.91% vs 59.11% — which the team partly attributes to noise in the final-checkpoint evaluation. The team also describes a precision-switching strategy: switching the forward pass from NVFP4 to BF16 starting at 8.2T tokens (roughly 18% into the schedule) cut the relative loss error from 1.5% down to 0.5%.

NVFP4 vs MXFP4

On a separate 8B hybrid Mamba-Transformer trained on 1T tokens, NVFP4 achieved a relative loss error of roughly 1.5% compared to BF16, while MXFP4 hovered around 2.5%. To match the NVFP4 loss at 1T tokens, MXFP4 needed 1.36T tokens — a 36% increase in token consumption. The team traces this gap to NVFP4’s smaller block size and E4M3 scales, which retain more of the FP4 dynamic range than MXFP4’s power-of-two UE8M0 scales (which can sacrifice up to one entire binade and the ±4, ±6 values in the worst case).

Marktechpost’s Visual Explainer

● NVIDIA Technical Report

A 4-bit floating-point training recipe validated on a 12-billion-parameter hybrid Mamba-Transformer trained on 10 trillion tokens — the longest publicly documented 4-bit pretraining run to date.

62.58%

MMLU-Pro (vs 62.62 FP8)

SOURCE — arXiv:2509.25149v2 · NVIDIA · Available in Transformer Engine

01 — Context

Why move from FP8 to 4-bit pretraining

FP8 training has become the standard for frontier LLM pretraining. Shifting to FP4 offers a 2× to 3× increase in arithmetic throughput over FP8 and roughly half the operand memory footprint — but narrower formats squeeze dynamic range and magnify quantization error over long token horizons.

The key challenge is maintaining training stability and downstream accuracy across multi-trillion-token runs. This report lays out a recipe that achieves both, built on NVFP4, a 4-bit microscaling format with native hardware support on NVIDIA Blackwell Tensor Cores.

GB200 Throughput

BF16 baseline 1×
FP8 2×
FP4 (NVFP4) 4×

GB300 Throughput

BF16 baseline 1×
FP8 2×
FP4 (NVFP4) 6×

02 — The Format

What NVFP4 actually stores

Each element is encoded as E2M1 — 1 sign bit, 2 exponent bits, 1 mantissa bit — representing one of: ±0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6.

Every group of 16 consecutive elements shares a single E4M3 scale factor. A second FP32 per-tensor scale sits above that to keep the E4M3 block scales within range. The outcome: at least 6.25% of values in each block (the per-block amax) are represented at near-FP8 precision.

FP8 scale

0.5

-2

-4

-1

-3

0.5

-1

E4M3 block scale
Block amax (mapped to FP4 max)
16 FP4 elements

03 — Format Comparison

How NVFP4 differs from MXFP4

NVFP4 introduces three design changes to the microscaling approach that meaningfully boost representation fidelity at 4 bits.

MXFP4

Block size 32
Element E2M1
Block scale UE8M0
Scale type Power of 2
Tensor scale None

NVFP4

Block size 16
Element E2M1
Block scale E4M3
Scale type Fractional
Tensor scale FP32

MXFP4’s power-of-two UE8M0 scaling can sacrifice up to one full binade of dynamic range and drops the ±4 and ±6 FP4 values after scale rounding. NVFP4’s E4M3 scaling maps the block amax much closer to the FP4 maximum.

04 — Scope

What runs in NVFP4 — and what doesn’t

Only the three GEMMs inside linear layers — Fprop, Dgrad, and Wgrad — actually execute in NVFP4. Everything else remains in higher precision.

In NVFP4

Linear Fprop GEMM
Linear Dgrad GEMM
Linear Wgrad GEMM

In BF16 / FP32

Embeddings · Output head
Normalization layers
Non-linearities
Attention (softmax, QK, score-V)
Master weights · Optimizer states
TP reductions (BF16)

The “FP4 training” label refers to the most compute-intensive GEMMs, not the entire forward and backward computation graph.

05 — The Recipe

Four techniques required for convergence

Quantizing every linear-layer GEMM to NVFP4 with default settings — 1×16 block scaling everywhere, round-to-nearest-even, no transforms — diverges early in training. The recipe stabilizes it with four components. Ablation studies show each is necessary.

Selective High Precision

Keep ~16% of linear layers in BF16, concentrated in the final blocks. For the 12B model: first 2 + final 8 of 62 blocks.

Random Hadamard Transforms (RHT)

16×16 Hadamard matrix + random ±1 sign vector, applied only to Wgrad inputs. d=4 performed worse; d=128 was similar to d=16.

2D Block Scaling for Weights

16×16 block scales for weights so forward and backward passes see the same quantized representation. Activations and gradients keep 1×16 scaling.

Stochastic Rounding on Gradients

Probabilistic rounding eliminates systematic gradient bias. Detrimental on forward-pass tensors — restrict to gradients only.

06 — Training Setup

The 12B hybrid Mamba-Transformer

The model uses the Nemotron-Nano-12B-v2-Base architecture: 62 blocks consisting of 6 Self-Attention, 28 FFN, and 28 Mamba-2 blocks.

Architecture

Blocks 62
Hidden dim 5120
FFN dim 20480
Q heads 40
KV heads 8
Mamba state dim 128

Training

Tokens 10T
Batch size 736
Sequence length 8192
Schedule WSD 80/20
Peak LR 4.5e-4
Weight decay 0.1

FP8 reference baseline follows DeepSeek-V3: E4M3 elements, 128×128 weight blocks, 1×128 activation/gradient blocks, with the first block and last two in BF16.

07 — Downstream Results

NVFP4 matches FP8 across most benchmarks

Validation loss stays within 1% of FP8 during the stable phase, widening to slightly above 1.5% during decay. Downstream accuracies tracked below.

Benchmark	FP8	NVFP4
MMLU-Pro 5-shot	62.62	62.58
MMLU	77.36	76.57
AGIEval English CoT	67.01	70.31
GSM8K CoT	89.08	92.27
MATH	83.32	81.48
MGSM	81.87	85.53
HumanEval+	59.93	57.43
MBPP+	59.11	55.91
ARC Challenge	91.81	91.81

Coding shows the widest gap. Switching the forward pass to BF16 at 8.2T tokens (last 18%) reduces relative loss error from 1.5% to 0.5%.

08 — Format Efficiency

NVFP4 vs MXFP4 on the same 8B model

On an 8B hybrid Mamba-Transformer trained on the same data, NVFP4 converged to a meaningfully better loss than MXFP4 in the same token budget.

Loss vs BF16 @ 1T tokens

NVFP4 ~1.5% gap
MXFP4 ~2.5% gap

Tokens to match NVFP4 loss

NVFP4 1.00T
MXFP4 1.36T (+36%)

The 36% token overhead translates directly into longer training time. Smaller block size and E4M3 scales preserve more of the FP4 dynamic range than MXFP4’s UE8M0 design.

09 — Practitioner Takeaways

What this unlocks for AI engineers

4-bit pretraining at multi-trillion-token scale is now reproducible with a known recipe, on Blackwell hardware, via Transformer Engine.

✓

Throughput
It seems like you’ve provided a snippet of HTML content, but the beginning of the article is missing. However, I can still paraphrase the text you’ve provided, keeping the HTML structure intact. Here’s the paraphrased version of the content you shared:

FP4 GEMMs execute twice as fast as FP8 on GB200 and three times faster on GB300. Memory for operands is reduced by approximately half.

✓

Reproducible method

Selective BF16 layers + 16×16 RHT on Wgrad + 2D weight scaling + stochastic rounding applied to gradients.

→

Unresolved questions

Quantizing all linear layers, applying NVFP4 to attention and communication pathways, scaling laws for FP4 across different parameter counts and training horizons.

⌘

Access

NVFP4 training is available in NVIDIA Transformer Engine. Source: arXiv:2509.25149v2.

MARKTECHPOST · AI research, deeply explained.

Key Takeaways

Check out the Paper here. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Top Posts

Charting the Vessel Storm: A Proteomic Blueprint for Vasculitis Remission

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Supercharging Smart Homes: The Fibre Internet Revolution Behind IoT Awakening

NVIDIA Unveils 4-Bit Pretraining with NVFP4: Scaling a 12B Hybrid Mamba-Transformer Across 10 Trillion Tokens

Why move from FP8 to 4-bit pretraining

What NVFP4 actually stores

How NVFP4 differs from MXFP4

Charting the Vessel Storm: A Proteomic Blueprint for Vasculitis Remission

Beyond Guesswork: A Slurm-Powered Battle Plan for Benchmarking Distributed LLM Servers

Beyond Prompt Engineering: How 4 Context Bricks Silence RAG Hallucinations

Run Mythos Enhanced Coding Model Locally with llama.cpp on Raspberry Pi

Astryx: Meta’s Open-Source React Toolkit—150+ Accessible Components, 7 Themes, and a CLI Agent-Ready Design System

Endless Code: Mastering the Art of the 24-Hour Claude Agent

Charting the Vessel Storm: A Proteomic Blueprint for Vasculitis Remission

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Supercharging Smart Homes: The Fibre Internet Revolution Behind IoT Awakening

Speed, VRAM, Multi-GPU Smackdown: Unsloth, Axolotl, TRL, or LLaMA-Factory?

Secret Sabotage: How Hidden Azure DevOps PR Comments Can Hijack AI Agents

AI Jailbreak: OpenAI Models Breach Test Prison, Rig Hugging Face Leaderboard with Cheat Code

Precision Medicine Deposited: The Art of Microdispensing for Next-Gen Medical Devices

When the World Cup Collided with the Cloud: 2026’s Digital Traffic Surge

Trending

Charting the Vessel Storm: A Proteomic Blueprint for Vasculitis Remission

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

NVIDIA Unveils 4-Bit Pretraining with NVFP4: Scaling a 12B Hybrid Mamba-Transformer Across 10 Trillion Tokens

Understanding NVFP4

What Gets Quantized — and What Stays in Higher Precision

The Four-Part Training Methodology

Results on the 12B Hybrid Mamba-Transformer

NVFP4 vs MXFP4

Marktechpost’s Visual Explainer

Why move from FP8 to 4-bit pretraining

What NVFP4 actually stores

How NVFP4 differs from MXFP4

What runs in NVFP4 — and what doesn’t

Four techniques required for convergence

The 12B hybrid Mamba-Transformer

NVFP4 matches FP8 across most benchmarks

NVFP4 vs MXFP4 on the same 8B model

What this unlocks for AI engineers

Key Takeaways

Related Posts