Oh wow, they pre-trained Nemotron 3 Ultra in NVFP4
big update for estimating future model sizes and flops, especially for OpenAI models
FP4 training achieved a 0.4% loss gap compared to BF16.
Oh wow, they pre-trained Nemotron 3 Ultra in NVFP4
big update for estimating future model sizes and flops, especially for OpenAI models
Users praise NVIDIA's Nemotron-3 Ultra models and NVFP4 pretraining for the detailed tech reports that offer transparency on data work, benchmarks, and real-world fit.
No Digg Deeper questions have been answered for this story yet.
If FP4 training is a thing now is Anthropic cooked?
Chips (compute in PFLOPS, FP4 and FP8) • Trainium 2: no FP4 / 1.3 • Trainium 3: 2.5 / 2.5 • GB200: 10 / 5 • VR200: 35 / 17.5
Compute of the largest datacenters (in ZFLOPS): • Rainier: ~1M Trainium 2 = no FP4, 1.3 FP8 • Abilene: ~220k GB200 = 2.2 FP4, 1.1 FP8 • Fairwater: ~300k GB200 = 3.0 FP4, 1.5 FP8
With FP8 they have as much compute as OpenAI, but if OpenAI is training in FP4 Anthropic is already 1.7-2.3x behind in FLOPS, meaning OpenAI can RL longer or train a bigger model.
Trainium 3, which is already being deployed is ~4x behind GB200 and ~14x behind VR200 on FP4 compute per chip. Amazon's largest cluster would need ~4x more chips per cluster in 2026, but 14x in 2027.
It seems like OpenAI will completely compute mog Anthropic in 2027.
So in 2027 Anthropic models will either fall behind, or more likely, they will start using Google TPUs or even VR200 clusters for training.
The model frontier looks probably something like this (image), if you ignore deployment (in-)efficiencies.
So OpenAI could actually train a 20-30T model this year and deploy it without too much hassle on a single 20TB GB300 NVL72 or later this year on VR200.
Maybe a chonky GPT-6?
Oh wow, they pre-trained Nemotron 3 Ultra in NVFP4
big update for estimating future model sizes and flops, especially for OpenAI models
MTP=6, latentMoE, Mamba It's very interesting in that they *chose* to make a low-sparsity, relatively compact, computationally intense model, and then make it fast despite that. If it is not a yapper on top of that (and seems that way), that's doubly impressive.
Lastly, a section on inference. Arch wise, its already pretty inference friendly with latentmoe allowing for more routed experts, hybrid mamba2, mtp etc.
First, Ultra has lower sparsity which means more flops at prefill where it trails Qwen with ~2x more activated params. But, it dominates at decode mainly due to the SSM layers
Optimal MTP length is 6 for almost 3x faster throughput. For the Mamba state, they snapshot at each step to facilitate rollbacks on rejection.
@scaling01 They've already trained a bigger model than Mythos
If FP4 training is a thing now is Anthropic cooked?
Chips (compute in PFLOPS, FP4 and FP8) • Trainium 2: no FP4 / 1.3 • Trainium 3: 2.5 / 2.5 • GB200: 10 / 5 • VR200: 35 / 17.5
Compute of the largest datacenters (in ZFLOPS): • Rainier: ~1M Trainium 2 = no FP4, 1.3 FP8 • Abilene: ~220k GB200 = 2.2 FP4, 1.1 FP8 • Fairwater: ~300k GB200 = 3.0 FP4, 1.5 FP8
With FP8 they have as much compute as OpenAI, but if OpenAI is training in FP4 Anthropic is already 1.7-2.3x behind in FLOPS, meaning OpenAI can RL longer or train a bigger model.
Trainium 3, which is already being deployed is ~4x behind GB200 and ~14x behind VR200 on FP4 compute per chip. Amazon's largest cluster would need ~4x more chips per cluster in 2026, but 14x in 2027.
It seems like OpenAI will completely compute mog Anthropic in 2027.
So in 2027 Anthropic models will either fall behind, or more likely, they will start using Google TPUs or even VR200 clusters for training.
The model frontier looks probably something like this (image), if you ignore deployment (in-)efficiencies.
So OpenAI could actually train a 20-30T model this year and deploy it without too much hassle on a single 20TB GB300 NVL72 or later this year on VR200.
Maybe a chonky GPT-6?
Post-training uses expert RLVR + MOPD merging of the experts.
SFT data is a mix of : - long context data - GPTOSS-120B generated synthetic data for STEM/Instruction Following. Interestingly during SFT, they mask out thinking tokens and do not train on traces - Multilingual Safety - An expanded search dataset of 21.7K trajectories with MiniMax 2.5 and GLM5.1 as teachers - Terminal use data where DSV3.2 is used within Terminus 2 Agent - Synthetic Conversational using GLM 5 and Nemotron GenRM as the preference model - SWE trajectories using several harnesses (OpenHands, SWE-Agent, MiniSWEAgent, Opencode). The second image below has some info on the heuristics they used to filter for trajectories - DSV3.2 + Nemotron Cascade/Math for Math data - GPTOSS120B competitive coding problems - CUDA Kernel data using DeepSeek R1(?) and GPT OSS 120B for both pytorch and natural language to kernel problems. They select based on benchmarking and also train for repair with the model even having access to the profiler. (Super hyped for this)
The second divergence is weirder. They have 2 hypotheses: 1) Expert imbalance. They find that with Ultra, MaxVio (a measure of expert imbalance) increased significantly in the first layer where the first layer ended up receiving 12x more than the mean 2) Residual stream norms differed by 4 orders of magnitude across depth.
Regardless, the solution was to cut to 20T tokens (this answers my question previously) and rollback + start learning rate annealing
Anyways huge credit to Nvidia for being transparent!

The base model basically crushes all other base models while admittedly the other base models are a bit old at this point
i think Nemotron 3 Super alreaddy did NVFP4, but it was a much smaller model
this is like a scaling proof
Oh wow, they pre-trained Nemotron 3 Ultra in NVFP4
big update for estimating future model sizes and flops, especially for OpenAI models
@teortaxesTex GPT-6?
or did they also scale up GPT-5.6, like they did with GPT-5.4 and GPT-5.5?
@scaling01 They've already trained a bigger model than Mythos
Now back to the fun part about stability during NVFP4 training.
When faced with divergence around 40% of training, they roll back to using FP32 gradient.
Previous NVIDIA work which showed that reverting to high precision can kind of recover some loss divergence - provided that its done earlier on in the training as later stages have too low a learning rate to recover.
The base model basically crushes all other base models while admittedly the other base models are a bit old at this point

The gains is limited on non-agentic reasoning. The hypothesis is that there is a lot of the teacher's distribution/gain that is not ever sampled by the student so the student can never learn.

The second divergence is weirder. They have 2 hypotheses: 1) Expert imbalance. They find that with Ultra, MaxVio (a measure of expert imbalance) increased significantly in the first layer where the first layer ended up receiving 12x more than the mean 2) Residual stream norms differed by 4 orders of magnitude across depth.
Regardless, the solution was to cut to 20T tokens (this answers my question previously) and rollback + start learning rate annealing
Anyways huge credit to Nvidia for being transparent!

@scaling01 Good thing they have TPUs coming with it
Lastly, a section on inference. Arch wise, its already pretty inference friendly with latentmoe allowing for more routed experts, hybrid mamba2, mtp etc.
First, Ultra has lower sparsity which means more flops at prefill where it trails Qwen with ~2x more activated params. But, it dominates at decode mainly due to the SSM layers
Optimal MTP length is 6 for almost 3x faster throughput. For the Mamba state, they snapshot at each step to facilitate rollbacks on rejection.
There is also a bunch of miscellanous quantization stuff. 1) Scaling algorithm selection 2) Quantization speed/performance 3) Int8 Stochastic Rounded Mamba cache (important since mamba has larger cache at lower context lengths)
Finally, they show minimal degradation post-quant

Some MOPD open questions they list out: - Instead of just doing loss only on the selected token, train on the entire distribution. They say this did not help much as it might amplify noise - How to ensure student trajectories lie within the teacher's support for effective scoring - Efficiency across different domains where rollouts have very different times
They now have a long section on RL infra. Training is 1 step async off policy and aggressively uses 5 MTP spec decoding to speed up rollout
Much of the failures they encounter are due to sandbox/toolcalling/generation. Much of their optimizations target reliability/efficiency here.
The table below is a nice summary but again I dont see a point in summarizing since the section has so many cool details. Go check it out! This is just a quick list: - Slurm launch overheads - Make EP AllToAll toplogy aware to only be within NVLink domains and not InifiniBand between racks. (20%! throughput improvement) - In a NVL72 rack, gpus are affiliated to nearby NUMA nodes for faster memory access. Without special care, workers on a GPU could be binded to a remote node rather than the local faster memory. (10% throughput) - Async checkpointing - Compilation times taking ages so cache and share compiled artifacts. (I have a tweet about this hahaha - Multinode vllm instablity -All nodes were reading and crashing the shared storage.
MTP is trained with KL against the backbone head's full logits as a speculator. A single head/weights is used recursively
Second plot shows efficiency/accuracy across different reasoning efforts. Verbosity is relative to Qwen 3.5 397B's average token usage
For pretraining data, NVIDIA releases new subsets for code/legal. 1) Code includes 173B new tokens with a cutoff of Sept 2025 2) Synthetic QA data in both MCQ and free form answer formats. They validate taht this increases downstream evals specifically MMLU Pro and GPQA 3) Synthetic fact-seeking data. This helps SimpleQA scores 4) Legal documents. Full list is in the paper, main thing is that they use the largest Qwen3 model to perform rephrasing. This naturally boosts legal bench scores
They have a quality based data schedule. The first 15T (75%) focus on diversity while the rest of training samples from higher quality tokens.
They also have a long context mid training phase of 33B tokens towards the end up to 1M context length.
Interestingly, they say that only putting code and math SFT style data in the 4K Sequence length iterations worked best to maintain short benchmark scores.
(20T tokens seems slightly low for this size ?)
The main highlight is that NVIDIA did NVFP4 pretraining. Much of the recipe follows previous Nemotron work: - Hadamard Transforms applied to weight gradient computation to reduce the impact of outliers. - Some layers kept at higher precision. (Table from Nemotron 3 Super). Specifically, final layers tend to require more dynamic range and mantissa than FP4 provides. - Stochastic rounding rather than deterministic rounding to prevent bias, specifically in the gradients.
To validate the recipe, they train smaller models up to 16T and show a mere ~0.4% relative train loss gap with the bf16 baseline.
See more: https://arxiv.org/abs/2509.25149
(we discuss other stability issues in later sections)
@scaling01 No idea how they'll label it 5.6 is likely just a post-trained 5.5
@teortaxesTex GPT-6?
or did they also scale up GPT-5.6, like they did with GPT-5.4 and GPT-5.5?
1) The first divergence happens because of the MTP loss which is scaled to 0.1. The gradient from the MTP heads is lost in BF16. From the figure we see that MTP loss starts diverging before overall training loss. This was resolved by using FP32 gradient recipe
Now back to the fun part about stability during NVFP4 training.
When faced with divergence around 40% of training, they roll back to using FP32 gradient.
Previous NVIDIA work which showed that reverting to high precision can kind of recover some loss divergence - provided that its done earlier on in the training as later stages have too low a learning rate to recover.
Next, a massive section on MOPD. Algorithm wise, they actually have 2 MOPD iterations. Loss is just reverse KL without any environment reward loss term.
Their version of MOPD has 3 async parts (rollout, teacher and learners). So they compute the log probs from all 3 with the KL being between the teacher and the learner.
They do a similar importance ratio masking where the ratio that is masked is between the rollout and teacher while the teacher-learner is clipped.
They have a ton of info on the data/process used for each teacher. It super detailed and so i think its pointless for me to distill it here so read that section for this part
Not much info on the RL data beyond the standard domains. The algorithm seems to be GRPO with IS masking so pretty standard
They say they have a "Gaussian" sampling strategy for the curriculum. I'm not entirely sure what this means so leaving this section from Nano here
For benchmarking, their table has 0 bolding or highlighting which makes it annoying to compare so I got GPT-Image to annotate.
Kimi K2.6 is quite insane still but Nemotron 3 ultra fares very well on non agentic tasks
They now have a long section on RL infra. Training is 1 step async off policy and aggressively uses 5 MTP spec decoding to speed up rollout
Much of the failures they encounter are due to sandbox/toolcalling/generation. Much of their optimizations target reliability/efficiency here.
The table below is a nice summary but again I dont see a point in summarizing since the section has so many cool details. Go check it out! This is just a quick list: - Slurm launch overheads - Make EP AllToAll toplogy aware to only be within NVLink domains and not InifiniBand between racks. (20%! throughput improvement) - In a NVL72 rack, gpus are affiliated to nearby NUMA nodes for faster memory access. Without special care, workers on a GPU could be binded to a remote node rather than the local faster memory. (10% throughput) - Async checkpointing - Compilation times taking ages so cache and share compiled artifacts. (I have a tweet about this hahaha - Multinode vllm instablity -All nodes were reading and crashing the shared storage.

Think this table is interesting to see what domains does the student outperform the teacher.
The merged model outperforms the specialized RLVR model on agentic and instruction following benches. On TBench, the student significantly outperforms the teacher which is interesting.
For reference, the second table is a similar figure from Mimo-v2-flash. Interesting to compare relative performance in ~similar domains