/Tech20d ago

NVIDIA's Nemotron 3 Ultra report details pre-training a 550B model in NVFP4 precision over 20 trillion tokens

FP4 training achieved a 0.4% loss gap compared to BF16.

--0--

#501

Original post

Lisan al Gaib@scaling01#1215inTech

Oh wow, they pre-trained Nemotron 3 Ultra in NVFP4

big update for estimating future model sizes and flops, especially for OpenAI models

7:21 AM · Jun 4, 2026 · 69.2K Views

Sentiment

Users praise NVIDIA's Nemotron-3 Ultra models and NVFP4 pretraining for the detailed tech reports that offer transparency on data work, benchmarks, and real-world fit.

Pos

100.0%

Neg

0.0%

15 comments with sentiment.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Related links

2509.25149

ARXIV.ORGVia

#1824

Posts from X

Most Activity

VIEWS52.6KBOOKMARKS106LIKES285RETWEETS16REPLIES25

Lisan al Gaib@scaling01

If FP4 training is a thing now is Anthropic cooked?

Chips (compute in PFLOPS, FP4 and FP8) • Trainium 2: no FP4 / 1.3 • Trainium 3: 2.5 / 2.5 • GB200: 10 / 5 • VR200: 35 / 17.5

Compute of the largest datacenters (in ZFLOPS): • Rainier: ~1M Trainium 2 = no FP4, 1.3 FP8 • Abilene: ~220k GB200 = 2.2 FP4, 1.1 FP8 • Fairwater: ~300k GB200 = 3.0 FP4, 1.5 FP8

With FP8 they have as much compute as OpenAI, but if OpenAI is training in FP4 Anthropic is already 1.7-2.3x behind in FLOPS, meaning OpenAI can RL longer or train a bigger model.

Trainium 3, which is already being deployed is ~4x behind GB200 and ~14x behind VR200 on FP4 compute per chip. Amazon's largest cluster would need ~4x more chips per cluster in 2026, but 14x in 2027.

It seems like OpenAI will completely compute mog Anthropic in 2027.

So in 2027 Anthropic models will either fall behind, or more likely, they will start using Google TPUs or even VR200 clusters for training.

The model frontier looks probably something like this (image), if you ignore deployment (in-)efficiencies.

So OpenAI could actually train a 20-30T model this year and deploy it without too much hassle on a single 20TB GB300 NVL72 or later this year on VR200.

Maybe a chonky GPT-6?

Lisan al Gaib@scaling01

Oh wow, they pre-trained Nemotron 3 Ultra in NVFP4

big update for estimating future model sizes and flops, especially for OpenAI models

20d52.6K285106

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

MTP=6, latentMoE, Mamba It's very interesting in that they *chose* to make a low-sparsity, relatively compact, computationally intense model, and then make it fast despite that. If it is not a yapper on top of that (and seems that way), that's doubly impressive.

wh@nrehiew_

Lastly, a section on inference. Arch wise, its already pretty inference friendly with latentmoe allowing for more routed experts, hybrid mamba2, mtp etc.

First, Ultra has lower sparsity which means more flops at prefill where it trails Qwen with ~2x more activated params. But, it dominates at decode mainly due to the SSM layers

Optimal MTP length is 6 for almost 3x faster throughput. For the Mamba state, they snapshot at each step to facilitate rollbacks on rejection.

20d2K237

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@scaling01 They've already trained a bigger model than Mythos