/AI1h ago

NVIDIA's Nemotron 3 Ultra report details pre-training a 550B model in NVFP4 precision over 20 trillion tokens

FP4 training achieved a 0.4% loss gap compared to BF16.

2014581811.7K

Original posts

#975

Comments

#975

Original post

Lisan al Gaib@scaling01#975inAI

Oh wow, they pre-trained Nemotron 3 Ultra in NVFP4

big update for estimating future model sizes and flops, especially for OpenAI models

7:21 AM · Jun 4, 2026 · 10.2K Views

/AI1h ago

NVIDIA's Nemotron 3 Ultra report details pre-training a 550B model in NVFP4 precision over 20 trillion tokens

FP4 training achieved a 0.4% loss gap compared to BF16.

--0--

Original posts

#975

Comments

#975

Original post

Lisan al Gaib@scaling01#975inAI

Oh wow, they pre-trained Nemotron 3 Ultra in NVFP4

big update for estimating future model sizes and flops, especially for OpenAI models

7:21 AM · Jun 4, 2026 · 10.2K Views

Sentiment

Many users praised NVIDIA's Nemotron-3 tech reports for their transparency on NVFP4 pretraining techniques and for openly releasing useful pretraining data mixtures and SFT details.

Pos

100.0%

Neg

0.0%

10 comments with sentiment.

Cluster Engagement

Sentiment

Sentiment building, check back later.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

VIEWS2KLIKES13

Lisan al Gaib@scaling01

i think Nemotron 3 Super alreaddy did NVFP4, but it was a much smaller model

this is like a scaling proof

Lisan al Gaib@scaling01

Oh wow, they pre-trained Nemotron 3 Ultra in NVFP4

big update for estimating future model sizes and flops, especially for OpenAI models

1h2K130

RETWEETS1

wh@nrehiew_

Post-training uses expert RLVR + MOPD merging of the experts.

SFT data is a mix of : - long context data - GPTOSS-120B generated synthetic data for STEM/Instruction Following. Interestingly during SFT, they mask out thinking tokens and do not train on traces - Multilingual Safety - An expanded search dataset of 21.7K trajectories with MiniMax 2.5 and GLM5.1 as teachers - Terminal use data where DSV3.2 is used within Terminus 2 Agent - Synthetic Conversational using GLM 5 and Nemotron GenRM as the preference model - SWE trajectories using several harnesses (OpenHands, SWE-Agent, MiniSWEAgent, Opencode). The second image below has some info on the heuristics they used to filter for trajectories - DSV3.2 + Nemotron Cascade/Math for Math data - GPTOSS120B competitive coding problems - CUDA Kernel data using DeepSeek R1(?) and GPT OSS 120B for both pytorch and natural language to kernel problems. They select based on benchmarking and also train for repair with the model even having access to the profiler. (Super hyped for this)

wh@nrehiew_

The second divergence is weirder. They have 2 hypotheses: 1) Expert imbalance. They find that with Ultra, MaxVio (a measure of expert imbalance) increased significantly in the first layer where the first layer ended up receiving 12x more than the mean 2) Residual stream norms differed by 4 orders of magnitude across depth.

Regardless, the solution was to cut to 20T tokens (this answers my question previously) and rollback + start learning rate annealing

Anyways huge credit to Nvidia for being transparent!

48m3510

REPLIES2

wh@nrehiew_

For benchmarking, their table has 0 bolding or highlighting which makes it annoying to compare so I got GPT-Image to annotate.

Kimi K2.6 is quite insane still but Nemotron 3 ultra fares very well on non agentic tasks

wh@nrehiew_

They now have a long section on RL infra. Training is 1 step async off policy and aggressively uses 5 MTP spec decoding to speed up rollout

Much of the failures they encounter are due to sandbox/toolcalling/generation. Much of their optimizations target reliability/efficiency here.

The table below is a nice summary but again I dont see a point in summarizing since the section has so many cool details. Go check it out! This is just a quick list: - Slurm launch overheads - Make EP AllToAll toplogy aware to only be within NVLink domains and not InifiniBand between racks. (20%! throughput improvement) - In a NVL72 rack, gpus are affiliated to nearby NUMA nodes for faster memory access. Without special care, workers on a GPU could be binded to a remote node rather than the local faster memory. (10% throughput) - Async checkpointing - Compilation times taking ages so cache and share compiled artifacts. (I have a tweet about this hahaha - Multinode vllm instablity -All nodes were reading and crashing the shared storage.

48m4120