Oh wow, they pre-trained Nemotron 3 Ultra in NVFP4
big update for estimating future model sizes and flops, especially for OpenAI models
FP4 training achieved a 0.4% loss gap compared to BF16.
Oh wow, they pre-trained Nemotron 3 Ultra in NVFP4
big update for estimating future model sizes and flops, especially for OpenAI models
i think Nemotron 3 Super alreaddy did NVFP4, but it was a much smaller model
this is like a scaling proof
Oh wow, they pre-trained Nemotron 3 Ultra in NVFP4
big update for estimating future model sizes and flops, especially for OpenAI models
Post-training uses expert RLVR + MOPD merging of the experts.
SFT data is a mix of : - long context data - GPTOSS-120B generated synthetic data for STEM/Instruction Following. Interestingly during SFT, they mask out thinking tokens and do not train on traces - Multilingual Safety - An expanded search dataset of 21.7K trajectories with MiniMax 2.5 and GLM5.1 as teachers - Terminal use data where DSV3.2 is used within Terminus 2 Agent - Synthetic Conversational using GLM 5 and Nemotron GenRM as the preference model - SWE trajectories using several harnesses (OpenHands, SWE-Agent, MiniSWEAgent, Opencode). The second image below has some info on the heuristics they used to filter for trajectories - DSV3.2 + Nemotron Cascade/Math for Math data - GPTOSS120B competitive coding problems - CUDA Kernel data using DeepSeek R1(?) and GPT OSS 120B for both pytorch and natural language to kernel problems. They select based on benchmarking and also train for repair with the model even having access to the profiler. (Super hyped for this)
The second divergence is weirder. They have 2 hypotheses: 1) Expert imbalance. They find that with Ultra, MaxVio (a measure of expert imbalance) increased significantly in the first layer where the first layer ended up receiving 12x more than the mean 2) Residual stream norms differed by 4 orders of magnitude across depth.
Regardless, the solution was to cut to 20T tokens (this answers my question previously) and rollback + start learning rate annealing
Anyways huge credit to Nvidia for being transparent!
For benchmarking, their table has 0 bolding or highlighting which makes it annoying to compare so I got GPT-Image to annotate.
Kimi K2.6 is quite insane still but Nemotron 3 ultra fares very well on non agentic tasks
They now have a long section on RL infra. Training is 1 step async off policy and aggressively uses 5 MTP spec decoding to speed up rollout
Much of the failures they encounter are due to sandbox/toolcalling/generation. Much of their optimizations target reliability/efficiency here.
The table below is a nice summary but again I dont see a point in summarizing since the section has so many cool details. Go check it out! This is just a quick list: - Slurm launch overheads - Make EP AllToAll toplogy aware to only be within NVLink domains and not InifiniBand between racks. (20%! throughput improvement) - In a NVL72 rack, gpus are affiliated to nearby NUMA nodes for faster memory access. Without special care, workers on a GPU could be binded to a remote node rather than the local faster memory. (10% throughput) - Async checkpointing - Compilation times taking ages so cache and share compiled artifacts. (I have a tweet about this hahaha - Multinode vllm instablity -All nodes were reading and crashing the shared storage.
FP4 training achieved a 0.4% loss gap compared to BF16.
Oh wow, they pre-trained Nemotron 3 Ultra in NVFP4
big update for estimating future model sizes and flops, especially for OpenAI models
Many users praised NVIDIA's Nemotron-3 tech reports for their transparency on NVFP4 pretraining techniques and for openly releasing useful pretraining data mixtures and SFT details.