in 2027 you'll be able to pretrain a ~mythos-class model (100T tokens, 200B active params) on like 10 kyber racks in a month or two
Users agree with the prediction that 10 Kyber Racks can pretrain 200B-param models by 2027 because they see it as very much possible.
No Digg Deeper questions have been answered for this story yet.
Most Activity
a nvl576 kyber rack needs like a megawatt all-in. gigawatts are coming online. building very very capable models is going to be pretty broadly accessible and difficult to monitor.
in 2027 you'll be able to pretrain a ~mythos-class model (100T tokens, 200B active params) on like 10 kyber racks in a month or two

@willccbb Won't it be tough to procure data / RL envs?

Grok did the math for us!!
Rough FLOPs estimate for pretraining (Chinchilla-style scaling rule of thumb): ~6 × parameters × tokens = 6 × 200 × 10⁹ × 100 × 10¹² = 1.2 × 10²⁶ FLOPs One Kyber NVL576 rack → 5 × 10¹⁸ FLOPS (FP8 training). 10 racks → 50 × 10¹⁸ = 5 × 10¹⁹ FLOPS aggregate. Theoretical time at 100% utilization: 1.2 × 10²⁶ ÷ 5 × 10¹⁹ ≈ 2.4 million seconds ≈ 28 days (about 1 month). Real-world training utilization (Model FLOPs Utilization / MFU) is typically 40–60% due to communication, data loading, checkpoints, etc. This pushes the timeline to ~1–2 months.

@BenjaminEmHe “procure” is the wrong angle
efficiency of expert human data for capability improvement increases significantly as the cost of code + general reasoning goes to zero, you spend more FLOPs on juicing signal from more abstract sources like freeform interviews rather than “labels”

@PaIbraNiang1 you can do the experts in nvfp4 at least

@willccbb @inductionheads What model can you get if you are gpu poor and only have maybe two or three nvl576 kyber racks?

@willccbb >difficult to monitor doubt this

>requires auditing all code running in every medium sized datacenter Wouldn't it be enough to legally mandate that 3rd party compute provider monitor and disclose to some regulatory body just I/O patterns (mem, disk, network)? (I'm making assumption that such patterns are different enough for large scale LLM training runs vs LLM or VLM inference, not sure about this, you know much more on this topic)
(also, I expect a system where training model exceeding some capabilities/parameters will be illegal unless disclosed to and approved by some regulatory body. I think this would be enough to deter even without monitoring)

@willccbb Bet

@williawa @inductionheads glm-5.2 / opus / gpt-5.4

@willccbb I'm pretty sure that I am not the "you" in this statement.

@willccbb Recalculating with nvfp4 would reduce the training time at 15-23 days (roughly 3-4 weeks) correct me if I’m wrong.

@willccbb Students would build Beowulf clusters for that.

@willccbb I thought this said 2037

@willccbb We just beed the good pre train data that labs spent millions in creating

@willccbb but Rubin is being scaled down ~1/2

@willccbb the most important question is what's the cost ? 👀

@willccbb don't tell the USG this

@umi33563 requires auditing all code running in every medium sized datacenter that might be used for video model inference or recsys serving or whatever. admin would have to nationalize the whole supply chain or tank the economy

@TheWhiteTower16 @willccbb Try getting those 10 Kyber racks.