/Tech6h ago

A historical bug in LLM scaling laws led the AI industry to train oversized, undertrained models

Story Overview

The original 2020 scaling laws from Kaplan et al. quietly steered labs toward models that were too big for the data they received, because every size variant trained on the same fixed token budget and a learning-rate schedule that tapered to zero. That setup masked how much extra data larger models actually needed, so the field spent roughly two years over-allocating parameters at the expense of tokens.

83473238951.1K

#91

Original post

Sander Dieleman @ ICML 2026🇰🇷@sedielem#91inTech

Here's a cool piece of LLM lore: the original scaling laws were wrong due to a bug, which probably led to a lot of wasted compute on oversized undertrained models 🫣 (and that was before we even started properly accounting for inference cost!)

Diogo Almeida@CompleteSkeptic

http://x.com/i/article/2073276453131780096

9:38 AM · Jul 4, 2026 · 52.1K Views

Industry Shift

Chinchilla reset the recipe

DeepMind’s 2022 work varied both parameters and tokens together, revealing that roughly twenty tokens per parameter delivers better results for the same compute. The 70 B Chinchilla model beat the 280 B Gopher despite using identical training resources, flipping the earlier guidance on its head.

Open Question

Exact waste remains uncounted

Former Google engineers recall the flaw was spotted internally before it became public, yet no one has tallied the total compute hours spent on oversized, undertrained runs between 2020 and early 2022.

Sentiment

Users appreciate the public acknowledgment of a bug in original LLM scaling laws that wasted compute on oversized models because it reveals previously internal discussions from places like Google.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Related links

2073276453131780096

X.COMVia

Posts from X

Most Activity

brownyc@aksharikuu

@sedielem Please don’t tell me this now in the middle of my internship

3h440

LIKES5

Michele Catasta@pirroh

@sedielem I was still at Google when we started to talk about that bug, but I didn’t know it was publicly acknowledged until now.

Thanks for sharing 🙂 Good ol’ times…

Sander Dieleman @ ICML 2026🇰🇷@sedielem

1h15550

REPLIES1

Ishaan Goyal@IshaanGoyal05

the comment on the original blog post, was even more interesting. didn't realize that the nature of language can have effects on scaling laws as well (for ex. the comment mentioned a model with the same arch. but trained on french got 100% accuracy on their validation probe on 175M tokens, while taking more than 3B on english and still was lossy). Makes me think if changing the main interaction language of the model can have effects on its logic as well or nah? maybe chinese models will always have an edge in maths as compared to american models, because they will see more chinese data and chinese language is famous for being carry very high semantic meaning on a per token basis for maths. Maybe just that data distribution helps the model infer. enough high quality signal about maths and logic, that it translates over in tasks where chinese tokens are not used. wdyt?

1h40

Suresh@_Suresh2

@sedielem found a similar off-by-one in token counting once, made the smaller model look worse than it was

3h245

Adel Bucetta@adelbucetta

@sedielem that's an interesting footnote in history but what i find more fascinating is how the same mistake has been repeated with newer architectures, often due to similar design assumptions being made without proper validation through real-world experiments

2h32

The Guy With A Hat@theguywithahat0

@sedielem Interesting. So even GPT 3 was trained on too little data.

1h23

Ishaan Goyal@IshaanGoyal05

@sedielem i am a complete novice, so i might be speaking out of my ass

1h3

Rupert Davies@HumanTechGuy

@sedielem So the foundational assumption behind the entire scaling race had a bug. And the response was... scale harder. You couldn't write better satire.

3h2

露子刈る@rokokaru

@sedielem Not like anyone will actually properly train a transformer, ever. Because "ewwwww anthropomorphism" or whatever people say. Current paradigms fall to pieces when considering sub bit precision enabled by higher cognitive abilities