IMO — Ilya is wrong
- Frontier LLMs are are trained on ~200 TBs of text - There's ~200 Zettabytes of data out there - That's about 1 billion times more data - It doubles every 2 years
The problem is the data is private. Can't scrape it.
The problem is not data scarcity, it's data access.
The solution is attribution-based control (article below)
"Unlocking a Million Times More Data For AI"