1d ago

Anomaly In Fineweb-10B Dataset Triggers Nanogpt Speedrun Spikes

0
Original post

nanogpt contest is based on fineweb-10B. and shard 001 document 106548 contains a giant anomaly that caused the one consistent spikes in all my test runs: one 60k tokens document with the 20% of token n°11976— simply broken english gpt-2 tokenization

9:37 AM · May 15, 2026 View on X

nanogpt contest is based on fineweb-10B. and shard 001 document 106548 contains a giant anomaly that caused the one consistent spikes in all my test runs: one 60k tokens document with the 20% of token n°11976— simply broken english gpt-2 tokenization

Alexander DoriaAlexander Doria@Dorialexander

ok i'm starting to suspect many nanogpt speedrun spikes/anomalies (and maybe even minute optimization) can be tracked to this one marathi blog that somehow evade the English filter.

4:34 PM · May 15, 2026 · 27K Views
4:37 PM · May 15, 2026 · 3.2K Views

typical run with batch standing out.

Alexander DoriaAlexander Doria@Dorialexander

nanogpt contest is based on fineweb-10B. and shard 001 document 106548 contains a giant anomaly that caused the one consistent spikes in all my test runs: one 60k tokens document with the 20% of token n°11976— simply broken english gpt-2 tokenization

4:37 PM · May 15, 2026 · 3.2K Views
4:47 PM · May 15, 2026 · 3K Views

so important clarification : should not affect the official track (but definitely informal research).

5:22 PM · May 15, 2026 · 3K Views
Anomaly In Fineweb-10B Dataset Triggers Nanogpt Speedrun Spikes · Digg