1d ago

Anomaly In Fineweb-10B Dataset Triggers Nanogpt Speedrun Spikes

3169379.2K

——0——

Original post

nanogpt contest is based on fineweb-10B. and shard 001 document 106548 contains a giant anomaly that caused the one consistent spikes in all my test runs: one 60k tokens document with the 20% of token n°11976— simply broken english gpt-2 tokenization

9:37 AM · May 15, 2026

Cluster engagement

117 snapshots

#897Alexander Doria@DORIALEXANDER

Alexander Doria@Dorialexander

ok i'm starting to suspect many nanogpt speedrun spikes/anomalies (and maybe even minute optimization) can be tracked to this one marathi blog that somehow evade the English filter.

4:34 PM · May 15, 2026 · 27K Views

4:37 PM · May 15, 2026 · 3.2K Views

#897Alexander Doria@DORIALEXANDER

typical run with batch standing out.

Alexander Doria@Dorialexander

4:37 PM · May 15, 2026 · 3.2K Views

4:47 PM · May 15, 2026 · 3K Views

QUOTE POST

#897Alexander Doria@DORIALEXANDER

so important clarification : should not affect the official track (but definitely informal research).

5:22 PM · May 15, 2026 · 3K Views