tag yourself, i’m token 33723
They say “Look at your data!"
But when is the last time you looked at the pretraining data?
The tool visualizes document concatenation and Common Crawl metadata.
tag yourself, i’m token 33723
They say “Look at your data!"
But when is the last time you looked at the pretraining data?
Users are excited by the tool displaying real pretraining data instances from OLMo 3 32B, praising how open source makes it possible and noting the fun of reading training batches.
No Digg Deeper questions have been answered for this story yet.
@xeophon @tmkadamcz it’s always fun to read training batches!
except when @mechanicaldirk does, during olmo 3 ablations he ended up reading 8M tokens at a time looking for spiky data
@tmkadamcz open source rocks, @soldni will love this

Check it out: http://TokenStew.com
Each view is one 8192-token instance from OLMo 3 32B's pretraining (stage-1) run.
You can truly view the entire pretraining data, no tricks. The site fetches data live from the 24-terabyte corpus using HTTP range requests.
They say “Look at your data!"
But when is the last time you looked at the pretraining data?

@tmkadamcz open source rocks, @soldni will love this

@xeophon what number is the cat tokens at?! i need to be it.