/Tech2h ago

Tokenstew.com launches to let researchers inspect raw 8192-token pretraining sequences from OLMo 3 32B

The tool visualizes document concatenation and Common Crawl metadata.

533263K

#301

Original post

Florian Brand@xeophon#1778inTech

tag yourself, i’m token 33723

Tom Adamczewski@tmkadamcz

They say “Look at your data!"

But when is the last time you looked at the pretraining data?

1:09 PM · Jun 19, 2026 · 968 Views

Sentiment

Users are excited by the tool displaying real pretraining data instances from OLMo 3 32B, praising how open source makes it possible and noting the fun of reading training batches.

Pos

100.0%

Neg

0.0%

2 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS93LIKES5REPLIES1

Luca Soldaini 🎀@soldni

@xeophon @tmkadamcz it’s always fun to read training batches!

except when @mechanicaldirk does, during olmo 3 ablations he ended up reading 8M tokens at a time looking for spiky data

Florian Brand@xeophon

@tmkadamcz open source rocks, @soldni will love this

1h9350

BOOKMARKS1

Tom Adamczewski@tmkadamcz

Check it out: http://TokenStew.com

Each view is one 8192-token instance from OLMo 3 32B's pretraining (stage-1) run.

You can truly view the entire pretraining data, no tricks. The site fetches data live from the 24-terabyte corpus using HTTP range requests.

2h6641

RETWEETS2

Tom Adamczewski@tmkadamcz

They say “Look at your data!"

But when is the last time you looked at the pretraining data?

2h2.1K165

Florian Brand@xeophon

@tmkadamcz open source rocks, @soldni will love this

2h894

testtm@test_tm7873

@xeophon what number is the cat tokens at?! i need to be it.

2h11