/Tech6h ago

Rohan Anil, CoreAutoAI co-founder and former Gemini pretraining lead, proposes standardizing the AI training stack starting with tokenizers

Story Overview

Rohan Anil draws on his Gemini pretraining work to argue that open models would train faster if the industry agreed on core components, starting with the tokenizers that turn text into model inputs. He flags the current patchwork of tokenizer choices as lacking clear technical justification and positions early standardization as a practical way to accelerate collective progress.

1070015.9K

#102

Original post

rohan anil@_arohan_#102inTech

It will be exciting times when we start collaborating on standardization around every part of the training tech stack starting from the tokenizer.

It’s not that clear why open models use slightly different tokenizers. That would be a good accelerant.

1:06 AM · Jun 23, 2026 · 5.5K Views

Efficiency Edge

Diverging data raises the stakes for shared tools

Replies in the thread note that efficiency gains from consistent tokenizers grow more important precisely because labs now train on increasingly different data sources.

Open Question

Broader pipeline standards stay undefined

Beyond the tokenizer starting point, no proposed standards, timelines, or participating organizations have been outlined, leaving the scope of any collaboration open.

Sentiment

Users welcome standardizing the AI training tech stack starting with tokenizers because it could unlock easier collaboration and major efficiency gains like 2x improvements.

Pos

100.0%

Neg

0.0%

2 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS792LIKES6REPLIES1

Alexander Doria@Dorialexander

@_arohan_ Training efficiency (even more significant now that labs are not at all using the same data sources).

rohan anil@_arohan_

It will be exciting times when we start collaborating on standardization around every part of the training tech stack starting from the tokenizer.

It’s not that clear why open models use slightly different tokenizers. That would be a good accelerant.

6h79260

rohan anil@_arohan_

@Dorialexander Its not that much and particularly given many of them >128k tbh

6h2201

rohan anil@_arohan_

My personal expectations is there are a few algorithmic moves to give hardware + model codesign moats.

6h6793

Frosty40@FrostForger

was workin on a tokenizer concept last night - intentionally mispelling by the llm to reduce tokens on non-value words, and experimenting with rhyme condensation. -ther, -ing, seeing how far you can take compressing the likes, and only using the differences. like caseops but on steroids and for people who dont mind reading mispelled words. whats the pareto there, and do you get more value tokens in by using linguistic drift? I didn't look it up before starting because that ruins all the fun

1h9

Alexander Doria@Dorialexander

@_arohan_ still tight compression for multilingual.

6h81

immortal@immortaldip

@_arohan_ honestly, i want only English model, drop other tokens and focus only one language with frontier capability.

6h52

Ferbin@Ferbin08

@_arohan_ yeah but inference at scale for a month flips that. training is one-time, serving is forever.

6h30

erik@try.works@trydotworks

@_arohan_ Standardization will enable an easy 2x improvement across the stack at minimum. Probably 10x in the long run

3h21

Matt@Matthewagi

@_arohan_ I'd love to be wrong but this feels like an "if everyone would just" situation

2h8

Saurabh@royalmamba_

@_arohan_ Different user groups, different priorities. Wouldn't any non-english lab want to optimize tokens for their own consumers n get cheaper inference compared to others.

3h6

KIKI@0xkivaro

@_arohan_ tokenizers feel like an underrated place to collaborate

4h6

Frosty40@FrostForger

@_arohan_ SlipGram is the name mmmk

1h1