1d ago

Jan Tempus and coauthors release paper 'Tokenisation via Convex Relaxations' that frames tokenization as optimization in 100-million-dimensional space solved via convex relaxations

Method yields consistent gains over BPE across language models.

03454294.6K

——0——

Original post

#486@JM_ALEXIAOP

Jan Tempus@JAN55028368

In our new paper, we reinterpret tokenisation as a problem in high-dimensional geometry (100M dims to be precise!), which we can solve efficiently to get a globally near-optimal tokeniser! Our method consistently improves language models over BPE. See 🧵for details.

5:51 AM · May 22, 2026

QUOTE POST

#565Alex Nichol@UNIXPICKLE

I really enjoyed reading this paper. I paused after the graph framing but before the ILP formulation to derive it myself. Took >an hour, even knowing that it *could* be framed as an LP. Fun puzzle! I won't spoil it.

Jan Tempus@Jan55028368

12:51 PM · May 22, 2026 · 34.5K Views

6:07 AM · May 23, 2026 · 5K Views

Jan Tempus and coauthors release paper 'Tokenisation via Convex Relaxations' that frames tokenization as optimization in 100-million-dimensional space solved via convex relaxations

Cluster engagement

Sentiment