1d ago

Jan Tempus and coauthors release paper 'Tokenisation via Convex Relaxations' that frames tokenization as optimization in 100-million-dimensional space solved via convex relaxations

Method yields consistent gains over BPE across language models.

0
Original post

In our new paper, we reinterpret tokenisation as a problem in high-dimensional geometry (100M dims to be precise!), which we can solve efficiently to get a globally near-optimal tokeniser! Our method consistently improves language models over BPE. See 🧵for details.

5:51 AM · May 22, 2026 View on X

I really enjoyed reading this paper. I paused after the graph framing but before the ILP formulation to derive it myself. Took >an hour, even knowing that it *could* be framed as an LP. Fun puzzle! I won't spoil it.

Jan TempusJan Tempus@Jan55028368

In our new paper, we reinterpret tokenisation as a problem in high-dimensional geometry (100M dims to be precise!), which we can solve efficiently to get a globally near-optimal tokeniser! Our method consistently improves language models over BPE. See 🧵for details.

12:51 PM · May 22, 2026 · 34.5K Views
6:07 AM · May 23, 2026 · 5K Views