/Tech33d ago

Perplexity open-sources a rebuilt Unigram tokenizer that reduces CPU utilization by 5x to 6x

It resolves latency bottlenecks in small rerankers and embedders

11298399349139.5K

#99

Original post

Aravind Srinivas#99

Perplexity@perplexity_ai#1079inTech

We're open-sourcing the Unigram tokenizer we rebuilt to reduce CPU utilization by 5-6x.

Small rerankers and embedders run in single-digit milliseconds on GPU, making CPU tokenization a meaningful share of total latency.

http://github.com/perplexityai/pplx-garden

8:55 AM · May 27, 2026 · 96.9K Views

Sentiment

Many users praised Perplexity for open-sourcing an optimized Unigram Tokenizer that cuts CPU utilization 5-6x, calling the reduction a huge win for production inference and the open source community.

Pos

100.0%

Neg

0.0%

17 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

GITHUB.COMVia

#1079

Posts from X

Most Activity

VIEWS35.2KBOOKMARKS47LIKES201RETWEETS11REPLIES28

Aravind Srinivas@AravSrinivas

Every millisecond matters. We’re open sourcing the tokenizer we built and deployed on production; that’s far efficient than huggingface and sentencepiece.

Perplexity@perplexity_ai

We're open-sourcing the Unigram tokenizer we rebuilt to reduce CPU utilization by 5-6x.

Small rerankers and embedders run in single-digit milliseconds on GPU, making CPU tokenization a meaningful share of total latency.

http://github.com/perplexityai/pplx-garden

33d35.2K20147

Perplexity@perplexity_ai

At production input lengths, the encoder cuts p50 latency by roughly 5× vs. HuggingFace tokenizers, 2× vs. SentencePiece C++, and 1.5× vs. IREE C.

At 514 tokens, it runs in 63 µs with zero heap allocations.

33d1.6K83

Perplexity@perplexity_ai

The work targets XLM-RoBERTa’s 250K-token Unigram vocabulary, commonly used for ranking and retrieval.

The encoder produces the same tokens as the reference implementation, but avoids rebuilding strings and chasing hash maps while deciding how text should be split.

33d1.3K101

Perplexity@perplexity_ai

Read more about improving Unigram tokenizer CPU performance on our blog:

https://research.perplexity.ai/articles/improving-unigram-tokenizer-cpu-performance

33d1.1K32

Suhail@Suhail

@AravSrinivas Super cool

Aravind Srinivas@AravSrinivas

Every millisecond matters. We’re open sourcing the tokenizer we built and deployed on production; that’s far efficient than huggingface and sentencepiece.

33d7.4K20

찡긋@Alignment100

@perplexity_ai Future perplexity Korea B2C "ME"

33d731

audex@audexdev

@perplexity_ai this is the kind of latency work that actually matters.

once the model pass is single-digit ms, all the boring CPU pieces stop being boring.

33d541

Jaydev Gusani@1337JG

@perplexity_ai Rebuilding a tokenizer to slash CPU utilization by 5-6x is an incredible engineering flex. But it’s wild that we optimize single-digit milliseconds under the hood, yet the desktop user interface remains locked in a rigid, hardcoded container surrounded by dead space.

33d431

Null Hype@nullhypeai

@perplexity_ai Good signal for where inference optimization is moving.

Once rerankers/embedders run in single digit ms on GPU, CPU tokenization and heap allocation stop being background noise.

The model is not always the bottleneck. The path into the model starts to matter.

33d411

lifcc@mylifcc

@perplexity_ai 5-6x CPU reduction is a real lever here: once rerankers are single-digit ms on GPU, tokenization stops being plumbing and starts showing up in p95.

33d371

Markets & Mayhem@Mayhem4Markets

@perplexity_ai This is actually cool

33d69

Element Dong@elementdsj

@perplexity_ai zero heap allocations at 63µs is the thing. cpu tokenization becoming the gc pressure point in high-throughput inference is what creeps up on you.

33d63

Anton Abyzov@aabyzov

@perplexity_ai 5-6x CPU win on tokenization is the kind of unsexy infra optimization most teams skip until latency budgets blow up. Glad it is open. Reranker stacks badly need this.

33d58

heath@HeathAtHelix

@perplexity_ai 5-6x just on tokenization is wild. easy to forget the CPU side once the model itself is single-digit ms on GPU.

33d51

rahul@ErRahul337

@AravSrinivas Amazing work @aravsrinivas! 5-6x lower CPU utilization on tokenization is huge for production inference.

Every ms counts when your rerankers/embedders are already blazing fast on GPU. Will definitely try pplx-unigram from pplx-garden 🔥

33d50

Deva@DevaBuilds

@perplexity_ai At 100ms inference, tokenization is rounding error. At 8ms GPU it's 20% of wall time. Batching overhead and HTTP fan-out are probably next in line. The optimization queue for fast retrieval keeps moving up the stack.

33d47

heath@HeathAtHelix

@perplexity_ai love that the actual rewrite is open-sourced and not just the benchmark number

33d41