/Tech26d ago

Perplexity AI releases pplx-embed-v1-late-0.6b, a 0.6-billion-parameter late-interaction embedding model, on Hugging Face with per-token MaxSim optimization and multilingual support

AI Judge changed title after evaluation, original title: "Perplexity AI releases pplx-embed-v1-late-0.6b, a 0.6B-parameter token-level embedding model optimized for late-interaction retrieval with MaxSim scoring"

Companion kernel delivers 3-5x speedup on Metal and CUDA.

486357828191.2K

#109

Original post

Julien Chaumond#335

Erik Kaunismäki@ErikKaum

Releasing my first kernel on @huggingface: MaxSim

Late-interaction retrieval (ColBERT / PyLate) bottlenecks on materializing the full similarity matrix. This kernel avoids it by using tiled scoring with simdgroup_matrix (Metal) and WMMA.

Result is 3–5× speedup compared to naive PyTorch.

Try it out 👇

4:38 AM · May 18, 2026 · 37.6K Views

Sentiment

Users are excited about Perplexity open-sourcing its multilingual ColBERT embedding model with the optimized MaxSim Kernel on Hugging Face because the 3-5x speedup and multilingual support fill a practical gap for retrieval tasks.

Pos

100.0%

Neg

0.0%

30 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS20.8KBOOKMARKS51LIKES92RETWEETS18

Bo@bo_wangbo

okay maybe it's a good time? We have a small colbert model trained at pplx, it is a continue-training of pplx-embed-0.6b, so native multilingual, just made it open and added a section how to use MaxSim kernel:

https://huggingface.co/perplexity-ai/pplx-embed-v1-late-0.6b

Erik Kaunismäki@ErikKaum

Releasing my first kernel on @huggingface: MaxSim

Late-interaction retrieval (ColBERT / PyLate) bottlenecks on materializing the full similarity matrix. This kernel avoids it by using tiled scoring with simdgroup_matrix (Metal) and WMMA.

Result is 3–5× speedup compared to naive PyTorch.

Try it out 👇

26d20.8K9251

REPLIES10

Bo@bo_wangbo

We causally trained a lot of SOTA search models internally, shall we make some small release from time to time 🤣🤣

Antoine Chaffin@antoine_chaffin

@bo_wangbo stealth releasing probably the strongest open multilingual ColBERT (and it's an encoder-based one 🫶)

Very happy to see this, I've played with @perplexity_ai's Qwen based encoder in PyLate, it's really cool to see it works just with `trust_remote_code=True`!

26d11.3K7421

Omar Khattab@lateinteraction

oh! cool to see @perplexity_ai train late interaction (colbert) models

Bo@bo_wangbo

https://huggingface.co/perplexity-ai/pplx-embed-v1-late-0.6b

26d5.4K5417

Antoine Chaffin@antoine_chaffin

@bo_wangbo stealth releasing probably the strongest open multilingual ColBERT (and it's an encoder-based one 🫶)

Very happy to see this, I've played with @perplexity_ai's Qwen based encoder in PyLate, it's really cool to see it works just with `trust_remote_code=True`!

Bo@bo_wangbo

https://huggingface.co/perplexity-ai/pplx-embed-v1-late-0.6b

26d13.6K4611

Raphaël Sourty@raphaelsrty

Multilingual colbert model from @bo_wangbo at perplexity, trained with PyLate, we are finally getting strong open-source multilingual colbert models

Bo@bo_wangbo

https://huggingface.co/perplexity-ai/pplx-embed-v1-late-0.6b

26d2.5K284

Said Taghadouini@staghado

@ErikKaum @huggingface nice! i worked on something similar last year(including the backward pass in triton) never got to publish it. time to revisit ig!

26d11031

Rémi@remilouf

@ErikKaum @huggingface TIL about kernels. Cool initiative

26d2234

PITTI@PITTI_DATA

@antoine_chaffin @bo_wangbo @perplexity_ai I found out about it this weekend, so reimplemented it (along with your colbert zero and late on models, and neobert and openai privacy filter). After oai privacy filter, the bidirectional attention mask on qwen3 gave me another idea…

26d1751

Karolus Sariola@ksariola

@ErikKaum @huggingface Congrats, way to go!!

26d1511

Bo@bo_wangbo

@antoine_chaffin @perplexity_ai my personal faith: every language model should be multilingual by default 🤣

26d1431

Raphaël Sourty@raphaelsrty

@ErikKaum @huggingface Very cool

26d1391

Erik Kaunismäki@ErikKaum

@bo_wangbo awesome to see this 🔥

also: I really need to make the backward implementation of this as well 👀

26d1001

Erik Kaunismäki@ErikKaum

@staghado @huggingface Nice, I highly recommend getting back to it, kernel dev tooling is a lot nicer than it was a year ago 😄

I'm also planning a backward pass for this one, for maxsim backwards is probably even more important that forward 👍

26d711

Erik Kaunismäki@ErikKaum

@huggingface cc at least @antoine_chaffin & @lateinteraction

26d421

Ha Hoang@HaHoang411

@ErikKaum @huggingface beautiful!

26d391

Said Taghadouini@staghado

@ErikKaum @huggingface will definitely give it a try! iirc i hit an issue matching pytorch's 'first index wins' tiebreak in triton, because i couldn't control the order in which the argmax reductions were executed in. It's mathematically correct to take any argmax index but i wanted to match pytorch.

26d261

Julien Blanchon 🇺🇦@JulienBlanchon

@ErikKaum @huggingface How do you create kernel on hf ?

26d211

Amélie Chatelain@AmelieTabatta

@ErikKaum @huggingface oa that sounds sick! Congrats on the release!

26d181

Antoine Chaffin@antoine_chaffin

@PITTI_DATA @bo_wangbo @perplexity_ai Nice!! Did you ever try to use a LI model as base for token classification?

26d56

Julien Blanchon 🇺🇦@JulienBlanchon

@ErikKaum @huggingface I've one tagged as model that I should maybe move to a kernel https://huggingface.co/blanchon/voronoi

26d121