1d ago

Perplexity AI releases pplx-embed-v1-late-0.6b, a 0.6-billion-parameter late-interaction embedding model, on Hugging Face with per-token MaxSim optimization and multilingual support

0

Companion kernel delivers 3-5x speedup on Metal and CUDA.

Original post

Releasing my first kernel on @huggingface: MaxSim Late-interaction retrieval (ColBERT / PyLate) bottlenecks on materializing the full similarity matrix. This kernel avoids it by using tiled scoring with simdgroup_matrix (Metal) and WMMA. Result is 3–5× speedup compared to naive PyTorch. Try it out 👇

4:38 AM · May 18, 2026 View on X
Reposted by

oh! cool to see @perplexity_ai train late interaction (colbert) models

BoBo@bo_wangbo

okay maybe it's a good time? We have a small colbert model trained at pplx, it is a continue-training of pplx-embed-0.6b, so native multilingual, just made it open and added a section how to use MaxSim kernel: https://huggingface.co/perplexity-ai/pplx-embed-v1-late-0.6b

5:07 PM · May 18, 2026 · 20.2K Views
8:12 PM · May 18, 2026 · 5K Views
Perplexity AI releases pplx-embed-v1-late-0.6b, a 0.6-billion-parameter late-interaction embedding model, on Hugging Face with per-token MaxSim optimization and multilingual support · Digg