/AI5h ago

Mixedbread AI uses Sparse Autoencoders to extract sparse latent terms from dense embeddings for BM25 retrieval

Ben Clavié says this reveals hidden indexable structures in embeddings

--0--
Original posts
Quote posts
Reposts
Original postsamsja#1262
Mixedbread@mixedbreadai

By now, everyone knows that single-vector embedding models are hugely limiting for modern workflows.

But they contain than you think: you can extract sparse Latent Terms from them.

And it turns out that BM25 is all you need to turn this vocabulary into a strong retriever.

11:48 AM · Jun 2, 2026 · 6.8K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
VIEWS662BOOKMARKS3LIKES10RETWEETS2
Ben Clavié@bclavie

my main takeaway from this isn't "oh, this is cool! BM25 works on hidden activations!" but "we understand so little about retrieval that models have an entire sparse indexable work we knew almost nothing about".

future's bright, tons of work to do.

Mixedbread@mixedbreadai

By now, everyone knows that single-vector embedding models are hugely limiting for modern workflows.

But they contain than you think: you can extract sparse Latent Terms from them.

And it turns out that BM25 is all you need to turn this vocabulary into a strong retriever.

29mViews 662Likes 10Bookmarks 3