/AI5h ago

Mixedbread AI uses Sparse Autoencoders to extract sparse latent terms from dense embeddings for BM25 retrieval

Ben Clavié says this reveals hidden indexable structures in embeddings

106415406K

Original posts

Quote posts

Reposts

#160

Original post

samsja#1262

Mixedbread@mixedbreadai

By now, everyone knows that single-vector embedding models are hugely limiting for modern workflows.

But they contain than you think: you can extract sparse Latent Terms from them.

And it turns out that BM25 is all you need to turn this vocabulary into a strong retriever.

11:48 AM · Jun 2, 2026 · 6.8K Views

/AI5h ago

Mixedbread AI uses Sparse Autoencoders to extract sparse latent terms from dense embeddings for BM25 retrieval

Ben Clavié says this reveals hidden indexable structures in embeddings

--0--

Original posts

Quote posts

Reposts

#160

Original post

samsja#1262

Mixedbread@mixedbreadai

By now, everyone knows that single-vector embedding models are hugely limiting for modern workflows.

But they contain than you think: you can extract sparse Latent Terms from them.

And it turns out that BM25 is all you need to turn this vocabulary into a strong retriever.

11:48 AM · Jun 2, 2026 · 6.8K Views

Sentiment

Users find it neat that SAE latents on embedders produce natural-language-like distributions and are curious how BM25 on extracted terms compares to learned sparse retrieval.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Sentiment

Sentiment building, check back later.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

VIEWS662BOOKMARKS3LIKES10RETWEETS2

Ben Clavié@bclavie

my main takeaway from this isn't "oh, this is cool! BM25 works on hidden activations!" but "we understand so little about retrieval that models have an entire sparse indexable work we knew almost nothing about".

future's bright, tons of work to do.

Mixedbread@mixedbreadai

By now, everyone knows that single-vector embedding models are hugely limiting for modern workflows.

But they contain than you think: you can extract sparse Latent Terms from them.

And it turns out that BM25 is all you need to turn this vocabulary into a strong retriever.

29m662103