2d ago

Waterloo researchers replicate DPR model in Pyserini

0

Researchers Xueguang Ma, Rodrigo Pradeep and University of Waterloo undergraduate Kai Sun replicated the 2020 Dense Passage Retrieval model from Karpukhin et al. Their April 2021 arXiv study integrated BM25 sparse retrieval with dense neural retrievers inside the Pyserini toolkit. The resulting hybrid methods appear in Pyserini and Pi-Serini, with evaluations on BEIR, LongEmbed and BrowseComp-Plus benchmarks. Recent extensions pair the retriever with search agents.

Original post

I think @xueguang_ma is being too modest, so I'll provide context: he along with @rpradeep42 and a UWaterloo ugrad (Kai Sun) popularized hybrid search in its current form. So, if you're using hybrid search today, thank them. 🙏 Yes, this is clickbait-y, so I'll support my claims 🧵

10:08 AM · May 14, 2026 View on X
Reposted by

@lintool cheeky not to count index as its embeddings

Jimmy LinJimmy Lin@lintool

Since we're counting model parameters, let me introduce you to a two-parameter model for agentic search that's awesome: It's called BM25. I haven't tried it yet, but I think fp4 will work fine. https://arxiv.org/abs/2605.10848

3:56 PM · May 15, 2026 · 5.8K Views
3:18 AM · May 16, 2026 · 454 Views

The DPR paper might have shown you wouldn't need BM25, but that neural models were insufficient was known before the popularization of hybrid search. In particular, in our SIGIR paper (that slightly preceded the BEIR paper by @beirmug and colleagues) we learned that strong neural rankers don't generalize well and underpform BM25. @beirmag paper makes some additional findings, in particular, that the dense retrieval generalizes more poorly compared to the BERT-based rankers. The @UWaterloo team surely deserves the credit for the ultimate demonstration of the usefulness of the hybrid search. However, the need for a hybrid search was also motivated by prior work. https://arxiv.org/abs/2103.03335

Jimmy LinJimmy Lin@lintool

Thus, our conclusions: This I believe is the first demonstration of the need for hybrid search. Hence the claim that hybrid search is a @UWaterloo innovation. You're welcome! The broader lesson is that old baselines are still surprisingly important. Let's not forget them.

5:14 PM · May 14, 2026 · 3.9K Views
5:59 PM · May 14, 2026 · 1.5K Views

@mrdrozdov @xueguang_ma @mat_jacob1002 I think in that case one should blame the ranker, not a hybrid search. With a better ranker, hybrid search + ranker should have outperformed just vector search.

7:37 PM · May 14, 2026 · 27 Views

BTW, there's truly unique (to my knowledge) @UWaterloo invention that has become a largely essential cog in of the hybrid retrieval machine. Yet nobody is talking about it. It is a reciprocal rank fusion published by G. Cormack, @claclarke , and Stefan Büttcher. AFAIK, it is implemented and used nearly everywhere.

Jimmy LinJimmy Lin@lintool

Thus, our conclusions: This I believe is the first demonstration of the need for hybrid search. Hence the claim that hybrid search is a @UWaterloo innovation. You're welcome! The broader lesson is that old baselines are still surprisingly important. Let's not forget them.

5:14 PM · May 14, 2026 · 3.9K Views
8:38 PM · May 14, 2026 · 2.1K Views

I think @xueguang_ma is being too modest, so I'll provide context: he along with @rpradeep42 and a UWaterloo ugrad (Kai Sun) popularized hybrid search in its current form.

So, if you're using hybrid search today, thank them. 🙏

Yes, this is clickbait-y, so I'll support my claims 🧵

Xueguang MaXueguang Ma@xueguang_ma

This plot reminds me of my first IR work reproducing DPR in Pyserini, where we found BM25 is amazingly helpful when hybrid with a dense retriever. BM25 is never just a simple baseline -- used the right way, it can easily outperform many fancy methods. BM25 was the most robust method shown in BEIR, the most effective and efficient method for long-context search shown in LongEmbed, and now @mattjustram and @xuzihuan4 show that BM25 can push the search agents into the best efficiency frontier. p.s. Pyserini and pi-serini are two different repos.

3:19 AM · May 13, 2026 · 11.4K Views
5:08 PM · May 14, 2026 · 4.9K Views

The original DPR paper https://aclanthology.org/2020.emnlp-main.550/ claimed that with dense retrieval, you no longer needed BM25.

Jimmy LinJimmy Lin@lintool

I think @xueguang_ma is being too modest, so I'll provide context: he along with @rpradeep42 and a UWaterloo ugrad (Kai Sun) popularized hybrid search in its current form. So, if you're using hybrid search today, thank them. 🙏 Yes, this is clickbait-y, so I'll support my claims 🧵

5:08 PM · May 14, 2026 · 4.9K Views
5:09 PM · May 14, 2026 · 551 Views

But that's not what we found: even with DPR, a dense-sparse hybrid with BM25 is significantly better than DPR alone. https://arxiv.org/abs/2104.05740

Jimmy LinJimmy Lin@lintool

The original DPR paper https://aclanthology.org/2020.emnlp-main.550/ claimed that with dense retrieval, you no longer needed BM25.

5:09 PM · May 14, 2026 · 551 Views
5:11 PM · May 14, 2026 · 655 Views

Thus, our conclusions: This I believe is the first demonstration of the need for hybrid search. Hence the claim that hybrid search is a @UWaterloo innovation. You're welcome!

The broader lesson is that old baselines are still surprisingly important. Let's not forget them.

Jimmy LinJimmy Lin@lintool

But that's not what we found: even with DPR, a dense-sparse hybrid with BM25 is significantly better than DPR alone. https://arxiv.org/abs/2104.05740

5:11 PM · May 14, 2026 · 655 Views
5:14 PM · May 14, 2026 · 3.9K Views

Since we're counting model parameters, let me introduce you to a two-parameter model for agentic search that's awesome: It's called BM25. I haven't tried it yet, but I think fp4 will work fine. https://arxiv.org/abs/2605.10848

3:56 PM · May 15, 2026 · 5.8K Views

But I think we can do better... what about zero parameters? Let me introduce you to something else that's awesome: It's called grep. https://arxiv.org/abs/2605.05242

Jimmy LinJimmy Lin@lintool

Since we're counting model parameters, let me introduce you to a two-parameter model for agentic search that's awesome: It's called BM25. I haven't tried it yet, but I think fp4 will work fine. https://arxiv.org/abs/2605.10848

3:56 PM · May 15, 2026 · 5.8K Views
3:57 PM · May 15, 2026 · 396 Views

@srchvrs @xueguang_ma @mat_jacob1002 Indeed. No silver bullet. :)

Leo BoytsovLeo Boytsov@srchvrs

@mrdrozdov @xueguang_ma @mat_jacob1002 I think in that case one should blame the ranker, not a hybrid search. With a better ranker, hybrid search + ranker should have outperformed just vector search.

7:37 PM · May 14, 2026 · 27 Views
7:59 PM · May 14, 2026 · 25 Views
Waterloo researchers replicate DPR model in Pyserini · Digg