/Tech1d ago

Sharon Li introduces ModeX for evaluator-free Best-of-N selection, but Stella Biderman says the clustering method is standard textbook ML

The official implementation has been released on GitHub.

3999617.4K
Original post
Sharon Li@SharonYixuanLi#652inTech

Best-of-N sampling is often used to boost LLM performance, but the selection relies on external evaluators, adding cost and bias. What if you could select the best output without any external scoring at all?

Introducing ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation, accepted at #ACL2026 Main! (led by Hyeong Kyu Choi @HyeonggyuC)

πŸ’‘ Our key insight: among multiple LLM generations, high-quality outputs tend to cluster together semantically. The best answer is the modal one: the generation that captures the dominant consensus.

How ModeX works: 1⃣ Build a similarity graph over N candidate generations 2⃣ Recursively apply spectral clustering via the Fiedler vector to isolate the dominant semantic cluster 3⃣ Select the centroid of that cluster as the final output

No reward models. No external evaluators. No auxiliary inference. Just the texts themselves.

πŸ“Š Results across text summarization (CNN/DailyMail), code generation (HumanEval), and math reasoning (Math-500) show ModeX consistently outperforms single-path and multi-path baselines, achieving state-of-the-art among evaluator-free methods.

We also provide theoretical justifications connecting our graph-based mode selection to kernel density estimation, grounding the approach with principled foundations.

πŸ“„ Paper: http://arxiv.org/abs/2601.02535 πŸ’» Code: http://github.com/deeplearning-wisc/ModeX

Sometimes the best signal is already hiding in the samples; you just need to find the mode. 🎯

7:49 AM Β· Jun 9, 2026 Β· 17 Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS6.2KBOOKMARKS52LIKES81RETWEETS9REPLIES3
Sharon Li@SharonYixuanLi

Best-of-N sampling is often used to boost LLM performance, but the selection relies on external evaluators, adding cost and bias. What if you could select the best output without any external scoring at all?

Introducing our #ACL2026 paper ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation! (led by Hyeong Kyu Choi @HyeonggyuC)

πŸ’‘ Our key insight: among multiple LLM generations, high-quality outputs tend to cluster together semantically. The best answer is the modal one: the generation that captures the dominant consensus.

How ModeX works: 1⃣ Build a similarity graph over N candidate generations 2⃣ Recursively apply spectral clustering via the Fiedler vector to isolate the dominant semantic cluster 3⃣ Select the centroid of that cluster as the final output

No reward models. No external evaluators. No auxiliary inference. Just the texts themselves.

πŸ“Š Results across text summarization (CNN/DailyMail), code generation (HumanEval), and math reasoning (Math-500) show ModeX consistently outperforms single-path and multi-path baselines, achieving state-of-the-art among evaluator-free methods.

We also provide theoretical justifications connecting our graph-based mode selection to kernel density estimation, grounding the approach with principled foundations.

πŸ“„ Paper: http://arxiv.org/abs/2601.02535 πŸ’» Code: http://github.com/deeplearning-wisc/ModeX

Sometimes the best signal is already hiding in the samples; you just need to find the mode. 🎯

1dViews 6.2KLikes 81Bookmarks 52

Surely you would want to cite works on semantic uncertainty:

https://arxiv.org/abs/2302.09664

or the nature version:

https://www.nature.com/articles/s41586-024-07421-0

or other works by Yarin Gal, e.g.:

https://proceedings.neurips.cc/paper_files/paper/2024/hash/10c456d2160517581a234dfde15a7505-Abstract-Conference.html

A bit surprised that the related work section is so short on semantic approaches for clustering

Sharon Li@SharonYixuanLi

Best-of-N sampling is often used to boost LLM performance, but the selection relies on external evaluators, adding cost and bias. What if you could select the best output without any external scoring at all?

Introducing our #ACL2026 paper ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation! (led by Hyeong Kyu Choi @HyeonggyuC)

πŸ’‘ Our key insight: among multiple LLM generations, high-quality outputs tend to cluster together semantically. The best answer is the modal one: the generation that captures the dominant consensus.

How ModeX works: 1⃣ Build a similarity graph over N candidate generations 2⃣ Recursively apply spectral clustering via the Fiedler vector to isolate the dominant semantic cluster 3⃣ Select the centroid of that cluster as the final output

No reward models. No external evaluators. No auxiliary inference. Just the texts themselves.

πŸ“Š Results across text summarization (CNN/DailyMail), code generation (HumanEval), and math reasoning (Math-500) show ModeX consistently outperforms single-path and multi-path baselines, achieving state-of-the-art among evaluator-free methods.

We also provide theoretical justifications connecting our graph-based mode selection to kernel density estimation, grounding the approach with principled foundations.

πŸ“„ Paper: http://arxiv.org/abs/2601.02535 πŸ’» Code: http://github.com/deeplearning-wisc/ModeX

Sometimes the best signal is already hiding in the samples; you just need to find the mode. 🎯

1dViews 628Likes 12Bookmarks 5
Stella Biderman@BlancheMinerva

@SharonYixuanLi This methodology can be found in virtually every ML textbook in the world and is already in widespread use.

Sharon Li@SharonYixuanLi

Best-of-N sampling is often used to boost LLM performance, but the selection relies on external evaluators, adding cost and bias. What if you could select the best output without any external scoring at all?

Introducing our #ACL2026 paper ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation! (led by Hyeong Kyu Choi @HyeonggyuC)

πŸ’‘ Our key insight: among multiple LLM generations, high-quality outputs tend to cluster together semantically. The best answer is the modal one: the generation that captures the dominant consensus.

How ModeX works: 1⃣ Build a similarity graph over N candidate generations 2⃣ Recursively apply spectral clustering via the Fiedler vector to isolate the dominant semantic cluster 3⃣ Select the centroid of that cluster as the final output

No reward models. No external evaluators. No auxiliary inference. Just the texts themselves.

πŸ“Š Results across text summarization (CNN/DailyMail), code generation (HumanEval), and math reasoning (Math-500) show ModeX consistently outperforms single-path and multi-path baselines, achieving state-of-the-art among evaluator-free methods.

We also provide theoretical justifications connecting our graph-based mode selection to kernel density estimation, grounding the approach with principled foundations.

πŸ“„ Paper: http://arxiv.org/abs/2601.02535 πŸ’» Code: http://github.com/deeplearning-wisc/ModeX

Sometimes the best signal is already hiding in the samples; you just need to find the mode. 🎯

23hViews 630Likes 7Bookmarks 4