/AI11h ago

Sharon Li introduces ModeX for evaluator-free Best-of-N selection, but Stella Biderman says the clustering method is standard textbook ML

The official implementation has been released on GitHub.

3555424.8K
Original post
Sharon Li@SharonYixuanLi#604inAI

Best-of-N sampling is often used to boost LLM performance, but the selection relies on external evaluators, adding cost and bias. What if you could select the best output without any external scoring at all?

Introducing ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation, accepted at #ACL2026 Main! (led by Hyeong Kyu Choi @HyeonggyuC)

πŸ’‘ Our key insight: among multiple LLM generations, high-quality outputs tend to cluster together semantically. The best answer is the modal one: the generation that captures the dominant consensus.

How ModeX works: 1⃣ Build a similarity graph over N candidate generations 2⃣ Recursively apply spectral clustering via the Fiedler vector to isolate the dominant semantic cluster 3⃣ Select the centroid of that cluster as the final output

No reward models. No external evaluators. No auxiliary inference. Just the texts themselves.

πŸ“Š Results across text summarization (CNN/DailyMail), code generation (HumanEval), and math reasoning (Math-500) show ModeX consistently outperforms single-path and multi-path baselines, achieving state-of-the-art among evaluator-free methods.

We also provide theoretical justifications connecting our graph-based mode selection to kernel density estimation, grounding the approach with principled foundations.

πŸ“„ Paper: http://arxiv.org/abs/2601.02535 πŸ’» Code: http://github.com/deeplearning-wisc/ModeX

Sometimes the best signal is already hiding in the samples; you just need to find the mode. 🎯

7:49 AM Β· Jun 9, 2026 Β· 17 Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS4.4KBOOKMARKS40LIKES51RETWEETS5REPLIES3
Sharon Li@SharonYixuanLi

Best-of-N sampling is often used to boost LLM performance, but the selection relies on external evaluators, adding cost and bias. What if you could select the best output without any external scoring at all?

Introducing our #ACL2026 paper ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation! (led by Hyeong Kyu Choi @HyeonggyuC)

πŸ’‘ Our key insight: among multiple LLM generations, high-quality outputs tend to cluster together semantically. The best answer is the modal one: the generation that captures the dominant consensus.

How ModeX works: 1⃣ Build a similarity graph over N candidate generations 2⃣ Recursively apply spectral clustering via the Fiedler vector to isolate the dominant semantic cluster 3⃣ Select the centroid of that cluster as the final output

No reward models. No external evaluators. No auxiliary inference. Just the texts themselves.

πŸ“Š Results across text summarization (CNN/DailyMail), code generation (HumanEval), and math reasoning (Math-500) show ModeX consistently outperforms single-path and multi-path baselines, achieving state-of-the-art among evaluator-free methods.

We also provide theoretical justifications connecting our graph-based mode selection to kernel density estimation, grounding the approach with principled foundations.

πŸ“„ Paper: http://arxiv.org/abs/2601.02535 πŸ’» Code: http://github.com/deeplearning-wisc/ModeX

Sometimes the best signal is already hiding in the samples; you just need to find the mode. 🎯

11hViews 4.4KLikes 51Bookmarks 40
Stella Biderman@BlancheMinerva

@SharonYixuanLi This methodology can be found in virtually every ML textbook in the world and is already in widespread use.

Sharon Li@SharonYixuanLi

Best-of-N sampling is often used to boost LLM performance, but the selection relies on external evaluators, adding cost and bias. What if you could select the best output without any external scoring at all?

Introducing our #ACL2026 paper ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation! (led by Hyeong Kyu Choi @HyeonggyuC)

πŸ’‘ Our key insight: among multiple LLM generations, high-quality outputs tend to cluster together semantically. The best answer is the modal one: the generation that captures the dominant consensus.

How ModeX works: 1⃣ Build a similarity graph over N candidate generations 2⃣ Recursively apply spectral clustering via the Fiedler vector to isolate the dominant semantic cluster 3⃣ Select the centroid of that cluster as the final output

No reward models. No external evaluators. No auxiliary inference. Just the texts themselves.

πŸ“Š Results across text summarization (CNN/DailyMail), code generation (HumanEval), and math reasoning (Math-500) show ModeX consistently outperforms single-path and multi-path baselines, achieving state-of-the-art among evaluator-free methods.

We also provide theoretical justifications connecting our graph-based mode selection to kernel density estimation, grounding the approach with principled foundations.

πŸ“„ Paper: http://arxiv.org/abs/2601.02535 πŸ’» Code: http://github.com/deeplearning-wisc/ModeX

Sometimes the best signal is already hiding in the samples; you just need to find the mode. 🎯

4hViews 397Likes 3Bookmarks 2

Surely you would want to cite works on semantic uncertainty:

https://arxiv.org/abs/2302.09664

or the nature version:

https://www.nature.com/articles/s41586-024-07421-0

or other works by Yarin Gal, e.g.:

https://proceedings.neurips.cc/paper_files/paper/2024/hash/10c456d2160517581a234dfde15a7505-Abstract-Conference.html

A bit surprised that the related work section is so short on semantic approaches for clustering

10hViews 79Likes 2Bookmarks 1