Best-of-N sampling is often used to boost LLM performance, but the selection relies on external evaluators, adding cost and bias. What if you could select the best output without any external scoring at all?
Introducing ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation, accepted at #ACL2026 Main! (led by Hyeong Kyu Choi @HyeonggyuC)
💡 Our key insight: among multiple LLM generations, high-quality outputs tend to cluster together semantically. The best answer is the modal one: the generation that captures the dominant consensus.
How ModeX works: 1⃣ Build a similarity graph over N candidate generations 2⃣ Recursively apply spectral clustering via the Fiedler vector to isolate the dominant semantic cluster 3⃣ Select the centroid of that cluster as the final output
No reward models. No external evaluators. No auxiliary inference. Just the texts themselves.
📊 Results across text summarization (CNN/DailyMail), code generation (HumanEval), and math reasoning (Math-500) show ModeX consistently outperforms single-path and multi-path baselines, achieving state-of-the-art among evaluator-free methods.
We also provide theoretical justifications connecting our graph-based mode selection to kernel density estimation, grounding the approach with principled foundations.
📄 Paper: http://arxiv.org/abs/2601.02535 💻 Code: http://github.com/deeplearning-wisc/ModeX
Sometimes the best signal is already hiding in the samples; you just need to find the mode. 🎯
