NEW paper worth reading.
GPT-5.4 nano plus a critic-comparator orchestration loop hits 76.4% on SWE-bench Verified, matching standalone Gemini 3 Pro and Claude Opus 4.5 Thinking.
The trick is to select from k=8 weak-model proposals using execution and proof signals.
What does this mean?
Many of the patches you'd expect from a frontier model are already inside a weak model's top-8 candidates.
When you have 8 candidate patches from a weak model, don't ask the model which is best. Run them and verify them. That's enough to match a frontier model's accuracy.
The takeaway for AI devs: a weak model's top-k often already contains the right answer. What limits you is the quality of your selector, not the capability of the model.
Paper: https://arxiv.org/abs/2605.14163
Learn to build effective AI agents in our academy: https://academy.dair.ai/