This is the most interesting recent benchmark result that I've seen:
The 100-line mini-swe-agent harness gets better performance out of Opus, GPT, and Gemini than their respective bespoke harnesses.
(As measured on the excellent DeepSWE bench).
Why would that be true?