1/5 Our latest Labs in Front piece: Agent pipeline order matters. By reversing a common agent recipe - scale first, enrich second - we reached SOTA on a Dec ‘25 to Mar ‘26 slice (123 issues): 60.9%.
Users praise AI21 Labs for reversing its agent pipeline as it enables effective scale-first enrichment rather than blind search on SWE-rebench.
Most Activity

2/5 Started with a baseline: classic ReAct agent (GPT-5.2), single Docker-terminal tool. Baselines on the slice: vanilla 53.8%, enrich-only 55.6%, scale-only (n=5 + LLM judge) 55.4%, enrich-then-scale 57.7%.

4/5 Still came in ~$0.30 under Claude Code’s spend at a similar score. So we added a lightweight Test Agent that writes repo tests and filters failing patches, pushing our final result to 60.9% - surpassing Claude Code (60.9% vs 56.2%) at the same cost.

5/5 Takeaway: pipeline order is a hyperparameter. If you're already paying for parallel rollouts, reuse them - they're relevant context, not just candidate answers. Full write-up: [https://www.ai21.com/blog/first-scale-then-enrich-how-the-right-execution-strategy-helped-us-reach-state-of-the-art-on-swe-rebench/?utm_source=org-twitter]

3/5 Reversing (scale-then-enrich) pushed our score to 59.7%. Why? Enrich-first searches a big repo blind from the raw issue. Scale-first hands the extractor your N rollouts - aka a contextual map of where fixes were attempted - so it targets high-probability files.

@AI21Labs scale first enriches pipeline noise second makes the bar go brr

@AI21Labs small detail but reversing the recipe shifted the whole ceiling
scale first really lets enrichment do its thing

@AI21Labs I have similar thoughts: order probably matters more than people assume in these agent pipelines. Curious whether this holds outside the Dec-Mar slice too.

@AI21Labs Scale-first gives enrichment a map instead of a blind search. Wonder how many production systems still do it backwards.