📝 New research from @scale_AI
Frontier SWE benchmarks are usually single-turn, one-shot tasks: the agent gets a detailed spec upfront, then implements autonomously.
That is not how most real coding-agent workflows feel.
Introducing SWE-Interact.
🧵
Both benchmarks are built using real-world SWE-chat data.
📝 New research from @scale_AI
Frontier SWE benchmarks are usually single-turn, one-shot tasks: the agent gets a detailed spec upfront, then implements autonomously.
That is not how most real coding-agent workflows feel.
Introducing SWE-Interact.
🧵
No Digg Deeper questions have been answered for this story yet.
Big day for interactive coding benchmarks! Two new evals just dropped, both powered by SWE-chat.
SWE-Together transforms real coding sessions from SWE-chat into replayable evals, with robust checks for correctness and user experience.
SWE-Interact flips traditional one-shot benchmarks into dynamic developer workflows, with user simulators conditioned on realistic SWE-chat personas.
Congrats to @yifannnwu @mohit_r9a and teams on the releases!