/Tech8h ago

Scale AI releases SWE-Together and SWE-Interact to evaluate coding agents on dynamic, multi-turn software engineering workflows

Both benchmarks are built using real-world SWE-chat data.

3181031.3K

Original post

📝 New research from @scale_AI

Frontier SWE benchmarks are usually single-turn, one-shot tasks: the agent gets a detailed spec upfront, then implements autonomously.

That is not how most real coding-agent workflows feel.

Introducing SWE-Interact.

🧵

9:23 AM · Jun 30, 2026 · 358 Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS36RETWEETS6

Kevin Li@kevin_x_li

Big day for interactive coding benchmarks! Two new evals just dropped, both powered by SWE-chat.

SWE-Together transforms real coding sessions from SWE-chat into replayable evals, with robust checks for correctness and user experience.

SWE-Interact flips traditional one-shot benchmarks into dynamic developer workflows, with user simulators conditioned on realistic SWE-chat personas.

Congrats to @yifannnwu @mohit_r9a and teams on the releases!

5h893100