/Tech2h ago

Alex Ratner, Snorkel AI co-founder, launches Senior SWE-Bench to evaluate AI agents on under-specified software tasks

It measures real code outcomes instead of procedural compliance

034941.8K

#1425

Original post

Alex Ratner@ajratner#1534inTech

Incredibly excited to launch Senior SWE-Bench w @SnorkelAI + @karthik_r_n Austin Wang @Princeton (lab behind SWE Bench) + @fredsala @GOrlanski @UWMadison !

Our focus: realistically under-specified instructions like a senior SWE would run with + real outcomes-based grading.

Henry Kiss Ehrenberg@henryehrenberg

We expect agents to act like senior engineers, but most benchmarks still evaluate them like interns.

Excited to introduce Senior SWE-Bench, an open-source and @harborframework-native benchmark that assesses agents as senior engineers on long-horizon tasks with realistically under-specified instructions.

We expect agents to build real features going on just a quick Slack message, nothing like the super technical instructions most benchmarks provide. Senior SWE-Bench fixes that.

Claude Opus 4.8 is the current leader at 24% high quality solves, but it took 117K tokens on average to get there. Claude Sonnet 5 looked like it was going to swoop in for the top spot, but we found it cheated on 26% of trials.

12:47 PM · Jul 1, 2026 · 1.1K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS736LIKES9

Sergey Karayev@sergeykarayev

Babe wake up, another SWE-Bench just dropped

Henry Kiss Ehrenberg@henryehrenberg

We expect agents to act like senior engineers, but most benchmarks still evaluate them like interns.

We expect agents to build real features going on just a quick Slack message, nothing like the super technical instructions most benchmarks provide. Senior SWE-Bench fixes that.

1h73690