Incredibly excited to launch Senior SWE-Bench w @SnorkelAI + @karthik_r_n Austin Wang @Princeton (lab behind SWE Bench) + @fredsala @GOrlanski @UWMadison !
Our focus: realistically under-specified instructions like a senior SWE would run with + real outcomes-based grading.
We expect agents to act like senior engineers, but most benchmarks still evaluate them like interns.
Excited to introduce Senior SWE-Bench, an open-source and @harborframework-native benchmark that assesses agents as senior engineers on long-horizon tasks with realistically under-specified instructions.
We expect agents to build real features going on just a quick Slack message, nothing like the super technical instructions most benchmarks provide. Senior SWE-Bench fixes that.
Claude Opus 4.8 is the current leader at 24% high quality solves, but it took 117K tokens on average to get there. Claude Sonnet 5 looked like it was going to swoop in for the top spot, but we found it cheated on 26% of trials.