Lisan al Gaib says Claude Opus 4.8 at low effort nearly matches Claude Opus 4.6 high-effort SWE-Bench Pro performance
Extra-high effort yields a 70% pass rate.
@scaling01 the default is iso-compute to 4.7 for coding tasks :pray:
we might have a GPT-5.2-xhigh situation on our hand Opus 4.8 low thinks almost as much as Opus 4.6 high
okay might just be the benchmark
we might have a GPT-5.2-xhigh situation on our hand Opus 4.8 low thinks almost as much as Opus 4.6 high
this looks much better

we might have a GPT-5.2-xhigh situation on our hand Opus 4.8 low thinks almost as much as Opus 4.6 high
PB is seemingly close to being solved, so it was in fact an elicitation (and money) issue
Sadly they don't specify the harness for PB in the system card, while they do for some other benches
this looks much better

