Abundant launches SWE-Marathon coding benchmark, where Claude Opus 4.8 leads frontier models with a 26% resolution rate

Story Brief

The long-horizon tasks include building Slack from scratch.

Commentary on X

Highest ranked

@bcherny Claude has really turned to crap. I’ve switched to Codex for now while still subscribed, hoping Mythos will improve it. Today I revisited a couple of simple updates, and it hallucinated and tried to change things I never asked for. Opus 4.8 cannot be trusted.

Abundant launches SWE-Marathon coding benchmark, where Claude Opus 4.8 leads frontier models with a 26% resolution rate

Related Stories

Commentary on X

Digg Deeper