/AI15h ago

T3 Stack creator Theo Browne calls SWE-bench unreliable for LLM evaluation, promoting DeepSWE as a better agentic coding benchmark

Claude-3-Opus topped the new DeepSWE rankings at 77.9%.

--0--
Quote posts
Original post
Theo - t3.gg@theo#1829inAI

swe-bench is kind of a shitshow, and it makes evaluating LLMs hard. DeepSWE is the first agentic code bench that makes sense.

Datacurve@datacurve

Opus 4.8 is now on DeepSWE.

On the default high thinking effort, it scores 6% higher than Opus 4.7 xhigh, while also lowering average cost per task.

1:45 AM · Jun 1, 2026 · 89.3K Views
Sentiment
Sentiment unavailable for this story.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
No ranked X posts are available for this story yet.