/AI15h ago

T3 Stack creator Theo Browne calls SWE-bench unreliable for LLM evaluation, promoting DeepSWE as a better agentic coding benchmark

Claude-3-Opus topped the new DeepSWE rankings at 77.9%.

416882313290.5K

Quote posts

#1829

Original post

Theo - t3.gg@theo#1829inAI

swe-bench is kind of a shitshow, and it makes evaluating LLMs hard. DeepSWE is the first agentic code bench that makes sense.

Datacurve@datacurve

Opus 4.8 is now on DeepSWE.

On the default high thinking effort, it scores 6% higher than Opus 4.7 xhigh, while also lowering average cost per task.

1:45 AM · Jun 1, 2026 · 89.3K Views

/AI15h ago

T3 Stack creator Theo Browne calls SWE-bench unreliable for LLM evaluation, promoting DeepSWE as a better agentic coding benchmark

Claude-3-Opus topped the new DeepSWE rankings at 77.9%.

--0--

Quote posts

#1829

Original post

Theo - t3.gg@theo#1829inAI

swe-bench is kind of a shitshow, and it makes evaluating LLMs hard. DeepSWE is the first agentic code bench that makes sense.

Datacurve@datacurve

Opus 4.8 is now on DeepSWE.

On the default high thinking effort, it scores 6% higher than Opus 4.7 xhigh, while also lowering average cost per task.

1:45 AM · Jun 1, 2026 · 89.3K Views

Sentiment

Many users praised DeepSWE for independently measuring realistic engineering issues like long dependencies and flaky tests, while others called the benchmark untrustworthy due to suspected VC funding bias and flawed harnesses.

Pos

57.7%

Neg

42.3%

27 comments with sentiment.

Cluster Engagement

Sentiment

Sentiment unavailable for this story.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

No ranked X posts are available for this story yet.