/Tech4h ago

LisanBench creator `@scaling01` argues identical benchmark scores overstate open-source model capabilities compared to closed models

SWE-bench Verified scores mean less for MiniMax than Anthropic

312001.1K

#501

Original post

Lisan al Gaib@scaling01#1215inTech

yes there's an actual useful capability gain

but these gains are narrower and not comparable to anthropic or OpenAI models

people like to use individual benchmarks to show open models are as good as closed models, but it doesn't work that way

a score of 80% on SWE-Bench Verified is more meaningful for Anthropic models than they are for MiniMax models for example

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@scaling01 @cheatyyyy hillclimbing on a measurable proxy of downstream capability is legitimate and what benchmarks originally existed for I don't think they benchmaxed on DeepSWE as such though

9:34 AM · Jun 12, 2026 · 550 Views

Sentiment

Users agree that open model benchmark scores are less meaningful than those of closed AI systems.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS457LIKES4REPLIES1

Lisan al Gaib@scaling01

@teortaxesTex @cheatyyyy they are more meaningful because OpenAI and Anthropic have a lot more targets to hit. They have much more RL envs. Open models just have to hillclimb their 5 RL envs and call it a day. There's no competition between tasks.

Lisan al Gaib@scaling01

yes there's an actual useful capability gain

but these gains are narrower and not comparable to anthropic or OpenAI models

people like to use individual benchmarks to show open models are as good as closed models, but it doesn't work that way

a score of 80% on SWE-Bench Verified is more meaningful for Anthropic models than they are for MiniMax models for example

4h45740

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

@scaling01 @cheatyyyy true

Lisan al Gaib@scaling01

yes there's an actual useful capability gain

but these gains are narrower and not comparable to anthropic or OpenAI models

people like to use individual benchmarks to show open models are as good as closed models, but it doesn't work that way

a score of 80% on SWE-Bench Verified is more meaningful for Anthropic models than they are for MiniMax models for example

4h11010