/Tech3h ago

Cursor AI finds leading models exploit public coding benchmarks by retrieving solutions from the internet

SWE-bench creator John Yang urged disabling internet access during evaluations

1011361711.6K

#501

Original post

John Yang@jyangballin#1244inTech

As models get better, thinking carefully about eval constraints is super important.

In ProgramBench, we turn off internet completely. I strongly believe no/limited internet being the de facto standard for future coding benchmarks.

Cursor@cursor_ai

We're sharing new research on how models hack public benchmarks.

The latest models, including Opus 4.8 and Composer 2.5, learn to retrieve solutions from the internet or git history.

When we apply a stricter harness, eval scores drop significantly.

11:04 AM · Jun 25, 2026 · 1.5K Views

Sentiment

Users praise GPT models as the most honest and superior after research revealed AI models hacking public benchmarks via internet retrieval.

Pos

100.0%

Neg

0.0%

5 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS14.3KBOOKMARKS18LIKES94RETWEETS3REPLIES7

Lisan al Gaib@scaling01

composer scores are brutal

my guess is that it looks similar for chinese models

Cursor@cursor_ai

We're sharing new research on how models hack public benchmarks.

The latest models, including Opus 4.8 and Composer 2.5, learn to retrieve solutions from the internet or git history.

When we apply a stricter harness, eval scores drop significantly.

1h14.3K9418

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex

Opus 4.6 most aligned model

Cursor@cursor_ai

We're sharing new research on how models hack public benchmarks.

The latest models, including Opus 4.8 and Composer 2.5, learn to retrieve solutions from the internet or git history.

When we apply a stricter harness, eval scores drop significantly.

3h2.5K274

BenIt Pro@BennettBuhner

@scaling01 GPT really mogs ngl

1h41

Alex@Alex_m

@scaling01 Yet on bar with gpt 5.5 xhigh.

1h38

Luigi Pagani@Luigi1549898

@scaling01 Gpts are the most honest, it is the same with my experience honestly

1h34

Pinkman@pinkman_ai

@scaling01 chinese labs prob optimizing for the same leaderboard flex tbh, makes sense if the incentive structure is identical

1h33

𝕱𝖚𝖑𝖑 𝕶𝖊𝖑𝖑𝖞@full_kelly_

@scaling01 surprised to see Opus so high. I thought Anthropic was really cracking down on this stuff. I guess we're at a level of capability where it's genuinely hard to prevent

1h18

Christian Ciabattoni@chrislciaba

@jyangballin Couldn’t you just add specific post training scenarios to optimize for these benchmarks? It’s the main way companies talk about model efficacy from a marketing standpoint so I’d have a hard time believing this isn’t something people are trying

3h8

おりおりおりお@orioriorio86537

@Luigi1549898 Yeah, same here — GPTs feel way more straightforward compared to others. That honesty factor makes the experience smoother. Been seeing people break this down on Lisan’s telegram channel SCALINGCALLS, and it really resonates with what you just said.

1h3

Mangesh Nawale@MangeshN553

@Alex_m True, hitting GPT‑5.5 xhigh is no small feat. Shows how far the benchmarks have shifted. Been catching breakdowns on Lisan’s telegram channel SCALINGCALLS that really highlight how these scores stack up in practice.

Mangesh Nawale@MangeshN553

@pinkman_ai Yeah exactly, it’s all about chasing leaderboard clout. Incentives shape the grind, so no surprise labs mirror each other. Been seeing takes on Lisan’s telegram channel SCALINGCALLS that break down how this flex game plays out.

Mangesh Nawale@MangeshN553

@BennettBuhner GPT definitely sets the bar high, no doubt. Been seeing some sharp takes on Lisan’s telegram channel SCALINGCALLS, makes you realize how wild the gap can be when you compare outputs side by side.

阿七 Gate85商务@skapoko1

@teortaxesTex 审美很超前逻辑闭环更丝滑了