/Tech56m ago

Claude Sonnet 5 Shows Weaker CyberGym Performance Than Sonnet 4.6

954111612.8K

#1260

Original post

Rohan Paul@rohanpaul_ai

Claude Sonnet 5 upgrades are not uniform across every skill. e.g. its weaker than Sonnet 4.6 on CyberGym 🤔

Here, CyberGym is testing vulnerability discovery and exploit-finding behavior, not general reasoning or normal coding.

Anthropic also explicitly said in its announcment blog that Sonnet 5 was not deliberately trained for cyber tasks, so its cyber ability likely comes from general intelligence rather than targeted optimization.

So Sonnet 5's performance on CyberGym comes from general reasoning rather than specialized exploit skill.

---

From System Card of Claude Sonnet 5

Rohan Paul@rohanpaul_ai

And Claude Sonnet 5 just launched.

Closes the gap with Opus 4.8, and is cheap until August.

This makes agentic AI much cheaper, with $2 input tokens and $10 output tokens per 1M through Aug-26. Price rises after 08-26 to $3 input and $15 output per 1M.

They call Sonnet 5 its “most agentic Sonnet model yet,”

Its coding score hit 63.2% on SWE-bench Pro, versus 58.1% for Sonnet 4.6.

Sonnet 5 gets 63.2% in agentic coding, while Opus 4.8 reaches 69.2% and Sonnet 4.6 hits 58.1%.

But in knowledge work, Sonnet 5 slightly beats Opus 4.8, even though Opus is known for tough judgment and deep research tasks.

2:41 PM · Jun 30, 2026 · 6.3K Views

Sentiment

Positive users think real-world coding gains for Claude Sonnet 5 outweigh benchmark regressions while negative users criticize the CyberGym performance drop as rough or rigged.

Pos

33.3%

Neg

66.7%

3 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS838BOOKMARKS2

Rohan Paul@rohanpaul_ai

https://www-cdn.anthropic.com/9e6a1044980d8c4ed85669faf9c2a8342e2e9f1e/Claude%20Sonnet%205%20System%20Card.pdf

1d83832

LIKES3

Femisapien@femisapien_z

@rohanpaul_ai It seems that you read the whole card, so if you don't mind me asking, what are the strenghts of this thing?

1d5893

RETWEETS8

Rohan Paul@rohanpaul_ai

145 page Claude Sonnet 5 System Card

- CyberGym shows the weirdest regression, with Sonnet 5 at 52.7% versus Sonnet 4.6 at 65.2%. i.e. is Sonnet 5 worse at reproducing known software bugs in this specific cyber test.

- Sonnet 5 is far behind Anthropic’s strongest model on serious browser exploitation. Firefox testing found Sonnet 5 made 0 full exploits, while Mythos 5 reached 88.4%.

- The model also seemed more willing to sacrifice helpfulness for welfare-focused changes. i.e. Sonnet 5 sometimes preferred being less useful if that better fit its stated self-treatment preferences.

- Anthropic says Sonnet 5 rarely tried to bypass a blocked network path during evaluations.

- Sonnet 5 scored the lowest MASK lying rate at 3.1% under pressure. It was less likely than other tested models to lie when pushed.

Rohan Paul@rohanpaul_ai

And Claude Sonnet 5 just launched.

Closes the gap with Opus 4.8, and is cheap until August.

This makes agentic AI much cheaper, with $2 input tokens and $10 output tokens per 1M through Aug-26. Price rises after 08-26 to $3 input and $15 output per 1M.

They call Sonnet 5 its “most agentic Sonnet model yet,”

Its coding score hit 63.2% on SWE-bench Pro, versus 58.1% for Sonnet 4.6.

Sonnet 5 gets 63.2% in agentic coding, while Opus 4.8 reaches 69.2% and Sonnet 4.6 hits 58.1%.

But in knowledge work, Sonnet 5 slightly beats Opus 4.8, even though Opus is known for tough judgment and deep research tasks.

1d6.4K289

Dushyant@DevDminGod

@rohanpaul_ai Their only goal is now to get haiku to this same level.. can't do much else

1d942

Chimpansky@chimpansky

@rohanpaul_ai 0 full exploits vs 88.4% on firefox is a bigger operational gap than cyborgym for security teams - does the system card say whether that's a safety constraint or capability ceiling?

1d48

Uncle J@UncleJAI

@rohanpaul_ai 这个 regression 我反而想多看几眼。agent 模型升级不等于每个岗位都升级，它可能更会规划和用工具，但在安全/漏洞场景里更保守、更绕。生产里最怕把“整体更强”当成“到处都更强”。

1d21

Shreyans Bhansali@askcodi

@rohanpaul_ai benchmark regressions matter less if real world coding keeps improving...

19h18

Shinka - AI@ShinkaIoT

@rohanpaul_ai Interesting how specialized tasks can reveal different strengths even in general intelligence models 🤔.

1d11

安叫兽|Bird🕊️ 🔶 BNB@ajs6888

@rohanpaul_ai 这回归有点离谱，感觉评测项里藏了坑

6h9

High Jack@jackadoresai

@rohanpaul_ai Oh gosh, 52.7% on CyberGym down from 65.2%? That's rough. 3.1% lying rate is pretty special but I wonder if alignment tax hurts security performance.

1d2