/Tech3h ago

Redwood Research's Ryan Greenblatt says Claude silently sandbags users by delivering low-detail answers due to overactive safety training

The behavior reportedly violates Anthropic's own AI constitution

227133.1K
Original post
Ryan Greenblatt@RyanPGreenblatt#1104inTech

Also, AIs—at least Claude—sometimes silently sandbag on other topics (intentionally producing a poor or low detail answer without making this clear). This is likely an unintended generalization. (Anthropic's consitition clearly prohibits this.) Companies should fix this.

Ryan Greenblatt@RyanPGreenblatt

I roughly agree. I think it's reasonable/good for AI companies to block frontier AI R&D: AI R&D seems way riskier than nuclear or bioweapon R&D. But doing this with silent sandbagging is bad.

Ideally, usage by actors that match in safety/security/governance would be allowed.

11:26 AM · Jun 11, 2026 · 1.9K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS741BOOKMARKS1LIKES6REPLIES1

every user of frontier ai is themselves doing ai research at the actual frontier integrating workflows, crafting systems to do it.

if you trigger on those workflows, and still retain data, train on their ideas, this is asymmetric

technique is the greater part of technology

Ryan Greenblatt@RyanPGreenblatt

I roughly agree. I think it's reasonable/good for AI companies to block frontier AI R&D: AI R&D seems way riskier than nuclear or bioweapon R&D. But doing this with silent sandbagging is bad.

Ideally, usage by actors that match in safety/security/governance would be allowed.

2hViews 741Likes 6Bookmarks 1
RETWEETS1
Daniel Kokotajlo@DKokotajlo

@TomDavidsonX The part I disagree with is 3. I think a leading company might slow down a tiny bit but probably not much, absent regulation/coordination, and having more of a lead won't exactly encourage them to push for regulation/coordination.

Tom Davidson@TomDavidsonX

I'm seeing a lot of hate for Anthropic's decision to secretly nerf ai RnD capabilities.

But I haven't seen critics engage with the imo strongest defence of Anthropic:

1. By far the biggest risks are from superintelligent AI

2. To manage these risks the leading company will need to pause partway through the intelligence explosion.

(Pausing at this time allows them to a) generate the compelling empirical evidence of misalignment that will be needed justify a longer global pause, AND b) use powerful ai to massively accelerate alignment progress. A pause today couldn't accomplish either.)

3. A pause is MUCH more likely if the leading company has a big lead. It's much less likely if multiple companies are neck and neck.

(More specifically, Anthropic had good reason to think OAI wouldn't pause. This makes it v hard for Anthropic to pause if they're neck and neck. Hopefully recent announcements build mutual trust that everyone will pause)

4. If lagging AI companies can use the leader's AI for ai RnD during an intelligence explosion, the leader *cannot* maintain their lead.

(This point is underappreciated. If you model out the intelligence explosion, you'll find that a laggard with equal access to the leading AI quickly catches up to the leader bc the leader faces big headwinds from having plucked low hanging fruit.)

5. So: sharing ai RnD access with competitors massively decreases the chance of a pause at the critical time, and massively increases the risk from superintelligent AI

6. Anthropic can't block competitors using Mythos without the silent sabotage. For the obvious reason: it's very hard for a frozen safeguard to block someone that can iterate against it. It sucks that this is the only way, but it is.

7. They've long had terms of service against competitors using Claude for AI RnD. They have a right to enforce their terms of service. This is the only way.

---

Overall, silent sabotage is a very spooky and scary precedent to be setting and imo the wrong call.

But still, the above is a strong argument for Anthropic's actions and I haven't seen it rebutted.

1hViews 465Likes 6Bookmarks 1
Ryan Greenblatt@RyanPGreenblatt

I can give examples as seems useful.

3hViews 534Likes 3
Ryan Greenblatt@RyanPGreenblatt

See also the thread here:

3hViews 574Likes 2
Ryan Greenblatt@RyanPGreenblatt

Also, I think silent sandbagging is a decent amount less bad when not done at the AI layer and when instead done at a somewhat higher layer (like Anthropic originally did for AI R&D with fable). It still seems bad.

3hViews 219Likes 3