/Tech18h ago

Sayash Kapoor argues Anthropic's undisclosed safety guardrails undermine third-party evaluations by masking capability failures as refusals

Andreas Kirsch of Google DeepMind questioned how to diagnose the hidden filters.

601.3K119184102.8K

#241

Original post

Andreas Kirsch 🇺🇦@BlackHC#241inTech

@sayashk Wait but if the classifiers refuse, you'll know that you get rerouted?

Sayash Kapoor@sayashk

There is a lot of justified anger at Anthropic for sandbagging Fable 5 for AI development tasks. But an unanticipated side effect is that third-party evaluators can no longer credibly use the model for evaluations.

Case in point: we are in the middle of running *really hard* AI R&D evaluations. Fable 5 would be a perfect test candidate. But because of Anthropic's guardrails, we can't know if the model failed or if their classifiers blocked the capability.

By the way, this is not just true for AI R&D. Since Anthropic doesn't make it clear when they are sandbagging, this could seep into any number of technical tasks, and the evaluators wouldn't have any way to know. So they can't credibly claim to evaluate state-of-the-art accuracy using the model.

3:34 AM · Jun 10, 2026 · 1.5K Views

/Tech18h ago

Sayash Kapoor argues Anthropic's undisclosed safety guardrails undermine third-party evaluations by masking capability failures as refusals

Andreas Kirsch of Google DeepMind questioned how to diagnose the hidden filters.

601.3K119184102.8K

#241

Original post

Andreas Kirsch 🇺🇦@BlackHC#241inTech

@sayashk Wait but if the classifiers refuse, you'll know that you get rerouted?

Sayash Kapoor@sayashk

3:34 AM · Jun 10, 2026 · 1.5K Views

Sentiment

Many users objected to Anthropic's guardrails as they sabotage independent evaluations, erode trust in model outputs, and contradict the company's safety rhetoric.

Pos

9.1%

Neg

90.9%

11 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS10.8K

Auyon Siddiq@auyonomous

@sayashk I don't understand the anger/surprise. They're a company chasing superintelligence, obviously they're going to take measures to hinder potential competition right? Is everyone forgetting how this works or am I missing something?

1d10.8K92

BOOKMARKS4

Sayash Kapoor@sayashk

Anyway, I’m grateful to be part the independent evaluation ecosystem (@aievalforum) that I’m sure will help figure out fruitful ways forward, and to have collaborators like @DavidDAfrica, who think deeply about eval integrity (David initially pointed out the risks of confounding our AI R&D evals by switching from Opus 4.8 to Fable 5).

22h2.1K204

LIKES79RETWEETS3REPLIES3

Sayash Kapoor@sayashk

Anthropic’s move might sound reasonable if you consider their actions as a company chasing superintelligence.

But consider that their customers are spending billions of dollars on their services! That is precisely what has led to their recent surge in ARR, popularity, and fund raising success.

So customers’ surprise and anger is warranted when they then sandbag in evals *without even informing them* about the degraded capabilities.

1d9.2K794

Sayash Kapoor@sayashk

A few people have (correctly) pointed out that closed model providers have always implemented classifiers and other systems around the core language model in a non-transparent way. So if the model performs poorly, that’s on Anthropic.

That’s fine if the goal is to elicit accuracy for everyday tasks. But if the goal is to get early warning signals about potentially transformative capabilities, third-party evaluators have no way to do so without access to the non-sandbagged model.

And unlike previous versions of classifiers (say in biorisk), Anthropic fully intends to use the model for AI R&D capabilities that are publicly sandbagged. This leads to dangerous precedent for the independent evaluation ecosystem.

22h4.2K312

Klaus Schmid@1KlausSchmid

@sayashk I do not think, I follow you here. The model is what it is. If you evaluate it and because of guardrails it scores very low, than this is a fair assessment of the model capabilities as opposed to some hypothetical results that assume guardrails are not present.

1d1.3K40

Josh@JoshPurtell

@auyonomous @sayashk You're ok with companies deliberately sabotage their customers without notification?

1d42214

Josh@JoshPurtell

@auyonomous @sayashk I'd love it if they were bumping customers down to Opus 4.8, that'd be much much better than what they're doing

1d31310

Auyon Siddiq@auyonomous

Ah fair point, I was thinking of the bio queries. Yeah this sounds like they want to just fully deter any use of Claude for AI development by not letting users observe when they're getting nerfed. Otherwise people could train workarounds using the feedback signal of getting bumped down.

1d1955

Auyon Siddiq@auyonomous

@sayashk Yes I agree it's frustrating. And it makes their pro-humanity rhetoric even more cringe. I guess I was never particularly convinced by their altruistic posture, and it sounds like many others are now realizing that they're a company like any other.

1d1893

Josh@JoshPurtell

@auyonomous @sayashk I agree with your assessment of their intent

1d1777

Where the Tweets have no name@andrewthesmart

@sayashk @sethlazar But this is true of any public API - there are gazilions of classifiers on both ends - expanding queries, routing tokens, classifiers on outputs, you’re never interacting with the model, plus all the goofy posttraining. You’re evaluating a giant software system not the model

23h4.6K4

Toastbroti@diesesToastbrot

@sayashk If it fails the task because of the guardrails that's still a fail. A user doesn't care why it fails and the benchmark should reflect that. I'm against giving unrestricted access to the benchmarks for the same reason: it no longer reflects the actual real world use.

15h1556

Auyon Siddiq@auyonomous

@JoshPurtell @sayashk It's not sabotage if they're bumping you down to a weaker model for AI R&D queries. Sabotage would be deliberately shipping bugs. Customers don't have an absolute right to any product or service they pay for -- that's why terms and conditions exist right?

1d4351

ふぁりー💚💛@506Farley

@sayashk The whole thing is simple. If it can’t do something it’s FAILED. Who cares if it is a guardrails issue.

22h3215

David Manheim@davidmanheim

@BlackHC @sayashk Only for the cyber and chem/biorisk refusals, not for accelerating AI, which does something different: https://anthropic.com/claude-fable-5-mythos-5-system-card

18h641

Billy Lau@billytcl

@sayashk It gets worse. Any bug or mistake that Claude/Fable makes will be seen as sandbagging. The fact that they openly say this and it’s a feature not a bug means nobody can trust its output. Outside of AI why would you use it to code when you’re paying Anthropic to make mistakes?

21h6493

xOhoomfAy@jfFxjfgLrmKRr

@sayashk I mean you just score it on the response the model gives you. If Anthropic gives you a terrible model, give it a terrible score.

18h954

Corey Noles@CoreyNoles

@sayashk Hadn't even considered that as a possibility. I feel like this winds up landing it right back at them deciding who can and can not be trusted to have access. Really concerning precedents being set.

1d8232

Andreas Kirsch 🇺🇦@BlackHC

Aw my reply was specifically in response to "if their classifiers blocked the capability."

Imo I want to see the actual Fable performance numbers with everything they do bc that's what I'll also use at home and I don't care about hypotheticals.

The classifiers could obviously be changed more easily and refusal there could be reported separately

18h70

Ivan's Cat@IvansCat1

@sayashk If it does not answer, just score 0. It is a fair assessment of the model as it is. A different model with different guardrails might have other capabilities, but not the one you are testing.

23h1303