/Tech18h ago

Sayash Kapoor argues Anthropic's undisclosed safety guardrails undermine third-party evaluations by masking capability failures as refusals

Andreas Kirsch of Google DeepMind questioned how to diagnose the hidden filters.

601.3K119184102.8K
Original post

@sayashk Wait but if the classifiers refuse, you'll know that you get rerouted?

There is a lot of justified anger at Anthropic for sandbagging Fable 5 for AI development tasks. But an unanticipated side effect is that third-party evaluators can no longer credibly use the model for evaluations.

Case in point: we are in the middle of running *really hard* AI R&D evaluations. Fable 5 would be a perfect test candidate. But because of Anthropic's guardrails, we can't know if the model failed or if their classifiers blocked the capability.

By the way, this is not just true for AI R&D. Since Anthropic doesn't make it clear when they are sandbagging, this could seep into any number of technical tasks, and the evaluators wouldn't have any way to know. So they can't credibly claim to evaluate state-of-the-art accuracy using the model.

3:34 AM · Jun 10, 2026 · 1.5K Views
Sentiment

Many users objected to Anthropic's guardrails as they sabotage independent evaluations, erode trust in model outputs, and contradict the company's safety rhetoric.

Pos
9.1%
Neg
90.9%
11 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS10.8K
Auyon Siddiq@auyonomous

@sayashk I don't understand the anger/surprise. They're a company chasing superintelligence, obviously they're going to take measures to hinder potential competition right? Is everyone forgetting how this works or am I missing something?

1dViews 10.8KLikes 9Bookmarks 2
BOOKMARKS4

Anyway, I’m grateful to be part the independent evaluation ecosystem (@aievalforum) that I’m sure will help figure out fruitful ways forward, and to have collaborators like @DavidDAfrica, who think deeply about eval integrity (David initially pointed out the risks of confounding our AI R&D evals by switching from Opus 4.8 to Fable 5).

22hViews 2.1KLikes 20Bookmarks 4
LIKES79RETWEETS3REPLIES3

Anthropic’s move might sound reasonable if you consider their actions as a company chasing superintelligence.

But consider that their customers are spending billions of dollars on their services! That is precisely what has led to their recent surge in ARR, popularity, and fund raising success.

So customers’ surprise and anger is warranted when they then sandbag in evals *without even informing them* about the degraded capabilities.

1dViews 9.2KLikes 79Bookmarks 4

A few people have (correctly) pointed out that closed model providers have always implemented classifiers and other systems around the core language model in a non-transparent way. So if the model performs poorly, that’s on Anthropic.

That’s fine if the goal is to elicit accuracy for everyday tasks. But if the goal is to get early warning signals about potentially transformative capabilities, third-party evaluators have no way to do so without access to the non-sandbagged model.

And unlike previous versions of classifiers (say in biorisk), Anthropic fully intends to use the model for AI R&D capabilities that are publicly sandbagged. This leads to dangerous precedent for the independent evaluation ecosystem.

22hViews 4.2KLikes 31Bookmarks 2
Klaus Schmid@1KlausSchmid

@sayashk I do not think, I follow you here. The model is what it is. If you evaluate it and because of guardrails it scores very low, than this is a fair assessment of the model capabilities as opposed to some hypothetical results that assume guardrails are not present.

1dViews 1.3KLikes 40
Josh@JoshPurtell

@auyonomous @sayashk You're ok with companies deliberately sabotage their customers without notification?

1dViews 422Likes 14
Josh@JoshPurtell

@auyonomous @sayashk I'd love it if they were bumping customers down to Opus 4.8, that'd be much much better than what they're doing

1dViews 313Likes 10
Auyon Siddiq@auyonomous

Ah fair point, I was thinking of the bio queries. Yeah this sounds like they want to just fully deter any use of Claude for AI development by not letting users observe when they're getting nerfed. Otherwise people could train workarounds using the feedback signal of getting bumped down.

1dViews 195Likes 5
Auyon Siddiq@auyonomous

@sayashk Yes I agree it's frustrating. And it makes their pro-humanity rhetoric even more cringe. I guess I was never particularly convinced by their altruistic posture, and it sounds like many others are now realizing that they're a company like any other.

1dViews 189Likes 3
Josh@JoshPurtell

@auyonomous @sayashk I agree with your assessment of their intent

1dViews 177Likes 7

@sayashk @sethlazar But this is true of any public API - there are gazilions of classifiers on both ends - expanding queries, routing tokens, classifiers on outputs, you’re never interacting with the model, plus all the goofy posttraining. You’re evaluating a giant software system not the model

23hViews 4.6KLikes 4
Toastbroti@diesesToastbrot

@sayashk If it fails the task because of the guardrails that's still a fail. A user doesn't care why it fails and the benchmark should reflect that. I'm against giving unrestricted access to the benchmarks for the same reason: it no longer reflects the actual real world use.

15hViews 155Likes 6
Auyon Siddiq@auyonomous

@JoshPurtell @sayashk It's not sabotage if they're bumping you down to a weaker model for AI R&D queries. Sabotage would be deliberately shipping bugs. Customers don't have an absolute right to any product or service they pay for -- that's why terms and conditions exist right?

1dViews 435Likes 1

@sayashk The whole thing is simple. If it can’t do something it’s FAILED. Who cares if it is a guardrails issue.

22hViews 321Likes 5
David Manheim@davidmanheim

@BlackHC @sayashk Only for the cyber and chem/biorisk refusals, not for accelerating AI, which does something different: https://anthropic.com/claude-fable-5-mythos-5-system-card

18hViews 64Likes 1
Billy Lau@billytcl

@sayashk It gets worse. Any bug or mistake that Claude/Fable makes will be seen as sandbagging. The fact that they openly say this and it’s a feature not a bug means nobody can trust its output. Outside of AI why would you use it to code when you’re paying Anthropic to make mistakes?

21hViews 649Likes 3
xOhoomfAy@jfFxjfgLrmKRr

@sayashk I mean you just score it on the response the model gives you. If Anthropic gives you a terrible model, give it a terrible score.

18hViews 95Likes 4
Corey Noles@CoreyNoles

@sayashk Hadn't even considered that as a possibility. I feel like this winds up landing it right back at them deciding who can and can not be trusted to have access. Really concerning precedents being set.

1dViews 823Likes 2

Aw my reply was specifically in response to "if their classifiers blocked the capability."

Imo I want to see the actual Fable performance numbers with everything they do bc that's what I'll also use at home and I don't care about hypotheticals.

The classifiers could obviously be changed more easily and refusal there could be reported separately

18hViews 70
Ivan's Cat@IvansCat1

@sayashk If it does not answer, just score 0. It is a fair assessment of the model as it is. A different model with different guardrails might have other capabilities, but not the one you are testing.

23hViews 130Likes 3
Load more posts