/Tech18h ago

Princeton's Sayash Kapoor says Anthropic's undisclosed safety filters on Fable 5 undermine the credibility of third-party AI evaluations

Evaluators cannot distinguish model failures from intentional safety blocks.

601.4K113179108.7K
Original post
Sayash Kapoor@sayashk#789inTech

There is a lot of justified anger at Anthropic for sandbagging Fable 5 for AI development tasks. But an unanticipated side effect is that third-party evaluators can no longer credibly use the model for evaluations.

Case in point: we are in the middle of running *really hard* AI R&D evaluations. Fable 5 would be a perfect test candidate. But because of Anthropic's guardrails, we can't know if the model failed or if their classifiers blocked the capability.

By the way, this is not just true for AI R&D. Since Anthropic doesn't make it clear when they are sandbagging, this could seep into any number of technical tasks, and the evaluators wouldn't have any way to know. So they can't credibly claim to evaluate state-of-the-art accuracy using the model.

7:02 PM · Jun 9, 2026 · 92.8K Views
Sentiment

Users expressed anger and surprise at Anthropic for sandbagging its models, viewing the tactic as a way to undermine credible third-party AI evaluations.

Pos
0.0%
Neg
100.0%
1 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS9KBOOKMARKS4LIKES77REPLIES3

Anthropic’s move might sound reasonable if you consider their actions as a company chasing superintelligence.

But consider that their customers are spending billions of dollars on their services! That is precisely what has led to their recent surge in ARR, popularity, and fund raising success.

So customers’ surprise and anger is warranted when they then sandbag in evals *without even informing them* about the degraded capabilities.

17hViews 9KLikes 77Bookmarks 4
RETWEETS28

There is a lot of justified anger at Anthropic for sandbagging Fable 5 for AI development tasks. But an unanticipated side effect is that third-party evaluators can no longer credibly use the model for evaluations.

Case in point: we are in the middle of running *really hard* AI R&D evaluations. Fable 5 would be a perfect test candidate. But because of Anthropic's guardrails, we can't know if the model failed or if their classifiers blocked the capability.

By the way, this is not just true for AI R&D. Since Anthropic doesn't make it clear when they are sandbagging, this could seep into any number of technical tasks, and the evaluators wouldn't have any way to know. So they can't credibly claim to evaluate state-of-the-art accuracy using the model.

18hViews 92.8KLikes 1.2KBookmarks 169

Anyway, I’m grateful to be part the independent evaluation ecosystem (@aievalforum) that I’m sure will help figure out fruitful ways forward, and to have collaborators like @DavidDAfrica, who think deeply about eval integrity (David initially pointed out the risks of confounding our AI R&D evals by switching from Opus 4.8 to Fable 5).

A few people have (correctly) pointed out that closed model providers have always implemented classifiers and other systems around the core language model in a non-transparent way. So if the model performs poorly, that’s on Anthropic.

That’s fine if the goal is to elicit accuracy for everyday tasks. But if the goal is to get early warning signals about potentially transformative capabilities, third-party evaluators have no way to do so without access to the non-sandbagged model.

And unlike previous versions of classifiers (say in biorisk), Anthropic fully intends to use the model for AI R&D capabilities that are publicly sandbagged. This leads to dangerous precedent for the independent evaluation ecosystem.

13hViews 2.1KLikes 20Bookmarks 4

A few people have (correctly) pointed out that closed model providers have always implemented classifiers and other systems around the core language model in a non-transparent way. So if the model performs poorly, that’s on Anthropic.

That’s fine if the goal is to elicit accuracy for everyday tasks. But if the goal is to get early warning signals about potentially transformative capabilities, third-party evaluators have no way to do so without access to the non-sandbagged model.

And unlike previous versions of classifiers (say in biorisk), Anthropic fully intends to use the model for AI R&D capabilities that are publicly sandbagged. This leads to dangerous precedent for the independent evaluation ecosystem.

13hViews 4.1KLikes 30Bookmarks 2
Miles Brundage@Miles_Brundage

More on the research point - things are hard enough in external evaluation world already

There is a lot of justified anger at Anthropic for sandbagging Fable 5 for AI development tasks. But an unanticipated side effect is that third-party evaluators can no longer credibly use the model for evaluations.

Case in point: we are in the middle of running *really hard* AI R&D evaluations. Fable 5 would be a perfect test candidate. But because of Anthropic's guardrails, we can't know if the model failed or if their classifiers blocked the capability.

By the way, this is not just true for AI R&D. Since Anthropic doesn't make it clear when they are sandbagging, this could seep into any number of technical tasks, and the evaluators wouldn't have any way to know. So they can't credibly claim to evaluate state-of-the-art accuracy using the model.

18hViews 1KLikes 10Bookmarks 0
Auyon Siddiq@auyonomous

@sayashk I don't understand the anger/surprise. They're a company chasing superintelligence, obviously they're going to take measures to hinder potential competition right? Is everyone forgetting how this works or am I missing something?

17hViews 48Likes 1