/AI1h ago

Princeton's Sayash Kapoor says Anthropic's undisclosed safety filters on Fable 5 undermine the credibility of third-party AI evaluations

Evaluators cannot distinguish model failures from intentional safety blocks.

744212.1K
Original post
Sayash Kapoor@sayashk#745inAI

There is a lot of justified anger at Anthropic for sandbagging Fable 5 for AI development tasks. But an unanticipated side effect is that third-party evaluators can no longer credibly use the model for evaluations.

Case in point: we are in the middle of running *really hard* AI R&D evaluations. Fable 5 would be a perfect test candidate. But because of Anthropic's guardrails, we can't know if the model failed or if their classifiers blocked the capability.

By the way, this is not just true for AI R&D. Since Anthropic doesn't make it clear when they are sandbagging, this could seep into any number of technical tasks, and the evaluators wouldn't have any way to know. So they can't credibly claim to evaluate state-of-the-art accuracy using the model.

7:02 PM · Jun 9, 2026 · 2.8K Views
Sentiment

Users expressed anger and surprise at Anthropic for sandbagging its models, viewing the tactic as a way to undermine credible third-party AI evaluations.

Pos
0.0%
Neg
100.0%
1 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS792REPLIES2

Anthropic’s move might sound reasonable if you consider their actions as a company chasing superintelligence.

But consider that their customers are spending billions of dollars on their services! That is precisely what has led to their recent surge in ARR, popularity, and fund raising success.

So customers’ surprise and anger is warranted when they then sandbag in evals *without even informing them* about the degraded capabilities.

1hViews 792Likes 7Bookmarks 0
LIKES8
Miles Brundage@Miles_Brundage

More on the research point - things are hard enough in external evaluation world already

There is a lot of justified anger at Anthropic for sandbagging Fable 5 for AI development tasks. But an unanticipated side effect is that third-party evaluators can no longer credibly use the model for evaluations.

Case in point: we are in the middle of running *really hard* AI R&D evaluations. Fable 5 would be a perfect test candidate. But because of Anthropic's guardrails, we can't know if the model failed or if their classifiers blocked the capability.

By the way, this is not just true for AI R&D. Since Anthropic doesn't make it clear when they are sandbagging, this could seep into any number of technical tasks, and the evaluators wouldn't have any way to know. So they can't credibly claim to evaluate state-of-the-art accuracy using the model.

1hViews 563Likes 8Bookmarks 0
Auyon Siddiq@auyonomous

@sayashk I don't understand the anger/surprise. They're a company chasing superintelligence, obviously they're going to take measures to hinder potential competition right? Is everyone forgetting how this works or am I missing something?

1hViews 48Likes 1