Princeton's Sayash Kapoor says Anthropic's undisclosed safety filters on Fable 5 undermine the credibility of third-party AI evaluations

VIEWS9KBOOKMARKS4LIKES77REPLIES3

Anthropic’s move might sound reasonable if you consider their actions as a company chasing superintelligence.

But consider that their customers are spending billions of dollars on their services! That is precisely what has led to their recent surge in ARR, popularity, and fund raising success.

So customers’ surprise and anger is warranted when they then sandbag in evals *without even informing them* about the degraded capabilities.

17h9K774

RETWEETS28

Sayash Kapoor@sayashk

There is a lot of justified anger at Anthropic for sandbagging Fable 5 for AI development tasks. But an unanticipated side effect is that third-party evaluators can no longer credibly use the model for evaluations.

Case in point: we are in the middle of running *really hard* AI R&D evaluations. Fable 5 would be a perfect test candidate. But because of Anthropic's guardrails, we can't know if the model failed or if their classifiers blocked the capability.

By the way, this is not just true for AI R&D. Since Anthropic doesn't make it clear when they are sandbagging, this could seep into any number of technical tasks, and the evaluators wouldn't have any way to know. So they can't credibly claim to evaluate state-of-the-art accuracy using the model.

18h92.8K1.2K169

Sayash Kapoor@sayashk

Anyway, I’m grateful to be part the independent evaluation ecosystem (@aievalforum) that I’m sure will help figure out fruitful ways forward, and to have collaborators like @DavidDAfrica, who think deeply about eval integrity (David initially pointed out the risks of confounding our AI R&D evals by switching from Opus 4.8 to Fable 5).

Sayash Kapoor@sayashk

A few people have (correctly) pointed out that closed model providers have always implemented classifiers and other systems around the core language model in a non-transparent way. So if the model performs poorly, that’s on Anthropic.

That’s fine if the goal is to elicit accuracy for everyday tasks. But if the goal is to get early warning signals about potentially transformative capabilities, third-party evaluators have no way to do so without access to the non-sandbagged model.

And unlike previous versions of classifiers (say in biorisk), Anthropic fully intends to use the model for AI R&D capabilities that are publicly sandbagged. This leads to dangerous precedent for the independent evaluation ecosystem.

13h2.1K204

Sayash Kapoor@sayashk

A few people have (correctly) pointed out that closed model providers have always implemented classifiers and other systems around the core language model in a non-transparent way. So if the model performs poorly, that’s on Anthropic.

That’s fine if the goal is to elicit accuracy for everyday tasks. But if the goal is to get early warning signals about potentially transformative capabilities, third-party evaluators have no way to do so without access to the non-sandbagged model.

And unlike previous versions of classifiers (say in biorisk), Anthropic fully intends to use the model for AI R&D capabilities that are publicly sandbagged. This leads to dangerous precedent for the independent evaluation ecosystem.

13h4.1K302

Miles Brundage@Miles_Brundage

More on the research point - things are hard enough in external evaluation world already

Sayash Kapoor@sayashk

There is a lot of justified anger at Anthropic for sandbagging Fable 5 for AI development tasks. But an unanticipated side effect is that third-party evaluators can no longer credibly use the model for evaluations.

Case in point: we are in the middle of running *really hard* AI R&D evaluations. Fable 5 would be a perfect test candidate. But because of Anthropic's guardrails, we can't know if the model failed or if their classifiers blocked the capability.

By the way, this is not just true for AI R&D. Since Anthropic doesn't make it clear when they are sandbagging, this could seep into any number of technical tasks, and the evaluators wouldn't have any way to know. So they can't credibly claim to evaluate state-of-the-art accuracy using the model.

18h1K100

Auyon Siddiq@auyonomous

@sayashk I don't understand the anger/surprise. They're a company chasing superintelligence, obviously they're going to take measures to hinder potential competition right? Is everyone forgetting how this works or am I missing something?

17h481