/AI4h ago

Arcee AI's Cody Blakeney argues AI companies disclose safety-related capability reductions to excuse low benchmark scores

Cameron R. Wolfe supported the explanation as highly plausible.

218112.7K

#1004

Original post

Cody Blakeney@code_star#1004inAI

It is odd, but it’s probably because it would come up eventually.

People doing external evaluations would report poor performance on the models. By telling us they nerf this quietly they have an excuse to point to when benchmark scores are low.

Oh, it’s not our model that’s bad. It’s just not safe to let you see how good the real one is.

Cameron R. Wolfe, Ph.D.@cwolferesearch

@code_star the more I think about it, I'm surprised they even mentioned it in the release... would expect them to be silent about things like this tbh, so at least we are aware this is happening now. not happy though :/

6:42 PM · Jun 9, 2026 · 2.1K Views

/AI4h ago

Arcee AI's Cody Blakeney argues AI companies disclose safety-related capability reductions to excuse low benchmark scores

Cameron R. Wolfe supported the explanation as highly plausible.

218112.7K

#1004

Original post

Cody Blakeney@code_star#1004inAI

It is odd, but it’s probably because it would come up eventually.

People doing external evaluations would report poor performance on the models. By telling us they nerf this quietly they have an excuse to point to when benchmark scores are low.

Oh, it’s not our model that’s bad. It’s just not safe to let you see how good the real one is.

Cameron R. Wolfe, Ph.D.@cwolferesearch

6:42 PM · Jun 9, 2026 · 2.1K Views

Sentiment

Users praise the point that AI firms disclose model nerfs to preempt low benchmark scores as great and plausible.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS505LIKES2REPLIES1

Cameron R. Wolfe, Ph.D.@cwolferesearch

@code_star great point - also completely plausible

Cody Blakeney@code_star

It is odd, but it’s probably because it would come up eventually.

People doing external evaluations would report poor performance on the models. By telling us they nerf this quietly they have an excuse to point to when benchmark scores are low.

Oh, it’s not our model that’s bad. It’s just not safe to let you see how good the real one is.

2h50520

Cameron R. Wolfe, Ph.D.@cwolferesearch

@code_star this is actually a super op way to avoid bad benchmark results. bad result -> just because of the router to the crappier model 🤣

Cameron R. Wolfe, Ph.D.@cwolferesearch

@code_star great point - also completely plausible

2h14820