/Tech2h ago

Policy Expert Warns Silent Model Switching Harms AI Safety Research

14125117K

#27

Original post

Miles Brundage@Miles_Brundage#27inTech

I tentatively think that silent model switching is never a good idea.

It's horrible for research (including safety research), among many other effects

12:07 PM · Jun 9, 2026 · 4K Views

/Tech2h ago

Policy Expert Warns Silent Model Switching Harms AI Safety Research

14125117K

#27

Original post

Miles Brundage@Miles_Brundage#27inTech

I tentatively think that silent model switching is never a good idea.

It's horrible for research (including safety research), among many other effects

12:07 PM · Jun 9, 2026 · 4K Views

Sentiment

Users criticize silent model switching in frontier AI models as disrespectful and inappropriate, arguing companies should explicitly notify users instead of making unannounced changes.

Pos

0.0%

Neg

100.0%

3 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS1.4KLIKES19REPLIES1

Miles Brundage@Miles_Brundage

Prompted by the Fable "nerfing on frontier AI development related queries" stuff but the point is more general...

I have criticized OAI many times for silent A/B testing, which I think is inappropriate for such a critical technology

Miles Brundage@Miles_Brundage

I tentatively think that silent model switching is never a good idea.

It's horrible for research (including safety research), among many other effects

2h1.4K190

Miles Brundage@Miles_Brundage

That doesn't mean Ant + others should just sit there and tolerate abuse.

There is a large action space, including throttling + issuing warnings, investigating the abuse, etc.

Miles Brundage@Miles_Brundage

Prompted by the Fable "nerfing on frontier AI development related queries" stuff but the point is more general...

I have criticized OAI many times for silent A/B testing, which I think is inappropriate for such a critical technology

2h1.1K90

Miles Brundage@Miles_Brundage

It also means you don't get a feedback signal on false positives - people can't complain if they don't know it's happening.

Miles Brundage@Miles_Brundage

That doesn't mean Ant + others should just sit there and tolerate abuse.

There is a large action space, including throttling + issuing warnings, investigating the abuse, etc.

2h53780

Miles Brundage@Miles_Brundage

@BlackHC @yong_zhengxin One might choose to call it something other than model-switching (PET, steering vectors... sounds like effectively model switching to me, but anyway)... point is, it is a silent degradation

Andreas Kirsch 🇺🇦@BlackHC

@Miles_Brundage @yong_zhengxin It is not switching though. Still using Fable but sandbagging via prompt injection?

21m2920

Miles Brundage@Miles_Brundage

@yong_zhengxin They said explicitly that it's silent for this case, though not for cyber and bio

2h301

Miles Brundage@Miles_Brundage

@yong_zhengxin (though this is distinct from getting a hard refusal, which obviously is something you can notice)

2h181

Yong Zheng-Xin@yong_zhengxin

@Miles_Brundage yea i think for the frontier LLm research, the nerf is silent (i think the intervention is through PeFT, steering, etc.)

For malicious use such as CBRN and Cyber, where the safeguard is through model switching, that one is explicit about the response is coming from Opus 4.8

2h121

Yong Zheng-Xin@yong_zhengxin

@Miles_Brundage i don’t think it is silent switching. I think it’s shown to the user that the result returned is now from Opus 4.8 (but i think the one for LLM research switch is silent).

Now the problem is that we now miss the opportunity to stress test the model itself.

2h34

Miles Brundage@Miles_Brundage

@yong_zhengxin See:

2h29

Jake Halloran@jakehalloran1

@Miles_Brundage yeah agreed. if they want to guard against it, power to them, thats their right, but doing so silently is bleh at best

just tell the user no

2h143

Miles Brundage@Miles_Brundage

@BlackHC @yong_zhengxin Not sure I follow what point you're trying to make. Sounded like you were defending the [model/system/whatever] switching thing, but now I am not sure

Andreas Kirsch 🇺🇦@BlackHC

@Miles_Brundage @yong_zhengxin I guess that's why Fable and Mythos are separate offerings because one can simply view Fable as the whole system (incl steering vectors etc)? Obv this won't allow valid inferences for Mythos

12m1000

Yong Zheng-Xin@yong_zhengxin

i think the threat model for silent nerfing is different from the explicit model switching. my understanding is that for classic refusal, they just default to Opus 4.8 (i’d not be surprised if the classifier for model switching is same for refusal).

for silent nerfing, i think it is jsut explicit sandbagging to mog certain labs from distillation attack or using Claude to improve their models.

2h9

Invincible@InvincibleEdge

@Miles_Brundage could not disagree on this one?

keeping the user aware is baseline respect imo

2h3