I tentatively think that silent model switching is never a good idea.
It's horrible for research (including safety research), among many other effects
I tentatively think that silent model switching is never a good idea.
It's horrible for research (including safety research), among many other effects
Users criticize silent model switching in frontier AI models as disrespectful and inappropriate, arguing companies should explicitly notify users instead of making unannounced changes.
Prompted by the Fable "nerfing on frontier AI development related queries" stuff but the point is more general...
I have criticized OAI many times for silent A/B testing, which I think is inappropriate for such a critical technology
I tentatively think that silent model switching is never a good idea.
It's horrible for research (including safety research), among many other effects
That doesn't mean Ant + others should just sit there and tolerate abuse.
There is a large action space, including throttling + issuing warnings, investigating the abuse, etc.
Prompted by the Fable "nerfing on frontier AI development related queries" stuff but the point is more general...
I have criticized OAI many times for silent A/B testing, which I think is inappropriate for such a critical technology
It also means you don't get a feedback signal on false positives - people can't complain if they don't know it's happening.
That doesn't mean Ant + others should just sit there and tolerate abuse.
There is a large action space, including throttling + issuing warnings, investigating the abuse, etc.
@BlackHC @yong_zhengxin One might choose to call it something other than model-switching (PET, steering vectors... sounds like effectively model switching to me, but anyway)... point is, it is a silent degradation
@Miles_Brundage @yong_zhengxin It is not switching though. Still using Fable but sandbagging via prompt injection?

@yong_zhengxin They said explicitly that it's silent for this case, though not for cyber and bio

@yong_zhengxin (though this is distinct from getting a hard refusal, which obviously is something you can notice)

@Miles_Brundage yea i think for the frontier LLm research, the nerf is silent (i think the intervention is through PeFT, steering, etc.)
For malicious use such as CBRN and Cyber, where the safeguard is through model switching, that one is explicit about the response is coming from Opus 4.8

@Miles_Brundage i don’t think it is silent switching. I think it’s shown to the user that the result returned is now from Opus 4.8 (but i think the one for LLM research switch is silent).
Now the problem is that we now miss the opportunity to stress test the model itself.

@yong_zhengxin See:

@Miles_Brundage yeah agreed. if they want to guard against it, power to them, thats their right, but doing so silently is bleh at best
just tell the user no
@BlackHC @yong_zhengxin Not sure I follow what point you're trying to make. Sounded like you were defending the [model/system/whatever] switching thing, but now I am not sure
@Miles_Brundage @yong_zhengxin I guess that's why Fable and Mythos are separate offerings because one can simply view Fable as the whole system (incl steering vectors etc)? Obv this won't allow valid inferences for Mythos

i think the threat model for silent nerfing is different from the explicit model switching. my understanding is that for classic refusal, they just default to Opus 4.8 (i’d not be surprised if the classifier for model switching is same for refusal).
for silent nerfing, i think it is jsut explicit sandbagging to mog certain labs from distillation attack or using Claude to improve their models.

@Miles_Brundage could not disagree on this one?
keeping the user aware is baseline respect imo
I tentatively think that silent model switching is never a good idea.
It's horrible for research (including safety research), among many other effects