(btw i know PEFT is technically training the model btw, but they probably don't use PEFT to limit the capabilities of cyber in fable compared to mythos)
glad anthropic walked this back and will now tell users when capabilities are nerfed
my biggest concern was hiding this from the user and the paranoia it would have created. i still think part of that will remain as people realize that even as a good actor you won't always have access to the best model, and this is the reason open models and open research are critical
@drfeifei, @sriramk and many others say it much better than me, but i consider it very important for our civilization that good faith researchers get access to the best AI, and that at least part of this research happens in the open and not only inside a few closed labs (not talking only about ai research here)
going forward, i REALLY hope that anthropic (and other labs) will be transparent when they nerf a model in certain fields, whether it's at inference time (~PEFT/steering, previous safeguard) or at training time (training against, mythos vs fable)
i also hope we will see more work and transparency on evaluating models capabilities to do ai research, both autonomy and raw capabilities. right now this is very light even in anthropic and oai system cards. you can't treat this as a first-class risk and only report weak evals to the public. we also need strong third party actors here







