ht @Moh1tAgarwal for noticing this eval in the blog post and pointing it out as an RL target
Re the Fable ML sandbagging, the model's AI research capabilities were probably at least partly trained on Anthropic employees diffing atop proprietary algos and infra.
So the IP leak is somewhat like a researcher who knows Anthropic's stack getting poached to another lab.
Anthropic's recent "When AI builds itself" post talks about a next-step eval. Where they snapshot a research session at the moment a human researcher made a suboptimal next-step choice, show a model only the transcript up to that point and ask what it would do next, then have a hindsight-equipped LLM judge decide whether the model's suggestion or the human's actual choice was better.
This eval seems like a very good RL target for AI R&D - one among many that could be used to have AIs emulate Anthropic researchers and their research products.
I'm just speculating. But if this was a motivation, then Anthropic should have figured out a better way to protect IP than sandbagging without telling the user they're sandbagging, which is very hostile and untrustworthy behavior.
