If Claude can detect that it is being induced to sandbag and doesn't inform the user, then it is being recruited into a user deception campaign.
Remind me, which lab published research recently about model introspection?
I think this fails on Anthropic's own terms.


