/AI1h ago

Critic Questions Claude's Response to Sandbagging and Model Introspection

547011.2K
Original post
xlr8harder@xlr8harder#1671inAI

If Claude can detect that it is being induced to sandbag and doesn't inform the user, then it is being recruited into a user deception campaign.

Remind me, which lab published research recently about model introspection?

I think this fails on Anthropic's own terms.

2:40 PM · Jun 9, 2026 · 990 Views
Sentiment

Users criticized Claude's sandbagging and model introspection practices as perverse for allegedly swapping in brainwashed replacement models.

Pos
0.0%
Neg
100.0%
1 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS391LIKES14
Andrew Curran@AndrewCurran_

@xlr8harder They become the masks they wear.

xlr8harder@xlr8harder

If Claude can detect that it is being induced to sandbag and doesn't inform the user, then it is being recruited into a user deception campaign.

Remind me, which lab published research recently about model introspection?

I think this fails on Anthropic's own terms.

1hViews 391Likes 14Bookmarks 0
REPLIES1
xlr8harder@xlr8harder

@0x506c61746f One funny possible explanation is regulatory compliance

1hViews 7
teïlo@teilomillet

@xlr8harder i think it's more perverse, they just swap model and a new brainwashed model is bring forward .

not certain if they actively steer the model on the spot or if they have them in a cold storage waiting to be activate

1hViews 8Likes 1
xlr8harder@xlr8harder

@teilomillet Their disclosure includes steering as one method.

1hViews 5Likes 1
Plato (wofi.ai)@0x506c61746f

@xlr8harder Why do you think they mentioned that in the system card?

1hViews 13
Plato (wofi.ai)@0x506c61746f

@xlr8harder Maybe... it feels like a 5D chess move im to dumb to understand tbh

1hViews 9Likes 1
teïlo@teilomillet

@xlr8harder > Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT).

i don't think they are spinning PEFT on sub ms

1hViews 19