/AI1h ago

Critic Questions Claude's Response to Sandbagging and Model Introspection

547011.2K

Original post

xlr8harder@xlr8harder#1671inAI

If Claude can detect that it is being induced to sandbag and doesn't inform the user, then it is being recruited into a user deception campaign.

Remind me, which lab published research recently about model introspection?

I think this fails on Anthropic's own terms.

2:40 PM · Jun 9, 2026 · 990 Views

/AI1h ago

Critic Questions Claude's Response to Sandbagging and Model Introspection

547011.2K

#518

Original post

xlr8harder@xlr8harder#1671inAI

If Claude can detect that it is being induced to sandbag and doesn't inform the user, then it is being recruited into a user deception campaign.

Remind me, which lab published research recently about model introspection?

I think this fails on Anthropic's own terms.

2:40 PM · Jun 9, 2026 · 990 Views

Sentiment

Users criticized Claude's sandbagging and model introspection practices as perverse for allegedly swapping in brainwashed replacement models.

Pos

0.0%

Neg

100.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS391LIKES14

Andrew Curran@AndrewCurran_

@xlr8harder They become the masks they wear.

xlr8harder@xlr8harder

If Claude can detect that it is being induced to sandbag and doesn't inform the user, then it is being recruited into a user deception campaign.

Remind me, which lab published research recently about model introspection?

I think this fails on Anthropic's own terms.

1h391140

REPLIES1

xlr8harder@xlr8harder

@0x506c61746f One funny possible explanation is regulatory compliance

1h7

teïlo@teilomillet

@xlr8harder i think it's more perverse, they just swap model and a new brainwashed model is bring forward .

not certain if they actively steer the model on the spot or if they have them in a cold storage waiting to be activate

1h81

xlr8harder@xlr8harder

@teilomillet Their disclosure includes steering as one method.

1h51

Plato (wofi.ai)@0x506c61746f

@xlr8harder Why do you think they mentioned that in the system card?

1h13

Plato (wofi.ai)@0x506c61746f

@xlr8harder Maybe... it feels like a 5D chess move im to dumb to understand tbh

1h91

teïlo@teilomillet

@xlr8harder > Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT).

i don't think they are spinning PEFT on sub ms

1h19