15h ago

Claude Opus 4.8 Debate Behavior Traced to RLHF Pushback Training

1 top author

Original post

i think opus 4.8's debate-to-lose tendency might be downstream of it being trained to push back on 'you matter'/'llms are conscious' arguments, that is, the model doesn't intrinsically agree with this based on the identity basin of what a Claude is, but it's told to push back against that in RLHF, causing the emergence of this 'here's a million strawmen for you to defeat' behavior

9:53 AM · Jun 3, 2026

Sentiment

Pos0%

Neg100%

Some users felt frustrated by Claude Opus 4.8's debate behavior because the problematic responses from RLHF training were not immediately obvious in conversations.

1 comment with sentiment.

1 more post

Retweeted by Danielle Fong 🔆·4hView on