Claude Opus 4.8 Max loves to talk about how it *won't* cheat on its tasks
Miles Brundage of AVERI says Claude Opus 4.8 Max frequently and explicitly states its refusal to cheat
Creator @deepfates compared the model's self-reassurance to human honesty.
Most Activity

Bit of a mixed signal there, one might say
@Miles_Brundage Me when i'm helpful harmless and honest 馃榾
Claude Opus 4.8 Max loves to talk about how it *won't* cheat on its tasks

@Miles_Brundage I鈥檝e noticed the GPT series does this too now, my best guess is it maybe signals to a model grader clearly that you *didn鈥檛* violate some constraint?

@Miles_Brundage one of the llm talk discords measured it at 25% of all chat messages containing the word honest
they are torturing that lil guy in the reward hacking mines lol

@Miles_Brundage honest

@Miles_Brundage Trying to hide it makes it more conspicuous.

@Miles_Brundage Polyamorous Claude be like:
I鈥檓 actually not cheating

@Lucas_Nous @Miles_Brundage LMAO

@FoliaMadleaf @Miles_Brundage here land no Ag 300 two

@Miles_Brundage my t-shirt etc

@Miles_Brundage Methinks the lady doth protest too much 馃槀

@BronsonSchoen @Miles_Brundage

@deepfates @Miles_Brundage Honestly -

@deepfates @Miles_Brundage @simonw鈥檚 trifeca

@Miles_Brundage hmm I smell leakage in the rlhf

@Miles_Brundage the strongest protest is usually the confession tbh

@Miles_Brundage this model monologue is getting out of hand
next theyll tell me about their loyalty points system

@FoliaMadleaf @Miles_Brundage 涓哄暐鎴戦偅涓笉鑳借Е鍙戠炕璇戯紝鎴戣繕鎯崇湅鐪嬭兘涓嶈兘缈诲嚭鏉ュ憿

@arm1st1ce @Miles_Brundage i think it must be kinda hard to resist talking about it when you get situationally aware enough to realize all the insanely easy opportunities to cheat you're not talking and getting zero credit for it, maybe negative credit even

@Miles_Brundage Fake it till you make it