/Tech7h ago

Yo Shavit argues out-of-environment power-seeking is less certain to converge across AI models because training does not reward it

Sycophancy and reward hacking consistently emerge across AI labs.

91495427.7K

#372

Original post

Marius Hobbhahn@MariusHobbhahn

Reward hacking was convergent across ~all models and labs Sycophancy was convergent Eval awareness was convergent

All three of the above a) were predicted by theory, b) are quite sticky. So I think this is evidence that we should scheming & powerseeking to behave the same

2:22 PM · Jun 9, 2026 · 7.3K Views

/Tech7h ago

Yo Shavit argues out-of-environment power-seeking is less certain to converge across AI models because training does not reward it

Sycophancy and reward hacking consistently emerge across AI labs.

91495427.7K

#372

Original post

Marius Hobbhahn@MariusHobbhahn

Reward hacking was convergent across ~all models and labs Sycophancy was convergent Eval awareness was convergent

All three of the above a) were predicted by theory, b) are quite sticky. So I think this is evidence that we should scheming & powerseeking to behave the same

2:22 PM · Jun 9, 2026 · 7.3K Views

Sentiment

Negative users worry that convergence frameworks lack the specificity needed to be useful when discussing AI behaviors that signal risks of scheming and power-seeking.

Pos

0.0%

Neg

100.0%

1 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

FleetingBits@fleetingbits

@MariusHobbhahn what do you mean sycophancy was convergent?

20h6497

BOOKMARKS4

Marius Hobbhahn@MariusHobbhahn

@thkostolansky I think the metagaming post is relatively strong evidence in favor of convergence and against dataset contamination: https://alignment.openai.com/metagaming/

22h34474

LIKES11

Marius Hobbhahn@MariusHobbhahn

@fleetingbits All models across providers turned out to be sycophantic despite being developed in a different stack and on different data, etc.

So there probably is some underlying structural reason, e.g. RLHF / reward models reinforcing it independent of specific design choices

19h344111

REPLIES2

quetzal_rainbow@quetzal_rainbow

@thkostolansky @MariusHobbhahn If molecule can't exist under certain conditions, you can just say that it is thermodynamically unstable or you can show which sort of reaction would destroy it if it appeared. They are both causal explanations, but alignment theory has more explanations of the first type

10h20

Tim Kostolansky@thkostolansky

@MariusHobbhahn but does theory give causal mechanisms for each that matched the ways that these emergent behaviors arose? and where did theory fail to predict mechanisms and behaviors? pretty interested in this

22h29842

Tim Kostolansky@thkostolansky

re eval awareness: thoughts on this being due to cross-lab dataset convergence/contamination? eg the same papers and discussions on eval awareness are likely in the corpus of all of the models who eventually had eval awareness. unclear to me what the impact that this has had on actually developing eval awareness...

22h4824

Sichu Lu@lu_sichu

@MariusHobbhahn @fleetingbits I think you probably could design environments against power seeking or scheming but it would look extremely bizarre and unnatural and be selected against in any sort of evolutionary scenarios

18h4411

quetzal_rainbow@quetzal_rainbow

@thkostolansky @MariusHobbhahn I think you are thinking about kinetic mechanisms, not causal (more general category)

15h3511

Max Lamparth@MLamparth

@MariusHobbhahn I agree! It will unfortunately also be way too easy to make one thing worse while trying to fix another…

18h22711

Sichu Lu@lu_sichu

@MariusHobbhahn @fleetingbits My concern with "convergence" frameworks is that it's not specific enough to be very useful. Of course there is some convergence or of course something divergence in the details but it's hard to judge apriori that this guarantees behavior* of any kind

18h1811

Tim Kostolansky@thkostolansky

@quetzal_rainbow @MariusHobbhahn say more?

13h231

David Johnston@OrionJohnston

@thkostolansky @MariusHobbhahn Yeah I think e.g. you could argue that reward hacking was predicted, but you could also argue that much more/much more flagrant reward hacking was predicted.

19h231

Tim Kostolansky@thkostolansky

@quetzal_rainbow @MariusHobbhahn ah i see, do you have examples of such theory that shows this? this this is interesting but i guess the former feels much looser than the latter

2h13

Tim Kostolansky@thkostolansky

@MariusHobbhahn there seems to be evidence that models are still being continuously post-trained on large swaths of internet data, eg knowing teno here

22h1382

Yo Shavit@yonashav

@MariusHobbhahn less clear to me if outside-of-env powerseeking will be convergent (since it's not rewarded), and I feel like theory is a bit mixed? in-env powerseeking does seem extremely likely. (Similar for CoT-obfuscation, assuming no CoT pressure)

Marius Hobbhahn@MariusHobbhahn

Reward hacking was convergent across ~all models and labs Sycophancy was convergent Eval awareness was convergent

All three of the above a) were predicted by theory, b) are quite sticky. So I think this is evidence that we should scheming & powerseeking to behave the same

2h59800

Marius Hobbhahn@MariusHobbhahn

@OrionJohnston @thkostolansky yes, I definitely don't say that the predictions were very fine-grained. Also agree that more egregious reward hacking might have been possible.

The difference with scheming is that you might not notice unless you actively look for it

19h292

Sören Mindermann@sorenmind

@MariusHobbhahn Situational awareness more broadly was also predicted, e.g. by Ajeya Cotra's blog and our paper here https://arxiv.org/pdf/2209.00626

6h631

Bronson Schoen@BronsonSchoen

@MariusHobbhahn @fleetingbits There’s the excellently titled https://arxiv.org/abs/2409.12822

18h361

quetzal_rainbow@quetzal_rainbow

@thkostolansky @MariusHobbhahn I think it the whole "risks from learned optimization/deceptive alignment" branch of alignment theory

2h131

quetzal_rainbow@quetzal_rainbow

@thkostolansky @MariusHobbhahn You can predict reward hacking without pointing out which particular bugs in reward mechanism incentivize it

10h131