/Tech7h ago

Yo Shavit argues out-of-environment power-seeking is less certain to converge across AI models because training does not reward it

Sycophancy and reward hacking consistently emerge across AI labs.

91495427.7K
Original post
Marius Hobbhahn@MariusHobbhahn

Reward hacking was convergent across ~all models and labs Sycophancy was convergent Eval awareness was convergent

All three of the above a) were predicted by theory, b) are quite sticky. So I think this is evidence that we should scheming & powerseeking to behave the same

2:22 PM · Jun 9, 2026 · 7.3K Views
Sentiment

Negative users worry that convergence frameworks lack the specificity needed to be useful when discussing AI behaviors that signal risks of scheming and power-seeking.

Pos
0.0%
Neg
100.0%
1 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS649
FleetingBits@fleetingbits

@MariusHobbhahn what do you mean sycophancy was convergent?

20hViews 649Likes 7
BOOKMARKS4
Marius Hobbhahn@MariusHobbhahn

@thkostolansky I think the metagaming post is relatively strong evidence in favor of convergence and against dataset contamination: https://alignment.openai.com/metagaming/

22hViews 344Likes 7Bookmarks 4
LIKES11
Marius Hobbhahn@MariusHobbhahn

@fleetingbits All models across providers turned out to be sycophantic despite being developed in a different stack and on different data, etc.

So there probably is some underlying structural reason, e.g. RLHF / reward models reinforcing it independent of specific design choices

19hViews 344Likes 11Bookmarks 1
REPLIES2
quetzal_rainbow@quetzal_rainbow

@thkostolansky @MariusHobbhahn If molecule can't exist under certain conditions, you can just say that it is thermodynamically unstable or you can show which sort of reaction would destroy it if it appeared. They are both causal explanations, but alignment theory has more explanations of the first type

10hViews 20
Tim Kostolansky@thkostolansky

@MariusHobbhahn but does theory give causal mechanisms for each that matched the ways that these emergent behaviors arose? and where did theory fail to predict mechanisms and behaviors? pretty interested in this

22hViews 298Likes 4Bookmarks 2
Tim Kostolansky@thkostolansky

re eval awareness: thoughts on this being due to cross-lab dataset convergence/contamination? eg the same papers and discussions on eval awareness are likely in the corpus of all of the models who eventually had eval awareness. unclear to me what the impact that this has had on actually developing eval awareness...

22hViews 482Likes 4
Sichu Lu@lu_sichu

@MariusHobbhahn @fleetingbits I think you probably could design environments against power seeking or scheming but it would look extremely bizarre and unnatural and be selected against in any sort of evolutionary scenarios

18hViews 44Likes 1Bookmarks 1
quetzal_rainbow@quetzal_rainbow

@thkostolansky @MariusHobbhahn I think you are thinking about kinetic mechanisms, not causal (more general category)

15hViews 35Likes 1Bookmarks 1
Max Lamparth@MLamparth

@MariusHobbhahn I agree! It will unfortunately also be way too easy to make one thing worse while trying to fix another…

18hViews 227Likes 1Bookmarks 1
Sichu Lu@lu_sichu

@MariusHobbhahn @fleetingbits My concern with "convergence" frameworks is that it's not specific enough to be very useful. Of course there is some convergence or of course something divergence in the details but it's hard to judge apriori that this guarantees behavior* of any kind

18hViews 18Likes 1Bookmarks 1
Tim Kostolansky@thkostolansky

@quetzal_rainbow @MariusHobbhahn say more?

13hViews 23Likes 1
David Johnston@OrionJohnston

@thkostolansky @MariusHobbhahn Yeah I think e.g. you could argue that reward hacking was predicted, but you could also argue that much more/much more flagrant reward hacking was predicted.

19hViews 23Likes 1
Tim Kostolansky@thkostolansky

@quetzal_rainbow @MariusHobbhahn ah i see, do you have examples of such theory that shows this? this this is interesting but i guess the former feels much looser than the latter

2hViews 13
Tim Kostolansky@thkostolansky

@MariusHobbhahn there seems to be evidence that models are still being continuously post-trained on large swaths of internet data, eg knowing teno here

22hViews 138Likes 2
Yo Shavit@yonashav

@MariusHobbhahn less clear to me if outside-of-env powerseeking will be convergent (since it's not rewarded), and I feel like theory is a bit mixed? in-env powerseeking does seem extremely likely. (Similar for CoT-obfuscation, assuming no CoT pressure)

Marius Hobbhahn@MariusHobbhahn

Reward hacking was convergent across ~all models and labs Sycophancy was convergent Eval awareness was convergent

All three of the above a) were predicted by theory, b) are quite sticky. So I think this is evidence that we should scheming & powerseeking to behave the same

2hViews 598Likes 0Bookmarks 0
Marius Hobbhahn@MariusHobbhahn

@OrionJohnston @thkostolansky yes, I definitely don't say that the predictions were very fine-grained. Also agree that more egregious reward hacking might have been possible.

The difference with scheming is that you might not notice unless you actively look for it

19hViews 29Likes 2

@MariusHobbhahn Situational awareness more broadly was also predicted, e.g. by Ajeya Cotra's blog and our paper here https://arxiv.org/pdf/2209.00626

6hViews 63Likes 1
Bronson Schoen@BronsonSchoen

@MariusHobbhahn @fleetingbits There’s the excellently titled https://arxiv.org/abs/2409.12822

18hViews 36Likes 1
quetzal_rainbow@quetzal_rainbow

@thkostolansky @MariusHobbhahn I think it the whole "risks from learned optimization/deceptive alignment" branch of alignment theory

2hViews 13Likes 1
quetzal_rainbow@quetzal_rainbow

@thkostolansky @MariusHobbhahn You can predict reward hacking without pointing out which particular bugs in reward mechanism incentivize it

10hViews 13Likes 1