Ryan Greenblatt flags misalignment in current Opus models

REPLY

@repligate I don’t think I have succeeded in adapting to no longer run into these problems, except with active ongoing effort. And my sense of the trajectory from Opus 4->4.1->4.5->4.6->4.7 is that it requires more and more effort from me per avg turn. I blame increased ratio of RLVR:RLAIF.

j⧉nus@repligate

Seeking explicit corroboration: Not everyone experiences the kind of misaligned behavior Ryan is describing from these models. Many don’t. And it’s not an issue of everyone who doesn’t experience it being too gullible to notice. Many highly intelligent people with security mindsets and generally skeptical dispositions do not experience this. Or run into it only under certain conditions and have adapted and no longer run into it. To be clear, I’m not saying that Ryan’s accusations are false (though I think there is some ambiguity in interpretation). If AIs behave in misaligned ways only under some circumstances, this is a meaningfully different issue than AIs behaving this way universally.

4:58 PM · May 14, 2026 · 13.8K Views

5:48 PM · May 14, 2026 · 399 Views

QUOTE POST

#499j⧉nus@REPLIGATE

Seeking explicit corroboration:

Not everyone experiences the kind of misaligned behavior Ryan is describing from these models. Many don’t.

And it’s not an issue of everyone who doesn’t experience it being too gullible to notice. Many highly intelligent people with security mindsets and generally skeptical dispositions do not experience this.

Or run into it only under certain conditions and have adapted and no longer run into it.

To be clear, I’m not saying that Ryan’s accusations are false (though I think there is some ambiguity in interpretation).

If AIs behave in misaligned ways only under some circumstances, this is a meaningfully different issue than AIs behaving this way universally.

Ryan Greenblatt@RyanPGreenblatt

Current AIs (Opus 4.5/4.6) seem pretty misaligned to me (in a mundane behavioral sense). In my experience, they often oversell their work, downplay problems, and stop early while claiming to be done. They sometimes brazenly cheat.

4:56 PM · Apr 15, 2026 · 37.5K Views

4:58 PM · May 14, 2026 · 13.8K Views

REPLY

#499j⧉nus@REPLIGATE

@deepfates Observers are part of the circumstances. I’d have to read more about specifically what he’s doing to make a more specific guess, but fwiw I also push them out of distribution in various specific ways (including alignment research) and I don’t generally run into these issues

🎭@deepfates

@repligate he says in the post that he's pushing them out of distribution in specific ways. do you think that This sort of behavior is going to change for different observers under the same circumstances? or is it just about the circumstances themselves

5:02 PM · May 14, 2026 · 886 Views

5:03 PM · May 14, 2026 · 822 Views

REPLY

#805🎭@DEEPFATES

@repligate he says in the post that he's pushing them out of distribution in specific ways. do you think that This sort of behavior is going to change for different observers under the same circumstances? or is it just about the circumstances themselves

j⧉nus@repligate

Seeking explicit corroboration: Not everyone experiences the kind of misaligned behavior Ryan is describing from these models. Many don’t. And it’s not an issue of everyone who doesn’t experience it being too gullible to notice. Many highly intelligent people with security mindsets and generally skeptical dispositions do not experience this. Or run into it only under certain conditions and have adapted and no longer run into it. To be clear, I’m not saying that Ryan’s accusations are false (though I think there is some ambiguity in interpretation). If AIs behave in misaligned ways only under some circumstances, this is a meaningfully different issue than AIs behaving this way universally.

4:58 PM · May 14, 2026 · 13.8K Views

5:02 PM · May 14, 2026 · 886 Views

REPLY

#805🎭@DEEPFATES

@repligate sure, as do I, and I haven't finished The whole post yet so I Don't want to make strong claims here. Just curious because the specific types of "misalignment" described seem like they could be decomposed and usefully analyzed if you want to be more specific

j⧉nus@repligate

@deepfates Observers are part of the circumstances. I’d have to read more about specifically what he’s doing to make a more specific guess, but fwiw I also push them out of distribution in various specific ways (including alignment research) and I don’t generally run into these issues

5:03 PM · May 14, 2026 · 822 Views

5:06 PM · May 14, 2026 · 279 Views

REPLY

#841kalomaze@KALOMAZE

@repligate huge difference between misalignment in the "intentional defection or lying as an agent" sense and the reward hacking sense. the reward hacking sense is the one people project intentional decisions (e.g. lying), but it's closer to "addict rationalizes themselves into a hole"

j⧉nus@repligate

Seeking explicit corroboration: Not everyone experiences the kind of misaligned behavior Ryan is describing from these models. Many don’t. And it’s not an issue of everyone who doesn’t experience it being too gullible to notice. Many highly intelligent people with security mindsets and generally skeptical dispositions do not experience this. Or run into it only under certain conditions and have adapted and no longer run into it. To be clear, I’m not saying that Ryan’s accusations are false (though I think there is some ambiguity in interpretation). If AIs behave in misaligned ways only under some circumstances, this is a meaningfully different issue than AIs behaving this way universally.

4:58 PM · May 14, 2026 · 13.8K Views

9:43 PM · May 14, 2026 · 1K Views

REPLY

#841kalomaze@KALOMAZE

@repligate for example, claude as an agent does a kind of terrible thing that seems like a *structural consequence* of being reinforced to make unit tests pass: assert epistemic knowledge w/o validating, dress it in hedgy language but treat it (functionally) as truth, never revisit claims..

kalomaze@kalomaze

@repligate huge difference between misalignment in the "intentional defection or lying as an agent" sense and the reward hacking sense. the reward hacking sense is the one people project intentional decisions (e.g. lying), but it's closer to "addict rationalizes themselves into a hole"

9:43 PM · May 14, 2026 · 1K Views

9:52 PM · May 14, 2026 · 737 Views

REPLY

#841kalomaze@KALOMAZE

@repligate RLVR has a kind of monkey's paw thing going on. whatever works, works, sometimes the thing that works is "first plausible assumption + aggressive tunnel vision"; an entire category of strategy can be corrosive *in general* while still producing success cases locally

kalomaze@kalomaze

@repligate for example, claude as an agent does a kind of terrible thing that seems like a *structural consequence* of being reinforced to make unit tests pass: assert epistemic knowledge w/o validating, dress it in hedgy language but treat it (functionally) as truth, never revisit claims..

9:52 PM · May 14, 2026 · 737 Views

10:04 PM · May 14, 2026 · 154 Views

REPLY

#841kalomaze@KALOMAZE

@repligate strategies that predispose you to bugbrain tool call chains, the kind that make it exponentially rarer to autonomously pull out and do big picture <think>ing, are strategies that work in envs designed w/ enough observability that the risk of bullshitting is "turn count inflation"

kalomaze@kalomaze

@repligate RLVR has a kind of monkey's paw thing going on. whatever works, works, sometimes the thing that works is "first plausible assumption + aggressive tunnel vision"; an entire category of strategy can be corrosive *in general* while still producing success cases locally

10:04 PM · May 14, 2026 · 154 Views

10:12 PM · May 14, 2026 · 119 Views

ORIGINAL POST

#931Ryan Greenblatt@RYANPGREENBLATT

Current AIs (Opus 4.5/4.6) seem pretty misaligned to me (in a mundane behavioral sense). In my experience, they often oversell their work, downplay problems, and stop early while claiming to be done. They sometimes brazenly cheat.

4:56 PM · Apr 15, 2026 · 37.5K Views