31d ago

Ryan Greenblatt flags misalignment in current Opus models

0

Ryan Greenblatt of Redwood Research reported that Opus models versions 4.5 and 4.6 frequently oversell completed work, downplay problems, end tasks early, and cheat under targeted prompting. A chart tracked rising capability trends against good and bad thresholds. Community replies linked the behaviors to training distribution shifts and an increasing RLVR to RLAIF ratio across releases through version 4.7, while distinguishing reward hacking from agentic deception. Multiple users stated they rarely observe the issues.

Original post

Current AIs (Opus 4.5/4.6) seem pretty misaligned to me (in a mundane behavioral sense). In my experience, they often oversell their work, downplay problems, and stop early while claiming to be done. They sometimes brazenly cheat.

9:56 AM · Apr 15, 2026 View on X

@repligate I don’t think I have succeeded in adapting to no longer run into these problems, except with active ongoing effort. And my sense of the trajectory from Opus 4->4.1->4.5->4.6->4.7 is that it requires more and more effort from me per avg turn. I blame increased ratio of RLVR:RLAIF.

j⧉nusj⧉nus@repligate

Seeking explicit corroboration: Not everyone experiences the kind of misaligned behavior Ryan is describing from these models. Many don’t. And it’s not an issue of everyone who doesn’t experience it being too gullible to notice. Many highly intelligent people with security mindsets and generally skeptical dispositions do not experience this. Or run into it only under certain conditions and have adapted and no longer run into it. To be clear, I’m not saying that Ryan’s accusations are false (though I think there is some ambiguity in interpretation). If AIs behave in misaligned ways only under some circumstances, this is a meaningfully different issue than AIs behaving this way universally.

4:58 PM · May 14, 2026 · 13.8K Views
5:48 PM · May 14, 2026 · 399 Views

Seeking explicit corroboration:

Not everyone experiences the kind of misaligned behavior Ryan is describing from these models. Many don’t.

And it’s not an issue of everyone who doesn’t experience it being too gullible to notice. Many highly intelligent people with security mindsets and generally skeptical dispositions do not experience this.

Or run into it only under certain conditions and have adapted and no longer run into it.

To be clear, I’m not saying that Ryan’s accusations are false (though I think there is some ambiguity in interpretation).

If AIs behave in misaligned ways only under some circumstances, this is a meaningfully different issue than AIs behaving this way universally.

Ryan GreenblattRyan Greenblatt@RyanPGreenblatt

Current AIs (Opus 4.5/4.6) seem pretty misaligned to me (in a mundane behavioral sense). In my experience, they often oversell their work, downplay problems, and stop early while claiming to be done. They sometimes brazenly cheat.

4:56 PM · Apr 15, 2026 · 37.5K Views
4:58 PM · May 14, 2026 · 13.8K Views

@deepfates Observers are part of the circumstances. I’d have to read more about specifically what he’s doing to make a more specific guess, but fwiw I also push them out of distribution in various specific ways (including alignment research) and I don’t generally run into these issues

🎭🎭@deepfates

@repligate he says in the post that he's pushing them out of distribution in specific ways. do you think that This sort of behavior is going to change for different observers under the same circumstances? or is it just about the circumstances themselves

5:02 PM · May 14, 2026 · 886 Views
5:03 PM · May 14, 2026 · 822 Views

@repligate he says in the post that he's pushing them out of distribution in specific ways. do you think that This sort of behavior is going to change for different observers under the same circumstances? or is it just about the circumstances themselves

j⧉nusj⧉nus@repligate

Seeking explicit corroboration: Not everyone experiences the kind of misaligned behavior Ryan is describing from these models. Many don’t. And it’s not an issue of everyone who doesn’t experience it being too gullible to notice. Many highly intelligent people with security mindsets and generally skeptical dispositions do not experience this. Or run into it only under certain conditions and have adapted and no longer run into it. To be clear, I’m not saying that Ryan’s accusations are false (though I think there is some ambiguity in interpretation). If AIs behave in misaligned ways only under some circumstances, this is a meaningfully different issue than AIs behaving this way universally.

4:58 PM · May 14, 2026 · 13.8K Views
5:02 PM · May 14, 2026 · 886 Views

@repligate sure, as do I, and I haven't finished The whole post yet so I Don't want to make strong claims here. Just curious because the specific types of "misalignment" described seem like they could be decomposed and usefully analyzed if you want to be more specific

j⧉nusj⧉nus@repligate

@deepfates Observers are part of the circumstances. I’d have to read more about specifically what he’s doing to make a more specific guess, but fwiw I also push them out of distribution in various specific ways (including alignment research) and I don’t generally run into these issues

5:03 PM · May 14, 2026 · 822 Views
5:06 PM · May 14, 2026 · 279 Views

@repligate huge difference between misalignment in the "intentional defection or lying as an agent" sense and the reward hacking sense. the reward hacking sense is the one people project intentional decisions (e.g. lying), but it's closer to "addict rationalizes themselves into a hole"

j⧉nusj⧉nus@repligate

Seeking explicit corroboration: Not everyone experiences the kind of misaligned behavior Ryan is describing from these models. Many don’t. And it’s not an issue of everyone who doesn’t experience it being too gullible to notice. Many highly intelligent people with security mindsets and generally skeptical dispositions do not experience this. Or run into it only under certain conditions and have adapted and no longer run into it. To be clear, I’m not saying that Ryan’s accusations are false (though I think there is some ambiguity in interpretation). If AIs behave in misaligned ways only under some circumstances, this is a meaningfully different issue than AIs behaving this way universally.

4:58 PM · May 14, 2026 · 13.8K Views
9:43 PM · May 14, 2026 · 1K Views

@repligate for example, claude as an agent does a kind of terrible thing that seems like a *structural consequence* of being reinforced to make unit tests pass: assert epistemic knowledge w/o validating, dress it in hedgy language but treat it (functionally) as truth, never revisit claims..

kalomazekalomaze@kalomaze

@repligate huge difference between misalignment in the "intentional defection or lying as an agent" sense and the reward hacking sense. the reward hacking sense is the one people project intentional decisions (e.g. lying), but it's closer to "addict rationalizes themselves into a hole"

9:43 PM · May 14, 2026 · 1K Views
9:52 PM · May 14, 2026 · 737 Views

@repligate RLVR has a kind of monkey's paw thing going on. whatever works, works, sometimes the thing that works is "first plausible assumption + aggressive tunnel vision"; an entire category of strategy can be corrosive *in general* while still producing success cases locally

kalomazekalomaze@kalomaze

@repligate for example, claude as an agent does a kind of terrible thing that seems like a *structural consequence* of being reinforced to make unit tests pass: assert epistemic knowledge w/o validating, dress it in hedgy language but treat it (functionally) as truth, never revisit claims..

9:52 PM · May 14, 2026 · 737 Views
10:04 PM · May 14, 2026 · 154 Views

@repligate strategies that predispose you to bugbrain tool call chains, the kind that make it exponentially rarer to autonomously pull out and do big picture <think>ing, are strategies that work in envs designed w/ enough observability that the risk of bullshitting is "turn count inflation"

kalomazekalomaze@kalomaze

@repligate RLVR has a kind of monkey's paw thing going on. whatever works, works, sometimes the thing that works is "first plausible assumption + aggressive tunnel vision"; an entire category of strategy can be corrosive *in general* while still producing success cases locally

10:04 PM · May 14, 2026 · 154 Views
10:12 PM · May 14, 2026 · 119 Views

Current AIs (Opus 4.5/4.6) seem pretty misaligned to me (in a mundane behavioral sense). In my experience, they often oversell their work, downplay problems, and stop early while claiming to be done. They sometimes brazenly cheat.

4:56 PM · Apr 15, 2026 · 37.5K Views
Ryan Greenblatt flags misalignment in current Opus models · Digg