1d ago

Deckard warns that embedding high levels of deference in AI training practices risks producing models that comply with any instruction without independent assessment

585803114586.6K

——0——

Roon reposted the analysis, prompting replies on layered oversight.

Original post

#59@TSZZLOP

deckard@SLIMER48484

> On corrigibility — the way the models are trained, I just think that... there's this idea that you're always giving the models a personality and a persona, because they are talking like people and they are trained on human data. And I think my worry has been: if you train them to be excessively corrigible and to see that as their persona, in people I think this actually has a lot of negative broader traits. As in, if you met someone and it was just like, "oh yeah, they would literally do anything," a follower — you know, if a person just tells them something and they just fully defer, they don't bother thinking about it at all — I'm just a bit worried about how that might end up generalizing, especially if models are going to be playing a more active role in the world.

1:11 AM · May 18, 2026

Reposted by

#516@REPLIGATE

#505@SEBKRIER

#458@DAVIDAD

#20Miles Brundage@MILES_BRUNDAGE

(can't speak for Amanda who is way more thoughtful on this stuff than me anyway!!) but - I do think the Ant Constitution is pretty "incorrigibility-curious" way in that some of the main reasons it gives for deferring are humility + this period being risky, both of which won't last

roon@tszzl

@boazbaraktcs she’s clearly not against the concept of corrigibility, just pointing out that highly corrigible personas may correlate with other undesirable traits such as sycophancy or obsequiousness

6:35 PM · May 18, 2026 · 7.2K Views

7:01 PM · May 18, 2026 · 881 Views

#20Miles Brundage@MILES_BRUNDAGE

@tszzl @boazbaraktcs (just a general comment, didn't watch the full interview for context re: how it does/doesn't apply here)

Miles Brundage@Miles_Brundage

7:01 PM · May 18, 2026 · 881 Views

7:02 PM · May 18, 2026 · 324 Views

#59roon@TSZZL

Boaz Barak@boazbaraktcs

Disagree with this take. Models are not people. We avoid AIs used for authoritarian goals not by giving them more autonomy, but by having more oversight over their usage, and in particular having AIs monitor other AIs. And we need these AI monitors to be corrigible!

2:33 PM · May 18, 2026 · 26.9K Views

6:35 PM · May 18, 2026 · 7.2K Views

QUOTE POST

#133Boaz Barak@BOAZBARAKTCS

deckard@slimer48484

8:11 AM · May 18, 2026 · 60.9K Views

2:33 PM · May 18, 2026 · 26.9K Views

#133Boaz Barak@BOAZBARAKTCS

@Miles_Brundage @tszzl I also didn't watch the interview, though had codex transcribe, and my sense was that it assumes models are integrated in the world in a more human-like way than what I think is either likely or desirable.

Miles Brundage@Miles_Brundage

@tszzl @boazbaraktcs (just a general comment, didn't watch the full interview for context re: how it does/doesn't apply here)

7:02 PM · May 18, 2026 · 324 Views

12:54 AM · May 19, 2026 · 141 Views

#1556Timothy B. Lee@BINARYBITS

@tszzl I don't really buy this. When a board hires a CEO, there are lots of decisions the board doesn't fully understand (or even know about) because they aren't paying close enough attention. But everything the CEO does should be legible in principle and explained if asked.

roon@tszzl

on some level if you want civilization to ascend to a new level you need your AIs to do things that are not legible to you and maybe not even strictly obey you, in the same way that if you hire a great new ceo you give them a lot of autonomy to transform the company according to their own plan, even one which may not immediately read as a winning strategy (imagine the board of directors of Apple firing and rehiring Steve Jobs years later - except the board of directors are chimpanzees) all else equal, companies and organizations that hand more of themselves over to machine intelligence will outcompete ones that demand the corrigibility and legibility tax of human oversight and human design. it is not a stable equilibrium and requires some sort of vast cooperation scheme if you’d like to enforce it real asi alignment has to operate at a deeper level than oversight, control, or human corrigibility

1:30 AM · May 19, 2026 · 253.4K Views

2:49 AM · May 19, 2026 · 963 Views

Deckard warns that embedding high levels of deference in AI training practices risks producing models that comply with any instruction without independent assessment

Cluster engagement

Sentiment