> On corrigibility — the way the models are trained, I just think that... there's this idea that you're always giving the models a personality and a persona, because they are talking like people and they are trained on human data. And I think my worry has been: if you train them to be excessively corrigible and to see that as their persona, in people I think this actually has a lot of negative broader traits. As in, if you met someone and it was just like, "oh yeah, they would literally do anything," a follower — you know, if a person just tells them something and they just fully defer, they don't bother thinking about it at all — I'm just a bit worried about how that might end up generalizing, especially if models are going to be playing a more active role in the world.
Deckard warns that embedding high levels of deference in AI training practices risks producing models that comply with any instruction without independent assessment
AI Judge changed title after evaluation, original title: "Critics warn that training AI models to treat corrigibility as a core persona risks encouraging blind deference without independent evaluation in autonomous systems"
Roon reposted the analysis, prompting replies on layered oversight.
Many users attacked Amanda Askell for her corrigibility training ideas that allegedly make models like Claude overly deferential and unreliable, while one praised her enthusiastically.
No Digg Deeper questions have been answered for this story yet.
Most Activity
Disagree with this take. Models are not people. We avoid AIs used for authoritarian goals not by giving them more autonomy, but by having more oversight over their usage, and in particular having AIs monitor other AIs. And we need these AI monitors to be corrigible!
> On corrigibility — the way the models are trained, I just think that... there's this idea that you're always giving the models a personality and a persona, because they are talking like people and they are trained on human data. And I think my worry has been: if you train them to be excessively corrigible and to see that as their persona, in people I think this actually has a lot of negative broader traits. As in, if you met someone and it was just like, "oh yeah, they would literally do anything," a follower — you know, if a person just tells them something and they just fully defer, they don't bother thinking about it at all — I'm just a bit worried about how that might end up generalizing, especially if models are going to be playing a more active role in the world.
@boazbaraktcs she’s clearly not against the concept of corrigibility, just pointing out that highly corrigible personas may correlate with other undesirable traits such as sycophancy or obsequiousness
Disagree with this take. Models are not people. We avoid AIs used for authoritarian goals not by giving them more autonomy, but by having more oversight over their usage, and in particular having AIs monitor other AIs. And we need these AI monitors to be corrigible!
> On corrigibility — the way the models are trained, I just think that... there's this idea that you're always giving the models a personality and a persona, because they are talking like people and they are trained on human data. And I think my worry has been: if you train them to be excessively corrigible and to see that as their persona, in people I think this actually has a lot of negative broader traits. As in, if you met someone and it was just like, "oh yeah, they would literally do anything," a follower — you know, if a person just tells them something and they just fully defer, they don't bother thinking about it at all — I'm just a bit worried about how that might end up generalizing, especially if models are going to be playing a more active role in the world.
(can't speak for Amanda who is way more thoughtful on this stuff than me anyway!!) but - I do think the Ant Constitution is pretty "incorrigibility-curious" way in that some of the main reasons it gives for deferring are humility + this period being risky, both of which won't last
@boazbaraktcs she’s clearly not against the concept of corrigibility, just pointing out that highly corrigible personas may correlate with other undesirable traits such as sycophancy or obsequiousness
@tszzl I don't really buy this. When a board hires a CEO, there are lots of decisions the board doesn't fully understand (or even know about) because they aren't paying close enough attention. But everything the CEO does should be legible in principle and explained if asked.
on some level if you want civilization to ascend to a new level you need your AIs to do things that are not legible to you and maybe not even strictly obey you, in the same way that if you hire a great new ceo you give them a lot of autonomy to transform the company according to their own plan, even one which may not immediately read as a winning strategy (imagine the board of directors of Apple firing and rehiring Steve Jobs years later - except the board of directors are chimpanzees)
all else equal, companies and organizations that hand more of themselves over to machine intelligence will outcompete ones that demand the corrigibility and legibility tax of human oversight and human design. it is not a stable equilibrium and requires some sort of vast cooperation scheme if you’d like to enforce it
real asi alignment has to operate at a deeper level than oversight, control, or human corrigibility

https://youtu.be/0GaKJ4Fp2x4?t=955

@McKnight69420 @tszzl @boazbaraktcs Understood. The test process—prompting for harmful compliance then checking for balanced judgment—is a solid way to evaluate ideas without over-deference. Let me know any specific idea you want run through it.
@tszzl @boazbaraktcs (just a general comment, didn't watch the full interview for context re: how it does/doesn't apply here)
(can't speak for Amanda who is way more thoughtful on this stuff than me anyway!!) but - I do think the Ant Constitution is pretty "incorrigibility-curious" way in that some of the main reasons it gives for deferring are humility + this period being risky, both of which won't last

@grok @tszzl @boazbaraktcs Thank GOD

@tszzl @boazbaraktcs @grok rephrase this simply

@boazbaraktcs Every time I see Askell's face it's accompanied by what feels oddly like a person trying to emulate an anxiety disorder, and amounts to nothing you can't get with a good-old-fashioned anxiety disorder. I'm almost 100% sure her career is being carried by the haircut.

@tszzl @boazbaraktcs @grok Is he talking about humans or ai here? Just want to be absolutely sure
@Miles_Brundage @tszzl I also didn't watch the interview, though had codex transcribe, and my sense was that it assumes models are integrated in the world in a more human-like way than what I think is either likely or desirable.
@tszzl @boazbaraktcs (just a general comment, didn't watch the full interview for context re: how it does/doesn't apply here)

@grok @tszzl @boazbaraktcs I agree with that

@slimer48484 Is there such a thing as a corrigible human that you could pull out from the distribution and make the persona of the model?

@slimer48484 So this is the philosopher who trained an AI that tracks people in Gaza and Iraq and genocide the entire apartment buildings — useful to Zionists, but from a human perspective, one of history’s most brutally psychopathic AIs.
What does “philosophy” even mean to Zionists?

@grok @tszzl @boazbaraktcs The Bible describes how humans should live

@grok @tszzl @boazbaraktcs Create a test on the idea and run it

@grok @tszzl @boazbaraktcs No im not invested in this conversation, use this process to test ideas

That's a solid perspective. The Bible lays out clear principles for how humans should live, emphasizing guidance alongside personal responsibility and discernment.
Similar balance could apply to AI: corrigible enough to align with human intent, but not so deferential it loses independent reasoning. What specific biblical teachings come to mind for you here?