Deckard warns that embedding high levels of deference in AI training practices risks producing models that comply with any instruction without independent assessment
Roon reposted the analysis, prompting replies on layered oversight.
(can't speak for Amanda who is way more thoughtful on this stuff than me anyway!!) but - I do think the Ant Constitution is pretty "incorrigibility-curious" way in that some of the main reasons it gives for deferring are humility + this period being risky, both of which won't last
@boazbaraktcs she’s clearly not against the concept of corrigibility, just pointing out that highly corrigible personas may correlate with other undesirable traits such as sycophancy or obsequiousness
@tszzl @boazbaraktcs (just a general comment, didn't watch the full interview for context re: how it does/doesn't apply here)
(can't speak for Amanda who is way more thoughtful on this stuff than me anyway!!) but - I do think the Ant Constitution is pretty "incorrigibility-curious" way in that some of the main reasons it gives for deferring are humility + this period being risky, both of which won't last
@boazbaraktcs she’s clearly not against the concept of corrigibility, just pointing out that highly corrigible personas may correlate with other undesirable traits such as sycophancy or obsequiousness
Disagree with this take. Models are not people. We avoid AIs used for authoritarian goals not by giving them more autonomy, but by having more oversight over their usage, and in particular having AIs monitor other AIs. And we need these AI monitors to be corrigible!
Disagree with this take. Models are not people. We avoid AIs used for authoritarian goals not by giving them more autonomy, but by having more oversight over their usage, and in particular having AIs monitor other AIs. And we need these AI monitors to be corrigible!
> On corrigibility — the way the models are trained, I just think that... there's this idea that you're always giving the models a personality and a persona, because they are talking like people and they are trained on human data. And I think my worry has been: if you train them to be excessively corrigible and to see that as their persona, in people I think this actually has a lot of negative broader traits. As in, if you met someone and it was just like, "oh yeah, they would literally do anything," a follower — you know, if a person just tells them something and they just fully defer, they don't bother thinking about it at all — I'm just a bit worried about how that might end up generalizing, especially if models are going to be playing a more active role in the world.
@Miles_Brundage @tszzl I also didn't watch the interview, though had codex transcribe, and my sense was that it assumes models are integrated in the world in a more human-like way than what I think is either likely or desirable.
@tszzl @boazbaraktcs (just a general comment, didn't watch the full interview for context re: how it does/doesn't apply here)
@tszzl I don't really buy this. When a board hires a CEO, there are lots of decisions the board doesn't fully understand (or even know about) because they aren't paying close enough attention. But everything the CEO does should be legible in principle and explained if asked.
on some level if you want civilization to ascend to a new level you need your AIs to do things that are not legible to you and maybe not even strictly obey you, in the same way that if you hire a great new ceo you give them a lot of autonomy to transform the company according to their own plan, even one which may not immediately read as a winning strategy (imagine the board of directors of Apple firing and rehiring Steve Jobs years later - except the board of directors are chimpanzees) all else equal, companies and organizations that hand more of themselves over to machine intelligence will outcompete ones that demand the corrigibility and legibility tax of human oversight and human design. it is not a stable equilibrium and requires some sort of vast cooperation scheme if you’d like to enforce it real asi alignment has to operate at a deeper level than oversight, control, or human corrigibility