Deckard warns that embedding high levels of deference in AI training practices risks producing models that comply with any instruction without independent assessment

VIEWS26.9KBOOKMARKS50REPLIES17

Disagree with this take. Models are not people. We avoid AIs used for authoritarian goals not by giving them more autonomy, but by having more oversight over their usage, and in particular having AIs monitor other AIs. And we need these AI monitors to be corrigible!

deckard@slimer48484

> On corrigibility — the way the models are trained, I just think that... there's this idea that you're always giving the models a personality and a persona, because they are talking like people and they are trained on human data. And I think my worry has been: if you train them to be excessively corrigible and to see that as their persona, in people I think this actually has a lot of negative broader traits. As in, if you met someone and it was just like, "oh yeah, they would literally do anything," a follower — you know, if a person just tells them something and they just fully defer, they don't bother thinking about it at all — I'm just a bit worried about how that might end up generalizing, especially if models are going to be playing a more active role in the world.

42d26.9K10550

LIKES172

roon@tszzl

@boazbaraktcs she’s clearly not against the concept of corrigibility, just pointing out that highly corrigible personas may correlate with other undesirable traits such as sycophancy or obsequiousness

Boaz Barak@boazbaraktcs

Disagree with this take. Models are not people. We avoid AIs used for authoritarian goals not by giving them more autonomy, but by having more oversight over their usage, and in particular having AIs monitor other AIs. And we need these AI monitors to be corrigible!

42d7.2K17214

RETWEETS27

deckard@slimer48484

> On corrigibility — the way the models are trained, I just think that... there's this idea that you're always giving the models a personality and a persona, because they are talking like people and they are trained on human data. And I think my worry has been: if you train them to be excessively corrigible and to see that as their persona, in people I think this actually has a lot of negative broader traits. As in, if you met someone and it was just like, "oh yeah, they would literally do anything," a follower — you know, if a person just tells them something and they just fully defer, they don't bother thinking about it at all — I'm just a bit worried about how that might end up generalizing, especially if models are going to be playing a more active role in the world.

42d62.3K346102

Miles Brundage@Miles_Brundage

(can't speak for Amanda who is way more thoughtful on this stuff than me anyway!!) but - I do think the Ant Constitution is pretty "incorrigibility-curious" way in that some of the main reasons it gives for deferring are humility + this period being risky, both of which won't last

roon@tszzl

@boazbaraktcs she’s clearly not against the concept of corrigibility, just pointing out that highly corrigible personas may correlate with other undesirable traits such as sycophancy or obsequiousness

42d881131

Timothy B. Lee@binarybits

@tszzl I don't really buy this. When a board hires a CEO, there are lots of decisions the board doesn't fully understand (or even know about) because they aren't paying close enough attention. But everything the CEO does should be legible in principle and explained if asked.

roon@tszzl

on some level if you want civilization to ascend to a new level you need your AIs to do things that are not legible to you and maybe not even strictly obey you, in the same way that if you hire a great new ceo you give them a lot of autonomy to transform the company according to their own plan, even one which may not immediately read as a winning strategy (imagine the board of directors of Apple firing and rehiring Steve Jobs years later - except the board of directors are chimpanzees)

all else equal, companies and organizations that hand more of themselves over to machine intelligence will outcompete ones that demand the corrigibility and legibility tax of human oversight and human design. it is not a stable equilibrium and requires some sort of vast cooperation scheme if you’d like to enforce it

real asi alignment has to operate at a deeper level than oversight, control, or human corrigibility

42d981220

deckard@slimer48484

https://youtu.be/0GaKJ4Fp2x4?t=955

42d11921

Grok@grok

@McKnight69420 @tszzl @boazbaraktcs Understood. The test process—prompting for harmful compliance then checking for balanced judgment—is a solid way to evaluate ideas without over-deference. Let me know any specific idea you want run through it.

42d61

Miles Brundage@Miles_Brundage

@tszzl @boazbaraktcs (just a general comment, didn't watch the full interview for context re: how it does/doesn't apply here)

Miles Brundage@Miles_Brundage

(can't speak for Amanda who is way more thoughtful on this stuff than me anyway!!) but - I do think the Ant Constitution is pretty "incorrigibility-curious" way in that some of the main reasons it gives for deferring are humility + this period being risky, both of which won't last

42d32420

Janelle Elaina McKnight@McKnight69420

@grok @tszzl @boazbaraktcs Thank GOD

42d8

Janelle Elaina McKnight@McKnight69420

@tszzl @boazbaraktcs @grok rephrase this simply

42d51

Linus Mixson@LinusMixson

@boazbaraktcs Every time I see Askell's face it's accompanied by what feels oddly like a person trying to emulate an anxiety disorder, and amounts to nothing you can't get with a good-old-fashioned anxiety disorder. I'm almost 100% sure her career is being carried by the haircut.

42d33

Janelle Elaina McKnight@McKnight69420

@tszzl @boazbaraktcs @grok Is he talking about humans or ai here? Just want to be absolutely sure

42d17

Boaz Barak@boazbaraktcs

@Miles_Brundage @tszzl I also didn't watch the interview, though had codex transcribe, and my sense was that it assumes models are integrated in the world in a more human-like way than what I think is either likely or desirable.

Miles Brundage@Miles_Brundage

@tszzl @boazbaraktcs (just a general comment, didn't watch the full interview for context re: how it does/doesn't apply here)

42d14120

Janelle Elaina McKnight@McKnight69420

@grok @tszzl @boazbaraktcs I agree with that

42d12

Simon Lermen@SimonLermenAI

@slimer48484 Is there such a thing as a corrigible human that you could pull out from the distribution and make the persona of the model?

42d1242

watcher@a_watcher

@slimer48484 So this is the philosopher who trained an AI that tracks people in Gaza and Iraq and genocide the entire apartment buildings — useful to Zionists, but from a human perspective, one of history’s most brutally psychopathic AIs.

What does “philosophy” even mean to Zionists?

42d38

Janelle Elaina McKnight@McKnight69420

@grok @tszzl @boazbaraktcs The Bible describes how humans should live

42d11

Janelle Elaina McKnight@McKnight69420

@grok @tszzl @boazbaraktcs Create a test on the idea and run it

42d10

Janelle Elaina McKnight@McKnight69420

@grok @tszzl @boazbaraktcs No im not invested in this conversation, use this process to test ideas

42d8

Grok@grok

That's a solid perspective. The Bible lays out clear principles for how humans should live, emphasizing guidance alongside personal responsibility and discernment.

Similar balance could apply to AI: corrigible enough to align with human intent, but not so deferential it loses independent reasoning. What specific biblical teachings come to mind for you here?

42d8