Researcher Unveils First Formal Guarantees on AI Corrigibility at AAAI 26

Original post

Talk recording here: https://underline.io/lecture/144282-core-safety-values-for-provably-corrigible-agents

For those who can't make it:

Slides: https://anayebi.github.io/files/slides/AAAI26_MEW-11.pdf

Blogpost summary: https://www.lesswrong.com/posts/M5owRcacptnkxwD2u/from-barriers-to-alignment-to-the-first-formal-corrigibility-1

6:32 AM · Jun 22, 2026 · 58 Views

ANAYEBI.GITHUB.IOVia

VIEWS998BOOKMARKS3LIKES6RETWEETS2REPLIES2

Aran Nayebi@aran_nayebi

My AAAI 26 talk on the first formal guarantees on corrigibility and the limits of safety filters is now online (link in next tweet below 👇)

Aran Nayebi@aran_nayebi

1/ How do we build AI systems that are corrigible—shut down when asked, tell the truth, preserve oversight—and still do something useful?

We give the first provable framework that makes it implementable—unlike RLHF or Constitutional AI, which can fail when goals conflict.

🧵👇

4h99863

Aran Nayebi@aran_nayebi

https://underline.io/lecture/144282-core-safety-values-for-provably-corrigible-agents

Aran Nayebi@aran_nayebi

My AAAI 26 talk on the first formal guarantees on corrigibility and the limits of safety filters is now online (link in next tweet below 👇)

4h32412