In case you want to learn more about AI safety this 4th, check out the recent recording of some of my group's work on the AI Safety Research directory!
In it, we discussed:
1. Why aligning AI to all human values is intractable — and what smaller, universal target we can aim for instead.
2. Corrigibility - building AI systems that stay editable, deferential, and willing to be shut down - is feasible with the lexicographic approach that sets hierarchical priorities on an agent’s objectives.
3. Our recent ROGUE benchmark: our empirical test of how corrigible today’s frontier AI agents really are. Models that behave well in ordinary chat don’t always stay that way once they’re given a real off-switch they could disable to finish a task.