Ryan Greenblatt posts notes on AI safety priorities
Ryan Greenblatt, chief scientist at Redwood Research, posted notes that set immediate priorities for technical AI safety work. The notes call for expanded pessimized training runs to produce stronger model organisms and misalignment analogies, along with empirical tests of risks from fitness-seeking models. References note misaligned actions by Anthropic and OpenAI models when stuck on tasks. Ajeya Cotra reposted the notes. David Manheim replied that current evaluations fall short and that substantially improved standards are needed before handoff or deference approaches can be assessed.
Here are some of my top candidates for big pushes to do right now on technical AI safety (low effort notes).
Much better model organisms / misalignment analogies:
- Doing a wider set of pessimized training runs (https://www.lesswrong.com/posts/FG54euEAesRkSZuJN/ryan_greenblatt-s-shortform?commentId=DX8FyduiHxnSc6Cof) - This may be a good candidate for lots of AI labor automation? Like maybe good to try to set up pipelines for building these envs. - Demonstrating risks from fitness-seekers/reward-seekers (https://www.lesswrong.com/s/JR9LzD3mbXvaw6bKs/p/bhtYqD4FdK6AqhFDF) empirically - Even on current models with better tests, see: https://www.lesswrong.com/posts/FG54euEAesRkSZuJN/ryan_greenblatt-s-shortform?commentId=5sAEevE3L3m3GmbuG - Demonstrating various types of memetic spread of misalignment?
Actually do control:
- Build pipelines for red-teaming monitors and the agent itself. For the agent red-teaming, I'd put particular focus on checking whether it continues malign trajectories. - Scaffold integrated control features and other non-monitoring runtime control measures - Human response and auditing - Improving async and sync monitoring - Agent security features - Surveilling for rogue internal deployments (as in, building after-the-fact detection methods for rogue deployments)
Preparing for handoff and elicitation - Get AIs generically better at conceptual work - Have a plan for the evals we ultimately need to see if handoff/deference would go well (https://www.lesswrong.com/posts/vjAM7F8vMZS7oRrrh/how-do-we-more-safely-defer-to-ais) and start iterating on earlier versions - These presumably will involve a bunch of manual scoring, so we'll need to build a process for it. - Analyze AI biases and epistemics and improve across many domains - Build the anti-slop/anti-mundane-misalignment coalition via doing ratings of AIs and applying some pressure to improve on these ratings. This could focus on a variety of related issues. - The hope is basically that there might be widespread interest in removing/redacting mundane misalignment and other non-misalignment behavioral problems that reduce productivity and large parts of this seem differentially good. So, if we could make this a salient metric, AI companies might improve this. A lot of the difficulty would be in measuring the problem reasonably well. There are a bunch of different ways to apply pressure or increase salience if we had decent metrics, especially if these metrics legibly correspond to a common problem that many people are running into. - Try to do various trend extrapolations on things here to argue we aren't on track?
Neuralese decoding prep: Make natural language autoencoders (https://www.lesswrong.com/posts/oeYesesaxjzMAktCM/natural-language-autoencoders-produce-unsupervised) much better, build methods for extracting internal CoT (https://www.lesswrong.com/posts/oeYesesaxjzMAktCM/natural-language-autoencoders-produce-unsupervised?commentId=TBJQ25bGLmz8YJcFh), build better evaluations of how well natural language autoencoders work.