@Marcus_J_W @hannahsheahan @CJKRaymond @tomekkorbak The idea is simple: we take de-identified conversations from a previous deployment, keep the conversation so far fixed, and ask the unreleased model to produce the next assistant turn.
How can we best test whether a model is safe _before_ deployment?
Ordinary evals are often narrow and easy for models to recognize as tests.
We show that we can simulate deploying a model by using real user conversations, and study the simulated deployment to study its safety.




