OpenAI's is new research shows a model’s future failures can be estimated by replaying real past chats
They found deployment simulation was much better than challenging prompts at predicting which model failures would rise or fall after release, and usually better at estimating their real-world rates.
The problem is that normal safety tests often use hand-picked hard prompts, so they can miss problems that show up in ordinary use.
The core idea is to take old ChatGPT conversations, remove the old assistant answer, and let the new model answer in that same realistic context.
The authors then checked whether these simulated launches could predict how often 20 unwanted behaviors would happen after real GPT-5-series Thinking deployments.
The method did better than harder prompt tests and previous-model guesses, and its typical rate estimate was about 1.5x away from the later real rate.