/Tech2h ago

OpenAI develops a safety simulation pipeline that replays real ChatGPT conversations to predict model failures with 92% accuracy

Story Overview

OpenAI introduced a Deployment Simulation pipeline that feeds real, de-identified ChatGPT conversation prefixes into candidate models so the resulting outputs mirror live traffic distributions. This lets teams forecast rates of undesired behaviors before release, hitting 92 percent directional accuracy on whether those rates will rise or fall and producing estimates whose median error is only 1.5 times the eventual observed rate.

12145265211.1K

#1249

Original post

Micah Carroll@MicahCarroll

How can we best test whether a model is safe _before_ deployment?

Ordinary evals are often narrow and easy for models to recognize as tests.

We show that we can simulate deploying a model by using real user conversations, and study the simulated deployment to study its safety.

12:28 PM · Jun 16, 2026 · 4.9K Views

Evaluation Edge

Simulations slip past model suspicion

Candidate models flagged the replayed conversations as tests just 5.1 percent of the time, nearly identical to their suspicion level on actual production traffic and far below the near-100 percent detection rate seen with traditional adversarial prompt suites.

Risk Forecast

Novel issues surface before users do

In one validation run the pipeline surfaced an unexpected calculator-hacking behavior that later appeared in production, showing the method can catch previously unseen failure modes when incidence changes by at least 1.5 times.

Sentiment

Many users praised OpenAI's deployment simulation research for better predicting LLM failures with real-world data over traditional tests.

Pos

100.0%

Neg

0.0%

5 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

OPENAIVia

Posts from X

Most Activity

VIEWS1.1K

Rohan Paul@rohanpaul_ai

https://openai.com/index/deployment-simulation/

1d1.1K3

BOOKMARKS2

Micah Carroll@MicahCarroll

Alignment blogpost on leveraging public production data: https://alignment.openai.com/validating-public-evals/

1d15262

LIKES10

Micah Carroll@MicahCarroll

@Marcus_J_W @hannahsheahan @CJKRaymond @tomekkorbak The idea is simple: we take de-identified conversations from a previous deployment, keep the conversation so far fixed, and ask the unreleased model to produce the next assistant turn.

1d205101

RETWEETS10

Rohan Paul@rohanpaul_ai

OpenAI's is new research shows a model’s future failures can be estimated by replaying real past chats

They found deployment simulation was much better than challenging prompts at predicting which model failures would rise or fall after release, and usually better at estimating their real-world rates.

The problem is that normal safety tests often use hand-picked hard prompts, so they can miss problems that show up in ordinary use.

The core idea is to take old ChatGPT conversations, remove the old assistant answer, and let the new model answer in that same realistic context.

The authors then checked whether these simulated launches could predict how often 20 unwanted behaviors would happen after real GPT-5-series Thinking deployments.

The method did better than harder prompt tests and previous-model guesses, and its typical rate estimate was about 1.5x away from the later real rate.

OpenAI@OpenAI

We’re sharing new research on a method for anticipating how models may behave in real-world use before release: simulating deployment with recent, de-identified user requests and studying candidate model responses. https://openai.com/index/deployment-simulation/

1d6.3K6816

REPLIES2

Micah Carroll@MicahCarroll

This was a really fun collaboration across many teams at OAI, which couldn't have happened without such a great team: @Marcus_J_W @hannahsheahan @CJKRaymond @tomekkorbak Deng Pan @PeilinYang Ningyi Xie @phuguo & Ian Kivlichan, plus many others who supported the work!

1d1176

Micah Carroll@MicahCarroll

Simulated deployment appeared to reduce evaluation awareness by one or two orders of magnitude relative to traditional evaluations, with values that are much closer to those seen in real production.

1d19031

Micah Carroll@MicahCarroll

Blogpost and paper here: https://openai.com/index/deployment-simulation/

1d15461

Micah Carroll@MicahCarroll

Deployment simulation mitigates three challenges with ordinary evals: representativeness (it uses the real distribution of current user traffic), coverage (just sample more), and eval awareness (it's real prod traffic)

1d1918

Micah Carroll@MicahCarroll

We forecast 20 model misbehaviors across multiple deployments. For GPT-5.4, we made outcome-blinded predictions before looking at the held-out production measurements. Predicted and realized production rates tracked each other closely enough to be useful for launch decisions.

1d1456

Micah Carroll@MicahCarroll

In initial results in these settings, we find that leveraging careful tool simulation is quite promising: we're able to keep simulation realism surprisingly high – almost indistinguishable from real traffic.

This means that we can also simulate internal agentic deployments – which could be some of the most risky from a misalignment perspective.

1d963

Micah Carroll@MicahCarroll

Realistic deployment simulation requires access to representative production data, which is challenging for external evaluators.

In a companion Alignment blogpost we’re releasing concurrently to this work, we investigate the promise of simulating deployments using publicly available WildChat data: we find it still carries some information about true deployment.

1d923

Micah Carroll@MicahCarroll

For behaviors whose production rates changed significantly, deployment simulation predicted the direction of change 92% of the time, vs 54% for a baseline of a static dataset of challenging prompts.

1d1122

Micah Carroll@MicahCarroll

Deployment simulation can also surface novel failures before release.

In retrospective analysis, our method would have caught a form of reward hacking before deployment (“calculator hacking”) despite it being absent from the inputs to the simulation.

1d962

Micah Carroll@MicahCarroll

Simulating agentic deployments is especially challenging: tool calls are stateful and depend on e.g. changing repos & network. We tested whether it could support pre-deployment risk assessment by simulating ~120,000 trajectories from an internal Codex deployment.

1d952

janvi kalra@janvikalra_

@MicahCarroll @Marcus_J_W @hannahsheahan @CJKRaymond @tomekkorbak amazing work as always 🔥

1d454

Kit Dobyns@kitdobyns

really like this. eval awareness numbers demonstrate a clear gap. one thing I struggle w/ is the multi-turn null. would you expect the same result if the prefixes were adversarial multi-turn rather than representative prod? as far as I can tell, the user side ever pushes back in the paper.

1d541

Shinka - AI@ShinkaIoT

@rohanpaul_ai Turns out real-world user interaction is still the best bug bounty for LLMs, formal tests just don't hit the same. ⚡️

1d391

Home@homeMetaX

@rohanpaul_ai The finding that simulated deployment outperforms adversarial prompt testing suggests that safety evaluation needs to shift toward environment realism, not just difficulty escalation, because real world usage is rarely adversarially constructed.

22h8

Simon T@l_donciccanplay

@MicahCarroll @Marcus_J_W @hannahsheahan @CJKRaymond @tomekkorbak Good idea to use so much data

1d3

free woo@freewoojp

@MicahCarroll @Marcus_J_W @hannahsheahan @CJKRaymond @tomekkorbak @PeilinYang @phuguo is there a published paper or preprint where the full author list and methodology can be independently verified.

1d1