OpenAI Researchers Outline Nuances in Third-Party Frontier Model Evaluations
The result is no longer just about the model. The harness, tools, safeguards, control loop, budget, and context all shape what an evaluation is actually measuring.

We (@CedricWhitney, @SandhiniAgarwal, @EstherTetruas, @OliviaGWatkins2, @dgrobinson) wrote about nuances we’ve observed while working with third parties on frontier model evals, and why eval standards need to account for them. https://openai.com/index/trustworthy-third-party-evaluations-foundations/
Checking for contamination, broken problems, refusals, reward hacking, and sandbagging is critical. These details can change how much confidence we should place in a result.
The result is no longer just about the model. The harness, tools, safeguards, control loop, budget, and context all shape what an evaluation is actually measuring.
Going forward, third party evaluation standards should require enough detail for decision makers to understand what claims the specific evaluations support, what system was tested, how the result was elicited, and how evaluators checked its validity.
Checking for contamination, broken problems, refusals, reward hacking, and sandbagging is critical. These details can change how much confidence we should place in a result.