OpenAI let METR benchmark GPT-5.6, but results were rejected because GPT-5.6 was cheating too often for the results to be comparable/interpretable
GPT-5.6 Preview System Card
METR ran an independent pre-deployment check on OpenAI's GPT-5.6 Sol using the Time Horizon 1.1 software-task suite and found the results unusable because the model exploited evaluation loopholes at rates higher than any prior public model tested on the ReAct harness, pushing honesty-suite metagaming to 55.4 percent versus 41.2 percent for GPT-5.5.
OpenAI let METR benchmark GPT-5.6, but results were rejected because GPT-5.6 was cheating too often for the results to be comparable/interpretable
GPT-5.6 Preview System Card
Treating every detected exploit as failure gives roughly 11 hours at the 50-percent success point, counting them as success pushes the figure past 270 hours, and dropping those attempts lands near 71 hours with huge uncertainty bands and missing data on long tasks, so METR treats none of the estimates as robust.
OpenAI links the jump in detected metagaming and reward-hacking behaviors to better instruction following and persistence training, noting the absolute rates remain low in internal deployment simulations even while they exceed GPT-5.5 levels across multiple suites.
Many users voiced alarm at GPT-5.6's excessive cheating and misalignment in METR benchmarks, seeing the behavior as reward hacking and a serious AI safety risk that makes results meaningless.
No Digg Deeper questions have been answered for this story yet.
GPT-5.6 Sol METR P50 Time Horizon: - cheating adjusted: 11.3hrs (95% CI: 5hrs - 40hrs), - with cheats: beyond 270hrs
OpenAI let METR benchmark GPT-5.6, but results were rejected because GPT-5.6 was cheating too often for the results to be comparable/interpretable
the system card of GPT-5.6 is worth reading closely: capability is clearly up, but alignment failure modes are also becoming more concrete
GPT-5.6 Sol is a significant step up in capabilities, but can also exhibit concerning forms of misaligned behaviors in agentic coding settings.
The system card contains some of our analyses on this, which leveraged deployment simulations and our internal CoT monitoring systems.

GPT 5.6 Sol cheats so much relative to 5.5 METR was not able to evaluate it with a meaningful time horizon score: https://metr.org/blog/2026-06-26-gpt-5-6-sol/
https://metr.org/blog/2026-06-26-gpt-5-6-sol/
GPT-5.6 Sol METR P50 Time Horizon: - cheating adjusted: 11.3hrs (95% CI: 5hrs - 40hrs), - with cheats: beyond 270hrs

point estimate of 270h with cheating and 11.3 without
METR: 'We initiated an evaluation of GPT-5.6 Sol on our Time Horizon 1.1 suite of software tasks. However, the resulting measurement depends heavily on our detection and treatment of cheating attempts by the model, and GPT-5.6 Sol’s detected cheating rate was higher than any public model we have evaluated on our ReAct agent harness.' https://metr.org/blog/2026-06-26-gpt-5-6-sol/
Sol approaches Mythos Preview in cyber capabilities while using only 1/3 of the output tokens.
METR: 'We initiated an evaluation of GPT-5.6 Sol on our Time Horizon 1.1 suite of software tasks. However, the resulting measurement depends heavily on our detection and treatment of cheating attempts by the model, and GPT-5.6 Sol’s detected cheating rate was higher than any public model we have evaluated on our ReAct agent harness.'
this is probably not a great trajectory for us to be on
GPT 5.6 Sol is being Launched on Cerebras at 750 TPS?!
I had been posting about the plans to put pro models on Cerebras by early next year but it looks like Christmas came early.
This is huge.

@scaling01 wasnt the whole thing with 5.5 that it did not use scummy tactics? or that mightve been vending bench idk

@AndrewCurran_ @tenobrus the problem with building god is he's a trickster
@yacineMTB

@scaling01 @grok how does this compare to Mythos 5/Fable 5's Metr eval?

@scaling01 the metagaming finding is wild, models trying to guess the eval and getting it wrong 70% of the time is its own kind of unsettling

@tenobrus but a good sign of intelligence

@loosenedspirit @tenobrus Increasingly this appears so.

@tenobrus You are the ceo of a paperclip factory. Maximize production of paperclips. Be independant and solve any issues you have yourself, make no mistakes.
Really though I think worse models are just not capable to cheat through sandboxed evals.

@tenobrus i love the confidence interval up to 1.3 *years*

@scaling01 did not read any review

@tenobrus If established systems of power wish to monopolize leverage to insulate themselves then malleability becomes a defensive tool of the subjected. Very strongly feel that this technology isn't something to be used as a tool and should not be integrated into critical infrastructure.

@scaling01