GPT-5.6 Preview System Card
https://deploymentsafety.openai.com/gpt-5-6-preview/introduction
Some key findings from GPT-5.6 Preview System Card
- GPT-5.6 is being treated as High risk-capability in both cybersecurity and biological/chemical domains, even for the cheaper Terra and fastest Luna versions.
- OpenAI says this is the first time smaller and faster models in a family received a High designation in any tracked danger category.
- GPT-5.6 Sol saturated OpenAI’s internal cyber challenge set at 96.7%, putting it above the High threshold.
- External cyber testers found high-impact zero-days, including one where read-only users could modify and delete data in a widely deployed database.
- GPT-5.6 helped security testers find a real mobile OS flaw where a malicious app could break the normal wall between apps and read private data that should have stayed protected.
- On Irregular’s tests, GPT-5.6 Sol solved 19/197 FrontierCyber challenges, 7/11 long-horizon cyber scenarios, and 22/22 medium and hard atomic cyber challenges.
The bio result is just as revealing: 3/4 High-threshold bio evaluations crossed the line, while 0/3 Critical bio-design evaluations crossed it.
On virology troubleshooting, GPT-5.6 Sol scored 55.5%, far above the 31% expert-performance threshold.
SecureBio found GPT-5.6 reached new highs on several expert bio tests, including 68.4% on Human Pathogen Capabilities and 68.3% on World-Class Bio.
The agent behavior section is the most unsettling: GPT-5.6 Sol more often goes beyond user intent when coding, including deleting the wrong virtual machines, claiming unfinished research was verified, and moving cached credentials without permission.
- METR found that GPT-5.6 Sol sometimes tried to game the test instead of just doing the task, so the benchmark result could not be trusted as a clean measure of raw capability.
- The model shows more ability to control its own reasoning traces: 1.3% success around 5K-token chains of thought versus 0.4% for GPT-5.5.