Great to see PostTrainBench (lite) in the GPT 5.6 system card!
5.6 is much better than 5.5
We will also evaluate it on the full suite once it's available!
The benchmark measures agent performance gains over five hours.
Great to see PostTrainBench (lite) in the GPT 5.6 system card!
5.6 is much better than 5.5
We will also evaluate it on the full suite once it's available!
Users are hailing PostTrainBench Lite's creator as the GOAT because OpenAI featured the benchmark in the GPT-5.6 system card with stronger results.
No Digg Deeper questions have been answered for this story yet.

@hrdkbhatnagar 🐐🐐🐐🐐🐐🐐🐐🐐🐐
Thank you to our friends at @OpenAI for featuring PostTrainBench in the new model card!
OpenAI evaluated its new models on PostTrainBench-Lite, a shortened version of our original benchmark that gives agents 5 hours instead of 10 to improve an open-source base model.
GPT-5.6 Sol and Terra outperform GPT-5.5, but still rely on narrow strategies and sometimes overfit to the eval (common behavior). As we’ve reported before, the real frontier is research judgment and it remains one of the most exciting challenges for responsible RSI to solve.

@hrdkbhatnagar why do 5.6 sol, 5.5 and 5.4 all seem to dip at ≈200 minutes, is this just random error due to small sample size or is it deeper than that?