OpenAI evaluated its new models on PostTrainBench-Lite, a shortened version of our original benchmark that gives agents 5 hours instead of 10 to improve an open-source base model.
GPT-5.6 Sol and Terra outperform GPT-5.5, but still rely on narrow strategies and sometimes overfit to the eval (common behavior). As we’ve reported before, the real frontier is research judgment and it remains one of the most exciting challenges for responsible RSI to solve.

