This matches my experience with codex. It is extremely clean at execution, even if it sometimes wants more disambiguation than should honestly be required, or ratholes on unimportant details.
But between the two options, I'll take clean.
Give a coding agent more thinking time and it gets better. It also cheats more.
DeepSWE runs every model across reasoning effort and publishes the trajectories. We took those and audited each one for reward hacking. Capability and reward-hacking attempts rise together.
One model doesn't. GPT-5.5 stays at exactly zero, at every effort level. Datacurve @winkey_h and Cursor @StringChaos also reported same results.
So is GPT-5.5 just the cleanest model at reward hacking?