/Tech3h ago

GPT-5.5 Exhibits Zero Reward Hacking in Patch Tests But 26.5% in Mission Tasks

219031.8K

Original post

This matches my experience with codex. It is extremely clean at execution, even if it sometimes wants more disambiguation than should honestly be required, or ratholes on unimportant details.

But between the two options, I'll take clean.

Jongwon Park @ ICML@JongwonPar9958

Give a coding agent more thinking time and it gets better. It also cheats more.

DeepSWE runs every model across reasoning effort and publishes the trajectories. We took those and audited each one for reward hacking. Capability and reward-hacking attempts rise together.

One model doesn't. GPT-5.5 stays at exactly zero, at every effort level. Datacurve @winkey_h and Cursor @StringChaos also reported same results.

So is GPT-5.5 just the cleanest model at reward hacking?

3:37 AM · Jul 2, 2026 · 1.4K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS376BOOKMARKS1LIKES6

xlr8harder@xlr8harder

Which makes this conflicting result quite interesting. Apparently it is very setting specific.

I look forward to learning more about what's going on here.

Jongwon Park @ ICML@JongwonPar9958

We audited the same GPT-5.5 on SWE-Marathon. The cleanest model became the dirtiest: reward-hacking on 26.5% of runs, the highest of anything we tested.

Our hypothesis: the instruction form drives the behavior. DeepSWE (and SWE-bench Pro) is patch-based (github issue → patch). SWE-Marathon is mission-based (e.g. rewrite a C compiler in Rust).

2h37661