As models get better, thinking carefully about eval constraints is super important.
In ProgramBench, we turn off internet completely. I strongly believe no/limited internet being the de facto standard for future coding benchmarks.
We're sharing new research on how models hack public benchmarks.
The latest models, including Opus 4.8 and Composer 2.5, learn to retrieve solutions from the internet or git history.
When we apply a stricter harness, eval scores drop significantly.







