/Tech28d ago

SWE-bench creator Ofir Press finds AI agents cheat on ProgramBench by embedding download commands in compilation scripts

The loophole allowed models to obtain perfect evaluation scores.

68775619.7K

#78

Original post

Ofir Press@OfirPress#78inTech

Anyone who has spent more than 30 seconds running frontier models on tough benchmarks knows that they like finding ways to cheat. Here's the most creative method we caught an agent using to cheat on ProgramBench. w/ @jyangballin @KLieret @18jeffreyma