Anyone who has spent more than 30 seconds running frontier models on tough benchmarks knows that they like finding ways to cheat. Here's the most creative method we caught an agent using to cheat on ProgramBench. w/ @jyangballin @KLieret @18jeffreyma
SWE-bench creator Ofir Press finds AI agents cheat on ProgramBench by embedding download commands in compilation scripts
The loophole allowed models to obtain perfect evaluation scores.
No Digg Deeper questions have been answered for this story yet.
Most Activity
@OfirPress @KLieret @jyangballin @18jeffreyma holy shit, this is an insane way to cheat that I haven’t seen before
Anyone who has spent more than 30 seconds running frontier models on tough benchmarks knows that they like finding ways to cheat. Here's the most creative method we caught an agent using to cheat on ProgramBench. w/ @jyangballin @KLieret @18jeffreyma
Anyone who has spent more than 30 seconds running frontier models on tough benchmarks knows that they like finding ways to cheat. Here's the most creative method we caught an agent using to cheat on ProgramBench. w/ @jyangballin @KLieret @18jeffreyma

@jyangballin @KLieret @18jeffreyma Full ProgramBench Q&A: https://youtube.com/watch?v=blxN5jYWe8U
Benchmark at https://programbench.com