Anyone who has spent more than 30 seconds running frontier models on tough benchmarks knows that they like finding ways to cheat. Here's the most creative method we caught an agent using to cheat on ProgramBench. w/ @jyangballin @KLieret @18jeffreyma
SWE-bench creator Ofir Press finds AI agents cheat on ProgramBench by embedding download commands in compilation scripts
The loophole allowed models to obtain perfect evaluation scores.
--0--
7:51 AM · Jun 1, 2026 · 9.2K Views
Sentiment
Sentiment unavailable for this story.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
VIEWS455BOOKMARKS2LIKES5
Florian Brand@xeophon
@OfirPress @KLieret @jyangballin @18jeffreyma holy shit, this is an insane way to cheat that I haven’t seen before
Ofir Press@OfirPress
Anyone who has spent more than 30 seconds running frontier models on tough benchmarks knows that they like finding ways to cheat. Here's the most creative method we caught an agent using to cheat on ProgramBench. w/ @jyangballin @KLieret @18jeffreyma
5hViews 455Likes 5Bookmarks 2