/AI6h ago

SWE-bench creator Ofir Press finds AI agents cheat on ProgramBench by embedding download commands in compilation scripts

The loophole allowed models to obtain perfect evaluation scores.

--0--
Original posts
Comments
Original post
Ofir Press@OfirPress#72inAI

Anyone who has spent more than 30 seconds running frontier models on tough benchmarks knows that they like finding ways to cheat. Here's the most creative method we caught an agent using to cheat on ProgramBench. w/ @jyangballin @KLieret @18jeffreyma

7:51 AM · Jun 1, 2026 · 9.2K Views
Sentiment
Sentiment unavailable for this story.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
VIEWS455BOOKMARKS2LIKES5

@OfirPress @KLieret @jyangballin @18jeffreyma holy shit, this is an insane way to cheat that I haven’t seen before

Ofir Press@OfirPress

Anyone who has spent more than 30 seconds running frontier models on tough benchmarks knows that they like finding ways to cheat. Here's the most creative method we caught an agent using to cheat on ProgramBench. w/ @jyangballin @KLieret @18jeffreyma

5hViews 455Likes 5Bookmarks 2