/Tech4h ago

SWE-bench creator Ofir Press outlines a cyclical framework where benchmark development drives language model progress

The cycle maps how benchmarks guide model pretraining and scaffolding

3432123.6K

Original post

Ofir Press@OfirPress#78inTech

slide from my current talk:

5:34 PM · Jun 17, 2026 · 2.5K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS832BOOKMARKS1LIKES8

This is the "outer" reinforcement learning loop.

slide from my current talk:

4h83281

@OfirPress I think you're missing one more step before new evals are created: benchmaxxing.

slide from my current talk:

3h27441