/AI1h ago

ProgramBench analysis finds GPT-5.5 spends 80% of turns probing requirements while Claude relies on iterative generation and verification

GPT models then generate near-final code with minimal verification.

--0--
Original posts
Comments
Original post
Ofir Press@OfirPress#72inAI

John (@jyangballin) talking about the wide behavioral differences between GPT and Claude on ProgramBench.

7:42 AM · Jun 1, 2026 · 1.5K Views
Sentiment
Sentiment unavailable for this story.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
VIEWS115LIKES2
Lisan al Gaib@scaling01

@OfirPress @jyangballin (and Mythos if that's possible)

Ofir Press@OfirPress

@jyangballin Full ProgramBench Q&A: https://www.youtube.com/watch?v=blxN5jYWe8U

Full benchmark at https://programbench.com/

1hViews 115Likes 2Bookmarks 0
ProgramBench analysis finds GPT-5.5 spends 80% of turns probing requirements while Claude relies on iterative generation and verification · Digg