/AI1h ago

ProgramBench analysis finds GPT-5.5 spends 80% of turns probing requirements while Claude relies on iterative generation and verification

GPT models then generate near-final code with minimal verification.

--0--

Original posts

Comments

Original post

Ofir Press@OfirPress#72inAI

John (@jyangballin) talking about the wide behavioral differences between GPT and Claude on ProgramBench.

7:42 AM · Jun 1, 2026 · 1.5K Views

Sentiment

Sentiment unavailable for this story.

Cluster Engagement

-

Views

-

Comments

-

Reposts

-

Bookmarks

Expand data

Posts from X

Most Activity

VIEWS115LIKES2

Lisan al Gaib@scaling01

@OfirPress @jyangballin (and Mythos if that's possible)

Ofir Press@OfirPress

@jyangballin Full ProgramBench Q&A: https://www.youtube.com/watch?v=blxN5jYWe8U

Full benchmark at https://programbench.com/

1h11520

ProgramBench analysis finds GPT-5.5 spends 80% of turns probing requirements while Claude relies on iterative generation and verification · Digg