/Tech3h ago

Second Batch of 1st Proof Results Shows No AI Improvement

14245215421.8K

Original post

Firstproof results are out. My main takeaway: GPT5.5pro is a very strong model. 3/4 teams used it. Our Princeton team used Gemini 3.1 with our fall'25 style harness (original version performed very well on IMO problems). But it is clear vanilla prompting of 5.5pro gives very strong --and token-efficient-- results on research level math problems https://1stproof.org/assets/docs/report.pdf

12:16 PM · Jun 10, 2026 · 8.7K Views

/Tech3h ago

Second Batch of 1st Proof Results Shows No AI Improvement

14245215421.8K

#121

Original post

Sanjeev Arora@prfsanjeevarora#121inTech

12:16 PM · Jun 10, 2026 · 8.7K Views

Sentiment

Users reacted negatively to the second batch of 1st proof results showing no AI improvement, calling the outcomes worse than expected and criticizing high costs for only minor gains.

Pos

0.0%

Neg

100.0%

2 comments with sentiment.

Cluster Engagement

Posts from X

Most Activity

VIEWS2.1KLIKES13

Daniel Litt@littmath

I haven't yet had a chance to get a sense of how difficult these problems are, though the report (here: https://1stproof.org/assets/docs/report.pdf) contains preregistered difficulty statements. See e.g. here for the statement on problem 8, which wasn't solved.

3h2.1K131

BOOKMARKS2RETWEETS1

Daniel Litt@littmath

A bit interesting also that Submission D (a scaffold for Gemini 3.1) seems to have underperformed GPT 5.5 Pro "out of the box," in many cases at ~10x the cost... Submission A, which performed best, spent between $36 and ~$950 per problem.

2h1.1K122

REPLIES1

Jose Brox 🏳️‍🌈🏳️‍⚧️@josebrox

@littmath Are the prompts used known?

2h123

Torgeir Lysen@torgeirlysen

@littmath Correct me if I’m mistaken, but it seems like the problems might have been slightly more difficult this time around.

3h3177

Daniel Litt@littmath

@torgeirlysen Hard to say, I'm not really sure.

2h2685

Torgeir Lysen@torgeirlysen

@littmath I was just thinking based on the wording in the report. Since they increased the proof length and specified that the problems require “nonstandard insight” to solve.

2h1253

Tilman Bayer@tilmanbayer

@littmath Interesting how often reviewers disagreed about whether "the mathematics is correct", e.g. regarding all four submissions for problem 7... From a quick skim of section 5, it seems this was often because differences in tolerance for omitted steps or definitions

2h2322

Daniel Litt@littmath

@torgeirlysen I think this was probably more "learning from experience" than "making the problems more difficult." E.g. I think it's pretty plausible that problem 7 from the first batch was harder in some sense than any of these, though I'm not sure.

2h1221

big brane boi@bcubeddd

@littmath Wow that’s worse than what even I would have expected

2h755

Torgeir Lysen@torgeirlysen

@littmath Perhaps. It’s quite hard to tell without being an expert in any of these areas.

2h902

unique foot connoisseur@nonorrvau

@littmath from a brief glance this feels super disappointing for harnesses? seems like they 10x the cost for pretty minor improvements that might be overshadowed by even a few months of general improvements?

2h671

7rtp@fredyfredo123

I have some results in @leanprover

I've enriched Nat, each integer is a relational interface with its own global trace.

A trajectory does not merely cross numbers: it crosses traced interfaces carrying a diagonal witness, memory, and height.

If it fails to close, height grows. Under finite budget, closure, return, or explicit obstruction is forced.

3h197

Daniel Litt@littmath

@josebrox Yes, see section 4 here: https://1stproof.org/assets/docs/report.pdf

2h110

Daniel Litt@littmath

@tilmanbayer Informal mathematics is much less of a "verifiable domain" than one might think!

2h78

FateOfMuffins@FateOfMuffins

@littmath It seems Systems A and B are both harnesses of GPT 5.5 Pro (which itself is a harness)? It seems system B wasn't able to squeeze that much more out of just GPT 5.5 Pro out of the box despite 30x the cost.

I wonder how 5.6, 5.6 Pro and Mythos 5 would do in the coming weeks

1h9