/Tech3h ago

Second Batch of 1st Proof Results Shows No AI Improvement

14245215421.8K
Original post
Sanjeev Arora@prfsanjeevarora#121inTech

Firstproof results are out. My main takeaway: GPT5.5pro is a very strong model. 3/4 teams used it. Our Princeton team used Gemini 3.1 with our fall'25 style harness (original version performed very well on IMO problems). But it is clear vanilla prompting of 5.5pro gives very strong --and token-efficient-- results on research level math problems https://1stproof.org/assets/docs/report.pdf

12:16 PM · Jun 10, 2026 · 8.7K Views
Sentiment

Users reacted negatively to the second batch of 1st proof results showing no AI improvement, calling the outcomes worse than expected and criticizing high costs for only minor gains.

Pos
0.0%
Neg
100.0%
2 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS2.1KLIKES13
Daniel Litt@littmath

I haven't yet had a chance to get a sense of how difficult these problems are, though the report (here: https://1stproof.org/assets/docs/report.pdf) contains preregistered difficulty statements. See e.g. here for the statement on problem 8, which wasn't solved.

3hViews 2.1KLikes 13Bookmarks 1
BOOKMARKS2RETWEETS1
Daniel Litt@littmath

A bit interesting also that Submission D (a scaffold for Gemini 3.1) seems to have underperformed GPT 5.5 Pro "out of the box," in many cases at ~10x the cost... Submission A, which performed best, spent between $36 and ~$950 per problem.

2hViews 1.1KLikes 12Bookmarks 2
Torgeir Lysen@torgeirlysen

@littmath Correct me if I’m mistaken, but it seems like the problems might have been slightly more difficult this time around.

3hViews 317Likes 7
Daniel Litt@littmath

@torgeirlysen Hard to say, I'm not really sure.

2hViews 268Likes 5
Torgeir Lysen@torgeirlysen

@littmath I was just thinking based on the wording in the report. Since they increased the proof length and specified that the problems require “nonstandard insight” to solve.

2hViews 125Likes 3
Tilman Bayer@tilmanbayer

@littmath Interesting how often reviewers disagreed about whether "the mathematics is correct", e.g. regarding all four submissions for problem 7... From a quick skim of section 5, it seems this was often because differences in tolerance for omitted steps or definitions

2hViews 232Likes 2
Daniel Litt@littmath

@torgeirlysen I think this was probably more "learning from experience" than "making the problems more difficult." E.g. I think it's pretty plausible that problem 7 from the first batch was harder in some sense than any of these, though I'm not sure.

2hViews 122Likes 1
big brane boi@bcubeddd

@littmath Wow that’s worse than what even I would have expected

2hViews 75Likes 5
Torgeir Lysen@torgeirlysen

@littmath Perhaps. It’s quite hard to tell without being an expert in any of these areas.

2hViews 90Likes 2

@littmath from a brief glance this feels super disappointing for harnesses? seems like they 10x the cost for pretty minor improvements that might be overshadowed by even a few months of general improvements?

2hViews 67Likes 1
7rtp@fredyfredo123

I have some results in @leanprover

I've enriched Nat, each integer is a relational interface with its own global trace.

A trajectory does not merely cross numbers: it crosses traced interfaces carrying a diagonal witness, memory, and height.

If it fails to close, height grows. Under finite budget, closure, return, or explicit obstruction is forced.

3hViews 197
Daniel Litt@littmath

@josebrox Yes, see section 4 here: https://1stproof.org/assets/docs/report.pdf

2hViews 110
Daniel Litt@littmath

@tilmanbayer Informal mathematics is much less of a "verifiable domain" than one might think!

2hViews 78
FateOfMuffins@FateOfMuffins

@littmath It seems Systems A and B are both harnesses of GPT 5.5 Pro (which itself is a harness)? It seems system B wasn't able to squeeze that much more out of just GPT 5.5 Pro out of the box despite 30x the cost.

I wonder how 5.6, 5.6 Pro and Mythos 5 would do in the coming weeks

1hViews 9