Firstproof results are out. My main takeaway: GPT5.5pro is a very strong model. 3/4 teams used it. Our Princeton team used Gemini 3.1 with our fall'25 style harness (original version performed very well on IMO problems). But it is clear vanilla prompting of 5.5pro gives very strong --and token-efficient-- results on research level math problems https://1stproof.org/assets/docs/report.pdf
Users reacted negatively to the second batch of 1st proof results showing no AI improvement, calling the outcomes worse than expected and criticizing high costs for only minor gains.
Most Activity

I haven't yet had a chance to get a sense of how difficult these problems are, though the report (here: https://1stproof.org/assets/docs/report.pdf) contains preregistered difficulty statements. See e.g. here for the statement on problem 8, which wasn't solved.

A bit interesting also that Submission D (a scaffold for Gemini 3.1) seems to have underperformed GPT 5.5 Pro "out of the box," in many cases at ~10x the cost... Submission A, which performed best, spent between $36 and ~$950 per problem.

@littmath Are the prompts used known?

@littmath Correct me if I’m mistaken, but it seems like the problems might have been slightly more difficult this time around.

@torgeirlysen Hard to say, I'm not really sure.

@littmath I was just thinking based on the wording in the report. Since they increased the proof length and specified that the problems require “nonstandard insight” to solve.

@littmath Interesting how often reviewers disagreed about whether "the mathematics is correct", e.g. regarding all four submissions for problem 7... From a quick skim of section 5, it seems this was often because differences in tolerance for omitted steps or definitions

@torgeirlysen I think this was probably more "learning from experience" than "making the problems more difficult." E.g. I think it's pretty plausible that problem 7 from the first batch was harder in some sense than any of these, though I'm not sure.

@littmath Wow that’s worse than what even I would have expected

@littmath Perhaps. It’s quite hard to tell without being an expert in any of these areas.

@littmath from a brief glance this feels super disappointing for harnesses? seems like they 10x the cost for pretty minor improvements that might be overshadowed by even a few months of general improvements?

I have some results in @leanprover
I've enriched Nat, each integer is a relational interface with its own global trace.
A trajectory does not merely cross numbers: it crosses traced interfaces carrying a diagonal witness, memory, and height.
If it fails to close, height grows. Under finite budget, closure, return, or explicit obstruction is forced.

@josebrox Yes, see section 4 here: https://1stproof.org/assets/docs/report.pdf

@tilmanbayer Informal mathematics is much less of a "verifiable domain" than one might think!

@littmath It seems Systems A and B are both harnesses of GPT 5.5 Pro (which itself is a harness)? It seems system B wasn't able to squeeze that much more out of just GPT 5.5 Pro out of the box despite 30x the cost.
I wonder how 5.6, 5.6 Pro and Mythos 5 would do in the coming weeks