/Tech9h ago

Leanstral 1.5 shows smooth test-time scaling on PutnamBench but draws skepticism from Princeton's Sanjeev Arora

Sanjeev Arora called the mathematical model's leaderboard performance unimpressive.

2183177.6K

#50

Original post

Sanjeev Arora@prfsanjeevarora#92inTech

This is not impressive compared to current models on the leaderboard https://trishullab.github.io/PutnamBench/leaderboard.html

Mert Ünsal@mertunsal2020

Leanstral 1.5 shows the strongest test-time scaling we have seen from a formal-reasoning model. The figure below tracks Pass@8 on PutnamBench as we raise the token budget per attempt from 25k to 4M: performance climbs smoothly the whole way.

3:43 PM · Jul 3, 2026 · 7.8K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Related links

PutnamBench Leaderboard

TRISHULLAB.GITHUB.IOVia

#92

Posts from X

Most Activity

VIEWS438BOOKMARKS1

Dan Roy@roydanroy

@prfsanjeevarora 👀

Sanjeev Arora@prfsanjeevarora

This is not impressive compared to current models on the leaderboard https://trishullab.github.io/PutnamBench/leaderboard.html

5h43801

LIKES9RETWEETS1

Mert Ünsal@mertunsal2020

It’s much cheaper than Aleph Prover and Seed Prover high (and beats seed prover high which uses 10 H20 days per problem)!

It’s also better than Goedel Architect without NL solutions provided and a tiny bit worse with, using no special scaffold (you run the code agent with compaction and that’s it)!

Agree that it doesn’t move the frontier in number of Putnam problems solved which is why we put the Aleph Prover in the main plot!

9h2409