This is not impressive compared to current models on the leaderboard https://trishullab.github.io/PutnamBench/leaderboard.html
Leanstral 1.5 shows the strongest test-time scaling we have seen from a formal-reasoning model. The figure below tracks Pass@8 on PutnamBench as we raise the token budget per attempt from 25k to 4M: performance climbs smoothly the whole way.
