but honestly, this isn't the WeirdML profile it clearly still sucks in this. Too many tokens
Results for BrokenArXiv:
Teortaxes noted the models missed expected WeirdML efficiency profiles.
but honestly, this isn't the WeirdML profile it clearly still sucks in this. Too many tokens
Results for BrokenArXiv:
Some users praise MathArena for running thorough benchmarks like ArXivMath despite high costs limiting GPT Pro participation from other organizations.

Results for BrokenArXiv:

Further: Fable 5 is less expensive than Opus 4.8 on ArXivMath, since it uses fewer tokens. Further, Gemini-3.1-Pro scores quite poor this month, with DeepSeek-v4-Flash outperforming it.

Despite its impressive performance, Fable 5 is much more expensive than GPT 5.5 and requires a comparison with GPT-5.5-Pro for an accurate evaluation of its capabilities, but we can currently not make this comparison due to the costs of GPT-5.5-Pro.
The latest versions of ArXivMath and BrokenArXiv have been released! Impressive Performance of Fable 5, which takes the top spot on ArXivMath. On BrokenArXiv, GPT 5.5 continues to be in the lead.

Full results: http://matharena.ai

@TimGMath @Liam06972452 Incredible! Is broken arxiv math an alternative version of the same problems or a different proposal altogether?

@TimGMath have you approached openai for credits for gpt pro? i love matharena and so few orgs run gpt pro due to costs, painting an incomplete picture :( maybe @reach_vb could help you find the correct person for grants?
Teortaxes noted the models missed expected WeirdML efficiency profiles.
but honestly, this isn't the WeirdML profile it clearly still sucks in this. Too many tokens
Results for BrokenArXiv: