/Tech3h ago

Fable 5 and GPT 5.5 top updated ArXivMath and BrokenArXiv benchmarks, but critic warns of excessive token consumption

Teortaxes noted the models missed expected WeirdML efficiency profiles.

68992617.6K
Sentiment

Some users praise MathArena for running thorough benchmarks like ArXivMath despite high costs limiting GPT Pro participation from other organizations.

Pos
100.0%
Neg
0.0%
1 comments with sentiment.
Cluster Engagement
Posts from X
Most Activity
Most Activity
VIEWS1.5K
Tim@TimGMath

Results for BrokenArXiv:

7hViews 1.5KLikes 5Bookmarks 1
BOOKMARKS1REPLIES1
Tim@TimGMath

Further: Fable 5 is less expensive than Opus 4.8 on ArXivMath, since it uses fewer tokens. Further, Gemini-3.1-Pro scores quite poor this month, with DeepSeek-v4-Flash outperforming it.

7hViews 388Likes 3Bookmarks 1
LIKES9
Tim@TimGMath

Despite its impressive performance, Fable 5 is much more expensive than GPT 5.5 and requires a comparison with GPT-5.5-Pro for an accurate evaluation of its capabilities, but we can currently not make this comparison due to the costs of GPT-5.5-Pro.

7hViews 425Likes 9Bookmarks 1
RETWEETS9
Tim@TimGMath

The latest versions of ArXivMath and BrokenArXiv have been released! Impressive Performance of Fable 5, which takes the top spot on ArXivMath. On BrokenArXiv, GPT 5.5 continues to be in the lead.

7hViews 16.5KLikes 85Bookmarks 26
Tim@TimGMath

Full results: http://matharena.ai

7hViews 307Likes 2
Samian Noesis@samiannoesis

@TimGMath @Liam06972452 Incredible! Is broken arxiv math an alternative version of the same problems or a different proposal altogether?

5hViews 144

@TimGMath have you approached openai for credits for gpt pro? i love matharena and so few orgs run gpt pro due to costs, painting an incomplete picture :( maybe @reach_vb could help you find the correct person for grants?

2hViews 3