5h ago

Florian Brand of Prime Intellect argues OpenAI Codex can legitimately achieve 50% accuracy on the GSM8K math reasoning benchmark

Dimitris Papailiopoulos clarified the 50% metric was a target.

0
Original post

@DimitrisPapail tell codex 50% is possible and it should reach if by legitimate means without cheating

6:49 AM · May 30, 2026 View on X

@xeophon i set it as a /goal. didn't care :(

Florian BrandFlorian Brand@xeophon

@DimitrisPapail tell codex 50% is possible and it should reach if by legitimate means without cheating

1:49 PM · May 30, 2026 · 584 Views
1:50 PM · May 30, 2026 · 329 Views

@DimitrisPapail If you're willing to include things like units as hard-coded hints you can get more than 15%... what were your rules?

Dimitris PapailiopoulosDimitris Papailiopoulos@DimitrisPapail

The best symbolic solver for GSM8k (i.e., a pure python program) achieved a ~15% test error on GSM8k, which is kind incredible considering Llama 13B achieved the same :) Both Codex and CC seemed frustrated when pushed to grind beyond 15% as it seems to be a ceiling. The End.

1:46 PM · May 30, 2026 · 15.3K Views
2:18 PM · May 30, 2026 · 490 Views
Florian Brand of Prime Intellect argues OpenAI Codex can legitimately achieve 50% accuracy on the GSM8K math reasoning benchmark · Digg