5h ago

Florian Brand of Prime Intellect argues OpenAI Codex can legitimately achieve 50% accuracy on the GSM8K math reasoning benchmark

Dimitris Papailiopoulos clarified the 50% metric was a target.

319001.4K

——0——

Original post

#1153Florian Brand@XEOPHON

@DimitrisPapail tell codex 50% is possible and it should reach if by legitimate means without cheating

6:49 AM · May 30, 2026

#197Dimitris Papailiopoulos@DIMITRISPAPAIL

@xeophon i set it as a /goal. didn't care :(

Florian Brand@xeophon

@DimitrisPapail tell codex 50% is possible and it should reach if by legitimate means without cheating

1:49 PM · May 30, 2026 · 584 Views

1:50 PM · May 30, 2026 · 329 Views

#1352Alex J. Champandard 🌱@ALEXJC

@DimitrisPapail If you're willing to include things like units as hard-coded hints you can get more than 15%... what were your rules?

Dimitris Papailiopoulos@DimitrisPapail

The best symbolic solver for GSM8k (i.e., a pure python program) achieved a ~15% test error on GSM8k, which is kind incredible considering Llama 13B achieved the same :) Both Codex and CC seemed frustrated when pushed to grind beyond 15% as it seems to be a ceiling. The End.

1:46 PM · May 30, 2026 · 15.3K Views

2:18 PM · May 30, 2026 · 490 Views

Florian Brand of Prime Intellect argues OpenAI Codex can legitimately achieve 50% accuracy on the GSM8K math reasoning benchmark

Sentiment

Cluster engagement