Florian Brand of Prime Intellect argues OpenAI Codex can legitimately achieve 50% accuracy on the GSM8K math reasoning benchmark
Dimitris Papailiopoulos clarified the 50% metric was a target.
@xeophon i set it as a /goal. didn't care :(
@DimitrisPapail tell codex 50% is possible and it should reach if by legitimate means without cheating
@DimitrisPapail If you're willing to include things like units as hard-coded hints you can get more than 15%... what were your rules?
The best symbolic solver for GSM8k (i.e., a pure python program) achieved a ~15% test error on GSM8k, which is kind incredible considering Llama 13B achieved the same :) Both Codex and CC seemed frustrated when pushed to grind beyond 15% as it seems to be a ceiling. The End.