Math AI is roughly where coding was before CLI agents: single-turn and mostly ungrounded without a dense feedback loop.
The best math prover we have today is GPT-5.5 Pro doing for the most part single-turn natural language proofs. Without a real reactive environment, grounding, or real multi-turn correction. Very much the opposite of what CLI agents like Codex or Claude Code operate in. In current top math AI models you generate and then verify after the fact.
Terminal agents work so well because the terminal grounds them after every turn and lets them self-correct as they go. Each step gets verified on the way to the solution, and this also helps during training and test time! There's so much signal (literally thousands of tokens) that the bash terminal offers, both during training and during inference. That kind of reactive, and very verbose environment is exactly why Claude Code and Codex have taken off, and are the closest thing an LLM has been to an embodied agent.
My conjecture is that math needs the equivalent: a reactive environment, a "file system", and a "math terminal" that builds pieces of the proof as you go, verifies them and allows the model to backtrack and redo without keeping the entire proof/process in its context. When a real agentic math model is trained by experience inside that kind of environment, my conjecture is it'll be a phase transition given how strong GPT-5.5 and Gemini 3.1 Pro already are in ungrounded, single-turn settings.