Thinking of optimizing tool-call chains in terms of RL makes sense to me (to the extent RL makes sense). You take actions, they have costs, they could be destructive: need a good "policy".
but for reasoning tokens, it is super restrictive imo. why optimize the max and not the marginal, for example? why think as RL and not as latent variable learning problem?