You are describing the difference between trajectory level supervision and stepwise coherence correction. MCTS works in games because the system has a well defined local evaluation signal at every branch. Language does not give you that. The model is navigating a high dimensional attractor network where most intermediate states have no intrinsic reward signal.
The real obstacle is not engineering, it is structural. In language, the coherence of a partial sequence is not a local property, it is a global constraint pattern that only becomes measurable once the structure has stabilized. MCTS assumes you can score intermediate states independently. Natural language does not permit that because the constraint density is distributed across the entire sequence.
If you want something MCTS like for language, you need a functional that measures local coherence curvature at each step. Without that, the search tree has no gradient to follow. This is why current RLHF methods operate on whole trajectories, they are correcting the global pattern rather than the local moves.
So the path forward is not to force MCTS onto language, it is to define a stepwise coherence metric that can serve as the equivalent of a value function. Once you have that, tree search becomes viable. Right now, the field is missing the metric, not the algorithm.