I like this perspective a lot. Relatedly, a nice work in this area https://arxiv.org/abs/2205.11275 . As a further point LLM think in tokens, not text - interpreting thoughts is not so simple as it first appears, especially in modern models and vocabularies.
@yoavgo As it turns out, the KL regularized return maximization objective is exactly the ELBO from variational inference. One is forced to REINFORCE because you can’t use the reparameterization trick, but other than that it’s a VAE where action / reasoning tokens are the latents.