/AI3h ago

NLP researcher Yoav Goldberg proposes latent variable learning for reasoning tokens, but Rishabh Agarwal warns of mathematical intractability

Agarwal says marginalization is only feasible under binary rewards.

--0--
Original posts
Comments
Original post

Thinking of optimizing tool-call chains in terms of RL makes sense to me (to the extent RL makes sense). You take actions, they have costs, they could be destructive: need a good "policy".

but for reasoning tokens, it is super restrictive imo. why optimize the max and not the marginal, for example? why think as RL and not as latent variable learning problem?

9:56 AM · Jun 4, 2026 · 3.2K Views
Sentiment
Sentiment building, check back later.
Cluster Engagement
-
Views
-
Comments
-
Reposts
-
Bookmarks
Expand data
Posts from X
Most Activity
Most ActivityTimeline
VIEWS1.1KBOOKMARKS2LIKES7REPLIES2
Taco Cohen@TacoCohen

@yoavgo As it turns out, the KL regularized return maximization objective is exactly the ELBO from variational inference. One is forced to REINFORCE because you can’t use the reparameterization trick, but other than that it’s a VAE where action / reasoning tokens are the latents.

Thinking of optimizing tool-call chains in terms of RL makes sense to me (to the extent RL makes sense). You take actions, they have costs, they could be destructive: need a good "policy".

but for reasoning tokens, it is super restrictive imo. why optimize the max and not the marginal, for example? why think as RL and not as latent variable learning problem?

3hViews 1.1KLikes 7Bookmarks 2
RETWEETS1
Taco Cohen@TacoCohen

@yoavgo Yeah, if you use a single-sample estimator of the expectation. That's also true when you sample one z ~ p(z | x) in VAE. Of course you could use multiple samples or do fancier things.

@TacoCohen but you still commit to a single latent path, no? because this is how agents in an environment must act

2hViews 82Likes 1Bookmarks 1