/AI3h ago

NLP researcher Yoav Goldberg proposes latent variable learning for reasoning tokens, but Rishabh Agarwal warns of mathematical intractability

Agarwal says marginalization is only feasible under binary rewards.

11301136.1K

Original posts

#92

Comments

#92

Original post

(((ل()(ل() 'yoav))))👾@yoavgo#92inAI

Thinking of optimizing tool-call chains in terms of RL makes sense to me (to the extent RL makes sense). You take actions, they have costs, they could be destructive: need a good "policy".

but for reasoning tokens, it is super restrictive imo. why optimize the max and not the marginal, for example? why think as RL and not as latent variable learning problem?

9:56 AM · Jun 4, 2026 · 3.2K Views

/AI3h ago

NLP researcher Yoav Goldberg proposes latent variable learning for reasoning tokens, but Rishabh Agarwal warns of mathematical intractability

Agarwal says marginalization is only feasible under binary rewards.

--0--

Original posts

#92

Comments

#92

Original post

(((ل()(ل() 'yoav))))👾@yoavgo#92inAI

Thinking of optimizing tool-call chains in terms of RL makes sense to me (to the extent RL makes sense). You take actions, they have costs, they could be destructive: need a good "policy".

but for reasoning tokens, it is super restrictive imo. why optimize the max and not the marginal, for example? why think as RL and not as latent variable learning problem?

9:56 AM · Jun 4, 2026 · 3.2K Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Sentiment

Sentiment building, check back later.

Cluster Engagement

Views

Comments

Reposts

Bookmarks

Expand data

Posts from X

Most Activity

VIEWS1.1KBOOKMARKS2LIKES7REPLIES2

Taco Cohen@TacoCohen

@yoavgo As it turns out, the KL regularized return maximization objective is exactly the ELBO from variational inference. One is forced to REINFORCE because you can’t use the reparameterization trick, but other than that it’s a VAE where action / reasoning tokens are the latents.

(((ل()(ل() 'yoav))))👾@yoavgo

Thinking of optimizing tool-call chains in terms of RL makes sense to me (to the extent RL makes sense). You take actions, they have costs, they could be destructive: need a good "policy".

but for reasoning tokens, it is super restrictive imo. why optimize the max and not the marginal, for example? why think as RL and not as latent variable learning problem?

3h1.1K72

RETWEETS1

Taco Cohen@TacoCohen

@yoavgo Yeah, if you use a single-sample estimator of the expectation. That's also true when you sample one z ~ p(z | x) in VAE. Of course you could use multiple samples or do fancier things.

(((ل()(ل() 'yoav))))👾@yoavgo

Posts from X

Most Activity

VIEWS1.1KBOOKMARKS2LIKES7REPLIES2

Taco Cohen@TacoCohen

(((ل()(ل() 'yoav))))👾@yoavgo

Thinking of optimizing tool-call chains in terms of RL makes sense to me (to the extent RL makes sense). You take actions, they have costs, they could be destructive: need a good "policy".

but for reasoning tokens, it is super restrictive imo. why optimize the max and not the marginal, for example? why think as RL and not as latent variable learning problem?

3h1.1K72

RETWEETS1

Taco Cohen@TacoCohen

@yoavgo Yeah, if you use a single-sample estimator of the expectation. That's also true when you sample one z ~ p(z | x) in VAE. Of course you could use multiple samples or do fancier things.

(((ل()(ل() 'yoav))))👾@yoavgo

@TacoCohen but you still commit to a single latent path, no? because this is how agents in an environment must act

2h8211