/AI13h ago

Christian A. Naesseth and Kyle Kastner highlight the mathematical equivalence of RL with KL penalties and variational inference

Reasoning tokens act as latent variables under this formulation.

1173182.7K

#1009

Original post

Kyle Kastner@kastnerkyle#1009inAI

I like this perspective a lot. Relatedly, a nice work in this area https://arxiv.org/abs/2205.11275 . As a further point LLM think in tokens, not text - interpreting thoughts is not so simple as it first appears, especially in modern models and vocabularies.

Taco Cohen@TacoCohen

@yoavgo As it turns out, the KL regularized return maximization objective is exactly the ELBO from variational inference. One is forced to REINFORCE because you can’t use the reparameterization trick, but other than that it’s a VAE where action / reasoning tokens are the latents.

8:17 AM · Jun 6, 2026 · 683 Views

/AI13h ago

Christian A. Naesseth and Kyle Kastner highlight the mathematical equivalence of RL with KL penalties and variational inference

Reasoning tokens act as latent variables under this formulation.

1173182.7K

#1009

Original post

Kyle Kastner@kastnerkyle#1009inAI

Taco Cohen@TacoCohen

8:17 AM · Jun 6, 2026 · 683 Views

Sentiment

Sentiment building, check back later.

Cluster Engagement

Posts from X

Most Activity

VIEWS1.8KBOOKMARKS13LIKES15RETWEETS3

Christian A. Naesseth@canaesseth

The connection between control and inference is super useful and still somewhat underappreciated.

Control/Planning/RL: REINFORCE and Pathwise Gradient

Inference/VI: Score Function Estimator and Reparameterization Trick

#RL #Control #VI #ML #Steering

Taco Cohen@TacoCohen

13h1.8K1513

Kyle Kastner@kastnerkyle

Additionally "neural thickets" give some interesting empirical evidence that this search for behavior is often nearby in weight space https://arxiv.org/abs/2603.12228

Kyle Kastner@kastnerkyle

13h27801