Nando de Freitas releases reinforcement learning tutorial with notebook
Nando de Freitas released a reinforcement learning tutorial focused on policy gradients. The release includes a Python notebook and corresponding TeX source files hosted at love4all.ai. Development used coding assistance from OpenAI GPT, Codex, and AnthropicAI Claude. Tao Xu replied that the notebook places KL divergence directly inside advantage calculations and normalization, which can couple the KL term with added variance alongside standard policy-gradient or PPO methods.
This is a tutorial on reinforcement learning based on previous posts here. I'm including a policy gradient python notebook and the tex source so it can be translated to other languages to spread knowledge.
@OpenAI GPT & Codex and @AnthropicAI Claude Code helped me. Both were great.
So that people can find these, I am now placing all materials on my first blog website ❤️4∀.ai

The tutorial covers spicy topics like is reward enough? but first it provides the foundations: policy gradients, PPO, GrPO, probabilistic version via expectation maximisation (EM), RL for pretraining via e.g. online EM, imitation via Daggr, self-improvement, and tool-use with GLM
This is a tutorial on reinforcement learning based on previous posts here. I'm including a policy gradient python notebook and the tex source so it can be translated to other languages to spread knowledge. https://love4all.ai/ @OpenAI GPT & Codex and @AnthropicAI Claude Code helped me. Both were great. So that people can find these, I am now placing all materials on my first blog website ❤️4∀.ai
@txhf That puzzled me too. I left it because the results were ok, but let me ablate against the usual weighted combination. I’ll let you know. Thanks Tao 🙏
@NandoDF the notebook put KL into advantage calculation and normalization, it couples KL with extra variances?
@txhf You're absolutely right @txhf - That was a bug. It didn't show in the results because the example is rather easy. I increased the complexity a tiny bit and fixed it. Thanks 🙏
Actually for people following this, I would recommend you also try the standard KL estimator.
@NandoDF the notebook put KL into advantage calculation and normalization, it couples KL with extra variances?
@NandoDF the notebook put KL into advantage calculation and normalization, it couples KL with extra variances?
The tutorial covers spicy topics like is reward enough? but first it provides the foundations: policy gradients, PPO, GrPO, probabilistic version via expectation maximisation (EM), RL for pretraining via e.g. online EM, imitation via Daggr, self-improvement, and tool-use with GLM