I'm doing Q&A videos as I roll through my course. Here's the next one, covering subtle fixes to the on-policy distillation and reward model derivations, common notation traps when doing this math, and more added resources to go deeper (e.g. @johnschulman2's kl estimation blog).
Q&A 2 is here!
00:00 Derivation fixes 06:10 Code examples & additional resources 08:08 Extra RL notation and notes
Keep sending questions on YouTube, GitHub, and Discord. Phoebe and I are loving them.
