Sasha Rush@srush_nlp·Reply
I am a huge RL terminology skeptic. Having to teach MDP and POMDP made me sick. And the reward/advantage/q function is a mess. But I think in other ways this is the good case of a field working in isolation for a decade and then having the right abstractions ready to go when LLMs needed them.