the unchecked assumption that bothers me most about how people talk about RL is the lack of decoupling between what policy gradients actually are in the abstract (general approximators of what some hypothetical continuous objective would be + variance) w/ sparsity of task design
Critique Questions Common Assumptions About Reinforcement Learning Policy Gradients
Most Activity
i worry that there's people who think (explicitly or not) that GRPO is a thing that "works best on verifiers", who are otherwise brilliant, and haven't decoupled the fact that exact verifiers are a contingent task design trend that caught on for legibility & speccability reasons
people sorta want deep learning to not be about structure of how you frame the problem you are trying to solve + what constraints a system has to solve around this is an issue in general (algo changes instead of data/task changes), but for RL it's especially brutal
people sorta want deep learning to not be about structure of how you frame the problem you are trying to solve + what constraints a system has to solve around this is an issue in general (algo changes instead of data/task changes), but for RL it's especially brutal
the unchecked assumption that bothers me most about how people talk about RL is the lack of decoupling between what policy gradients actually are in the abstract (general approximators of what some hypothetical continuous objective would be + variance) w/ sparsity of task design