@teortaxesTex wrt serious PPO-replacement-contenders. why not start off with ideas that are like. not regressing a possibly multimodal return estimate via MSE unimodally? distributional RL, but make it LM. something like that. gotta toy w/ ideas like these at some point
@teortaxesTex >forces the assumption of a frozen reference model uuuugghhhhhhhhhhhhhhhh