@Grad62304977 Short horizon is a bandit problem so value functions are unnecessary. Long horizon chains have many sub chains that have been seen previously and can be valued. Short horizon PPo has extra latency has it has to wait for the critic
Not even directly saying that PPO is bad, but just questioning the idea that before PPO wasn’t worth it and now it is. U could say for shorter horizons the value models job is easier so it should also be great there too

