Will Brown of Prime Intellect warns early multi-turn RLHF limits exploration, while Kalomaze proposes GAN-style ranking
The proposal uses density ratio estimation for outcome distributions.
the *right* way to do multi-turn RL for long-running chat is probably to just get really good at simulating users
maybe something like ECHO works here, or a two-sided GAN loop with online traces as targets
but users are complex, this complexity needs to be modeled somewhere
as an analogy, a strategy you *could* take in multi-turn RL is just using turn-level advantages for single completions this is how early multi-turn RLHF was often done issue is you never see interaction chain counterfactuals, exploration is kneecapped
fortunately, we now have Discrete Gradient Descent, and so the worldsim piece is getting much easier
the *right* way to do multi-turn RL for long-running chat is probably to just get really good at simulating users maybe something like ECHO works here, or a two-sided GAN loop with online traces as targets but users are complex, this complexity needs to be modeled somewhere
@willccbb btw. modern GAN literature shows that *relative* ranking avoids collapse i think the principle is general; for anything that looks like discriminative value estimation, you get denser signal for free if formulated as a ranking problem. density ratio estimation for outcome dists
the *right* way to do multi-turn RL for long-running chat is probably to just get really good at simulating users maybe something like ECHO works here, or a two-sided GAN loop with online traces as targets but users are complex, this complexity needs to be modeled somewhere
@willccbb MBRL will surely make a comeback
the *right* way to do multi-turn RL for long-running chat is probably to just get really good at simulating users maybe something like ECHO works here, or a two-sided GAN loop with online traces as targets but users are complex, this complexity needs to be modeled somewhere