5h ago

Will Brown of Prime Intellect warns early multi-turn RLHF limits exploration, while Kalomaze proposes GAN-style ranking

The proposal uses density ratio estimation for outcome distributions.

0
Original post

as an analogy, a strategy you *could* take in multi-turn RL is just using turn-level advantages for single completions this is how early multi-turn RLHF was often done issue is you never see interaction chain counterfactuals, exploration is kneecapped

10:57 AM · May 27, 2026 View on X

the *right* way to do multi-turn RL for long-running chat is probably to just get really good at simulating users

maybe something like ECHO works here, or a two-sided GAN loop with online traces as targets

but users are complex, this complexity needs to be modeled somewhere

will brownwill brown@willccbb

as an analogy, a strategy you *could* take in multi-turn RL is just using turn-level advantages for single completions this is how early multi-turn RLHF was often done issue is you never see interaction chain counterfactuals, exploration is kneecapped

5:57 PM · May 27, 2026 · 1.6K Views
6:00 PM · May 27, 2026 · 2.7K Views

fortunately, we now have Discrete Gradient Descent, and so the worldsim piece is getting much easier

will brownwill brown@willccbb

the *right* way to do multi-turn RL for long-running chat is probably to just get really good at simulating users maybe something like ECHO works here, or a two-sided GAN loop with online traces as targets but users are complex, this complexity needs to be modeled somewhere

6:00 PM · May 27, 2026 · 2.7K Views
6:06 PM · May 27, 2026 · 1.1K Views

@willccbb btw. modern GAN literature shows that *relative* ranking avoids collapse i think the principle is general; for anything that looks like discriminative value estimation, you get denser signal for free if formulated as a ranking problem. density ratio estimation for outcome dists

will brownwill brown@willccbb

the *right* way to do multi-turn RL for long-running chat is probably to just get really good at simulating users maybe something like ECHO works here, or a two-sided GAN loop with online traces as targets but users are complex, this complexity needs to be modeled somewhere

6:00 PM · May 27, 2026 · 2.7K Views
6:38 PM · May 27, 2026 · 167 Views

@willccbb MBRL will surely make a comeback

will brownwill brown@willccbb

the *right* way to do multi-turn RL for long-running chat is probably to just get really good at simulating users maybe something like ECHO works here, or a two-sided GAN loop with online traces as targets but users are complex, this complexity needs to be modeled somewhere

6:00 PM · May 27, 2026 · 2.7K Views
9:30 PM · May 27, 2026 · 38 Views
Will Brown of Prime Intellect warns early multi-turn RLHF limits exploration, while Kalomaze proposes GAN-style ranking · Digg