Will Brown of Prime Intellect warns early multi-turn RLHF limits exploration, while Kalomaze proposes GAN-style ranking

REPLY

the *right* way to do multi-turn RL for long-running chat is probably to just get really good at simulating users

maybe something like ECHO works here, or a two-sided GAN loop with online traces as targets

but users are complex, this complexity needs to be modeled somewhere

will brown@willccbb

as an analogy, a strategy you *could* take in multi-turn RL is just using turn-level advantages for single completions this is how early multi-turn RLHF was often done issue is you never see interaction chain counterfactuals, exploration is kneecapped

5:57 PM · May 27, 2026 · 1.6K Views

6:00 PM · May 27, 2026 · 2.7K Views

REPLY

#339will brown@WILLCCBB

fortunately, we now have Discrete Gradient Descent, and so the worldsim piece is getting much easier

will brown@willccbb

the *right* way to do multi-turn RL for long-running chat is probably to just get really good at simulating users maybe something like ECHO works here, or a two-sided GAN loop with online traces as targets but users are complex, this complexity needs to be modeled somewhere

6:00 PM · May 27, 2026 · 2.7K Views

6:06 PM · May 27, 2026 · 1.1K Views

REPLY

#836kalomaze@KALOMAZE

@willccbb btw. modern GAN literature shows that *relative* ranking avoids collapse i think the principle is general; for anything that looks like discriminative value estimation, you get denser signal for free if formulated as a ranking problem. density ratio estimation for outcome dists

will brown@willccbb

the *right* way to do multi-turn RL for long-running chat is probably to just get really good at simulating users maybe something like ECHO works here, or a two-sided GAN loop with online traces as targets but users are complex, this complexity needs to be modeled somewhere

6:00 PM · May 27, 2026 · 2.7K Views

6:38 PM · May 27, 2026 · 167 Views

REPLY

#1356Charles Foster@CFGEEK

@willccbb MBRL will surely make a comeback

will brown@willccbb

the *right* way to do multi-turn RL for long-running chat is probably to just get really good at simulating users maybe something like ECHO works here, or a two-sided GAN loop with online traces as targets but users are complex, this complexity needs to be modeled somewhere

6:00 PM · May 27, 2026 · 2.7K Views

9:30 PM · May 27, 2026 · 38 Views

Will Brown of Prime Intellect warns early multi-turn RLHF limits exploration, while Kalomaze proposes GAN-style ranking

Sentiment

Cluster engagement