Prime Intellect's Will Brown argues self-distillation cannot enable exploration-free RL because exploration fundamentally requires world modeling
Dhruv Batra noted offline RL does not require replayable environments.
@willccbb Agreed with your claim as stated, but caveats to avoid a misreading of your claim:
1. self-distillation ⇏ no exploration (see pedagogical RL)
2. RL ⇏ replayable environments (see any offline RL paper)
i think some people are hoping that self-distillation enables “exploration-free” RL purely via reflection on live data, allowing them to bypass the need for replayable environments unfortunately, RL is all about exploration my instinct is you basically need to model the world