Really cool work @erichzjiang! 👏👏👏 I have a question about your visualization.
IIUC, in a "pure" self-distillation step you would sample terminal actions from the current policy and maximize the ELBO under these actions. (For FPO, this is the supervised conditional flow-matching loss.) Perfect self-distillation would produce a new policy that preserves the marginal distribution of terminal actions (and hence the action log-likelihoods). But your visualization seems to suggest that all intermediate marginal distributions for t ∈ [0, 1] are also preserved in the self-distillation step? Is that correct? If so, I don't see why that would necessarily be true.