What if we treat flow steps as RL actions?
Combined with our “flow reversal” technique, this leads to a really clean & powerful recipe for flow offline RL!
Thread 🧵
What if we treat flow steps as RL actions?
Combined with our “flow reversal” technique, this leads to a really clean & powerful recipe for flow offline RL!
Thread 🧵
Users appreciate treating flow steps as actions for efficient offline RL because it offers a smart way to extend learning without extra data.
No Digg Deeper questions have been answered for this story yet.
A new way to do off-policy RL with diffusion: if we have off-policy data, we need to figure out what the diffusion latent steps for it would be with our *current* policy (not the one that collected it), so this requires reversing the diffusion process on off-policy data.
What if we treat flow steps as RL actions?
Combined with our “flow reversal” technique, this leads to a really clean & powerful recipe for flow offline RL!
Thread 🧵

We recognize we can prevent an expansion in the value learning horizon by constructing "virtual" flow trajectories from standard prior data that are perfectly suited for multi-step returns.

We introduce *Reversal Q-Learning (RQL)* .
RQL achieves strong results across 50 locomotion and manipulation tasks compared to 19 other state-of-the-art flow-based offline RL algorithms.

That's it! A big thank you to my co-authors @seohong_park @svlevine.
Website: http://aober.ai/rql Paper: http://arxiv.org/abs/2606.17551 Codebase: http://github.com/aoberai/rql

We generate trajectories in the expanded framework via "flow reversal", which follows the flow ODE in reverse from actions in prior data.
We show these trajectories are deterministic and on-policy, and they thereby allow for unbiased, zero-variance multi-step returns.

We can then use reparameterized gradients on each flow step (alongside a BC term).

We know iterative generative models like flow matching are powerful for modeling complex robot policies in offline reinforcement learning (RL).
Yet, training them is non-trivial: BPTT is unstable, and 1-step distillations inhibit expressivity.

The implementation is really simple.
We learn a value function jointly over complete and partially-generated actions.

We can directly do RL over refinement steps, but this expands each action into multiple decision steps, multiplying the value learning horizon.
This expansion is particularly bad for off-policy RL, which exhibits the “curse of horizon”.

We propose a new algorithmic idea, starting from a simple view of flow RL.
A flow policy constructs actions via a sequence of refinement steps. To do RL, we can treat individual refinement steps as actions and apply standard RL algorithms.

@aditya_oberai sounds like a smart way to extend learning without extra data