/Tech9h ago

Researchers Treat Flow Steps as Actions for Efficient Offline RL

31942717334.5K

#26

Original post

Aditya Oberai@aditya_oberai

What if we treat flow steps as RL actions?

Combined with our “flow reversal” technique, this leads to a really clean & powerful recipe for flow offline RL!

Thread 🧵

8:30 AM · Jun 17, 2026 · 25.8K Views

Sentiment

Users appreciate treating flow steps as actions for efficient offline RL because it offers a smart way to extend learning without extra data.

Pos

100.0%

Neg

0.0%

1 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS8.7KBOOKMARKS58LIKES83RETWEETS8

Sergey Levine@svlevine

A new way to do off-policy RL with diffusion: if we have off-policy data, we need to figure out what the diffusion latent steps for it would be with our *current* policy (not the one that collected it), so this requires reversing the diffusion process on off-policy data.

Aditya Oberai@aditya_oberai

What if we treat flow steps as RL actions?

Combined with our “flow reversal” technique, this leads to a really clean & powerful recipe for flow offline RL!

Thread 🧵

5h8.7K8358

REPLIES2

Aditya Oberai@aditya_oberai

We recognize we can prevent an expansion in the value learning horizon by constructing "virtual" flow trajectories from standard prior data that are perfectly suited for multi-step returns.

17h39921

Aditya Oberai@aditya_oberai

We introduce *Reversal Q-Learning (RQL)* .

RQL achieves strong results across 50 locomotion and manipulation tasks compared to 19 other state-of-the-art flow-based offline RL algorithms.

17h67133

Aditya Oberai@aditya_oberai

That's it! A big thank you to my co-authors @seohong_park @svlevine.

Website: http://aober.ai/rql Paper: http://arxiv.org/abs/2606.17551 Codebase: http://github.com/aoberai/rql

17h50773

Aditya Oberai@aditya_oberai

We generate trajectories in the expanded framework via "flow reversal", which follows the flow ODE in reverse from actions in prior data.

We show these trajectories are deterministic and on-policy, and they thereby allow for unbiased, zero-variance multi-step returns.

17h38432

Aditya Oberai@aditya_oberai

We can then use reparameterized gradients on each flow step (alongside a BC term).

17h51622

Aditya Oberai@aditya_oberai

We know iterative generative models like flow matching are powerful for modeling complex robot policies in offline reinforcement learning (RL).

Yet, training them is non-trivial: BPTT is unstable, and 1-step distillations inhibit expressivity.

17h53521

Aditya Oberai@aditya_oberai

The implementation is really simple.

We learn a value function jointly over complete and partially-generated actions.

17h45421

Aditya Oberai@aditya_oberai

We can directly do RL over refinement steps, but this expands each action into multiple decision steps, multiplying the value learning horizon.

This expansion is particularly bad for off-policy RL, which exhibits the “curse of horizon”.

17h40421

Aditya Oberai@aditya_oberai

We propose a new algorithmic idea, starting from a simple view of flow RL.

A flow policy constructs actions via a sequence of refinement steps. To do RL, we can treat individual refinement steps as actions and apply standard RL algorithms.

17h4592

Christina Scoot@citylightspop

@aditya_oberai sounds like a smart way to extend learning without extra data

16h2