Diffusion (or flow) makes for excellent policies, but training them with RL is notoriously hard: BPTT is unstable, RL over diffusion blows up the horizon. In our new paper, we show how we can optimize flow matching actors by using "one weird trick" -- "approximate" the Jacobian of the flow denoising process with the identity matrix. 馃憞
Users reacted negatively to the Jacobian approximation technique for stabilizing RL flow policies by accusing researcher @svlevine of chasing personal romantic interests among prominent AI figures instead of pursuing technical work.
Most Activity
Our method (QGF) outperforms using the true Jacobian or BPTT. It is entirely a test-time method (i.e., the policy is trained with BC, the Q-function is trained with TD, and at test-time, optimize the Q-function wrt actions using the identity Jacobian "approximation").
Diffusion (or flow) makes for excellent policies, but training them with RL is notoriously hard: BPTT is unstable, RL over diffusion blows up the horizon. In our new paper, we show how we can optimize flow matching actors by using "one weird trick" -- "approximate" the Jacobian of the flow denoising process with the identity matrix. 馃憞
To find out more, check out the paper and website here: https://q-guided-flow.github.io/
A fun collaboration with @zhiyuan_zhou_, @andy_peng05, @CharlesXu0124, @qiyang_li, @kvfrans, @jtspringenberg
Our method (QGF) outperforms using the true Jacobian or BPTT. It is entirely a test-time method (i.e., the policy is trained with BC, the Q-function is trained with TD, and at test-time, optimize the Q-function wrt actions using the identity Jacobian "approximation").

@svlevine Gosh, he is literally chasing all hot guys in the ai, software and tech industry. My gosh, dude, you might have a preference for bfs but maybe the guys don't want to be in a relationship with you. Why are you collaborating with every single hot guy in the tech industry in the 馃寧?