SCORE Enables Direct Sim-To-Real Transfer For Robot Policies

VIEWS3.5KBOOKMARKS28LIKES62RETWEETS17

Policies trained on real robot data via imitation can be surprisingly capable. But for domains like dexterous manipulation, they are often not quite good enough: they move slowly, miss grasps, make unreliable contact, and fail under small perturbations.

Can we improve them without any additional data collection on the real robot?

In SCORE, we show that we can improve real-world diffusion/flow policies cheaply by using simulation to simply learn how to steer them on deployment. This leads to large gains in real-world success and speed across a variety of tasks, without requiring additional real-world experience:

https://weirdlabuw.github.io/score/

🧵 (1/10)

3h3.5K6228

REPLIES1

Abhishek Gupta@abhishekunique7

This makes the iteration loop really fast! We can go from task setup to real-world deployment for a new task in just half a day, turning a brittle base policy into one that is much more successful, while being robust, faster and more precise. For instance, we can see the ability to robust, continuous block picking with much higher throughput than possible before. (6/10)

Abhishek Gupta@abhishekunique7

SCORE gets the benefits of simulation for policy improvement: parallel interaction, privileged state, resets, and robustness to perturbations. Meanwhile, it avoids much of the manual engineering usually needed to make sim RL work, such as dense reward shaping, curriculum learning, or policy distillation. You don’t need to finetune the base policy, just directly transfer the steering policy from sim-to-real. (5/10)

3h12310

Abhishek Gupta@abhishekunique7

The empirical results are quite striking. While only using simulated steering, we see a 2.4x improvement in real-world success rate, and a 36.8% improvement in policy throughput. We also see the policy demonstrating retry and robustness behaviors that the base-policy failed to show. Interestingly, the relatively coarse knob of policy steering is able to solve some pretty cool, high-dexterity problems. (7/10)

Abhishek Gupta@abhishekunique7

This makes the iteration loop really fast! We can go from task setup to real-world deployment for a new task in just half a day, turning a brittle base policy into one that is much more successful, while being robust, faster and more precise. For instance, we can see the ability to robust, continuous block picking with much higher throughput than possible before. (6/10)

3h17441

Abhishek Gupta@abhishekunique7

To realize this, SCORE constrains RL improvement in simulation to the real-world policy’s support: the actions the real-world policy can plausibly generate with non-zero likelihood.

We realize a support constraint by only learning to steer in simulation, building on our prior work on diffusion steering. Freezing the base generative control policy (flow/diffusion policy) trained on real robot data, we use RL to learn how to steer the policy, i.e determine which latent inputs lead to success, rather than finetuning the policy itself. This lets simulation select among real-world behaviors, rather than inventing actions that may be unsafe or untransferable.

In this case diffusion steering is not just a convenient choice of RL algorithm, but actually necessary to enforce support constraints. (4/10)

Abhishek Gupta@abhishekunique7

A standard fix is to keep the learned policy close to the real-world policy, using a “distributional” constraint (like KL). But this often creates a difficult tradeoff:

Too loose → the policy exploits simulation. Too tight → the policy barely improves.

So we asked: is there a better way to constrain sim RL that allows policies to actually improve, while avoiding exploitation behavior? (3/10)

3h13720

Raymond Yu@yu_raymond5

Why is this? Because simulators are imperfect: contact, friction, compliance, geometry, and forces differ enough from the real world that RL can find solutions that underperform on hardware.

1h343

Abhishek Gupta@abhishekunique7

SCORE gets the benefits of simulation for policy improvement: parallel interaction, privileged state, resets, and robustness to perturbations. Meanwhile, it avoids much of the manual engineering usually needed to make sim RL work, such as dense reward shaping, curriculum learning, or policy distillation. You don’t need to finetune the base policy, just directly transfer the steering policy from sim-to-real. (5/10)

Abhishek Gupta@abhishekunique7

To realize this, SCORE constrains RL improvement in simulation to the real-world policy’s support: the actions the real-world policy can plausibly generate with non-zero likelihood.

We realize a support constraint by only learning to steer in simulation, building on our prior work on diffusion steering. Freezing the base generative control policy (flow/diffusion policy) trained on real robot data, we use RL to learn how to steer the policy, i.e determine which latent inputs lead to success, rather than finetuning the policy itself. This lets simulation select among real-world behaviors, rather than inventing actions that may be unsafe or untransferable.

In this case diffusion steering is not just a convenient choice of RL algorithm, but actually necessary to enforce support constraints. (4/10)

3h12910

Abhishek Gupta@abhishekunique7

What is pretty cool is that SCORE does not require perfect (or even near-perfect) base policies for successful improvement. What matters is coverage of the base-policy: failures, recoveries, and play data can all expand what the real-world policy is able to do.

Even if the base policy does not use these behaviors reliably for zero-shot success, SCORE can learn to steer towards them in simulation.

Some of our most robust policies came not from cleaner datasets, but from broader ones! (8/10)

Abhishek Gupta@abhishekunique7

The empirical results are quite striking. While only using simulated steering, we see a 2.4x improvement in real-world success rate, and a 36.8% improvement in policy throughput. We also see the policy demonstrating retry and robustness behaviors that the base-policy failed to show. Interestingly, the relatively coarse knob of policy steering is able to solve some pretty cool, high-dexterity problems. (7/10)

3h10810

Raymond Yu@yu_raymond5

Huge shoutout to my extraordinary co-lead @willhuey9, who taught me a ton throughout this project. Also, big thanks to my advisor @abhishekunique7. This project would look quite different if he wasn’t pushing us so hard.

This was a fun collaboration with @mukadammh and Anusha Nagabandi at Amazon!

Website, plz visit, you have no idea how many iterations it took: https://weirdlabuw.github.io/score/ Paper: https://arxiv.org/abs/2606.27475

1h212

Raymond Yu@yu_raymond5

A common response to this issue is to add distributional constraints, such as BC/KL regularization or small residual actions to keep the learned policy from deviating far from the base policy.

These can help, but they create a frustrating tradeoff. Constrain too loosely, and RL exploits the simulator. Constrain too harshly, and the robot barely improves.

1h192

Raymond Yu@yu_raymond5

Let’s take a different perspective. The base policy was trained on real robot trajectories, so the behaviors it can generate are grounded in things the robot can actually execute on hardware. This gives us a useful constraint: the policy’s support, or the set of actions it can already generate.

So we ask: what if simulation RL was only allowed to improve the policy within this support?

In SCORE, we freeze the base flow policy and use simulation to find latent inputs that make it succeed. Simulation learns where to steer the policy, not how to invent new actions.

1h142

Raymond Yu@yu_raymond5

Turns out by doing this, SCORE removes a lot of the annoying engineering that usually makes sim RL painful.

1.) We don’t need dense reward shaping or a careful curriculum to get policy improvement to work in simulation. 2.) We don’t need distillation or real-world finetuning to transfer it back to the robot.

Empirically, SCORE is quite amazing! Across eight tasks, we see a 2.4x improvement in real-world success, and a 36.8% speedup in policy throughput!!

1h132

Raymond Yu@yu_raymond5

And… it’s super fast to iterate on new tasks!

Just when I think I’ve wrapped up the project, my advisor @abhishekunique7 insists we add even MORE tasks. Luckily, SCORE let me finish the new tasks the following day!

Although, to my surprise, this turned into a recursive loop.

1h122

Raymond Yu@yu_raymond5

Alright, to wrap things up, the clear limitation in the room is that support constraints are nice, but they do not let policies go beyond their coverage.

This means expanding coverage is really important.

One example is adding retries, failures, and play data to the base policy. Once these behaviors exist somewhere in the support, SCORE can learn to steer toward the ones that are useful.

Some of our most robust policies came not from larger datasets, but more diverse ones!

1h112

Raymond Yu@yu_raymond5

So far, I’ve made simulation sound like the villain.

But simulation gives us a lot: cheap parallel interaction, privileged state, resets, and perturbations.

SCORE gets to keep most of these benefits, while avoiding the part where RL invents simulator-only behaviors. In practice, this lets us turn a brittle base policy into one with much higher throughput, success rate, and overall robustness.

1h112

Will Huey@willhuey9

It’s becoming increasingly common to evaluate policies in sim. Policy improvement is harder. We show why vanilla RL learns to exploit the sim2real gap, and provide a simple and principled solution.

Abhishek Gupta@abhishekunique7

Policies trained on real robot data via imitation can be surprisingly capable. But for domains like dexterous manipulation, they are often not quite good enough: they move slowly, miss grasps, make unreliable contact, and fail under small perturbations.

Can we improve them without any additional data collection on the real robot?

In SCORE, we show that we can improve real-world diffusion/flow policies cheaply by using simulation to simply learn how to steer them on deployment. This leads to large gains in real-world success and speed across a variety of tasks, without requiring additional real-world experience:

https://weirdlabuw.github.io/score/

🧵 (1/10)

1h55281

Will Huey@willhuey9

Shoutout @yu_raymond5 for pushing this project to its absolute limit. Could not have asked for a better co-lead (and he’s somehow an undergrad)

1h231

Abhishek Gupta@abhishekunique7

This project worked surprisingly well, and was a huge amount of hard work by @yu_raymond5 and @willhuey9. No matter what task I threw at them, they got it to work - really incredible work! And it really works surprisingly well, we highly recommend you try it out.

This was joint work with @mukadammh and Anusha Nagabandi at Amazon!

Website (lots of fun videos!): https://weirdlabuw.github.io/score/ Paper: https://arxiv.org/abs/2606.27475

(10/10)

3h54

Abhishek Gupta@abhishekunique7

A standard fix is to keep the learned policy close to the real-world policy, using a “distributional” constraint (like KL). But this often creates a difficult tradeoff:

Too loose → the policy exploits simulation. Too tight → the policy barely improves.

So we asked: is there a better way to constrain sim RL that allows policies to actually improve, while avoiding exploitation behavior? (3/10)

3h52

Abhishek Gupta@abhishekunique7

Now, SCORE is not magic - there are clear limitations. Real-world policy can be improved by choosing better behaviors inside its support, but it cannot create behaviors that were never present in the data, making it reliant on a level of base policy coverage and capabilities. (9/10)

3h19

Raymond Yu@yu_raymond5

Potentially obvious finding: naively taking a policy trained on real-world data and fine-tuning it with RL in simulation can produce quite dangerous behavior…

https://weirdlabuw.github.io/score/ 🧵👇

1h35182