/Tech16h ago

Self-play regularized with 30 minutes of human data produces human-like autonomous driving policies

Training required only 15 hours on a single GPU.

102362710932.3K

#1158

Original post

Daphne Cornelisse@daphne_cor

New Paper: Human-like Autonomy Emerges from Self-Play and a Pinch of Human Data.

We trained self-play RL on 60 years of simulation on 1 GPU in ~15 hours. Regularizing with 30 minutes of demonstration data produces much more human-like driving policies!

11:29 AM · Jun 19, 2026 · 32.3K Views

Sentiment

Users are praising the self-play RL paper on human-like driving policies with minimal data because they find it awesome, coherent, well-written, interesting, and superhuman.

Pos

100.0%

Neg

0.0%

7 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

VIEWS2K

kache@yacineMTB

@daphne_cor Unregularized is a better driver. Deploy it in real life. Superhuman

1d2K262

BOOKMARKS14LIKES30

Daphne Cornelisse@daphne_cor

Project page: https://spiced-self-play.com/ Arxiv: https://arxiv.org/abs/2606.19370

1d1.5K3014

RETWEETS27

Daphne Cornelisse@daphne_cor

New Paper: Human-like Autonomy Emerges from Self-Play and a Pinch of Human Data.

We trained self-play RL on 60 years of simulation on 1 GPU in ~15 hours. Regularizing with 30 minutes of demonstration data produces much more human-like driving policies!

1d32.3K236109

REPLIES2

kache@yacineMTB

@daphne_cor The reason unregularized drives the way it does is because it has god sight. Give it only raytraced pixels by simulating lidar

1d621121

Daphne Cornelisse@daphne_cor

A big thank you to my collaborators for their unique contributions to this work: @julianh651, Zixu Zhang, Waël Doulazmi, @kev_joseph_, Jaime Fernández Fisac, and @EugeneVinitsky!

1d1K101

Marcos Pereira@marcospereeira

@daphne_cor @yacineMTB I want to see the alien driving policies

1d299

Dan Advantage@DanAdvantage

@daphne_cor @julianh651 @kev_joseph_ @EugeneVinitsky very nice. thank you for a coherent, well-written paper! actually interesting research too.

1d261

Daphne Cornelisse@daphne_cor

@DanAdvantage @julianh651 @kev_joseph_ @EugeneVinitsky Thank you for the kind words! I hope it led to some high-advantage transitions

12h161

Spencer Cheng@spenccheng

@daphne_cor This is awesome Daphne!!

1d3282

Victor Butoi@ion_barrel

@daphne_cor Cool!!!

1d5261

Besar@moveToMoonlight

@daphne_cor regularized respects traffic, unregularized respects gradients:)

17h722

Lambda Rick 🏴‍☠️/acc@benrayfield

@yacineMTB @daphne_cor dont forget to put in human road-rage

1d97

Daphne Cornelisse@daphne_cor

@marcospereeira @yacineMTB We provide a couple of comparisons on the webpage! https://spiced-self-play.com/

e.g., see yellow car here

12h63

Dan Advantage@DanAdvantage

@daphne_cor @julianh651 @kev_joseph_ @EugeneVinitsky haha thanks for the irl chuckle

12h171

Dan Advantage@DanAdvantage

@yacineMTB @daphne_cor what they don't show is regularized still driving to point b after 10000000 hours meanwhile unreg is at the cabana sipping champagne

1d161

Daphne Cornelisse@daphne_cor

Hi! You could make the policies more cautious by intentionally creating more uncertainty in the environment. But that alone won't give you human-like policies. For context, the reward we specify is high-level and sparse (by design) - it only tells the agent: get to the goal destination safely. Since there are multiple possible ways to solve this perfectly (solution space is large), there is no guarantee that the RL policy converges to the same solution as people do. We talk about this more in the introduction, if you're interested. 🙂

12h25