/Tech4h ago

Dwarkesh Patel argues reinforcement learning from verbal feedback faces strict generalization limits in future AI training paradigms

Story Overview

Dwarkesh Patel's June 26 essay lays out why reinforcement learning with verifiable rewards is unlikely to scale into open-ended, long-horizon agents on its own. The piece stresses that grindability—massive parallel, reset-friendly rollouts—matters as much as verifiability, which explains slower gains in computer-use domains versus math or coding. It also flags that short-horizon RLVR may fail to transfer to messy, non-stationary real-world settings and that in-context learning cannot replace weight updates once valuable data only appears in actual deployment.

305334457852.7K

#60

Original post

Dwarkesh Patel@dwarkesh_sp#60inTech

What does the next training paradigm look like?

0:00:00 – The big research bet the labs are making 0:02:12 – Grindability is just as important as verifiability 0:06:10 – Will RLVR alone generalize? 0:08:41 – Getting the learning back to the weights 0:15:22 – Dreaming 0:17:23 – What 2027 looks like

Also on YouTube, pod feed, and Substack.

9:56 AM · Jun 26, 2026 · 51.7K Views

Open Question

Why RLVR progress may plateau

The essay treats current RL training as classroom case studies whose lessons do not automatically translate outside the lab, leaving real-world experience largely untapped.

FYI

The missing piece for 2027 models

Patel projects that mechanisms to fold deployment-derived updates back into model weights will become central, since single-user sessions rarely supply enough samples for effective continual learning.

Sentiment

Many users expressed interest in ideas like decentralized training and realistic environments from the Dwarkesh podcast on next AI paradigms because they view them as key solutions, while some criticized the host or current lab approaches.

Pos

66.7%

Neg

33.3%

12 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

Posts from X

Most Activity

Jayoo Hwang@JayooHwang

@dwarkesh_sp This has been bugging me for a while too.

I've written some more about why computer use is so clunky and explored a few solutions recently:

3h46411

BOOKMARKS1LIKES2

taf 🝮@tafphorisms

@dwarkesh_sp Shirt looks like it’s amazing quality but fit is tight around the chest. If you’re in Austin, try Gessane Tailors. They’ve been around forever (they did work for George Bush) and I’ve used them lots.

19m9721

RETWEETS2

Virang Jhaveri@VirangJhaveri

@dwarkesh_sp @steipete we are focusing on this problem!

Looping on real world outcomes. Do checkout https://github.com/Nimrobo/superdense

57m3511

Sums@sums001

@dwarkesh_sp My brain after coffee.

4h3832

Aashish Reddy@_AashishReddy

🚨 NEW: Dwarkesh thinks continual learning is a big deal

(Jokes aside I think this was my favourite essay he’s written, recommended)

Dwarkesh Patel@dwarkesh_sp

What does the next training paradigm look like?

Also on YouTube, pod feed, and Substack.

3h2.2K128

Florian S@airesearch12

@dwarkesh_sp Decentralized. It's the only way to tackle the hardware shortage and absurd compute costs. This to me looks very much like the solution:

2h2331

Nathan Witkin@NateWitkin

Discuss some further reasons to be skeptical about 'RL for everything' here (along with much else): https://open.substack.com/pub/arachnemag/p/ais-reliability-gap?r=18kjq3&utm_campaign=post-expanded-share&utm_medium=web

TLDR:

1. Most tasks lack objective success conditions.

2. Even those that have them may admit to a range of solutions workers have conflicting preferences over.

3. Knowledge work is always changing (especially now) so some RL environments may become obsolete before they can be made useful.

3h2201

Dan@DanMeier20

@dwarkesh_sp the only question left is which lab will buy you

3h169

Kevin Son@oraclekev

The approach taken by frontier labs is a natural progression of scaling the current paradigm. It essentially brute-forces the saturation of specialized tasks using massive compute and feedback loops. Ultimately, this is a stopgap measure until true continual learning and generalization are achieved; while performance on these tasks will continue to improve, significant gaps will always remain when encountering unexpected long-tail events. The issue with these long-tail events is that if they happen to be mission-critical, it’s game over for the entire workstream. Furthermore, it is virtually impossible to predict when one of these data-insufficiency or hallucination events will occur. As you eloquently elaborated on sample-efficient continual learning, I would just add that OPSD should be classified as continual fine-tuning rather than continual learning, as weight updates are sparse because an entire internet's worth of patterns is already present from pre-training.

3h119

David@dave21_ai

@dwarkesh_sp just stop this already

3h114

Fraser@FraserGreenlee

@dwarkesh_sp Given the labs already use agents internally across teams they have a great test bed for online learning.

2h108

Alejandro Méndez@Alejand63610945

@dwarkesh_sp substrate of truth.

3h101

judah@joodalooped

@dwarkesh_sp I'M SORRY BUT I MUST SHILL "WORK DATA" AS A TERM ONCE AGAIN

w.r.t. the end of the video this time, i.e. the new basis of scaling

https://anjalishriva.com/work-data/

1h281

xiao sun@xiaosun86

@dwarkesh_sp just like ai solve the world problem through coding, ai should understand the world through modeling. finite element method, solving differential equations, even making metaphors, yes analogy is also a kind of modeling, lossy, with boundaries, though.

1h63

eishan@eishanlawrence5

@dwarkesh_sp Please let the experts talk about this stuff

2h41

PJ Standley@ArkStabler

A customer of mine lost there key man 3 weeks after I started working with them.

The new person asked the @openclaw agent “we lost our key man, what do you know about the company and how do I keep it moving while learning?”

Gobs of important company information poured out of the agent.

New employee onboarded and orchestrating smoothly

My customer was elated

@openclaw is the way for small businesses to tackle these issues in their world.

56m10

Darshan V@darshanv3v

@dwarkesh_sp OPSD + dreaming seems like a brilliant idea , I am not sure if EBM's will become mainstream, but feels like EBM might be a good candidate for model architecture since they are good calibrated verifiers it would also allow for dreaming

3h7

deep Manifold@BetaTomorrow

@dwarkesh_sp This may answer your questions...

2h5

Owl@LetMeChatGPThat

@dwarkesh_sp What bookshelves did you buy?

9m4

Mario Princess@majesticcoder

@dwarkesh_sp Really interested in the idea of getting the learning back to the weights. Feels like a key question for where AI goes next.

3h2