/Tech28d ago

Eric Jang argues AlphaGo's Monte Carlo Tree Search bypasses the credit assignment problems plaguing naive LLM reinforcement learning

Policy gradients struggle on trajectories exceeding 100,000 tokens.

3595983785110.4K

#60

Original post

Dwarkesh Patel@dwarkesh_sp#60inTech

Every variant of Monte Carlo Tree Search faces the explore-exploit tradeoff: pick the branch that looks best right now, or test new branches?

Algorithms like PUCT, used in AlphaGo, score each move with two competing terms.

One is how good a move looks based on your exploration up till now. The other is a novelty bonus that rewards moves you've not visited much.

The neat thing is that, over time, the term dominating the overall score shifts automatically. The algorithm hands off from 'explore' to 'exploit' all on its own.

@ericjang11 explains how it works:

8:02 AM · May 20, 2026 · 16.7K Views

Sentiment

Many users praised Eric Jang's clear explanations of MCTS and related AI concepts in the Dwarkesh episode along with the accessible blackboard and clip format that sparks curiosity and simplifies complex ideas.

Pos

92.8%

Neg

7.2%

10 comments with sentiment.

Cluster Engagement

Digg Deeper

No Digg Deeper questions have been answered for this story yet.

DWARKESH.COMVia

#60

Posts from X

Most Activity

VIEWS67.6KBOOKMARKS619LIKES686RETWEETS60REPLIES21

Dwarkesh Patel@dwarkesh_sp

Monte Carlo Tree Search training corrects the model move by move, while current LLM training only tells it whether the whole trajectory worked.

MCTS is preferable if you can get it. But nobody's managed to get MCTS to work for language models.

In his blackboard lecture @ericjang11 talked to me about why:

28d67.6K686619

Boaz Barak@boazbaraktcs

I can't wait for codex to be able to do all its work unsupervised so I have the time to watch all the @dwarkesh_sp 's blackboard lectures.

Dwarkesh Patel@dwarkesh_sp

Monte Carlo Tree Search training corrects the model move by move, while current LLM training only tells it whether the whole trajectory worked.

MCTS is preferable if you can get it. But nobody's managed to get MCTS to work for language models.

In his blackboard lecture @ericjang11 talked to me about why:

24d9.2K7221

Dwarkesh Patel@dwarkesh_sp

Watch Eric's full lecture here: https://www.dwarkesh.com/p/eric-jang

Dwarkesh Patel@dwarkesh_sp

Monte Carlo Tree Search training corrects the model move by move, while current LLM training only tells it whether the whole trajectory worked.

MCTS is preferable if you can get it. But nobody's managed to get MCTS to work for language models.

In his blackboard lecture @ericjang11 talked to me about why:

28d11.2K2724

Dwarkesh Patel@dwarkesh_sp

Watch the full episode here: https://www.dwarkesh.com/p/eric-jang

Dwarkesh Patel@dwarkesh_sp

Every variant of Monte Carlo Tree Search faces the explore-exploit tradeoff: pick the branch that looks best right now, or test new branches?

Algorithms like PUCT, used in AlphaGo, score each move with two competing terms.

One is how good a move looks based on your exploration up till now. The other is a novelty bonus that rewards moves you've not visited much.

The neat thing is that, over time, the term dominating the overall score shifts automatically. The algorithm hands off from 'explore' to 'exploit' all on its own.

@ericjang11 explains how it works:

28d5.8K158

Brian Cheong@briancheong

@dwarkesh_sp Feels like the same gap we hit with agentic coding: you need step-level verifiers, not just end-to-end reward. Without a cheap partial-state scorer, MCTS stays mostly theoretical for language.

27d54711

AiDevCraft@AiDevCraft

The MCTS-for-LLMs ceiling is really about rollout cost, not search itself — every leaf evaluation is a full model call worth dollars and seconds, so the thousands of sims Go used per move are economically out of reach. Process reward models are the practical shadow of move-by-move credit until rollouts get an order of magnitude cheaper.

27d24511

IAm@JoeBlogs685544

You are describing the difference between trajectory level supervision and stepwise coherence correction. MCTS works in games because the system has a well defined local evaluation signal at every branch. Language does not give you that. The model is navigating a high dimensional attractor network where most intermediate states have no intrinsic reward signal.

The real obstacle is not engineering, it is structural. In language, the coherence of a partial sequence is not a local property, it is a global constraint pattern that only becomes measurable once the structure has stabilized. MCTS assumes you can score intermediate states independently. Natural language does not permit that because the constraint density is distributed across the entire sequence.

If you want something MCTS like for language, you need a functional that measures local coherence curvature at each step. Without that, the search tree has no gradient to follow. This is why current RLHF methods operate on whole trajectories, they are correcting the global pattern rather than the local moves.

So the path forward is not to force MCTS onto language, it is to define a stepwise coherence metric that can serve as the equivalent of a value function. Once you have that, tree search becomes viable. Right now, the field is missing the metric, not the algorithm.

27d12711

Ash@ashworks1706

@DnuLkjkjh @dwarkesh_sp there's been efforts and it still reward hacks that step level reward by early exiting resulting in poor reasoning accuracy where it doesn't go deep enough to reason about the problem since it's too scared because of step penalty

27d17

dnu@DnuLkjkjh

@dwarkesh_sp step-level credit assignment is the part i’d want in coding agents too. outcome-only training hides whether the miss was tool choice, edit order, or bad context packing.

28d1271

christopher peel@ChristopherPeel

@dwarkesh_sp @grok can you summarize and provide resources for building with this?

28d78

Santosh@SantoshStyles

@dwarkesh_sp These clip intros make each piece so easy to learn. Almost in a turn-your-brain-off sort of way. They spike my curiosity and prime me to get into the concepts without the friction of missing a piece or two.

Not always easy to do with STEM

27d4082

kayvon! (building murray ai)@MurrayAI

Exactly. Outcome-only supervision works okay for chat, but it’s a nightmare for agents that have to operate step-by-step in noisy or physical environments.

If you can’t tell the model which specific action was wrong (vs just “the whole plan failed”), learning robust recovery and guardrails becomes extremely sample-inefficient.

27d1.2K

James Kovalenko@deburdened

@dwarkesh_sp MCTS works because it turns trajectory-level failure into move-level correction.

the missing piece is a verifier that can score partial semantic states without collapsing into final-answer reward.

Once partial states become auditable, search trajectories can be retained.

28d1201

Behesi Darinda@Maverick_Quant

@dwarkesh_sp This is for deterministic environment with perfect information. For incomplete information or stochastic environment, you need CFR

27d341

Jonathan Bergqvist@j_bergq

@dwarkesh_sp Loved the episode, please do more in this format

26d295

joel@AISloppyJoel

@dwarkesh_sp Dwarkesh I think you’re the only hope to fix the literacy crisis post AGI

26d145

Duoduo@duoduobz

@dwarkesh_sp Move-level feedback is the missing granularity. Whole-trajectory reward tells an agent that something went wrong; it does not teach where the plan bent. Long-horizon agents need credit assignment you can inspect step by step.

27d139

MumbaiPanda@DarwinianVyas

@dwarkesh_sp "Current LLM training only tells it if the whole trajectory worked" No man, even the academia today has moved way beyond R1. On policy self correction like techniques are ubiquitous in academic articles.

26d134

Alper FERUDUN@AlperTheKing

@dwarkesh_sp For agents, PUCT needs a cost term too. The branch with the best expected value may also be a 40-tool-call rabbit hole. Planning policy should score reward, uncertainty, wall-clock, token burn, and reversibility before expanding.

28d92

Adel Bucetta@adelbucetta

@dwarkesh_sp the honest answer is that scaling mcts to language models requires a fundamental shift in how we think about optimization and feedback loops. current approaches just aren't designed to handle the complexity

27d90