AlphaGo's PUCT Algorithm Automatically Shifts From Exploration To Exploitation
——0——
Watch the full episode here: https://www.dwarkesh.com/p/eric-jang
Every variant of Monte Carlo Tree Search faces the explore-exploit tradeoff: pick the branch that looks best right now, or test new branches? Algorithms like PUCT, used in AlphaGo, score each move with two competing terms. One is how good a move looks based on your exploration up till now. The other is a novelty bonus that rewards moves you've not visited much. The neat thing is that, over time, the term dominating the overall score shifts automatically. The algorithm hands off from 'explore' to 'exploit' all on its own. @ericjang11 explains how it works:
3:02 PM · May 20, 2026 · 10.2K Views
3:02 PM · May 20, 2026 · 4.4K Views