Dwarkesh Podcast

Building AlphaGo from scratch – Eric Jang

Eric Jang walks through how to build AlphaGo from scratch, but with modern AI tools. Sometimes you understand the future better by stepping backward. AlphaGo is still the cleanest worked example of the primitives of intelligence: search, learning from experience, and self-play. You have to go back to 2017 to get insight into how the more general AIs of the future might learn. Once he explained how AlphaGo works, it gave us the context to have a discussion about how RL works in LLMs and how it could work better – naive policy gradient RL has to figure out which of the 100k+ tokens in your trajectory actually got you the right answer, while AlphaGo’s MCTS suggests a strictly better action every single move, giving you a training target that sidesteps the credit assignment problem. The way humans learn is surely closer to the second. Eric also kickstarted an Autoresearch loop on his project. And it was very interesting to discuss which parts of AI research LLMs can already automate pretty well (implementing and running experiments, optimizing hyperparameters) and which they still struggle with (choosing the right question to investigate next, escaping research dead ends). Informative to all the recent discussion about when we should expect an intelligence explosion, and what it would look like from the inside. 𝐄𝐏𝐈𝐒𝐎𝐃𝐄 𝐋𝐈𝐍𝐊𝐒 * Check out the flashcards I wrote to retain the insights: https://flashcards.dwarkesh.com/eric-jang/ * Transcript: https://www.dwarkesh.com/p/eric-jang 𝐒𝐏𝐎𝐍𝐒𝐎𝐑𝐒 - Cursor's agent SDK let me build a pipeline to generate flashcards for this episode. For each card, I had an agent read the transcript, ingest blackboard screenshots, generate an SVG visual, and run everything through a critic. A durable agent is much better at this kind of work than a chain of LLM calls, and Cursor's SDK made it easy. Check out the cards at https://flashcards.dwarkesh.com and get started with the SDK at https://cursor.com/dwarkesh - Jane Street gave me a real deep-dive tour of one of their datacenters. I got to ask a bunch of questions to Ron Minsky, who co-leads Jane Street's tech group, and Dan Pontecorvo, who runs Jane Street's physical engineering team. They were willing to literally pull up the floorboards and take out racks to explain how everything works. Check out the full tour at https://janestreet.com/dwarkesh To sponsor a future episode, visit https://dwarkesh.com/advertise. 𝐓𝐈𝐌𝐄𝐒𝐓𝐀𝐌𝐏𝐒 00:00:00 – Basics of Go 00:08:06 – Monte Carlo Tree Search 00:31:53 – What the neural network does 01:00:22 – Self-play 01:25:27 – Alternative RL approaches 01:45:36 – Why doesn’t MCTS work for LLMs 02:00:58 – Off-policy training 02:11:51 – RL is even more information inefficient than you thought 02:22:05 – Automated AI researchers

Dwarkesh PatelhostEric Jangguest

May 15, 20262h 37mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

Rebuilding AlphaGo: Go rules, MCTS, neural nets, and self-play scaling

The episode introduces the rules and scoring nuances of Go (especially Tromp-Taylor) to show why naïve search is intractable and why value estimation is essential.
It walks through Monte Carlo Tree Search mechanics (UCB/PUCT), the node statistics stored (Q, visit counts, priors), and the selection–expansion–evaluation–backup loop used per move.
It explains AlphaGo’s two-headed neural network (policy and value), how the policy prior guides tree expansion while the value head truncates depth, and why CNN/ResNet inductive bias still works well in low-data regimes versus transformers.
It details the key training loop: self-play with MCTS produces improved (often “softer”) action distributions that are distilled back into the policy network, making future search cheaper and stronger over time.
The conversation broadens to why MCTS ideas don’t transfer cleanly to LLMs (huge branching, low revisits, weak local evaluation), why naïve policy-gradient RL is information-inefficient, and what current LLM agents can/can’t automate in research iteration.

IDEAS WORTH REMEMBERING

5 ideas

AlphaGo’s core trick is shrinking both search depth and breadth with learned priors.

The value network approximates “who wins from here” so rollouts don’t need to reach terminal states (depth reduction), while the policy network concentrates probability mass on plausible moves so MCTS explores far fewer branches (breadth reduction).

MCTS is built online while searching, not precomputed.

Because Go’s game tree is astronomically large, you iteratively grow the tree during simulations using selection→expansion→evaluation→backup, rather than generating the full tree and then searching it.

PUCT combines exploitation, exploration, and a neural prior into a single selection rule.

The Q term exploits known good moves, the exploration term pushes visits to underexplored children, and the policy prior P(a) biases exploration toward network-suggested moves—crucial when the action space is large.

Training works because MCTS produces dense, low-variance supervision at every move.

Instead of only reinforcing entire winning trajectories, self-play plus MCTS yields an improved target distribution π_MCTS(s) for each encountered state, turning much of learning into stable supervised learning on better labels.

Soft targets (full MCTS visit distributions) carry more information than hard argmax labels.

Jang notes distillation is powerful partly because matching a distribution conveys “dark knowledge” about relative move quality; training only on the single chosen move discards that information and can slow learning.

WORDS WORTH SAVING

5 quotes

And, and this is what makes Go a very beautiful game, is that, um, you can kind of, uh, lose the battle but win the war.

— Eric Jang

It was just profound to see, you know, how smart AI systems could become and the, the kind of computational complexity class that they could tackle with deep learning.

— Eric Jang

And, and they're essentially running a neural network that looks at a board, and implicitly they are amortizing a huge number of possible game play-outs and, and taking that average and then deciding whether the board is winnable or not.

— Eric Jang

So, so this was a breakthrough that I think most people don't even understand today, like, fully comprehend, like, how profound that accomplishment is.

— Eric Jang

The major reason is that you never have to initialize at a zero percent success rate and solve the exploration problem of how to get a non-zero success rate.

— Eric Jang

Go rules and Tromp-Taylor scoringCombinatorial explosion and transpositions in Go treesMonte Carlo Tree Search (PUCT/UCB), node data structuresPolicy/value networks and architectures (ResNet vs transformer)MCTS as a policy improvement/distillation operatorSelf-play bootstrapping and value-function groundingOff-policy buffers, relabeling, and stability in RL/roboticsWhy MCTS is hard for LLM reasoningBits-per-sample, soft targets, and RL information inefficiencyAutomated research agents: hyperparameter search vs experiment selection

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.