Building AlphaGo from scratch – Eric Jang

Eric Jang walks through how to build AlphaGo from scratch, but with modern AI tools. Sometimes you understand the future better by stepping backward. AlphaGo is still the cleanest worked example of the primitives of intelligence: search, learning from experience, and self-play. You have to go back to 2017 to get insight into how the more general AIs of the future might learn. Once he explained how AlphaGo works, it gave us the context to have a discussion about how RL works in LLMs and how it could work better – naive policy gradient RL has to figure out which of the 100k+ tokens in your trajectory actually got you the right answer, while AlphaGo’s MCTS suggests a strictly better action every single move, giving you a training target that sidesteps the credit assignment problem. The way humans learn is surely closer to the second. Eric also kickstarted an Autoresearch loop on his project. And it was very interesting to discuss which parts of AI research LLMs can already automate pretty well (implementing and running experiments, optimizing hyperparameters) and which they still struggle with (choosing the right question to investigate next, escaping research dead ends). Informative to all the recent discussion about when we should expect an intelligence explosion, and what it would look like from the inside. 𝐄𝐏𝐈𝐒𝐎𝐃𝐄 𝐋𝐈𝐍𝐊𝐒 * Check out the flashcards I wrote to retain the insights: https://flashcards.dwarkesh.com/eric-jang/ * Transcript: https://www.dwarkesh.com/p/eric-jang 𝐒𝐏𝐎𝐍𝐒𝐎𝐑𝐒 - Cursor's agent SDK let me build a pipeline to generate flashcards for this episode. For each card, I had an agent read the transcript, ingest blackboard screenshots, generate an SVG visual, and run everything through a critic. A durable agent is much better at this kind of work than a chain of LLM calls, and Cursor's SDK made it easy. Check out the cards at https://flashcards.dwarkesh.com and get started with the SDK at https://cursor.com/dwarkesh - Jane Street gave me a real deep-dive tour of one of their datacenters. I got to ask a bunch of questions to Ron Minsky, who co-leads Jane Street's tech group, and Dan Pontecorvo, who runs Jane Street's physical engineering team. They were willing to literally pull up the floorboards and take out racks to explain how everything works. Check out the full tour at https://janestreet.com/dwarkesh To sponsor a future episode, visit https://dwarkesh.com/advertise. 𝐓𝐈𝐌𝐄𝐒𝐓𝐀𝐌𝐏𝐒 00:00:00 – Basics of Go 00:08:06 – Monte Carlo Tree Search 00:31:53 – What the neural network does 01:00:22 – Self-play 01:25:27 – Alternative RL approaches 01:45:36 – Why doesn’t MCTS work for LLMs 02:00:58 – Off-policy training 02:11:51 – RL is even more information inefficient than you thought 02:22:05 – Automated AI researchers

Dwarkesh PatelhostEric Jangguest

May 15, 20262h 37mWatch on YouTube ↗

CHAPTERS

Why rebuild AlphaGo now: compute efficiency, KataGo, and the mystery of “amortized search”
Eric explains why AlphaGo remains a uniquely instructive system: it solved an intractable search problem by compressing deep lookahead into a relatively small neural network plus search. He motivates his sabbatical project—rebuilding and modifying AlphaGo—by noting how modern open-source systems like KataGo drastically reduced training compute, making “team-of-DeepMind” scale results reproducible on a modest budget.
Go rules crash course: capturing, threats, and why humans stop early
They walk through the basic mechanics of Go: placing stones, liberties, captures, and tactical threats akin to “check” in chess. Eric emphasizes that Go’s beauty comes from trading local losses for global gains, which later connects to why evaluation is hard and why value functions matter.
Scoring and rulesets: Tromp-Taylor vs human ambiguity (and why it matters for AI)
Eric contrasts human scoring conventions with Tromp-Taylor rules, which are unambiguous and thus favored for Go AI training. The discussion highlights a key theme: humans use implicit evaluation to declare a game “done” before full resolution, while computers need a precise terminal definition unless they learn a value function.
Why Go is hard computationally: exploding game trees, symmetries, and depth
They frame Go as a deterministic, perfect-information game with an enormous branching factor and long horizon, making brute-force search impossible. Eric explains transpositions (different move orders reaching the same state) and why even with merging, the search space is astronomically large.
Monte Carlo Tree Search fundamentals: UCB/PUCT, visits, Q-values, and exploration
Eric introduces MCTS as an incremental tree-building procedure guided by bandit-inspired action selection (UCB) and AlphaGo’s PUCT variant. They unpack the node statistics (visit counts, mean action values) and why exploration bonuses matter when estimates are uncertain.
From terminal outcomes to backups: how rollouts become a value estimate
They describe how MCTS assigns values to leaves (win/loss at terminal states) and propagates them upward via the backup step, producing Q estimates at earlier decisions. This clarifies how sparse, selective expansion can still yield useful guidance without building the full tree.
Neural networks in AlphaGo: policy + value heads, architectures, and inductive bias
Eric introduces the two-network view (policy distribution and value prediction), then explains the modern merged “trunk + two heads” approach. They discuss architecture trade-offs (ResNets vs transformers), why local convolutional bias helps on limited data, and KataGo’s global feature tricks.
Getting a working baseline: supervised learning from expert games and sanity checks
They recommend starting with expert data (AlphaGo Lee style) to ensure the rules, training, and inference loop work before attempting tabula rasa self-play. Eric notes that a small network trained to imitate expert moves can already be surprisingly strong, serving as a crucial debugging and validation milestone.
Neural-MCTS in practice: selection → expansion → evaluation → backup (and AlphaGo Lee’s rollouts)
Eric gives a concrete walkthrough of the four-step MCTS loop using network priors and value estimates. He explains AlphaGo Lee’s additional rollout-to-end blending (policy self-play rollouts) and why later systems dropped it for speed once value learning improved.
Self-play as policy improvement: distilling MCTS into the network (test-time compute → training)
They explain the core AlphaGo Zero/AlphaZero loop: MCTS produces a stronger policy than the raw network, and the network is trained to imitate the MCTS distribution so future searches start from a better prior. This is framed as amortizing test-time search into the forward pass, enabling better performance for the same simulation budget.
When MCTS doesn’t help: value drift, resignation bias, and low-simulation variance
Eric clarifies there’s no guarantee MCTS improves the policy if value estimates are wrong or the search is too shallow. He gives practical failure modes—like training data missing late-game states due to resignations—causing incorrect backups and degraded policy improvement.
Alternatives to MCTS: model-free RL, neural fictitious self-play, and why credit assignment is hard
They compare AlphaGo-style improvement (better labels per state via search) to model-free RL approaches that reinforce entire winning trajectories, which suffer from severe credit assignment and variance problems. Eric introduces neural fictitious self-play: train best responses against fixed opponents and distill them into a robust mixed strategy, as used in complex games like StarCraft/Dota.
Why MCTS doesn’t transfer cleanly to LLMs: massive branching, no revisits, and weak heuristics
They address why PUCT-style tree search is awkward for language: the effective action space is enormous and repeated sampling of the same ‘child’ is rare, breaking the visit-count exploration logic. They note that LLMs often exhibit implicit backtracking behaviors, and that forward-search ideas may return in different forms, but the Go-style recipe doesn’t plug in directly.
Off-policy training and replay buffers: when it helps, when it harms (DAgger intuition)
They discuss why replay buffers—often criticized as “off-policy”—can still work in AlphaGo-like settings if the buffer covers states near the policy’s typical trajectories and includes corrective actions. Eric describes experiments relabeling random stored states with MCTS (like offline planning/daydreaming) and relates this to classic off-policy RL pipelines, while emphasizing stability advantages of more on-policy training.
RL information inefficiency and the power of soft targets: bits per sample, pass rate, distillation
Dwarkesh and Eric explore why naive RL can be dramatically less information-efficient than supervised learning, especially at low success rates and long horizons. Eric connects this to AlphaGo’s choice to train on the full MCTS distribution (soft labels), which carries more information than a single winning action, and to why distillation can be so effective.
Automated AI researchers: what LLM agents help with vs where they still fail
Eric shares lessons using frontier LLMs as research assistants during the AlphaGo rebuild: they excel at hyperparameter and code-level optimization, running experiments, and producing reports. They remain weaker at higher-level experimental direction—knowing when to abandon a track, diagnosing whether failures are conceptual vs bug-driven, and doing “lateral thinking” across research branches.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome

Why rebuild AlphaGo now: compute efficiency, KataGo, and the mystery of “amortized search”

Go rules crash course: capturing, threats, and why humans stop early

Scoring and rulesets: Tromp-Taylor vs human ambiguity (and why it matters for AI)

Why Go is hard computationally: exploding game trees, symmetries, and depth

Monte Carlo Tree Search fundamentals: UCB/PUCT, visits, Q-values, and exploration

From terminal outcomes to backups: how rollouts become a value estimate

Neural networks in AlphaGo: policy + value heads, architectures, and inductive bias

Getting a working baseline: supervised learning from expert games and sanity checks

Neural-MCTS in practice: selection → expansion → evaluation → backup (and AlphaGo Lee’s rollouts)

Self-play as policy improvement: distilling MCTS into the network (test-time compute → training)

When MCTS doesn’t help: value drift, resignation bias, and low-simulation variance

Alternatives to MCTS: model-free RL, neural fictitious self-play, and why credit assignment is hard

Why MCTS doesn’t transfer cleanly to LLMs: massive branching, no revisits, and weak heuristics

Off-policy training and replay buffers: when it helps, when it harms (DAgger intuition)

RL information inefficiency and the power of soft targets: bits per sample, pass rate, distillation

Automated AI researchers: what LLM agents help with vs where they still fail

Get more out of YouTube videos.