Noam Brown: AI vs Humans in Poker and Games of Strategic Negotiation | Lex Fridman Podcast #344

Noam Brown is a research scientist at FAIR, Meta AI, co-creator of AI that achieved superhuman level performance in games of No-Limit Texas Hold'em and Diplomacy. Please support this podcast by checking out our sponsors: - True Classic Tees: https://trueclassictees.com/lex and use code LEX to get 25% off - Audible: https://audible.com/lex to get 30-day free trial - InsideTracker: https://insidetracker.com/lex to get 20% off - ExpressVPN: https://expressvpn.com/lexpod to get 3 months free EPISODE LINKS: Noam's Twitter: https://twitter.com/polynoamial Noam's LinkedIn: https://www.linkedin.com/in/noam-brown-8b785b62/ webDiplomacy: https://webdiplomacy.net/ Noam's papers: Superhuman AI for multiplayer poker: https://par.nsf.gov/servlets/purl/10119653 Superhuman AI for heads-up no-limit poker: https://par.nsf.gov/servlets/purl/10077416 Human-level play in the game of Diplomacy: https://www.science.org/doi/10.1126/science.ade9097 PODCAST INFO: Podcast website: https://lexfridman.com/podcast Apple Podcasts: https://apple.co/2lwqZIr Spotify: https://spoti.fi/2nEwCF8 RSS: https://lexfridman.com/feed/podcast/ Full episodes playlist: https://www.youtube.com/playlist?list=PLrAXtmErZgOdP_8GztsuKi9nrraNbKKp4 Clips playlist: https://www.youtube.com/playlist?list=PLrAXtmErZgOeciFP3CBCIEElOJeitOr41 OUTLINE: 0:00 - Introduction 1:09 - No Limit Texas Hold 'em 5:02 - Solving poker 18:12 - Poker vs Chess 24:50 - AI playing poker 58:18 - Heads-up vs Multi-way poker 1:09:08 - Greatest poker player of all time 1:12:42 - Diplomacy game 1:22:33 - AI negotiating with humans 2:04:58 - AI in geopolitics 2:09:43 - Human-like AI for games 2:15:44 - Ethics of AI 2:19:57 - AGI 2:23:57 - Advice to beginners SOCIAL: - Twitter: https://twitter.com/lexfridman - LinkedIn: https://www.linkedin.com/in/lexfridman - Facebook: https://www.facebook.com/lexfridman - Instagram: https://www.instagram.com/lexfridman - Medium: https://medium.com/@lexfridman - Reddit: https://reddit.com/r/lexfridman - Support on Patreon: https://www.patreon.com/lexfridman

Noam BrownguestLex Fridmanhost

Dec 6, 20222h 29mWatch on YouTube ↗

CHAPTERS

No-limit Texas Hold’em basics: betting freedom, psychology, and “jumpy” money
Noam explains what makes no-limit Texas Hold’em the dominant poker variant and why the ability to bet any amount changes the mental game. Lex and Noam discuss how large stakes create discomfort, risk aversion, and opportunities to pressure opponents with aggressive sizing.
- •No-limit vs limit: any bet size allowed, so pots can escalate rapidly
- •“Jumpy” play: when money becomes life-impacting, decision-making becomes risk-averse
- •Big bets can strategically force opponents into uncomfortable, error-prone spots
- •Poker AI vs “reading souls”: tension between game theory and psychological narratives
Nash equilibrium and “not losing in expectation”: why poker can be ‘solved’ (in heads-up)
The conversation formalizes what it means to play optimally in finite two-player zero-sum games. Noam introduces Nash equilibrium with rock-paper-scissors intuition and clarifies the meaning of ‘in expectation’ in high-variance games like poker.
- •Definition: optimal strategies exist for finite two-player zero-sum games
- •Rock-paper-scissors as intuition for mixed strategies and equilibrium
- •‘In expectation’ vs single-hand outcomes; variance and long-run guarantees
- •Limits of Nash equilibrium guarantees outside two-player zero-sum settings
Counterfactual Regret Minimization (CFR): self-play as a principled route to equilibrium
Noam describes CFR as the core learning mechanism: simulate play, compute counterfactuals, accumulate regret, and shift probabilities toward higher-regret actions. Lex connects this to how humans analyze ‘what if’ alternatives after hands.
- •CFR loop: random start → counterfactual evaluation → regret updates → convergence
- •Regret as a signal for how much better another action would have been
- •Self-play is not inherently neural-network-based; it’s a general RL idea
- •Human learning parallels: post-hand questioning and counterfactual reasoning
Poker vs chess/go: imperfect information, probabilities, ranges, and explicit theory of mind
Noam argues poker’s hidden information forces reasoning about action probabilities and balance, not just best moves. They unpack ranges, bluff frequencies, and the need to model beliefs and ‘what you think I think’ recursively.
- •Imperfect information makes mixed strategies essential (probabilities matter)
- •Bluff value depends on frequency and reputation; balance is key
- •Ranges and unpredictability: act as if multiple hands could take the same line
- •Successful bots use explicit belief/theory-of-mind reasoning
GTO vs exploitative poker: why equilibrium play ‘crushed’ top pros
The discussion turns to the long-standing debate between game-theory-optimal (GTO) and exploitative styles. Noam recounts how Libratus’ equilibrium-approximation strategy beat elite heads-up players without adapting to them, reshaping community beliefs.
- •Historical debate: exploitative reads vs equilibrium-based GTO
- •Libratus’ strategy: prioritize balance; be robust even if fully ‘figured out’
- •Match outcome: ~120,000 hands; large cumulative edge over top professionals
- •GTO isn’t ‘predictable’—it is intentionally mixed/unpredictable
From precomputed strategy to real-time search: the algorithmic leap behind Libratus
Noam explains why earlier approaches failed: relying on huge precomputed abstractions without real-time planning. The key improvement was adding search during play—analogous to how humans think longer in tough spots—yielding enormous performance gains.
- •2015 loss revealed a gap: humans search/plan in real time; bot acted instantly
- •2017 approach: search-based refinement on top of offline computation
- •Search in poker must consider probabilities across possible private hands
- •Small amounts of search can outperform massive increases in precomputed policy size
Libratus match logistics and engineering: stress, compute scale, and optimizations
Noam details the 20-day, high-stakes-style match setup, including prize incentives and human collaboration to find weaknesses. He also shares implementation realities: C++ performance engineering, parallelism, memory scale, and constant uncertainty about the true ‘bar’ for beating humans.
- •Match design: 4 pros, 20 days, 120k hands, prize money incentives
- •Humans teamed up, compared notes, and hunted weaknesses
- •Engineering: C++, heavy parallelization, large CPU counts and memory
- •No clear benchmark; had to overbuild strength and optimize system throughput
Unexpected strategy shift: over-bets and how AI changed high-level poker play
A striking emergent behavior was Libratus’ use of huge over-bets (sometimes many times the pot), which created novel decision pressure for humans. The community later adopted over-bets widely, illustrating how AI can reshape strategic norms.
- •Humans typically size bets relative to pot; Libratus sometimes bet ~10x pot
- •Over-bets polarize ranges (‘nuts or bluff’) and force extremely hard decisions
- •Humans spent minutes deliberating; AI’s behavior proved strategically sound
- •AI-driven discovery influenced modern high-level poker strategy
From heads-up to 6-player poker: why non–zero-sum breaks guarantees, but still works
Moving to multiway poker removes the clean two-player zero-sum guarantees and introduces equilibrium-selection and coordination issues. Noam argues poker remains sufficiently adversarial that equilibrium-approximation techniques still perform strongly in practice.
- •6-player poker isn’t two-player zero-sum; Nash guarantees no longer apply
- •Multiple equilibria and coordination problems complicate theory
- •Poker’s anti-collusion, adversarial structure makes equilibrium methods effective
- •Distinction between provable guarantees vs strong empirical performance
Pluribus and depth-limited search: massive cost reduction through algorithmic innovation
Noam highlights the most surprising Pluribus result: it was dramatically cheaper than Libratus due to depth-limited search. The chapter emphasizes how algorithmic advances can dwarf hardware trends, enabling laptop-scale training for strong strategies.
- •Depth-limited search scales better than searching to endgame every time
- •Pluribus training cost dropped from ~$100k-scale to ~$150-scale runs
- •Algorithmic improvements—not hardware price drops—drove the difference
- •General lesson: smart search approximations unlock scalability
Neural nets (mostly) weren’t the key in poker: why beliefs and balance were the real challenge
Contrary to popular expectation, Libratus and Pluribus didn’t rely on neural networks. Noam explains that unlike Go (where feature/value learning was hard), poker’s core challenge was computing balanced strategies under hidden information; modern systems use nets differently via belief-conditioned value functions.
- •Libratus/Pluribus: no neural nets; relied on game-theoretic algorithms + search
- •Go benefited from neural nets for evaluation; poker’s bottleneck was balance/mixing
- •Modern poker value functions take beliefs as input (what each player thinks)
- •Performance evaluation in poker is noisy due to variance; hard to rank ‘GOATs’
Diplomacy rules and feel: a 7-player alliance-and-backstab game built on private talk
The podcast shifts to Diplomacy: a WWI-era strategy board game where negotiation is central and commitments are non-binding. Noam explains simultaneous moves, support mechanics, role-playing flavor, and why the game is ‘about people rather than pieces.’
- •Seven powers on a Europe map; success depends on alliances and coordination
- •Private, unstructured natural-language negotiation is the primary ‘action space’
- •Simultaneous resolution enables betrayal; promises aren’t enforceable
- •Win conditions: solo majority is rare; draws and scoring systems are common
Cicero’s core idea: intent-conditioned dialogue + RL planning, with safety/quality filters
Noam outlines how Cicero ties strategy to language: a planner/RL system computes intents (desired moves and requests), and a dialogue model generates messages aligned to those intents. Because language models can go off-rail, the system uses value-based filtering, coherence checks, and mechanisms to avoid strategically harmful messages (including excessive lying).
- •Separate modules: strategic planning/RL produces intents; LM turns intent → message
- •Filters evaluate likely downstream impact of a message (expected-value reasoning)
- •Avoiding unforced ‘reveals’ (e.g., announcing an attack) via response modeling
- •Pragmatic finding: lying often reduces long-run performance by destroying trust
Why self-play alone fails in cooperative multi-agent worlds: human anchoring and irrationality
The discussion explains why pure self-play can produce strategies incompatible with humans—like learning to ‘drive on the left’ socially—and why humans may punish rational-but-unfair behavior. Cicero uses supervised ‘anchor’ policies learned from human games and regularizes self-play toward human-likeness to remain a viable partner.
- •Self-play from scratch can learn ‘robot conventions’ that humans won’t follow
- •Human reactions (anger, fairness norms) can dominate outcomes over rational payoffs
- •Anchor policy: supervised model of human play; self-play is regularized toward it
- •Example failure mode: grabbing extra centers while ‘helping’ against a leader triggers retaliation
Beyond games: human-like AI, cheating risks, ethics of deception, and paths toward AGI
They zoom out to implications: Diplomacy as a bridge toward real-world negotiation, the difficulty of deploying RL without clear rewards, and the promise of intent-conditioned NPCs. The conversation closes with concerns about cheating detection, deception ethics, anti-AI bias, data efficiency as a bottleneck for AGI, and advice for newcomers to ML.
- •Transfer to NPCs and interactive agents: intent-conditioned language as a general pattern
- •Human-like bots complicate cheat detection in chess/poker and online play
- •Ethics: deception, ‘white lies,’ trust-building, and governance of persuasive systems
- •AGI hurdles: data efficiency, general planning, and leveraging broad prior knowledge
- •Career advice: build math/CS/stat foundations; seek distinctive perspectives

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

No-limit Texas Hold’em basics: betting freedom, psychology, and “jumpy” money

Nash equilibrium and “not losing in expectation”: why poker can be ‘solved’ (in heads-up)

Counterfactual Regret Minimization (CFR): self-play as a principled route to equilibrium

Poker vs chess/go: imperfect information, probabilities, ranges, and explicit theory of mind

GTO vs exploitative poker: why equilibrium play ‘crushed’ top pros

From precomputed strategy to real-time search: the algorithmic leap behind Libratus

Libratus match logistics and engineering: stress, compute scale, and optimizations

Unexpected strategy shift: over-bets and how AI changed high-level poker play

From heads-up to 6-player poker: why non–zero-sum breaks guarantees, but still works

Pluribus and depth-limited search: massive cost reduction through algorithmic innovation

Neural nets (mostly) weren’t the key in poker: why beliefs and balance were the real challenge

Diplomacy rules and feel: a 7-player alliance-and-backstab game built on private talk

Cicero’s core idea: intent-conditioned dialogue + RL planning, with safety/quality filters

Why self-play alone fails in cooperative multi-agent worlds: human anchoring and irrationality

Beyond games: human-like AI, cheating risks, ethics of deception, and paths toward AGI

Get more out of YouTube videos.