How a reasoning model cracked an 80-year-old math problem — the OpenAI Podcast Ep. 20

Last month AI found something mathematicians had missed for decades. Reasoning researchers Alexander Wei, Hongxun Wu, and Lijie Chen join the podcast to discuss how a general-purpose model helped disprove an 80-year-old conjecture from famed mathematician Paul Erdős. They walk through the moment the result started looking real, what it took to verify the proof, and what’s happened since sharing the discovery with the world. They also explore what this means for the future of math and for researchers learning to work with AI. Chapters 0:44 AI and the International Math Olympiad and International Olympiad of Informatics 6:35 An OpenAI model disproves the Erdős unit distance conjecture 8:33 Running the model and checking the proof 11:04 Why general models matter for discovery 15:55 Creativity, tools, and how the proof worked 18:25 Why AI should feel empowering for mathematicians 22:31 Advice for researchers using AI 27:24 What comes next for math and AI research 37:30 Cryptography, quantum computing, and the future

Andrew MaynehostLijie ChenguestHongxun WuguestAlexander Weiguest

Jun 4, 202641mWatch on YouTube ↗

CHAPTERS

0:00 – 0:37
Meet the reasoning research team and how the breakthrough felt in the moment
Andrew Mayne introduces Alexander Wei, Hongxun Wu, and Lijie Chen from OpenAI’s reasoning research team. They describe the shock, excitement, and skepticism that followed the model’s unexpected math result.
- •Introductions to the guests and their roles on the reasoning team
- •Initial emotional reaction: excitement, disbelief, lack of sleep
- •Early framing: this is beyond Olympiad-level performance
- •Hints that the result could be publishable in top math venues
0:37 – 2:16
Why IMO/IOI matter: hard benchmarks that became an AI milestone
The team explains the International Math Olympiad (IMO) and International Olympiad of Informatics (IOI) and why they became a long-running yardstick for AI reasoning. They describe how quickly models went from struggling with basic math to achieving elite contest performance.
- •What IMO and IOI are and why they’re devilishly difficult
- •These competitions as an implicit “grand challenge” for AI reasoning
- •Progress timeline: expectations vs reality on reaching IMO-gold level
- •Reasoning as a shift from instant answers to deeper problem-solving
2:16 – 5:18
Test-time compute: letting models think longer to solve harder problems
Alexander outlines the central idea behind the project: spending more compute at inference time to improve reasoning. This “think longer” paradigm enables models to explore approaches, self-correct, and produce stronger final outputs.
- •Definition of inference-time/test-time compute
- •Contrast with earlier models that answered “off the cuff”
- •How longer deliberation enables harder reasoning tasks
- •The research motivation: push beyond what models “obviously can’t do”
5:18 – 6:16
From benchmarks to big questions: P vs NP and the limits of today’s models
The conversation broadens from contest performance to frontier problems like P vs NP. Lijie argues some breakthroughs may require building entirely new theories—something current systems may still struggle to do.
- •P vs NP raised as a canonical “new theory required” problem
- •Difference between solving problems and inventing foundational frameworks
- •Uncertainty about timelines given rapid recent progress
- •Hongxun’s background and why he joined as the field accelerated
6:16 – 7:22
The 80-year-old Erdős unit distance conjecture—what it asks
Alexander explains the Erdős unit distance conjecture in combinatorial geometry: how many pairs of points can be exactly unit distance apart, and how this grows with the number of points. The stakes are high because it’s a central, long-studied problem in discrete geometry.
- •Problem statement: maximum number of unit-distance pairs among n planar points
- •Asymptotic growth as the key quantity of interest
- •Historical context: an 80-year-old open problem
- •Why it’s nontrivial and foundational in its subfield
7:22 – 8:16
What the model actually proved: the square grid is far from optimal
The team describes the core disproof: Erdős’ conjectured optimal square-grid construction is not close to optimal. The model found a stronger construction relying on sophisticated number theory, surprising researchers who expected the grid to be near-best.
- •Erdős’ original conjecture: square grid is essentially optimal
- •Model’s disproof: a construction that beats the grid substantially
- •Use of “high-powered” number theory to build the construction
- •Significance: a general-purpose model produced research-grade math
8:16 – 11:04
How they ran it and verified it: skepticism, peer review, and slow confidence
Lijie and Hongxun recount how they tested internal models, got plausible solutions, and then sought verification from mathematically trained colleagues. Confidence grew over days as reviewers failed to find errors, turning initial disbelief into cautious acceptance.
- •They were testing model capability limits using a subset of Erdős problems
- •Two different internal models produced correct-looking solutions
- •Verification workflow: model self-check, then human expert review
- •Reviewers’ confidence moved from “definitely wrong” to “maybe correct”
11:04 – 15:56
Why general models matter: discovery without benchmark-specific training
The discussion emphasizes that the system wasn’t trained specifically for this conjecture or even only for math. They argue that building broadly capable reasoning models can surface unexpected research breakthroughs and democratize access over time.
- •OpenAI emphasis: general capability over narrow benchmark training
- •This model also works as a general-purpose assistant (e.g., coding)
- •Claim: similar capabilities could become accessible to people at home
- •Online reactions: researchers asking to try their own open problems
15:56 – 17:05
Creativity in the proof: bridging class field theory and combinatorial geometry
Alexander highlights what felt genuinely creative: importing class field theory into a combinatorial geometry setting, a connection not commonly executed in prior work. They describe how both insight and meticulous technical execution were required, beyond typical contest-level math.
- •Novel cross-field connection: class field theory applied to geometry
- •Creativity as making the bridge and successfully executing details
- •The proof is described as well above typical IMO difficulty
- •“Come back from lunch” effect: models exceeding expectations
17:05 – 18:24
Tools and grounding: web access, coding ability, and definition-checking behavior
They clarify the model operated like a general ChatGPT/Codex setup, with abilities like browsing and coding, but not Lean formalization in this case. A memorable anecdote: the model looked up “unit” in a dictionary, reflecting a tendency to ground definitions carefully.
- •Model setup: general agentic assistant (code + web), not Lean-based proof
- •Grounding behavior: restating definitions and checking meanings
- •Anecdote: Cambridge Dictionary lookup of the word “unit”
- •Implication: tool use supports careful interpretation and verification
18:24 – 22:30
Why this should empower mathematicians: humans digest, generalize, and extend
Hongxun and Lijie argue the result shouldn’t intimidate mathematicians, but amplify them. They emphasize a division of labor: models generate breakthroughs and ideas; humans interpret, refine, and apply them to build broader theory and solve related problems.
- •Post-result: mathematicians improved bounds and extended ideas to other problems
- •Humans’ continuing role: digestion, understanding, theory-building
- •AI excels at connecting distant ideas; humans can pursue long arcs of theory
- •Analogy to coding: better tools can increase overall output, not reduce it
22:30 – 26:54
Practical advice for researchers: ask bold questions and recalibrate trust
The team offers concrete guidance for using AI in research: use stronger reasoning settings, ask ambitious questions directly, and learn the model’s boundary through iterative trust-building. They note that human decompositions can impose wrong priors that limit discovery.
- •Recommendation: use higher-capability tiers / longer-thinking modes
- •Ask the boldest version of the question; don’t over-decompose prematurely
- •Humans have priors and blind spots; models can bypass them
- •Technique: periodically “double trust,” observe failures, then recalibrate
26:54 – 37:30
What comes next: scaling independence, new ideas, and AI doing AI research
They discuss future milestones: longer autonomous work horizons, generating genuinely new ideas, and AI accelerating AI research itself. They frame progress as an exponential increase in effective independent work time, but note new mathematical theories may still demand long horizons.
- •“Moore’s law” style doubling of effective independent work horizon
- •Near-term: models solve hard problems; longer-term: new theories and new math
- •Milestone: AI substantially contributing to AI research itself
- •Follow-up replication: other models/labs can reach results with more steering
37:30 – 41:17
Cryptography and quantum computing: stress-testing assumptions and speeding progress
The episode closes by connecting reasoning advances to cryptography and quantum computing. Lijie notes cryptography rests on conjectured hardness assumptions that stronger reasoning could validate or break, while Hongxun suggests AI may accelerate quantum error correction and implementation timelines.
- •Cryptography relies on unproven hardness assumptions (e.g., factoring)
- •AI could prove security foundations or discover vulnerabilities
- •Quantum computing is a different paradigm, but AI can accelerate its development
- •AI assistance extends beyond one-shot answers: interactive explanation and teaching