OpenAIWhat happens now that AI is good at math? — the OpenAI Podcast Ep. 17
CHAPTERS
From “laughable at math” to Fields Medal assistance
Andrew Mayne opens with Sébastien Bubeck and Ernest Ryu on how quickly AI math capabilities have advanced—from pre-reasoning models to systems that can meaningfully assist working mathematicians. They frame mathematics as an unexpectedly revealing domain for measuring reasoning progress and why the pace has surprised even experts.
- •AI math progress over the last few years has been “miraculous”
- •Shift from no reasoning models to research-grade help in ~2 years
- •Math as a clear, objective lens for capability progress
- •Why the jump surprised the research and math communities
A 42-year-old open problem solved with ChatGPT (and a human verifier)
Ernest recounts using ChatGPT to tackle a genuinely open optimization-theory question about Nesterov’s accelerated gradient method. Over about 12 hours across three evenings, he iterated with the model, corrected mistakes, guided approaches, and verified the final argument—resulting in a correct resolution of a decades-old question.
- •Target problem: whether Nesterov acceleration can diverge in worst cases
- •Process: not one-shot—interactive collaboration with corrections and steering
- •Human role: verification and direction-setting were essential
- •Outcome: a correct proof that a bad instance exists (divergence is possible)
- •Public feedback loop: sharing on social media and community scrutiny
Calibrating progress: from everyday arithmetic failures to IMO gold
The group contrasts early model limitations (splitting expenses, time-zone scheduling) with sudden improvements that culminated in top human-level International Math Olympiad performance. They distinguish competition math (short, “canned” solutions) from research math, while noting the practical threshold: for most STEM users, today’s models cover nearly all needed mathematics—with caution and checks.
- •Earlier failures on practical multi-step arithmetic and planning tasks
- •IMO gold as a major milestone but not the same as research math
- •Pragmatic benchmark: most non-research STEM math is now doable with AI
- •Ongoing need for verification (models still make mistakes)
What changed: beyond scaling, toward reasoning systems
Sébastien argues the “scaling alone” framing misses the point: multiple innovations progressed together, not one silver bullet. He situates the progress historically (e.g., Minerva era) to show how quickly expectations have shifted, and emphasizes that modern models can often solve problems directly (not just via calculator tools).
- •Progress came from many coordinated research advances, not just scaling
- •Pre-ChatGPT benchmarks (e.g., Minerva) now look primitive in hindsight
- •Reasoning improvements changed what models can do without tools
- •Math capability moved from novelty to near-saturation for standard tasks
Why math matters for AGI: long, consistent chains of thought
Math is presented as more than “cool”—it demands long-horizon, error-intolerant reasoning where a single mistake collapses the whole argument. That property makes it an ideal training/measurement ground for reasoning that should transfer to other domains, mirroring why humans learn math for disciplined thinking.
- •Math problems require long-duration, consistent reasoning
- •Single error can invalidate an entire proof—strong pressure for self-correction
- •Math as an unambiguous benchmark with verifiable answers (pre-research)
- •Expectation: reasoning skills learned in math will generalize broadly
Erdős problems, literature search breakthroughs, and a communication trap
Sébastien explains Paul Erdős, the culture around his questions, and the online catalog of open problems. They describe early successes where models didn’t invent new proofs but performed deep “literature search + translation,” connecting results across fields—followed by controversy when those results were misunderstood as brand-new solutions.
- •Who Erdős was and why his problem lists matter (and Erdős numbers)
- •Using Erdős problem repositories as a testbed for research capabilities
- •Early wins: models surfaced existing solutions in obscure/adjacent literature
- •Key nuance: mapping between fields/languages was the hard part AI helped with
- •Public messaging risk: claims were misread as ‘solved 10 hard open problems’
From rediscovery to genuinely new combinatorics results
They describe rapid acceleration from “finding answers that already exist” to producing novel, publishable solutions. This raises deeper questions about what scientific creativity is—mere recombination plus reasoning, or rare sparks of genius—and whether AI can continuously extend human knowledge without bound.
- •Progress moved from literature-based solutions to original, publishable work
- •Claim: now exceeding ‘10’ truly new solutions, some fit for top journals
- •Debate on the nature of discovery: recombination vs. genius insights
- •AI research forces rethinking what ‘insight’ actually means
The automated researcher and ‘AGI time’ (seconds → weeks → months)
The conversation shifts to building systems that can work autonomously over long horizons, not just within a single chat session. Sébastien introduces “AGI time” as the duration an AI can sustain human-like research thinking, arguing that the key frontier is extending this from days to weeks and beyond—an open research problem tied to the “automated researcher” vision.
- •Today’s workflow resembles professor–student iteration, but faster
- •Automated researcher: agents working autonomously over long periods
- •“AGI time” as a capability axis: seconds → minutes → hours → days → weeks
- •Long-horizon autonomy needed for major breakthroughs and lab-integrated science
- •No one knows the full recipe yet; requires further innovation
Context limits, persistent workspaces, and the Codex analogy for math
Ernest highlights that typical chat context is roughly the size of a ~50-page paper, which is insufficient for deep breakthroughs that require far more thinking than the final write-up. He points to tools like Codex that operate over large codebases with persistent artifacts as a model for how math research agents could maintain long-running notes, summaries, and evolving work products.
- •Context window limits constrain deep research inside a single chat
- •Human research: months of thought distilled into a shorter paper
- •Codex-style persistent repositories suggest a path for long projects
- •Agents could ‘compactify’/summarize while preserving long-term progress
- •Goal: systems that support >50 pages worth of coherent, cumulative reasoning
Science acceleration in practice: lowering friction and expanding who can do what
Andrew shares a hands-on example of generating a benchmark dataset mid-workflow in minutes—something that would otherwise derail progress. Sébastien ties this to ‘science acceleration’ and emphasizes a two-way effect: mathematicians gain easy access to coding/experiments, and scientists in other fields gain access to advanced math.
- •AI removes “setup overhead” that causes people to abandon ideas
- •Example: generating benchmark data quickly during an active project
- •Mathematicians can now run computational experiments without coding expertise
- •Non-math scientists can leverage higher-level math with AI assistance
- •Acceleration is broad because training techniques are general across domains
Humans’ role as models surpass researchers: direction, meaning, and priorities
Sébastien predicts continued progress: systems that think for weeks, then years, plus agents that find mistakes in papers and even propose valuable new questions. He argues the human role becomes more about setting goals and ensuring science serves human needs (health, control over environment), since AI has no intrinsic stake in those outcomes.
- •Trendline suggests rapid extension of ‘thinking time’ and capability
- •Agents already find errors in papers and generate publishable research questions
- •AI is not just an answerer; it can be a question-asker
- •Human purpose: understanding and applying knowledge to human-relevant goals
- •Need for humans to guide what problems matter and keep control
Verification, shallow understanding risk, and how to learn math with ChatGPT
They discuss the dual-edged nature of AI: it can accelerate proof checking and improve trust, but overreliance can erode deep expertise and produce confident nonsense—especially from non-experts attempting grand proofs. They close with practical learning advice: use ChatGPT as an adaptive tutor, ask it for problems at your level, iterate socially, but still do the hard work of understanding and verifying.
- •AI can speed verification and flag suspect steps, but shouldn’t be sole arbiter
- •Risk: mental atrophy and shallow understanding if users outsource thinking
- •Non-experts can produce long, wrong ‘proofs’—expertise remains crucial
- •Cultural accountability: humans must take responsibility for AI-assisted outputs
- •Learning advice: explain your background to ChatGPT, ask targeted follow-ups, generate level-appropriate questions, and practice verification