No Priors

No Priors Ep. 113 | With OpenAI's Eric Mitchell and Brandon McKinzie

Sarah Guo and Brandon McKinzie on openAI’s O3: Tool-Using Reasoning Model Redefines Deep, Steerable AI.

Sarah GuohostBrandon McKinzieguestEric MitchellguestElad GilhostSarah Guohost
May 1, 202538m
Design and capabilities of OpenAI’s O3 reasoning modelReinforcement learning and test-time scaling for deep reasoningTool use (browsing, code execution, data analysis) as a force multiplierProduct tradeoffs: speed vs. depth, unifying vs. specializing modelsApplications in coding, research, analytics, and computer controlChallenges in evaluation, data, and multi-agent / human collaborationFuture directions: robotics, computer use, and spiky capability growth

In this episode of No Priors, featuring Sarah Guo and Brandon McKinzie, No Priors Ep. 113 | With OpenAI's Eric Mitchell and Brandon McKinzie explores openAI’s O3: Tool-Using Reasoning Model Redefines Deep, Steerable AI The episode explores OpenAI’s O3 reasoning model with researchers Eric Mitchell and Brandon McKinzie, focusing on how it ‘thinks before responding’ and uses tools to handle complex, multi-step tasks. They explain that O3 is trained heavily with reinforcement learning to solve hard problems, allocate compute at test time, and orchestrate tools like browsing and code execution. The conversation covers product tradeoffs between speed and depth, steerability for end users vs. developers, and why tool use dramatically improves test-time scaling, especially in vision and coding. They also discuss future directions such as computer use, robotics, multi-agent collaboration, better evals, and how AI can accelerate AI research itself.

OpenAI’s O3: Tool-Using Reasoning Model Redefines Deep, Steerable AI

The episode explores OpenAI’s O3 reasoning model with researchers Eric Mitchell and Brandon McKinzie, focusing on how it ‘thinks before responding’ and uses tools to handle complex, multi-step tasks. They explain that O3 is trained heavily with reinforcement learning to solve hard problems, allocate compute at test time, and orchestrate tools like browsing and code execution. The conversation covers product tradeoffs between speed and depth, steerability for end users vs. developers, and why tool use dramatically improves test-time scaling, especially in vision and coding. They also discuss future directions such as computer use, robotics, multi-agent collaboration, better evals, and how AI can accelerate AI research itself.

Key Takeaways

Reasoning models benefit from ‘thinking time’ and dynamic compute allocation.

O3 can spend more computation at inference to reason step-by-step, and empirical curves show that letting it think longer typically yields higher accuracy, especially on hard problems.

Tool use turns language models into higher-level agents rather than text generators.

By browsing, writing and running code, and iterating on results, O3 can autonomously decompose tasks like due diligence or research into sequences of tool calls, making its ‘thinking tokens’ far more productive.

Reinforcement learning on difficult, tool-based tasks is central to O3’s training.

Instead of only next-token prediction, O3 is optimized via RL to solve challenging, long-horizon tasks, learn when to call tools, and manage uncertainty and multi-step workflows.

Model steerability will matter as much as raw capability.

Users and API developers need to specify constraints like latency, cost, and depth of analysis; the vision is models that understand context (e. ...

Browsing and vision tools sharply improve test-time scaling and reliability.

For images and current information, O3 can recognize its own uncertainty and then act (e. ...

High-quality evals are as strategically important as high-quality training data.

As frontier models ‘solve’ many existing benchmarks, uncontaminated, well-designed evaluations become critical to measure real progress and avoid optimizing blindly against noisy or saturated metrics.

AI is reaching an inflection point as a force multiplier for AI research and coding.

The guests already use O-series models multiple times per day for navigating large codebases and research tasks, and they see a near-term loop where models materially accelerate the development of their successors.

Notable Quotes

O3 is focused on thinking carefully before it responds, and these models are in some vaguely general sense smarter than models that don’t think before they respond.

Eric Mitchell

You can feel this when you’re talking to O3… the longer it thinks, I really get the impression that I’m going to get a better result.

Brandon McKinzie

You can just allocate compute a lot more efficiently because you can defer stuff that the model doesn’t have comparative advantage at to a tool that is really well suited to doing that thing.

Eric Mitchell

It kind of drives me crazy in some sense that our models are not already just on my computer all day, watching what I’m doing… I hate typing.

Brandon McKinzie

Evaluating the capabilities of a general capable agent is really hard to do in a rigorous way… evals are a little underappreciated.

Eric Mitchell

Questions Answered in This Episode

How does O3 internally decide when to keep thinking versus when it’s confident enough to answer immediately?

The episode explores OpenAI’s O3 reasoning model with researchers Eric Mitchell and Brandon McKinzie, focusing on how it ‘thinks before responding’ and uses tools to handle complex, multi-step tasks. ...

What concrete safety and reliability thresholds would OpenAI need before allowing models to fully control a user’s computer or email?

How might multi-agent RL—training multiple O3-like models to collaborate—change how we structure teams and knowledge work?

Where do the guests expect general-purpose models like O3 to hit hard limits without specialized systems (e.g., in robotics or real-time control)?

What new kinds of evals would best capture O3’s real-world usefulness on long, messy tasks like multi-week software projects or complex M&A analysis?

EVERY SPOKEN WORD

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome