No PriorsNo Priors Ep. 113 | With OpenAI's Eric Mitchell and Brandon McKinzie
Sarah Guo and Brandon McKinzie on openAI’s O3: Tool-Using Reasoning Model Redefines Deep, Steerable AI.
In this episode of No Priors, featuring Sarah Guo and Brandon McKinzie, No Priors Ep. 113 | With OpenAI's Eric Mitchell and Brandon McKinzie explores openAI’s O3: Tool-Using Reasoning Model Redefines Deep, Steerable AI The episode explores OpenAI’s O3 reasoning model with researchers Eric Mitchell and Brandon McKinzie, focusing on how it ‘thinks before responding’ and uses tools to handle complex, multi-step tasks. They explain that O3 is trained heavily with reinforcement learning to solve hard problems, allocate compute at test time, and orchestrate tools like browsing and code execution. The conversation covers product tradeoffs between speed and depth, steerability for end users vs. developers, and why tool use dramatically improves test-time scaling, especially in vision and coding. They also discuss future directions such as computer use, robotics, multi-agent collaboration, better evals, and how AI can accelerate AI research itself.
OpenAI’s O3: Tool-Using Reasoning Model Redefines Deep, Steerable AI
The episode explores OpenAI’s O3 reasoning model with researchers Eric Mitchell and Brandon McKinzie, focusing on how it ‘thinks before responding’ and uses tools to handle complex, multi-step tasks. They explain that O3 is trained heavily with reinforcement learning to solve hard problems, allocate compute at test time, and orchestrate tools like browsing and code execution. The conversation covers product tradeoffs between speed and depth, steerability for end users vs. developers, and why tool use dramatically improves test-time scaling, especially in vision and coding. They also discuss future directions such as computer use, robotics, multi-agent collaboration, better evals, and how AI can accelerate AI research itself.
Key Takeaways
Reasoning models benefit from ‘thinking time’ and dynamic compute allocation.
O3 can spend more computation at inference to reason step-by-step, and empirical curves show that letting it think longer typically yields higher accuracy, especially on hard problems.
Tool use turns language models into higher-level agents rather than text generators.
By browsing, writing and running code, and iterating on results, O3 can autonomously decompose tasks like due diligence or research into sequences of tool calls, making its ‘thinking tokens’ far more productive.
Reinforcement learning on difficult, tool-based tasks is central to O3’s training.
Instead of only next-token prediction, O3 is optimized via RL to solve challenging, long-horizon tasks, learn when to call tools, and manage uncertainty and multi-step workflows.
Model steerability will matter as much as raw capability.
Users and API developers need to specify constraints like latency, cost, and depth of analysis; the vision is models that understand context (e. ...
Browsing and vision tools sharply improve test-time scaling and reliability.
For images and current information, O3 can recognize its own uncertainty and then act (e. ...
High-quality evals are as strategically important as high-quality training data.
As frontier models ‘solve’ many existing benchmarks, uncontaminated, well-designed evaluations become critical to measure real progress and avoid optimizing blindly against noisy or saturated metrics.
AI is reaching an inflection point as a force multiplier for AI research and coding.
The guests already use O-series models multiple times per day for navigating large codebases and research tasks, and they see a near-term loop where models materially accelerate the development of their successors.
Notable Quotes
“O3 is focused on thinking carefully before it responds, and these models are in some vaguely general sense smarter than models that don’t think before they respond.”
— Eric Mitchell
“You can feel this when you’re talking to O3… the longer it thinks, I really get the impression that I’m going to get a better result.”
— Brandon McKinzie
“You can just allocate compute a lot more efficiently because you can defer stuff that the model doesn’t have comparative advantage at to a tool that is really well suited to doing that thing.”
— Eric Mitchell
“It kind of drives me crazy in some sense that our models are not already just on my computer all day, watching what I’m doing… I hate typing.”
— Brandon McKinzie
“Evaluating the capabilities of a general capable agent is really hard to do in a rigorous way… evals are a little underappreciated.”
— Eric Mitchell
Questions Answered in This Episode
How does O3 internally decide when to keep thinking versus when it’s confident enough to answer immediately?
The episode explores OpenAI’s O3 reasoning model with researchers Eric Mitchell and Brandon McKinzie, focusing on how it ‘thinks before responding’ and uses tools to handle complex, multi-step tasks. ...
What concrete safety and reliability thresholds would OpenAI need before allowing models to fully control a user’s computer or email?
How might multi-agent RL—training multiple O3-like models to collaborate—change how we structure teams and knowledge work?
Where do the guests expect general-purpose models like O3 to hit hard limits without specialized systems (e.g., in robotics or real-time control)?
What new kinds of evals would best capture O3’s real-world usefulness on long, messy tasks like multi-week software projects or complex M&A analysis?
EVERY SPOKEN WORD
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome