No Priors Ep. 113 | With OpenAI's Eric Mitchell and Brandon McKinzie

No PriorsMay 1, 202538m

Sarah Guo (host), Brandon McKinzie (guest), Eric Mitchell (guest), Elad Gil (host), Sarah Guo (host)

Design and capabilities of OpenAI’s O3 reasoning modelReinforcement learning and test-time scaling for deep reasoningTool use (browsing, code execution, data analysis) as a force multiplierProduct tradeoffs: speed vs. depth, unifying vs. specializing modelsApplications in coding, research, analytics, and computer controlChallenges in evaluation, data, and multi-agent / human collaborationFuture directions: robotics, computer use, and spiky capability growth

In this episode of No Priors, featuring Sarah Guo and Brandon McKinzie, No Priors Ep. 113 | With OpenAI's Eric Mitchell and Brandon McKinzie explores openAI’s O3: Tool-Using Reasoning Model Redefines Deep, Steerable AI The episode explores OpenAI’s O3 reasoning model with researchers Eric Mitchell and Brandon McKinzie, focusing on how it ‘thinks before responding’ and uses tools to handle complex, multi-step tasks. They explain that O3 is trained heavily with reinforcement learning to solve hard problems, allocate compute at test time, and orchestrate tools like browsing and code execution. The conversation covers product tradeoffs between speed and depth, steerability for end users vs. developers, and why tool use dramatically improves test-time scaling, especially in vision and coding. They also discuss future directions such as computer use, robotics, multi-agent collaboration, better evals, and how AI can accelerate AI research itself.

OpenAI’s O3: Tool-Using Reasoning Model Redefines Deep, Steerable AI

The episode explores OpenAI’s O3 reasoning model with researchers Eric Mitchell and Brandon McKinzie, focusing on how it ‘thinks before responding’ and uses tools to handle complex, multi-step tasks. They explain that O3 is trained heavily with reinforcement learning to solve hard problems, allocate compute at test time, and orchestrate tools like browsing and code execution. The conversation covers product tradeoffs between speed and depth, steerability for end users vs. developers, and why tool use dramatically improves test-time scaling, especially in vision and coding. They also discuss future directions such as computer use, robotics, multi-agent collaboration, better evals, and how AI can accelerate AI research itself.

Key Takeaways

Reasoning models benefit from ‘thinking time’ and dynamic compute allocation.

O3 can spend more computation at inference to reason step-by-step, and empirical curves show that letting it think longer typically yields higher accuracy, especially on hard problems.

Get the full analysis with uListen AI

Tool use turns language models into higher-level agents rather than text generators.

By browsing, writing and running code, and iterating on results, O3 can autonomously decompose tasks like due diligence or research into sequences of tool calls, making its ‘thinking tokens’ far more productive.

Get the full analysis with uListen AI

Reinforcement learning on difficult, tool-based tasks is central to O3’s training.

Instead of only next-token prediction, O3 is optimized via RL to solve challenging, long-horizon tasks, learn when to call tools, and manage uncertainty and multi-step workflows.

Get the full analysis with uListen AI

Model steerability will matter as much as raw capability.

Users and API developers need to specify constraints like latency, cost, and depth of analysis; the vision is models that understand context (e. ...

Get the full analysis with uListen AI

Browsing and vision tools sharply improve test-time scaling and reliability.

For images and current information, O3 can recognize its own uncertainty and then act (e. ...

Get the full analysis with uListen AI

High-quality evals are as strategically important as high-quality training data.

As frontier models ‘solve’ many existing benchmarks, uncontaminated, well-designed evaluations become critical to measure real progress and avoid optimizing blindly against noisy or saturated metrics.

Get the full analysis with uListen AI

AI is reaching an inflection point as a force multiplier for AI research and coding.

The guests already use O-series models multiple times per day for navigating large codebases and research tasks, and they see a near-term loop where models materially accelerate the development of their successors.

Get the full analysis with uListen AI

Notable Quotes

“O3 is focused on thinking carefully before it responds, and these models are in some vaguely general sense smarter than models that don’t think before they respond.”
— Eric Mitchell

“You can feel this when you’re talking to O3… the longer it thinks, I really get the impression that I’m going to get a better result.”
— Brandon McKinzie

“You can just allocate compute a lot more efficiently because you can defer stuff that the model doesn’t have comparative advantage at to a tool that is really well suited to doing that thing.”
— Eric Mitchell

“It kind of drives me crazy in some sense that our models are not already just on my computer all day, watching what I’m doing… I hate typing.”
— Brandon McKinzie

“Evaluating the capabilities of a general capable agent is really hard to do in a rigorous way… evals are a little underappreciated.”
— Eric Mitchell

Questions Answered in This Episode

How does O3 internally decide when to keep thinking versus when it’s confident enough to answer immediately?

Get the full analysis with uListen AI

What concrete safety and reliability thresholds would OpenAI need before allowing models to fully control a user’s computer or email?

Get the full analysis with uListen AI

How might multi-agent RL—training multiple O3-like models to collaborate—change how we structure teams and knowledge work?

Get the full analysis with uListen AI

Where do the guests expect general-purpose models like O3 to hit hard limits without specialized systems (e.g., in robotics or real-time control)?

Get the full analysis with uListen AI

What new kinds of evals would best capture O3’s real-world usefulness on long, messy tasks like multi-week software projects or complex M&A analysis?

Get the full analysis with uListen AI

Transcript Preview

Sarah Guo

(instrumental music) . Hi, listeners, and welcome back to No Priors. Today, I'm speaking with Brandon McKinley and Eric Mitchell, two of the minds behind OpenAI's O3 model. O3 is the latest in the line of reasoning models from OpenAI, super powerful with the ability to figure out what tools to use, and then use them across multi-step tasks. We'll talk about how it was made, what's next, and how to reason about reasoning. Brandon and Eric, welcome to No Priors.

Brandon McKinzie

Thanks for having us.

Eric Mitchell

Yeah, thanks for having us.

Elad Gil

Do you mind walking us through, um, O3, what's different about it, what it, what it was in terms of a breakthrough in terms of, like, you know, a focus on reasoning and you're adding memory and other things versus just a core foundation model at LLM and what that is?

Eric Mitchell

So O3 is, like, our most recent model in this O series line of models that, um, are focused on thinking carefully before they respond, and these models are in sort of some vaguely general sense smarter than, like, models that don't think before they respond. You know, similarly to humans, um, it's easier to be, you know, more accurate if you think before you respond. I think the thing that is really exciting about O3, um, is that not only is it just smarter if you make, like, an apples to apples comparison to our previous O series models, you know, it's just better at, like, giving you correct answers of math problems or factual questions about the world or whatever. Um, this is true and it's great, and we, you know, will continue to train models that are smarter, um, but it's also very cool because it uses a lot of tools that, um, uh, that, that enhance its ability to do things that are useful for you. So yeah, like, you can train a model that's really smart, but, like, if it can't browse the web and get up-to-date information, there's just a limitation on how much useful stuff that model can do for you. If the model can't actually write and execute code, um, there's just a limitation to, um, how, you know, the, the sorts of things that an LLM can do efficiently, whereas, like, a relatively simple Python program can, you know, solve a particular problem very easily. So, um, not only is the model it's, on its own smarter than our previous O series models, which is great, but it's also able to use all these tools that, like, further enhance its abilities and whether that's doing, like, research on something where you want up-to-date information or you want the model to do some data analysis for you, or you want the model to be able to do the data analysis and then kind of review the results and adjust course as it sees fit instead of you having to be so sort of prescriptive about, like, each step along the way. The model's sort of able to take these, like, high-level requests, like do some due diligence on this company and, you know, maybe run some reasonable, like, forecasting models on so and so thing, and then, you know, write a summary for me. Like, the model will kind of, like, infer a reasonable set of actions to do on its own. So it gives you kind of like a higher level interface to, to doing some of these more complicated, uh, tasks.

Install uListen to search the full transcript and get AI-powered insights

Get Full Transcript

Get more from every podcast

AI summaries, searchable transcripts, and fact-checking. Free forever.

Add to Chrome