No PriorsNo Priors Ep. 113 | With OpenAI's Eric Mitchell and Brandon McKinzie
CHAPTERS
- 0:00 – 3:20
What makes o3 different: deliberate reasoning plus tool-using agent behavior
Eric explains o3 as the newest “O-series” reasoning model that pauses to think before answering. Beyond being stronger at core accuracy (math, facts, reasoning), o3 is designed to select and use tools—like browsing and code execution—to complete multi-step tasks more autonomously.
- •O-series emphasis: thinking carefully before responding improves correctness
- •o3 is “smarter” than prior O-series models on standard question answering
- •Tool access expands usefulness beyond static model knowledge
- •Model can infer a plan: research, analyze data, revise steps, and summarize
- •Higher-level interface: users give goals, model decides actionable steps
- 3:20 – 5:24
How o3 is trained: reinforcement learning and “test-time scaling” that feels worth the wait
Brandon highlights reinforcement learning as the major training difference versus pure next-token pretraining. They discuss test-time scaling—more inference-time compute leading to better outcomes—and how o3’s longer thinking tends to reliably translate into better results.
- •Reinforcement learning is the biggest shift in building reasoning models
- •Training targets solving difficult tasks with flexible thinking time
- •Earlier test-time scaling could degrade into unproductive “ranting”
- •With o3, longer thinking more consistently improves outcome quality
- •User experience: waiting feels justified because behavior stays goal-directed
- 5:24 – 6:39
Model unification and the product problem: fewer switches, more “smart defaults”
Elad asks whether the future is separate fast/cheap vs slow/expensive models. Eric frames OpenAI’s interest in unifying models and making the system choose appropriately, reducing user confusion from a “model switcher” experience.
- •Too many model choices create friction for users
- •Goal: make the experience intuitive—system decides which capability to use
- •Open question: two models vs many models vs decision baked into the model
- •Different tasks demand different latency/cost trade-offs
- •Experimentation will determine the best UI/UX abstraction
- 6:39 – 9:02
Steerability and uncertainty: models should think longer only when they need to
Sarah pushes on whether combining reasoning with pretraining is necessary versus purely product-driven selection. Brandon and Eric argue for models that understand their own uncertainty and can be steered by context—respond instantly when confident, deliberate when uncertain, and respect developer latency constraints.
- •Ideal behavior: time spent equals time needed for correctness
- •Better uncertainty estimation enables adaptive thinking time
- •Potential bifurcation: end-user product vs developer/API constraints
- •Steerability: “this is an API use case—be fast” vs “take your time”
- •Limits: deciding the right approach may itself require thought
- 9:02 – 11:05
Why tool use boosts test-time scaling: productive tokens, verifiable computation, and new inputs
They dig into why tools make extended reasoning more effective. Brandon gives visual reasoning examples where image manipulation (crop/zoom) turns extra tokens into progress; Eric adds that code tools shift work to verifiable, efficient computation rather than fragile in-context arithmetic.
- •Without tools, longer chains can become unproductive or unstable
- •Vision: tools let the model reduce uncertainty by manipulating images
- •With tools, “test-time scaling slope” becomes much steeper
- •Code execution enables fast self-verification and correct-by-construction steps
- •Tools allocate compute to what models aren’t comparatively best at doing
- 11:05 – 13:46
Deep Research as a proving ground: browsing + RL objectives tuned to user tolerance
Elad asks how Deep Research is trained and whether it required special RL. Eric describes browsing as a natural, widely applicable tool-use testbed and notes that RL is fundamentally about choosing objectives—length, depth, report format—based on expected user needs and patience.
- •Browsing is broadly useful for up-to-date, real-world queries
- •Early browsing was hard to make reliable; newer approaches improved
- •Tool-use tasks test whether RL yields long-horizon, meaningful behaviors
- •RL objective design depends on intended users and output expectations
- •Trade-offs: 30-minute rollouts vs concise reports; pages vs deep dossiers
- 13:46 – 15:56
Near-term application areas: coding, internal research velocity, and AI helping build AI
Brandon emphasizes coding and accelerating research workflows as immediate wins. Elad connects this to the “bootstrap” idea: using AI to improve the next generation of AI, with many sub-tasks across hardware, training, and evaluation that can be optimized.
- •Coding agents are a high-leverage, near-term application
- •Models are crossing an inflection point into daily usefulness
- •Internal codebases and complex systems benefit from persistent assistance
- •AI can speed many research components: training, evals, hardware workflow
- •Feedback loop: better AI helps create better AI faster
- 15:56 – 18:57
Future interaction: computer-use agents, opt-in monitoring, and safe affordances
Conversation shifts to richer interfaces—models operating directly on a user’s computer and business software workflows. Brandon wants “computer use” assistants that can help in-the-moment; Eric stresses caution due to asymmetric downside risks and the need for sandboxed deployment.
- •Computer-use agents could reduce friction vs typing prompts
- •Privacy/creepiness concerns imply opt-out or opt-in defaults
- •Models already show surprisingly human-like tool behavior
- •Safety: limit affordances to avoid costly mistakes (emails, deletions)
- •Deployment should expand capability as reliability improves
- 18:57 – 21:54
A framework for task difficulty: environment uncertainty and simulatable vs real-world constraints
Sarah asks for an organizing model of what gets easier with intelligence, tools, and RL. Eric offers two main dimensions: how much external uncertainty the task involves (needing experiments/tool feedback) and how constrained it is by real-world time/physics versus fast simulation.
- •Simple recall tasks require minimal environment interaction
- •Harder tasks require executing, testing, and iterating in an environment
- •External feedback introduces uncertainty that must be managed
- •Simulatable domains (e.g., coding) advance faster than robotics
- •Real-world physics imposes latency constraints that disembodied models can ignore
- 21:54 – 25:21
General-purpose vs specialized models: robotics, real-time requirements, and vision quirks
Elad probes whether robotics foundation models will merge into general systems. Brandon argues a unified model seems plausible, though conflicts can arise when very different capabilities share weights; they discuss real-time constraints and how current vision learning differs from embodied perception, illustrated by the “10:10 clock” bias.
- •Trend: specialized models often get subsumed by general-purpose models
- •Robotics adds strict real-time constraints (“gravity won’t wait”)
- •Small biological brains show efficient real-time competence, raising questions
- •Vision edge case: clock-reading bias toward 10:10 due to internet imagery
- •Closing the loop via action can reduce uncertainty versus pure thinking
- 25:21 – 29:07
Simulating humans and collaboration: multi-agent training, interactive intervention, and cost trade-offs
Sarah raises the challenge of simulating human collaboration in long-running tasks like software engineering. Brandon suggests training multiple models together (multi-agent RL) as a starting point for collaboration skills, and he imagines interactive training where humans can intervene mid-reasoning; Eric notes humans are high-uncertainty, slow “tool calls,” yet often crucial.
- •Long-horizon work involves unpredictable human collaboration
- •Idea: train models to cooperate with other models (multi-agent RL)
- •Interactive training could let users interrupt and shape behavior on the fly
- •Humans are slow/expensive tool calls compared with reading/browsing at scale
- •Hardest part of projects can be coordination, not coding itself
- 29:07 – 34:39
How models advance next: spiky capability gains, data vs algorithms, and the eval bottleneck
Sarah asks whether RL-driven progress makes capability improvements “spikier” across domains. Eric agrees there’s some structural spikiness due to prioritization, but warns against assuming models only improve in math/code; both highlight parallel tracks—domain data targeting and broad algorithmic lifts—and they emphasize the growing importance of uncontaminated, high-quality evals.
- •RL domain choices can create uneven (“spiky”) capability improvements
- •Counterpoint: improvements can generalize beyond targeted domains
- •Labs balance domain-specific data efforts with algorithmic changes that lift broadly
- •Evals are increasingly scarce as models saturate existing benchmarks
- •Wish list: uncontaminated evals and frontier-level long-horizon training data
- 34:39 – 38:10
Using reasoning models effectively: sample the distribution and aggregate many runs
They close with practical usage advice: don’t judge a model from a single sample. Eric advocates sending the same prompt many times to understand variance; Sarah suggests a product feature to automatically run many attempts and rank/synthesize the best responses, acknowledging the compute cost.
- •Single-prompt comparisons are misleading due to response variance
- •Peak performance can be impressive even if typical runs vary
- •Repeated sampling builds intuition about the model’s behavior distribution
- •Feature request: “run 100 times,” rank outputs, return top results
- •Trade-off is cost/infrastructure, but power users would pay for it