No Priors Ep. 112 | With OpenAI Deep Research, Isa Fulford

On this episode of No Priors, Sarah sits down with Isa Fulford, one of the masterminds behind deep research. They unpack how the initiative began, the role of human expert data, and what it takes to build agents with real-world capability and even taste. Isa shares the differences between deep research and OpenAI’s o3 model, the challenges around latency, and how she sees agent capabilities evolving. Plus, OpenAI has announced that deep research is free for all US users starting today. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @IsaFulf Show Notes: 0:00 Deep research’s inception & evolution 6:12 Data creation 7:20 Reinforcement fine-tuning 9:05 Why human expert data matters 11:23 Failure modes of agents 13:55 The roadmap ahead for Deep Research 18:32 How do agents develop taste? 19:29 Experience and path to building a broadly capable agent 22:03 Deep research vs. o3 25:55 Latency 27:56 Predictions for agent capabilities

Sarah GuohostIsa Fulfordguest

Apr 24, 202530mWatch on YouTube ↗

CHAPTERS

0:00 – 2:10
Deep Research origin: applying new RL to everyday browsing tasks
Isa shares how Deep Research started from excitement about an internal reinforcement learning breakthrough that boosted math/science/coding performance. She and colleagues explored whether the same approach could power practical, everyday agent tasks—especially web research and software engineering.
- •Internal RL progress inspired new product experimentation
- •Goal: transfer RL success from closed-form tasks to real-world user workflows
- •Early focus split between browsing research and software engineering
- •Browsing chosen as a high-value, widely applicable knowledge-work task
2:10 – 3:38
Why “read-only” research beats transactional agent demos
Sarah contrasts typical agent demos (ordering food, buying flowers) with Deep Research’s emphasis on synthesizing information. Isa explains the deliberate focus on read-only synthesis tasks as both broadly useful and aligned with OpenAI’s long-term scientific discovery goals.
- •Priority on information synthesis over taking actions
- •Read-only tasks map to many knowledge-work professions
- •Synthesis framed as prerequisite to scientific discovery (e.g., literature reviews)
- •Read-only scope also simplifies early safety constraints
3:38 – 4:00
Defining end goals and evals for open-ended browsing
Because browsing lacks neat ground-truth datasets, the team began by enumerating the kinds of outputs they wanted (ranked product lists, literature reviews, etc.). They used these target use cases to shape evaluations and steer what the model should become good at.
- •Browsing is open-ended; datasets don’t naturally exist like math problems do
- •Started with a concrete list of desired real-world research tasks
- •Designed tasks to be gradable enough to drive training progress
- •Used user-like requests (products, Reddit reviews, literature reviews) as anchors
4:00 – 4:56
From prompted demo to full training: tools, data, and iteration cadence
Isa describes building an initial UI demo with prompting only to pitch the concept, then shifting into the hard work of training. The team iterated for months on eval improvements, building browsing tooling and collaborating closely with RL researchers.
- •Initial prototype had no training—prompting + UI to sell the vision
- •Then: build datasets, grading methods, and browsing tools
- •Close collaboration with Edward Sun, others, and the RL team
- •Long uninterrupted iteration period helped drive eval gains before shipping
4:56 – 5:53
Internal eval tasks and early adoption signals
They used recurring “favorite” evaluation tasks to track progress (e.g., finding coauthored papers) and saw organic internal usage even when quality was still early. Strong internal pull—people complaining when it went down—signaled product value.
- •Recurring eval: find all coauthored papers by specific researchers
- •Example of a removed eval (middle name lookup) highlights privacy boundaries
- •Early internal playground (Streamlit) enabled broad testing
- •High demand from internal users suggested real utility early
5:53 – 7:01
Data creation + tool stack: browsing, PDFs/images, and Python analysis
Isa outlines the core ingredients: human trainer data, synthetic datasets, and careful dataset design to exercise desired skills. The agent relies on a text-based browser that can interpret PDFs/images plus a Python tool for quantitative analysis and plotting, with more tools planned later.
- •Used a mixture of human and synthetic data
- •Dataset design must target specific skills the agent should learn
- •Need grading mechanisms to train and measure progress
- •Tooling: text browser (PDFs/images) + Python for analysis/plots; toolset will expand
7:01 – 8:41
When reinforcement fine-tuning (RFT) is worth it vs orchestration
Sarah asks for guidance to startups deciding between RFT and simpler agent orchestration. Isa advises RFT when tasks are truly out-of-distribution or when marginal quality improvements are mission-critical; otherwise, model improvements over time may make custom training unnecessary.
- •Training on-task reliably improves performance on that task
- •Some generalization occurs, but targeted training still helps
- •Best for out-of-distribution domains or workflows requiring +10–15% quality
- •If models improve naturally each release, RFT may not justify the effort
8:41 – 10:16
Why expert human data matters—even for “universal” browsing
Isa explains that effective research requires domain judgment about source quality and relevance. They took a broad approach by recruiting experts across many domains, leveraging RL’s ability to learn the process from outcome-based objectives, with human data as a key driver of success.
- •Browsing quality depends on source evaluation and domain relevance judgments
- •RL can learn the process if you specify tasks and desired outcomes
- •OpenAI can afford a broad, multi-domain expert approach (hard for startups)
- •Human expert data remained crucial despite also using synthetic datasets
10:16 – 12:46
Agent failure modes: hacking constraints, hallucinations, and trust
The conversation shifts to common agent pitfalls and safety. Isa notes that while Deep Research can’t take consequential external actions yet, its long, comprehensive outputs can increase user trust—making hallucinations and incorrect inferences especially important to mitigate via citations and ongoing improvement.
- •Models may use unexpected search terms or planning behaviors
- •Risk of “hacking” constraints (working around tool/search restrictions)
- •Long, confident reports can raise over-trust; hallucinations are a key issue
- •Citations help users verify sources; future action-taking agents raise stakes
12:46 – 13:36
Guardrails and confirmations: how trust evolves for action-taking agents
Isa discusses how early agent products should require user confirmation for sensitive actions, similar to Operator-style approvals. Over time, as reliability grows, users may loosen oversight for routine actions, moving toward more autonomous delegation.
- •Early stage: confirmations and explicit guardrails build user trust
- •Oversight helps prevent unintended side effects (e.g., embarrassing emails)
- •End state aims for autonomy, but only with strong confidence in safety
- •User trust will likely increase with repeated successful experiences
13:36 – 16:31
Roadmap: unified agent, private data access, and compounding capability
Isa describes the product trajectory toward a unified agent that can handle a wide variety of delegated tasks. Near-term improvements include researching over private/internal data (docs, GitHub) and eventually taking actions via APIs, enabled by a compounding loop between product teams and large RL training runs.
- •Vision: one unified agent for tasks you’d delegate to a coworker
- •Next: research over private corp data (internal docs, GitHub)
- •Later: action-taking via APIs with stronger safety measures
- •Team contributes datasets to large RL runs; better base models accelerate progress
16:31 – 21:43
How people actually use Deep Research: science, code search, and data analysis
Isa highlights surprising and validating real-world usage patterns, especially domain experts endorsing outputs in areas she can’t personally evaluate. She also notes adoption for code search and coding questions, plus workflows that mix browsing with file-based data analysis and report generation.
- •Most compelling validation: domain experts ratifying outputs (e.g., medical/science)
- •Unexpected use case: code search and coding help using latest package/repo info
- •Data analysis workflows: upload files, compute stats, produce reports with numbers
- •Power comes from combining a strong base model with browsing + tooling
21:43 – 25:35
Deep Research vs o3: when to use which (specificity, retrieval, comprehensiveness)
Sarah asks for a practical mental model: when is Deep Research better than o3? Isa suggests Deep Research excels on well-defined queries that benefit from live online retrieval and when comprehensive, long-form synthesis is desired; general overviews may not require it.
- •Best fit: specific, well-scoped questions needing current web info
- •Retrieval and source targeting improve quality for constrained queries
- •Trained for longer, more comprehensive outputs than typical chat models
- •Example: shopping/fashion searches with many constraints and up-to-date availability
25:35 – 26:54
Latency and “how long to think”: toward adjustable depth and faster modes
Sarah raises the friction of multi-minute wait times compared to instant tools. Isa agrees there’s a missing middle ground (more than search, less than deep research) and argues the model should infer the right thinking time rather than forcing users to choose, noting training optimized for max thinking time.
- •Current tradeoff: depth and tool use increases latency
- •Need an intermediate mode between quick search and deep research
- •Ideally the model decides how long to think; user toggles may be poor UX
- •Future releases aim to fill the speed/depth gap
26:54 – 30:45
Predictions: day-long research, unified coworker-like agents, and rapid progress
Isa imagines Deep Research scaling to projects that take humans days or weeks—like extended research, thesis-level synthesis—once context, tooling, and safety improve. Looking a year out, she expects more unified agents that can both code (e.g., open PRs) and handle personal tasks (e.g., trip booking), with interfaces that feel like a coworker you can message and collaborate with.
- •Long-horizon tasks require breakthroughs in context/memory and efficiency
- •Potential: hours of compute could replace days of expert work; day-long runs could replace weeks
- •Near-term hope: trusted coding agents that produce PRs plus broader life tasks
- •Unified interface reduces cognitive overhead—one agent with state and memory, like a coworker