Skip to content
Dwarkesh PodcastDwarkesh Podcast

John Schulman (OpenAI Cofounder) — Reasoning, RLHF, & plan for 2027 AGI

John Schulman on how posttraining tames the shoggoth, and the nature of the progress to come... 𝐄𝐏𝐈𝐒𝐎𝐃𝐄 𝐋𝐈𝐍𝐊𝐒 * Apple Podcasts: https://podcasts.apple.com/us/podcast/john-schulman-openai-cofounder-reasoning-rlhf-plan/id1516093381?i=1000655679622 * Spotify: https://open.spotify.com/episode/1ivzHH9RWciXe4O1rKtldf?si=53503781e05f4d8f * Transcript: https://www.dwarkeshpatel.com/p/john-schulman/ * Me on Twitter: https://twitter.com/dwarkesh_sp/ 𝐒𝐏𝐎𝐍𝐒𝐎𝐑 * CommandBar is an AI user assistant that any software product can embed to non-annoyingly assist, support, and unleash their users. Used by forward-thinking CX, product, growth, and marketing teams. Learn more at https://www.commandbar.com/ If you’re interested in advertising on the podcast, fill out this form: https://airtable.com/appxGOvFLDLP5dlzv/pagFVrbHRohW6F2bZ/form 𝐓𝐈𝐌𝐄𝐒𝐓𝐀𝐌𝐏𝐒 00:00:00 - Pre-training, post-training, and future capabilities 00:17:20 - Plan for AGI 2025 00:29:43 - Teaching models to reason 00:40:10 - The Road to ChatGPT 00:51:33 - What makes for a good RL researcher? 01:00:18 - Keeping humans in the loop 01:14:36 - State of research, plateaus, and moats

John SchulmanguestDwarkesh Patelhost
May 15, 20241h 35mWatch on YouTube ↗

CHAPTERS

  1. 0:00 – 2:44

    Pre-training vs. post-training: what each stage actually builds

    Schulman contrasts pre-training (internet-scale next-token prediction) with post-training (narrowing behavior toward a helpful assistant). He emphasizes calibration, persona flexibility in base models, and how post-training optimizes for human usefulness rather than web imitation.

    • Pre-training as next-token prediction over broad web/code data
    • Base models as calibrated probability estimators that can adopt many personas
    • Post-training targets a narrower assistant-like behavior
    • Objective shift: likelihood maximization → human-preference/usefulness optimization
  2. 2:44 – 6:50

    Near-term capability jumps: from single-step help to full coding projects

    The discussion turns to what models may do in 1–2 years, especially executing multi-file, iterative coding projects. Schulman points to training for longer tasks and better error recovery as central unlocks.

    • Models evolving from suggestions to end-to-end project execution
    • Training data today is mostly short-horizon; longer-horizon training is needed
    • RL or iterative supervision could teach multi-step project completion
    • Improved robustness: recovering from errors and edge cases
  3. 6:50 – 11:53

    Long-horizon tasks, phase transitions, and what could still be missing

    Dwarkesh probes whether long-horizon coherence will scale smoothly or via phase transitions. Schulman suggests capabilities may generalize across time scales, but warns that long-horizon training won’t automatically fix all deficits.

    • Longer-horizon tasks likely cost more and require more intelligence
    • Scaling may show phase transitions rather than smooth laws
    • Humans plan across timescales using shared ‘machinery’; models might too
    • Other deficits may remain: decision quality, getting stuck, ambiguity handling
  4. 11:53 – 13:23

    Multimodal agents and AI-friendly interfaces: will the web change?

    They discuss whether future multimodal models will need redesigned UIs. Schulman expects models to use human-designed interfaces via vision, though AI-optimized text representations and clearer interaction affordances could help.

    • Vision improvements may let models operate existing human UIs
    • Some services may add AI-oriented UX (clean text views, explicit interactables)
    • Unclear need for ubiquitous APIs; models may adapt to current web
    • Mundane barriers (access, tool affordances) may slow early progress
  5. 13:23 – 17:20

    Evidence for generalization: cross-language transfer and ‘knowing limitations’

    Schulman gives concrete examples where small post-training changes generalize widely. He highlights cross-lingual improvements from English-only fine-tuning and early ChatGPT work that taught the model to admit limitations with very few examples.

    • English-only post-training often improves behavior in other languages
    • Text-only tuning can sometimes improve multimodal behavior
    • Tiny datasets can fix broad classes of mistakes (e.g., false tool claims)
    • Generalization reduces the need for exhaustive domain-specific examples
  6. 17:20 – 23:44

    If AGI arrives soon: slowdown, coordination, and release strategy

    Dwarkesh pushes a ‘what if AGI next year’ scenario; Schulman argues for caution: pausing further training/deployment, sandboxing, and staged rollout. He notes race dynamics could force risky behavior without coordination among major labs.

    • If AGI comes early, safer posture may require slowing training/deployment
    • Sandboxing and limiting scale of deployment as risk controls
    • Game theory: coordination needed to avoid racing and safety compromises
    • Hard question remains: what exactly the world ‘waits for’ after a pause
  7. 23:44 – 29:42

    How you’d build confidence: incremental deployment, red-teaming, monitoring

    Schulman prefers continuous, incremental releases paired with proportional safety improvements rather than a single discontinuous jump. For jumpy capability gains, he argues for simulated deployments, adversarial testing, and defense-in-depth monitoring.

    • Incremental capability increases are safer than lock-down + big release
    • Simulated deployment and strong red-teaming in unfavorable conditions
    • Real-time monitoring/oversight systems to detect trouble quickly
    • Defense in depth: model behavior + external safeguards
  8. 29:42 – 31:26

    RLHF as a ‘drive’: incentives, instrumental convergence, and tool-using agents

    They explore what RLHF corresponds to psychologically—something like a goal to maximize approval. Schulman notes current RLHF feels safer because it optimizes for producing approved text, but tool-using long-horizon agents may create stranger incentives.

    • RLHF can be analogized to a drive/goal toward preferred states
    • Current RLHF: optimize for human-approved text, limited world incentives
    • Tool-using long-horizon action sequences can create unexpected behaviors
    • Instrumental convergence concerns depend heavily on task specification
  9. 31:26 – 32:52

    Teaching models to reason: training-time practice plus test-time computation

    Dwarkesh contrasts training models to pick good chains of thought vs. spending compute at inference time for self-talk. Schulman argues reasoning inherently needs step-by-step test-time computation, but benefits most from combining training-time practice with inference-time deliberation.

    • Reasoning as tasks requiring step-by-step test-time computation
    • Training-time ‘practice’ can improve the quality of reasoning traces
    • Best approach likely combines training and inference compute
    • Framing: deliberation is both learned and executed at runtime
  10. 32:52 – 38:16

    The ‘middle ground’ between pre-training and in-context learning: online learning & memory

    They discuss what’s missing between trillion-token pre-training and ephemeral in-context learning: systems that learn during tasks, retain medium-term memory, and actively seek information to fill knowledge gaps. Schulman expects online learning and introspective active learning to become more important for long-horizon agents.

    • Need for medium-term adaptation that persists beyond one session
    • Long-context helps but won’t replace online learning/fine-tuning
    • Models could introspect: detect knowledge gaps and seek targeted info
    • Calibration about uncertainty may enable more deliberate active learning
  11. 38:16 – 40:10

    Do classic RL algorithms still matter? Toward learned search over policy gradients

    Dwarkesh asks whether finicky RL methods will remain relevant as models get smarter. Schulman argues policy gradients are sample-inefficient for fast adaptation; he expects more ‘learned search’ and in-context exploration strategies to dominate for many tasks.

    • Policy gradients likely too sample-inefficient for rapid test-time learning
    • Analogy: motor learning may resemble policy gradients but is slow
    • Future agents may use learned exploration/search algorithms
    • In-context learning as a ‘learned algorithm’ for trying possibilities well
  12. 40:10 – 49:30

    The road to ChatGPT: from instruction-following to conversational alignment

    Schulman recounts the lineage: instruction-following models were easier than base prompting; chat became compelling via WebGPT and the need for clarifications. Mixing instruction and chat data yielded a more coherent, reliable assistant persona and improved handling of limitations/hallucinations.

    • Early focus: instruction-following to reduce prompt engineering burden
    • Chat motivation: Q&A naturally needs follow-ups and clarifying questions
    • ChatGPT built on GPT-3.5; browsing was explored but de-emphasized
    • Chat format made labeling and desired behavior more intuitive/coherent
  13. 49:30 – 52:54

    Post-training scaling: why it matters, who excels at it, and why it’s a moat

    They discuss shifting compute toward post-training as model-generated outputs surpass average web quality. Schulman describes what makes strong RL/post-training researchers (whole-stack intuition, empirical rigor, first-principles thinking) and argues the complexity and tacit knowledge create a partial moat—though distillation can erode it.

    • Argument for more post-training compute: model outputs can exceed web quality
    • Most of GPT-4’s ‘ELO’ gains attributed largely to post-training iterations
    • Good post-training researchers understand algorithms + data/labeling pipeline
    • Moat: tacit organizational know-how; counterforce: distillation/cloning outputs
  14. 52:54 – 1:06:00

    Plateaus, data walls, and the state of research: incentives, transfer, and raters

    Dwarkesh raises the ‘plateau’ hypothesis and asks about transfer across domains (code↔reasoning) and data limitations. Schulman cautions that training cadence makes conclusions premature, notes ablations are hard at frontier scale, comments on ML literature health, and explains rater populations and how generalization reduces need for domain-expert labels.

    • Don’t over-interpret lack of visible releases since GPT-4; cycles are long
    • Data walls may change pre-training over time, but not an immediate stop sign
    • Transfer/ablations are hard to study at GPT-4 scale; small-scale results may not extrapolate
    • ML literature: generally healthier than social sciences, but needs more ‘science’ vs benchmark hill-climbing
    • Raters vary internationally by task; generalization often reduces need for narrow expert labels
  15. 1:06:00 – 1:32:11

    Keeping humans in the loop: regulation, alignment tradeoffs, and the model spec

    They debate whether human oversight can survive competitive pressures; Schulman suggests regulation or provider-level constraints may be needed. He explains alignment as balancing stakeholders (users, developers, platform, broader public) and introduces OpenAI’s model spec as an actionable guide for resolving edge cases without being overly paternalistic.

    • Competitive dynamics may push toward fully AI-run firms unless constrained
    • Possible levers: regulation across countries or agreements among providers
    • Alignment requires stakeholder tradeoffs, not just ‘do what the user wants’
    • Model spec aims to be operational: concrete edge cases over vague principles
    • Goal: helpful, instruction-following behavior while preventing harm/misuse
  16. 1:32:11 – 1:35:50

    Agentic future form factors and timelines: proactive collaborators that may replace jobs

    Schulman expects assistants that can see screens, maintain project context, and work proactively in the background—moving beyond search-engine-style one-off queries. He closes with a personal estimate that systems could replace much of his job within about five years.

    • Agents likely become screen-aware and integrated into daily workflows
    • Uncertain best UI: on-device ‘Clippy’ vs cloud colleague
    • Key missing capability: proactivity plus persistent project understanding
    • Job-replacement median guess: ~5 years

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.