John Schulman (OpenAI Cofounder) — Reasoning, RLHF, & plan for 2027 AGI

John Schulman on how posttraining tames the shoggoth, and the nature of the progress to come... 𝐄𝐏𝐈𝐒𝐎𝐃𝐄 𝐋𝐈𝐍𝐊𝐒 * Apple Podcasts: https://podcasts.apple.com/us/podcast/john-schulman-openai-cofounder-reasoning-rlhf-plan/id1516093381?i=1000655679622 * Spotify: https://open.spotify.com/episode/1ivzHH9RWciXe4O1rKtldf?si=53503781e05f4d8f * Transcript: https://www.dwarkeshpatel.com/p/john-schulman/ * Me on Twitter: https://twitter.com/dwarkesh_sp/ 𝐒𝐏𝐎𝐍𝐒𝐎𝐑 * CommandBar is an AI user assistant that any software product can embed to non-annoyingly assist, support, and unleash their users. Used by forward-thinking CX, product, growth, and marketing teams. Learn more at https://www.commandbar.com/ If you’re interested in advertising on the podcast, fill out this form: https://airtable.com/appxGOvFLDLP5dlzv/pagFVrbHRohW6F2bZ/form 𝐓𝐈𝐌𝐄𝐒𝐓𝐀𝐌𝐏𝐒 00:00:00 - Pre-training, post-training, and future capabilities 00:17:20 - Plan for AGI 2025 00:29:43 - Teaching models to reason 00:40:10 - The Road to ChatGPT 00:51:33 - What makes for a good RL researcher? 01:00:18 - Keeping humans in the loop 01:14:36 - State of research, plateaus, and moats

John SchulmanguestDwarkesh Patelhost

May 15, 20241h 35mWatch on YouTube ↗

CHAPTERS

0:00 – 2:44
Pre-training vs. post-training: what each stage actually builds
Schulman contrasts pre-training (internet-scale next-token prediction) with post-training (narrowing behavior toward a helpful assistant). He emphasizes calibration, persona flexibility in base models, and how post-training optimizes for human usefulness rather than web imitation.
- •Pre-training as next-token prediction over broad web/code data
- •Base models as calibrated probability estimators that can adopt many personas
- •Post-training targets a narrower assistant-like behavior
- •Objective shift: likelihood maximization → human-preference/usefulness optimization
2:44 – 6:50
Near-term capability jumps: from single-step help to full coding projects
The discussion turns to what models may do in 1–2 years, especially executing multi-file, iterative coding projects. Schulman points to training for longer tasks and better error recovery as central unlocks.
- •Models evolving from suggestions to end-to-end project execution
- •Training data today is mostly short-horizon; longer-horizon training is needed
- •RL or iterative supervision could teach multi-step project completion
- •Improved robustness: recovering from errors and edge cases
6:50 – 11:53
Long-horizon tasks, phase transitions, and what could still be missing
Dwarkesh probes whether long-horizon coherence will scale smoothly or via phase transitions. Schulman suggests capabilities may generalize across time scales, but warns that long-horizon training won’t automatically fix all deficits.
- •Longer-horizon tasks likely cost more and require more intelligence
- •Scaling may show phase transitions rather than smooth laws
- •Humans plan across timescales using shared ‘machinery’; models might too
- •Other deficits may remain: decision quality, getting stuck, ambiguity handling
11:53 – 13:23
Multimodal agents and AI-friendly interfaces: will the web change?
They discuss whether future multimodal models will need redesigned UIs. Schulman expects models to use human-designed interfaces via vision, though AI-optimized text representations and clearer interaction affordances could help.
- •Vision improvements may let models operate existing human UIs
- •Some services may add AI-oriented UX (clean text views, explicit interactables)
- •Unclear need for ubiquitous APIs; models may adapt to current web
- •Mundane barriers (access, tool affordances) may slow early progress
13:23 – 17:20
Evidence for generalization: cross-language transfer and ‘knowing limitations’
Schulman gives concrete examples where small post-training changes generalize widely. He highlights cross-lingual improvements from English-only fine-tuning and early ChatGPT work that taught the model to admit limitations with very few examples.
- •English-only post-training often improves behavior in other languages
- •Text-only tuning can sometimes improve multimodal behavior
- •Tiny datasets can fix broad classes of mistakes (e.g., false tool claims)
- •Generalization reduces the need for exhaustive domain-specific examples
17:20 – 23:44
If AGI arrives soon: slowdown, coordination, and release strategy
Dwarkesh pushes a ‘what if AGI next year’ scenario; Schulman argues for caution: pausing further training/deployment, sandboxing, and staged rollout. He notes race dynamics could force risky behavior without coordination among major labs.
- •If AGI comes early, safer posture may require slowing training/deployment
- •Sandboxing and limiting scale of deployment as risk controls
- •Game theory: coordination needed to avoid racing and safety compromises
- •Hard question remains: what exactly the world ‘waits for’ after a pause
23:44 – 29:42
How you’d build confidence: incremental deployment, red-teaming, monitoring
Schulman prefers continuous, incremental releases paired with proportional safety improvements rather than a single discontinuous jump. For jumpy capability gains, he argues for simulated deployments, adversarial testing, and defense-in-depth monitoring.
- •Incremental capability increases are safer than lock-down + big release
- •Simulated deployment and strong red-teaming in unfavorable conditions
- •Real-time monitoring/oversight systems to detect trouble quickly
- •Defense in depth: model behavior + external safeguards
29:42 – 31:26
RLHF as a ‘drive’: incentives, instrumental convergence, and tool-using agents
They explore what RLHF corresponds to psychologically—something like a goal to maximize approval. Schulman notes current RLHF feels safer because it optimizes for producing approved text, but tool-using long-horizon agents may create stranger incentives.
- •RLHF can be analogized to a drive/goal toward preferred states
- •Current RLHF: optimize for human-approved text, limited world incentives
- •Tool-using long-horizon action sequences can create unexpected behaviors
- •Instrumental convergence concerns depend heavily on task specification
31:26 – 32:52
Teaching models to reason: training-time practice plus test-time computation
Dwarkesh contrasts training models to pick good chains of thought vs. spending compute at inference time for self-talk. Schulman argues reasoning inherently needs step-by-step test-time computation, but benefits most from combining training-time practice with inference-time deliberation.
- •Reasoning as tasks requiring step-by-step test-time computation
- •Training-time ‘practice’ can improve the quality of reasoning traces
- •Best approach likely combines training and inference compute
- •Framing: deliberation is both learned and executed at runtime
32:52 – 38:16
The ‘middle ground’ between pre-training and in-context learning: online learning & memory
They discuss what’s missing between trillion-token pre-training and ephemeral in-context learning: systems that learn during tasks, retain medium-term memory, and actively seek information to fill knowledge gaps. Schulman expects online learning and introspective active learning to become more important for long-horizon agents.
- •Need for medium-term adaptation that persists beyond one session
- •Long-context helps but won’t replace online learning/fine-tuning
- •Models could introspect: detect knowledge gaps and seek targeted info
- •Calibration about uncertainty may enable more deliberate active learning
38:16 – 40:10
Do classic RL algorithms still matter? Toward learned search over policy gradients
Dwarkesh asks whether finicky RL methods will remain relevant as models get smarter. Schulman argues policy gradients are sample-inefficient for fast adaptation; he expects more ‘learned search’ and in-context exploration strategies to dominate for many tasks.
- •Policy gradients likely too sample-inefficient for rapid test-time learning
- •Analogy: motor learning may resemble policy gradients but is slow
- •Future agents may use learned exploration/search algorithms
- •In-context learning as a ‘learned algorithm’ for trying possibilities well
40:10 – 49:30
The road to ChatGPT: from instruction-following to conversational alignment
Schulman recounts the lineage: instruction-following models were easier than base prompting; chat became compelling via WebGPT and the need for clarifications. Mixing instruction and chat data yielded a more coherent, reliable assistant persona and improved handling of limitations/hallucinations.
- •Early focus: instruction-following to reduce prompt engineering burden
- •Chat motivation: Q&A naturally needs follow-ups and clarifying questions
- •ChatGPT built on GPT-3.5; browsing was explored but de-emphasized
- •Chat format made labeling and desired behavior more intuitive/coherent
49:30 – 52:54
Post-training scaling: why it matters, who excels at it, and why it’s a moat
They discuss shifting compute toward post-training as model-generated outputs surpass average web quality. Schulman describes what makes strong RL/post-training researchers (whole-stack intuition, empirical rigor, first-principles thinking) and argues the complexity and tacit knowledge create a partial moat—though distillation can erode it.
- •Argument for more post-training compute: model outputs can exceed web quality
- •Most of GPT-4’s ‘ELO’ gains attributed largely to post-training iterations
- •Good post-training researchers understand algorithms + data/labeling pipeline
- •Moat: tacit organizational know-how; counterforce: distillation/cloning outputs
52:54 – 1:06:00
Plateaus, data walls, and the state of research: incentives, transfer, and raters
Dwarkesh raises the ‘plateau’ hypothesis and asks about transfer across domains (code↔reasoning) and data limitations. Schulman cautions that training cadence makes conclusions premature, notes ablations are hard at frontier scale, comments on ML literature health, and explains rater populations and how generalization reduces need for domain-expert labels.
- •Don’t over-interpret lack of visible releases since GPT-4; cycles are long
- •Data walls may change pre-training over time, but not an immediate stop sign
- •Transfer/ablations are hard to study at GPT-4 scale; small-scale results may not extrapolate
- •ML literature: generally healthier than social sciences, but needs more ‘science’ vs benchmark hill-climbing
- •Raters vary internationally by task; generalization often reduces need for narrow expert labels
1:06:00 – 1:32:11
Keeping humans in the loop: regulation, alignment tradeoffs, and the model spec
They debate whether human oversight can survive competitive pressures; Schulman suggests regulation or provider-level constraints may be needed. He explains alignment as balancing stakeholders (users, developers, platform, broader public) and introduces OpenAI’s model spec as an actionable guide for resolving edge cases without being overly paternalistic.
- •Competitive dynamics may push toward fully AI-run firms unless constrained
- •Possible levers: regulation across countries or agreements among providers
- •Alignment requires stakeholder tradeoffs, not just ‘do what the user wants’
- •Model spec aims to be operational: concrete edge cases over vague principles
- •Goal: helpful, instruction-following behavior while preventing harm/misuse
1:32:11 – 1:35:50
Agentic future form factors and timelines: proactive collaborators that may replace jobs
Schulman expects assistants that can see screens, maintain project context, and work proactively in the background—moving beyond search-engine-style one-off queries. He closes with a personal estimate that systems could replace much of his job within about five years.
- •Agents likely become screen-aware and integrated into daily workflows
- •Uncertain best UI: on-device ‘Clippy’ vs cloud colleague
- •Key missing capability: proactivity plus persistent project understanding
- •Job-replacement median guess: ~5 years

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

Pre-training vs. post-training: what each stage actually builds

Near-term capability jumps: from single-step help to full coding projects

Long-horizon tasks, phase transitions, and what could still be missing

Multimodal agents and AI-friendly interfaces: will the web change?

Evidence for generalization: cross-language transfer and ‘knowing limitations’

If AGI arrives soon: slowdown, coordination, and release strategy

How you’d build confidence: incremental deployment, red-teaming, monitoring

RLHF as a ‘drive’: incentives, instrumental convergence, and tool-using agents

Teaching models to reason: training-time practice plus test-time computation

The ‘middle ground’ between pre-training and in-context learning: online learning & memory

Do classic RL algorithms still matter? Toward learned search over policy gradients

The road to ChatGPT: from instruction-following to conversational alignment

Post-training scaling: why it matters, who excels at it, and why it’s a moat

Plateaus, data walls, and the state of research: incentives, transfer, and raters

Keeping humans in the loop: regulation, alignment tradeoffs, and the model spec

Agentic future form factors and timelines: proactive collaborators that may replace jobs

Get more out of YouTube videos.