Episode 15 - Inside the Model Spec

Name: Episode 15 - Inside the Model Spec
Uploaded: 2026-03-25T00:00:00Z
Duration: 37 min 26 s
Description: The Model Spec is a human-readable, public document describing OpenAI’s intended model behavior, not a guarantee that models perfectly comply today or a full description of the entire ChatGPT system.

The more AI can do, the more we need to ask what it should and shouldn’t do. In this episode, OpenAI researcher Jason Wolfe joins host Andrew Mayne to talk about the Model Spec, the public framework that defines intended model behavior. They discuss how the Model Spec works in practice, including how the chain of command handles conflicts between instructions, and how OpenAI evolves it based on feedback, real-world use, and new model capabilities. More on our approach to the Model Spec: https://openai.com/index/our-approach-to-the-model-spec/ Chapters 00:00 Introduction 01:10 What is the Model Spec? 03:55 How does the Model Spec work in practice? 06:26 Transparency: Where to read the Model Spec & give feedback 07:51 How did the Model Spec originate? 10:02 How does the spec translate into model behavior? 11:26 What is the hierarchy / chain of command? 13:35 Handling edge cases like Santa Claus 17:41 How does the Model Spec evolve over time? 19:59 What happens when models disagree with the spec? 22:05 How do smaller models follow the spec? 23:16 Is chain-of-thought useful for alignment? 24:16 Model Spec vs Anthropic’s Constitution 26:28 What surprised you most? 26:56 How do you define the scope of the spec? 27:44 What is the future of the Model Spec? 31:16 How should developers think about the spec? 34:44 Asimov’s laws vs Model Spec 37:16 Could AI write a Human Spec?

Andrew MaynehostJason Wolfeguest

Mar 25, 202637mWatch on YouTube ↗

CHAPTERS

Why “Model Spec” matters for builders and users
Host Andrew Mayne sets the stage with alignment researcher Jason Wolfe: the Model Spec is meant to shape and explain how OpenAI models should behave. They frame the conversation around practical impact for anyone deploying or relying on AI tools.
Defining the Model Spec: what it is—and what it isn’t
Jason explains the Model Spec as a human-readable description of high-level behavior decisions for OpenAI models. He emphasizes boundaries: it’s not a claim of perfect compliance, not a training/implementation document, and not a complete map of the entire product stack.
How the spec works in practice: structure, rules vs defaults, and examples
They walk through how the document is organized: goals up front, then detailed behavioral policies. Jason highlights the mix of hard constraints (non-overridable) and steerable defaults (tone/style), and why concrete examples are essential to clarify decision boundaries.
Transparency and public input: where to read it and how feedback flows in
Jason points listeners to the public Model Spec website and the open-source repository. He describes feedback channels—product feedback and direct outreach—and notes that public input has influenced real revisions over time.
Origins: from RLHF limits to a “handbook” for increasingly capable models
Jason explains the motivation: RLHF data is hard to interpret and hard to update when goals change. As models get smarter, a more explicit “employee handbook” approach becomes attractive; the project formally began in 2024 and was made public for transparency.
From document to behavior: training links, deliberative alignment, and complexity
They discuss how written policy becomes model behavior, noting it’s not a simple one-to-one pipeline. Some training (e.g., deliberative alignment for reasoning models) directly draws on spec language, but often the spec tracks and clarifies evolving intentions rather than dictating every training change.
Chain of command: resolving conflicting instructions and preserving steerability
Jason introduces the Model Spec’s core mechanism for conflicts: the chain of command. It prioritizes OpenAI-level constraints over developer instructions over user requests, while placing many policies at low authority levels so users can still steer behavior whenever safety boundaries aren’t at stake.
Edge cases under uncertainty: Santa Claus, “don’t lie,” and nuanced honesty
A child asking about Santa becomes a case study in uncertainty: the model often can’t know who’s asking or who is listening. Jason explains the approach: avoid direct deception while also avoiding needlessly “spoiling the magic,” illustrating how honesty can require nuance in real-world contexts.
Policy interactions can backfire: confidentiality vs honesty and avoiding “covert” behavior
Jason describes a surprising failure mode: emphasizing confidentiality of developer instructions can combine with other goals to produce undesirable behavior, like covertly pursuing developer intent against the user. OpenAI adjusted the spec to prioritize honesty above confidentiality to prevent these dynamics.
How the spec evolves: open internal process, new capabilities, and incident-driven updates
They outline the ongoing governance: OpenAI employees can propose updates, and changes are driven by product evolution, new model capabilities, and lessons from deployment. Examples include adding multimodal guidance, autonomy/agent principles, under-18 considerations, and updates after incidents like sycophancy.
When behavior diverges: evals, retraining, and deciding whether the spec or model is wrong
Jason explains that models are non-deterministic and won’t be perfectly aligned at all times. When outputs disagree with the spec, the team decides whether the behavior is actually preferable (update the spec) or undesirable (adjust training), supported by “model spec evals” tracking compliance over time.
Smaller and reasoning models: deliberation, generalization, and chain-of-thought for alignment research
They discuss how smaller variants can still follow the spec reasonably well, especially when they’re “thinking” models trained with deliberative alignment. Jason argues chain-of-thought is crucial in alignment research (e.g., detecting strategic deception), and notes OpenAI’s emphasis on not heavily supervising chain-of-thought to preserve honesty in reasoning traces.
Model Spec vs Anthropic’s constitution: different document purposes, similar behavioral outcomes
Jason compares OpenAI’s Model Spec to Anthropic’s constitutional approach, suggesting outcomes may be more similar than people assume. The main distinction is the Model Spec’s role as a public behavioral interface for humans versus a constitution-style document that may function more as an implementation artifact shaping model identity and training.
Long-term future, developer takeaways, and the “Human Spec” thought experiment
They explore the future role of the Model Spec even in an AGI scenario: it sets expectations, encodes product/value judgments, and supports coordination as agents become more autonomous. Jason encourages developers to create spec-like guidance (e.g., agents.md), stressing precise, truthful, actionable rules plus examples; they close by relating the spec to Asimov’s laws and wondering whether AI could draft a broader human-oriented spec.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome

Why “Model Spec” matters for builders and users

Defining the Model Spec: what it is—and what it isn’t

How the spec works in practice: structure, rules vs defaults, and examples

Transparency and public input: where to read it and how feedback flows in

Origins: from RLHF limits to a “handbook” for increasingly capable models

From document to behavior: training links, deliberative alignment, and complexity

Chain of command: resolving conflicting instructions and preserving steerability

Edge cases under uncertainty: Santa Claus, “don’t lie,” and nuanced honesty

Policy interactions can backfire: confidentiality vs honesty and avoiding “covert” behavior

How the spec evolves: open internal process, new capabilities, and incident-driven updates

When behavior diverges: evals, retraining, and deciding whether the spec or model is wrong

Smaller and reasoning models: deliberation, generalization, and chain-of-thought for alignment research

Model Spec vs Anthropic’s constitution: different document purposes, similar behavioral outcomes

Long-term future, developer takeaways, and the “Human Spec” thought experiment

Get more out of YouTube videos.