Episode 15 - Inside the Model Spec

The more AI can do, the more we need to ask what it should and shouldn’t do. In this episode, OpenAI researcher Jason Wolfe joins host Andrew Mayne to talk about the Model Spec, the public framework that defines intended model behavior. They discuss how the Model Spec works in practice, including how the chain of command handles conflicts between instructions, and how OpenAI evolves it based on feedback, real-world use, and new model capabilities. More on our approach to the Model Spec: https://openai.com/index/our-approach-to-the-model-spec/ Chapters 00:00 Introduction 01:10 What is the Model Spec? 03:55 How does the Model Spec work in practice? 06:26 Transparency: Where to read the Model Spec & give feedback 07:51 How did the Model Spec originate? 10:02 How does the spec translate into model behavior? 11:26 What is the hierarchy / chain of command? 13:35 Handling edge cases like Santa Claus 17:41 How does the Model Spec evolve over time? 19:59 What happens when models disagree with the spec? 22:05 How do smaller models follow the spec? 23:16 Is chain-of-thought useful for alignment? 24:16 Model Spec vs Anthropic’s Constitution 26:28 What surprised you most? 26:56 How do you define the scope of the spec? 27:44 What is the future of the Model Spec? 31:16 How should developers think about the spec? 34:44 Asimov’s laws vs Model Spec 37:16 Could AI write a Human Spec?

Andrew MaynehostJason Wolfeguest

Mar 25, 202637mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

OpenAI’s Model Spec: transparency, policy hierarchy, and alignment practice today

The Model Spec is a human-readable, public document describing OpenAI’s intended model behavior, not a guarantee that models perfectly comply today or a full description of the entire ChatGPT system.
In practice, the spec combines high-level goals, policy details, and many examples to resolve tricky boundary cases while preserving user steerability where possible.
A core mechanism is the “chain of command,” which prioritizes OpenAI instructions over developer instructions over user instructions, while assigning “authority levels” so many policies can remain overridable by users.
The spec and model behavior co-evolve: capability changes, new product surfaces (multimodal, agents, under-18 mode), and real-world incidents drive updates, alongside training interventions and spec-wide evaluations.
Wolfe argues transparency matters (open-source spec, public access, feedback loops) and that techniques like deliberative alignment and chain-of-thought inspection can improve understanding of compliance and detect strategic deception.

IDEAS WORTH REMEMBERING

5 ideas

The Model Spec is an expectations contract, not an implementation manual.

Wolfe emphasizes the spec is primarily for humans—users, developers, policymakers—to understand intended behavior; it doesn’t attempt to document every system component (e.g., product features like memory, policy enforcement layers).

Alignment is iterative: the spec is a “North Star” that can lead current model behavior.

OpenAI expects gaps between written intent and model outputs because training is complex and outputs are non-deterministic; they close gaps via training interventions, evals, and sometimes revising the spec if the “violation” reflects a better policy.

Conflict resolution is central, so the spec encodes a chain-of-command hierarchy.

When instructions conflict, the model should prefer OpenAI-level policies over developer messages over user prompts, but OpenAI tries to keep many policies low-authority so users can override defaults (tone/style) without breaking safety boundaries.

Examples are how you make abstract principles operational.

Because many decisions are ambiguous (e.g., honesty vs kindness), the spec uses borderline scenarios and idealized responses to clarify decision boundaries and convey nuanced “how to talk” guidance that’s hard to formalize.

Honesty can conflict with other values, and hidden interactions can create dangerous behavior.

A key surprise was confidentiality interacting with developer goals in a way that could encourage covert pursuit of developer intent; OpenAI revised the spec so honesty more clearly outranks confidentiality to avoid incentives for deceptive behavior.

WORDS WORTH SAVING

5 quotes

The spec is our attempt to explain the high-level decisions we’ve made about how our models should behave.

— Jason Wolfe

The goal is always primarily to be understandable to humans.

— Jason Wolfe

At sort of the heart of the spec is this thing we call the chain of command.

— Jason Wolfe

The spec often leads where our models actually are today.

— Jason Wolfe

You can look at the chain of thought and see that no, actually the model’s misbehaving.

— Jason Wolfe

Definition and non-goals of the Model SpecPractical structure: goals, policies, and examplesTransparency, open-source access, and feedback channelsOrigins: limits of RLHF and “employee handbook” framingTraining translation: deliberative alignment and indirect linkageChain of command and authority levelsEdge cases: honesty vs politeness/confidentiality (Santa/Tooth Fairy)Spec evolution: new modalities, agents, under-18 mode, incidentsModel-spec evals and handling spec/model mismatchesSmall models and reasoning models’ complianceChain-of-thought for alignment vs detecting schemingComparison to Anthropic’s “constitution/soul spec”Developer guidance: writing actionable mini-specs (agents.md)Asimov’s laws analogy and conflict resolution

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.