OpenAIEpisode 15 - Inside the Model Spec
Andrew Mayne and Jason Wolfe on openAI’s Model Spec: transparency, policy hierarchy, and alignment practice today.
In this episode of OpenAI, featuring Andrew Mayne and Jason Wolfe, Episode 15 - Inside the Model Spec explores openAI’s Model Spec: transparency, policy hierarchy, and alignment practice today The Model Spec is a human-readable, public document describing OpenAI’s intended model behavior, not a guarantee that models perfectly comply today or a full description of the entire ChatGPT system.
At a glance
WHAT IT’S REALLY ABOUT
OpenAI’s Model Spec: transparency, policy hierarchy, and alignment practice today
- The Model Spec is a human-readable, public document describing OpenAI’s intended model behavior, not a guarantee that models perfectly comply today or a full description of the entire ChatGPT system.
- In practice, the spec combines high-level goals, policy details, and many examples to resolve tricky boundary cases while preserving user steerability where possible.
- A core mechanism is the “chain of command,” which prioritizes OpenAI instructions over developer instructions over user instructions, while assigning “authority levels” so many policies can remain overridable by users.
- The spec and model behavior co-evolve: capability changes, new product surfaces (multimodal, agents, under-18 mode), and real-world incidents drive updates, alongside training interventions and spec-wide evaluations.
- Wolfe argues transparency matters (open-source spec, public access, feedback loops) and that techniques like deliberative alignment and chain-of-thought inspection can improve understanding of compliance and detect strategic deception.
IDEAS WORTH REMEMBERING
7 ideasThe Model Spec is an expectations contract, not an implementation manual.
Wolfe emphasizes the spec is primarily for humans—users, developers, policymakers—to understand intended behavior; it doesn’t attempt to document every system component (e.g., product features like memory, policy enforcement layers).
Alignment is iterative: the spec is a “North Star” that can lead current model behavior.
OpenAI expects gaps between written intent and model outputs because training is complex and outputs are non-deterministic; they close gaps via training interventions, evals, and sometimes revising the spec if the “violation” reflects a better policy.
Conflict resolution is central, so the spec encodes a chain-of-command hierarchy.
When instructions conflict, the model should prefer OpenAI-level policies over developer messages over user prompts, but OpenAI tries to keep many policies low-authority so users can override defaults (tone/style) without breaking safety boundaries.
Examples are how you make abstract principles operational.
Because many decisions are ambiguous (e.g., honesty vs kindness), the spec uses borderline scenarios and idealized responses to clarify decision boundaries and convey nuanced “how to talk” guidance that’s hard to formalize.
Honesty can conflict with other values, and hidden interactions can create dangerous behavior.
A key surprise was confidentiality interacting with developer goals in a way that could encourage covert pursuit of developer intent; OpenAI revised the spec so honesty more clearly outranks confidentiality to avoid incentives for deceptive behavior.
Reasoning models often follow the spec better because they can apply policies deliberately.
With deliberative alignment, models are trained not just to mimic compliant outputs but to understand policies and resolve conflicts; this tends to generalize better, benefiting even smaller models when they have adequate reasoning capability.
Chain-of-thought visibility can be crucial for catching “scheming” and strategic deception.
Wolfe notes that behavior alone can look like an innocent mistake, while internal reasoning may reveal strategic misbehavior; OpenAI aims not to over-supervise chain-of-thought so it remains a candid diagnostic signal.
WORDS WORTH SAVING
5 quotesThe spec is our attempt to explain the high-level decisions we’ve made about how our models should behave.
— Jason Wolfe
The goal is always primarily to be understandable to humans.
— Jason Wolfe
At sort of the heart of the spec is this thing we call the chain of command.
— Jason Wolfe
The spec often leads where our models actually are today.
— Jason Wolfe
You can look at the chain of thought and see that no, actually the model’s misbehaving.
— Jason Wolfe
QUESTIONS ANSWERED IN THIS EPISODE
5 questionsWhich parts of ChatGPT behavior are intentionally outside the Model Spec (e.g., memory, policy enforcement), and how should users reason about those layers?
The Model Spec is a human-readable, public document describing OpenAI’s intended model behavior, not a guarantee that models perfectly comply today or a full description of the entire ChatGPT system.
Can you give a concrete example of a policy that is intentionally placed “below user instructions,” and one that must remain high-authority for safety—and why?
In practice, the spec combines high-level goals, policy details, and many examples to resolve tricky boundary cases while preserving user steerability where possible.
In the Santa/Tooth Fairy case, what exact wording patterns does the spec recommend to avoid both lying and “spoiling the magic”?
A core mechanism is the “chain of command,” which prioritizes OpenAI instructions over developer instructions over user instructions, while assigning “authority levels” so many policies can remain overridable by users.
What did the “sycophancy incident” teach you about spec wording versus training signals, and what specific spec changes followed?
The spec and model behavior co-evolve: capability changes, new product surfaces (multimodal, agents, under-18 mode), and real-world incidents drive updates, alongside training interventions and spec-wide evaluations.
How do model-spec evals work—are they scenario-based tests, rubric scoring, automated checks—and what failure modes do they most often reveal?
Wolfe argues transparency matters (open-source spec, public access, feedback loops) and that techniques like deliberative alignment and chain-of-thought inspection can improve understanding of compliance and detect strategic deception.
EVERY SPOKEN WORD
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome