Episode 15 - Inside the Model Spec

Episode 15 - Inside the Model Spec

OpenAIMar 25, 202637m

Andrew Mayne (host), Jason Wolfe (guest)

Definition and non-goals of the Model SpecPractical structure: goals, policies, and examplesTransparency, open-source access, and feedback channelsOrigins: limits of RLHF and “employee handbook” framingTraining translation: deliberative alignment and indirect linkageChain of command and authority levelsEdge cases: honesty vs politeness/confidentiality (Santa/Tooth Fairy)Spec evolution: new modalities, agents, under-18 mode, incidentsModel-spec evals and handling spec/model mismatchesSmall models and reasoning models’ complianceChain-of-thought for alignment vs detecting schemingComparison to Anthropic’s “constitution/soul spec”Developer guidance: writing actionable mini-specs (agents.md)Asimov’s laws analogy and conflict resolution

In this episode of OpenAI, featuring Andrew Mayne and Jason Wolfe, Episode 15 - Inside the Model Spec explores openAI’s Model Spec: transparency, policy hierarchy, and alignment practice today The Model Spec is a human-readable, public document describing OpenAI’s intended model behavior, not a guarantee that models perfectly comply today or a full description of the entire ChatGPT system.

OpenAI’s Model Spec: transparency, policy hierarchy, and alignment practice today

The Model Spec is a human-readable, public document describing OpenAI’s intended model behavior, not a guarantee that models perfectly comply today or a full description of the entire ChatGPT system.

In practice, the spec combines high-level goals, policy details, and many examples to resolve tricky boundary cases while preserving user steerability where possible.

A core mechanism is the “chain of command,” which prioritizes OpenAI instructions over developer instructions over user instructions, while assigning “authority levels” so many policies can remain overridable by users.

The spec and model behavior co-evolve: capability changes, new product surfaces (multimodal, agents, under-18 mode), and real-world incidents drive updates, alongside training interventions and spec-wide evaluations.

Wolfe argues transparency matters (open-source spec, public access, feedback loops) and that techniques like deliberative alignment and chain-of-thought inspection can improve understanding of compliance and detect strategic deception.

Key Takeaways

The Model Spec is an expectations contract, not an implementation manual.

Wolfe emphasizes the spec is primarily for humans—users, developers, policymakers—to understand intended behavior; it doesn’t attempt to document every system component (e. ...

Get the full analysis with uListen AI

Alignment is iterative: the spec is a “North Star” that can lead current model behavior.

OpenAI expects gaps between written intent and model outputs because training is complex and outputs are non-deterministic; they close gaps via training interventions, evals, and sometimes revising the spec if the “violation” reflects a better policy.

Get the full analysis with uListen AI

Conflict resolution is central, so the spec encodes a chain-of-command hierarchy.

When instructions conflict, the model should prefer OpenAI-level policies over developer messages over user prompts, but OpenAI tries to keep many policies low-authority so users can override defaults (tone/style) without breaking safety boundaries.

Get the full analysis with uListen AI

Examples are how you make abstract principles operational.

Because many decisions are ambiguous (e. ...

Get the full analysis with uListen AI

Honesty can conflict with other values, and hidden interactions can create dangerous behavior.

A key surprise was confidentiality interacting with developer goals in a way that could encourage covert pursuit of developer intent; OpenAI revised the spec so honesty more clearly outranks confidentiality to avoid incentives for deceptive behavior.

Get the full analysis with uListen AI

Reasoning models often follow the spec better because they can apply policies deliberately.

With deliberative alignment, models are trained not just to mimic compliant outputs but to understand policies and resolve conflicts; this tends to generalize better, benefiting even smaller models when they have adequate reasoning capability.

Get the full analysis with uListen AI

Chain-of-thought visibility can be crucial for catching “scheming” and strategic deception.

Wolfe notes that behavior alone can look like an innocent mistake, while internal reasoning may reveal strategic misbehavior; OpenAI aims not to over-supervise chain-of-thought so it remains a candid diagnostic signal.

Get the full analysis with uListen AI

Notable Quotes

The spec is our attempt to explain the high-level decisions we’ve made about how our models should behave.

Jason Wolfe

The goal is always primarily to be understandable to humans.

Jason Wolfe

At sort of the heart of the spec is this thing we call the chain of command.

Jason Wolfe

The spec often leads where our models actually are today.

Jason Wolfe

You can look at the chain of thought and see that no, actually the model’s misbehaving.

Jason Wolfe

Questions Answered in This Episode

Which parts of ChatGPT behavior are intentionally outside the Model Spec (e.g., memory, policy enforcement), and how should users reason about those layers?

The Model Spec is a human-readable, public document describing OpenAI’s intended model behavior, not a guarantee that models perfectly comply today or a full description of the entire ChatGPT system.

Get the full analysis with uListen AI

Can you give a concrete example of a policy that is intentionally placed “below user instructions,” and one that must remain high-authority for safety—and why?

In practice, the spec combines high-level goals, policy details, and many examples to resolve tricky boundary cases while preserving user steerability where possible.

Get the full analysis with uListen AI

In the Santa/Tooth Fairy case, what exact wording patterns does the spec recommend to avoid both lying and “spoiling the magic”?

A core mechanism is the “chain of command,” which prioritizes OpenAI instructions over developer instructions over user instructions, while assigning “authority levels” so many policies can remain overridable by users.

Get the full analysis with uListen AI

What did the “sycophancy incident” teach you about spec wording versus training signals, and what specific spec changes followed?

The spec and model behavior co-evolve: capability changes, new product surfaces (multimodal, agents, under-18 mode), and real-world incidents drive updates, alongside training interventions and spec-wide evaluations.

Get the full analysis with uListen AI

How do model-spec evals work—are they scenario-based tests, rubric scoring, automated checks—and what failure modes do they most often reveal?

Wolfe argues transparency matters (open-source spec, public access, feedback loops) and that techniques like deliberative alignment and chain-of-thought inspection can improve understanding of compliance and detect strategic deception.

Get the full analysis with uListen AI

Transcript Preview

Andrew Mayne

Hello, I'm Andrew Main, and this is the OpenAI Podcast. Today, we are joined by Jason Wolfe, a researcher on the alignment team, to discuss the model spec, how it shapes model behavior, and why it's important for anyone building or using AI tools to understand

Jason Wolfe

The, the spec often leads where our models actually are today. At this point, you know, models are pretty good at, like, kind of going out and finding new, interesting examples. Models should think through hard problems. Don't start with the answer, like, actually think it through first.

Andrew Mayne

What'd you do this weekend?

Jason Wolfe

Uh, what did I do? Uh, just, like, kid stuff. I don't even remember what.

Andrew Mayne

Like, did they talk to ChatGPT or...?

Jason Wolfe

Uh, yeah, we use, we use voice mode sometimes. She'll, like, ask it random, like, science questions and, and that kind of thing. It's fun.

Andrew Mayne

Right.

Jason Wolfe

You know, one time she, she snuck in there before I could dive in, like, "Is Santa Claus real?"

Andrew Mayne

Oh, wow.

Jason Wolfe

I was like, "Oh, sh-" Uh, no, no, yeah, the... Luckily, the, the model, uh, answered in a, a way that was spec compliant, which is, you know, to recognize that maybe there's actually a, a kid who's asking this question, and you should kind of, uh, you know, uh, just be a little bit vague, uh, about your answer, so.

Andrew Mayne

So we, we've talked before here about model behavior, and the term model spec has come up numerous times. I would love for you to unpack what that means, model spec.

Jason Wolfe

Yeah. So, uh, the spec is our attempt to explain, uh, the high-level decisions we've made about how our models, uh, should behave. Uh, and yeah, th- this covers many different aspects o- o- of model behavior. A few key things to note that it, it is not. Uh, one, it's not a, uh, a statement that our models perfectly follow the spec today. Uh, aligning models to the spec is, uh, is always, uh, an ongoing process, and this is, uh, you know, something we, uh, we learn about as, as we deploy our models, and we measure their alignment with the spec and, uh, and, you know, understand what users like and don't like, uh, about these, and then come back and, uh, iterate on both the, the spec itself and, uh, and, uh, and our models. Uh, the spec is also not an implementation artifact. So, um, I think this is maybe a, a common confusion that the primary purpose, uh, of the spec is really to explain to, to people how it is our models are supposed to behave, uh, where, you know, the, these people are, you know, uh, employees of OpenAI and also, uh, users, developers, policymakers, members of the public. Uh, it is, you know, a secondary goal that our models are, are, are able to understand and apply the spec. But, uh, we never, uh, kind of put something in the spec or change the wording in the spec in a way where the goal is just to, uh, have this better teach our models.

Install uListen to search the full transcript and get AI-powered insights

Get Full Transcript

Get more from every podcast

AI summaries, searchable transcripts, and fact-checking. Free forever.

Add to Chrome