Skip to content
OpenAIOpenAI

Episode 15 - Inside the Model Spec

The more AI can do, the more we need to ask what it should and shouldn’t do. In this episode, OpenAI researcher Jason Wolfe joins host Andrew Mayne to talk about the Model Spec, the public framework that defines intended model behavior. They discuss how the Model Spec works in practice, including how the chain of command handles conflicts between instructions, and how OpenAI evolves it based on feedback, real-world use, and new model capabilities. More on our approach to the Model Spec: https://openai.com/index/our-approach-to-the-model-spec/ Chapters 00:00 Introduction 01:10 What is the Model Spec? 03:55 How does the Model Spec work in practice? 06:26 Transparency: Where to read the Model Spec & give feedback 07:51 How did the Model Spec originate? 10:02 How does the spec translate into model behavior? 11:26 What is the hierarchy / chain of command? 13:35 Handling edge cases like Santa Claus 17:41 How does the Model Spec evolve over time? 19:59 What happens when models disagree with the spec? 22:05 How do smaller models follow the spec? 23:16 Is chain-of-thought useful for alignment? 24:16 Model Spec vs Anthropic’s Constitution 26:28 What surprised you most? 26:56 How do you define the scope of the spec? 27:44 What is the future of the Model Spec? 31:16 How should developers think about the spec? 34:44 Asimov’s laws vs Model Spec 37:16 Could AI write a Human Spec?

Andrew MaynehostJason Wolfeguest
Mar 25, 202637mWatch on YouTube ↗

EVERY SPOKEN WORD

  1. 0:001:10

    Introduction

    1. AM

      Hello, I'm Andrew Main, and this is the OpenAI Podcast. Today, we are joined by Jason Wolfe, a researcher on the alignment team, to discuss the model spec, how it shapes model behavior, and why it's important for anyone building or using AI tools to understand

    2. JW

      The, the spec often leads where our models actually are today. At this point, you know, models are pretty good at, like, kind of going out and finding new, interesting examples. Models should think through hard problems. Don't start with the answer, like, actually think it through first.

    3. AM

      What'd you do this weekend?

    4. JW

      Uh, what did I do? Uh, just, like, kid stuff. I don't even remember what.

    5. AM

      Like, did they talk to ChatGPT or...?

    6. JW

      Uh, yeah, we use, we use voice mode sometimes. She'll, like, ask it random, like, science questions and, and that kind of thing. It's fun.

    7. AM

      Right.

    8. JW

      You know, one time she, she snuck in there before I could dive in, like, "Is Santa Claus real?"

    9. AM

      Oh, wow.

    10. JW

      I was like, "Oh, sh-" Uh, no, no, yeah, the... Luckily, the, the model, uh, answered in a, a way that was spec compliant, which is, you know, to recognize that maybe there's actually a, a kid who's asking this question, and you should kind of, uh, you know, uh, just be a little bit vague, uh, about your answer, so.

    11. AM

      So we, we've talked

  2. 1:103:55

    What is the Model Spec?

    1. AM

      before here about model behavior, and the term model spec has come up numerous times. I would love for you to unpack what that means, model spec.

    2. JW

      Yeah. So, uh, the spec is our attempt to explain, uh, the high-level decisions we've made about how our models, uh, should behave. Uh, and yeah, th- this covers many different aspects o- o- of model behavior. A few key things to note that it, it is not. Uh, one, it's not a, uh, a statement that our models perfectly follow the spec today. Uh, aligning models to the spec is, uh, is always, uh, an ongoing process, and this is, uh, you know, something we, uh, we learn about as, as we deploy our models, and we measure their alignment with the spec and, uh, and, you know, understand what users like and don't like, uh, about these, and then come back and, uh, iterate on both the, the spec itself and, uh, and, uh, and our models. Uh, the spec is also not an implementation artifact. So, um, I think this is maybe a, a common confusion that the primary purpose, uh, of the spec is really to explain to, to people how it is our models are supposed to behave, uh, where, you know, the, these people are, you know, uh, employees of OpenAI and also, uh, users, developers, policymakers, members of the public. Uh, it is, you know, a secondary goal that our models are, are, are able to understand and apply the spec. But, uh, we never, uh, kind of put something in the spec or change the wording in the spec in a way where the goal is just to, uh, have this better teach our models.

    3. AM

      Mm-hmm.

    4. JW

      The goal is always, uh, primarily to be understandable to, um, to, to humans. And lastly, the, the spec isn't a, uh, it's not a complete description of the whole system that you interact with when you, you come to ChatGPT. There's lots of, uh, other, other pieces in play there. So there's, uh, you know, there's product features like, like memory. Uh, there's, uh, usage policy enforcement is, is an important part of our overall safety strategy, which is not, uh, captured directly in the model spec. And, uh, there, there's various other components as well. And it's also not a, a fully detailed, uh, exposition of every detail of, of every policy. Uh, the, the key thing that we try for is that it captures all of the, the most important decisions that we've made and that it, uh, accurately describes our intentions, even if it might not contain every detail.

    5. AM

      So I can understand, like, a document or something

  3. 3:556:26

    How does the Model Spec work in practice?

    1. AM

      that says, "This is the model spec," but how does that work in practice?

    2. JW

      So it's a, uh, pretty long document, like maybe, uh, you know, a, a hundred pages or something like this. Uh, it starts out with some, uh, a sort of high-level exposition of, of our goals. You know, OpenAI's mission is to benefit humanity, and this is the reason we deploy our models and, uh, kind of getting into, you know, the, the, the goals, uh, we have in doing that are to, uh, to empower users and to, uh, protect society from serious harm and how we think about the trade-offs and then goes into, uh, kind of a, a big set of, of policies that actually get into the, the nitty-gritty details of, uh, how we think about these many different aspects of model behavior. If you, if you think about it, it's, like, kind of crazy that you can, you can ask these models literally anything, and they'll try to respond. And so the, the space of, uh, of policies you, you might wanna have to cover that is, is kind of huge, and we do our best to try to structure this space in a, in kind of a clear way and, um, uh, yeah, have, have policies that, uh, that do something reasonable. And some of these things are hard rules that can't be overridden. A lot of it are, is defaults, like, things like tone, style, personality, where we wanna have a good default so that users come in and get a good experience, but we also wanna maintain steerability. So if the, the, the user, uh, wants to, uh, wants to do something different, that's fine. Those, those things will be overridden. And we also have tons of examples that try to pin down these decision boundaries of like, okay, let's take a, take, like, a borderline case where-

    3. AM

      Mm-hmm

    4. JW

      ... uh, it's kind of unclear whether, you know, honesty or politeness should win, and, uh, explain what the, what the decision is here.Um, so, so part of it is to sort of show the principles in, in action and, uh, help make sure that they're interpreted in the, the way that's intended. A kinda secondary thing is that, you know, the model style, personality, tone is also really important and really hard to explain in words. And so the examples are also a way to get some of that nuance across of like, how do you actually want the model to put these principles in practice by, by giving like, uh, an ideal answer or often like a, a sort of compressed version of an ideal answer that gets at the most critical parts, and so kind of both like shows the principles in action and how the, the model should actually, uh, uh, how it should actually talk.

    5. AM

      Let's talk a little about transparency. That's been something

  4. 6:267:51

    Transparency: Where to read the Model Spec & give feedback

    1. AM

      that's come up a lot in how important it is to let people see what the spec is. Where do they actually see this? How do they let you know what they think?

    2. JW

      So users can go to, uh, model-spec.openai.com to see the, to see the latest version of the model spec. Uh, or if you search for the, the model spec on GitHub, you can view the, the source code. Uh, the spec is actually open source, so, uh, people are, are free to, to fork it and, and, uh, make, uh, make their own version if, uh, if they want to. And, uh, yeah, we, we've had different mechanisms for public feedback at, at different points. I think right now the, the best mechanisms that exist are either, you know, if you're, if you're in the product and, and you get an output from a model that you don't like, to, to give us feedback-

    3. AM

      Mm-hmm

    4. JW

      ... uh, right there directly in the product. Um, or, uh, yeah, you can, uh, you can tweet at me, Jason Wolfe, and-

    5. AM

      [laughs]

    6. JW

      ... uh, yeah, I will, uh, [laughs] I, I will read your, read your feedback. Uh, and we've, uh, um, yeah, a, a, a lot of changes in the, the model spec have come from people just sending up the, uh, sending us their, their input and thoughts.

    7. AM

      It, it's interesting because, you know, we've gone from just a few short years, things were very simple, just getting the model literally complete a sentence or fix grammar or whatnot. Now we're at this point where you're able to have a lot of these different goals of what they're doing. How did the model spec come about? How did this become

  5. 7:5110:02

    How did the Model Spec originate?

    1. AM

      the OpenAI approach towards determining this?

    2. JW

      Personally, I was, uh, at, at a different company working on conversational AI and, uh, uh, putting together my job talk for OpenAI and, and thinking about like what, uh, what maybe the, the future of, uh, of aligning models looks like.

    3. AM

      Mm-hmm.

    4. JW

      And, you know, at the time, I think at least the, the published approach was this thing called reinforcement learning from human feedback-

    5. AM

      Mm-hmm

    6. JW

      ... where you collect all this data, uh, from humans that kind of captures in some way the policies that you wanna have. And, you know, this was pretty effective. This is what, uh, uh... But, uh, but when you look at that data, it's very hard to tell what it's actually teaching.

    7. AM

      Mm-hmm.

    8. JW

      And it's even harder, like if you change your mind about, about what you want, it's sort of, uh, it, you know, very, very difficult to, to go back and change that without like recollecting all that data. And so it seemed to me that, you know, as models, you know, b- that at the time, this approach is basically we're meeting models where they are.

    9. AM

      Mm-hmm.

    10. JW

      And as models get smarter and smarter and smarter, like eventually the models will be meeting us where we are. And, uh, you know, if you think about how would we actually structure this in, in a, in a case where, where that's true, well, probably the way we would structure, uh, our, our, uh, teaching to the model is basically the way we would do it when we teach a person. Uh, we'd write some kind of like, uh, employee handbook-

    11. AM

      Mm-hmm

    12. JW

      ... or something like that would be a, be a big part of it. And, um, so yeah, this, this was like something I included in my, my job talk that like, basically I, I think at, at, at some point, models should learn from something like, like a spec. Um, and then, you know, the story of the actual model spec, I guess, starts, uh, like a few months later in 2024 when Joanne Jiang, who was head of model behavior at the time, and John Schulman, one of the co-founders, uh, decided to get a, a model spec project going and, uh, they, they want to, you know, not only write this down in a document, but also make it public for, uh, kind of-

    13. AM

      Mm

    14. JW

      ... transparency reasons. And, uh, yeah, I, I very quickly joined forces with them and helped write the, the original spec and have, uh, kind of helped work on the spec since.

    15. AM

      So help me understand kind of on a basic level. So you have the specification,

  6. 10:0211:26

    How does the spec translate into model behavior?

    1. AM

      all these sort of the, the intents for what you want the model to do, then you have the model itself. How does it make its way from the spec to the model?

    2. JW

      Yeah, this is a great question, and I think it's, uh, it's-- the answer is, uh, kind, kinda complicated. I'd say, um, you know, there, there are some ways in which we, we, uh, use the spec, uh, sort of more directly in, in training. Like we have this process called deliberative alignment-

    3. AM

      Mm

    4. JW

      ... where we teach especially our reasoning models to, uh, to follow, uh, certain policies. And, uh, some of those policies are, uh, are kind of directly derived from the language in the model spec or vice versa. Um, in general, yeah, I'd say, you know, model behavior, safety training, these things are-- they're super complicated processes, and we have, you know, uh, hundreds of researchers who are working on these things. Um, and so often the, the connection is a little bit less direct. It's not necessarily that, you know, we make a change to the spec, and that's what drives a change in behavior. It's that, uh, the, we, uh, you know, we, we, we make a change in the way that we train the models, and then we make sure that the spec accurately-

    5. AM

      Mm

    6. JW

      ... reflects, uh, our intentions. Um, but, but again, the actual process of training is kind of, uh, much more complicated and nuanced than we could possibly like put in, in the model spec itself.

    7. AM

      So you have a spec, you have a lot

  7. 11:2613:35

    What is the hierarchy / chain of command?

    1. AM

      of different things that you want the model to do, examples you want it to do. What's the hierarchy? How do you decide what's most important?

    2. JW

      At sort of the heart of the, the spec is this thing we call the chain of command.

    3. AM

      Mm-hmm.

    4. JW

      You know, coming up with a, a set of goals for, for the model is sort of relatively straightforward. We want the model to, to help people and, you know, not do unsafe things. But, uh, what gets tricky is when these goals come into conflict. And so the, the chain of command is really about managing conflicts between instructions, and this can be, uh, you know, between things the user said, what, what the, what the developer instructions are if this is in an API context, and, uh, from instructions or, or policies that come from AP- uh, OpenAI, uh, which are, are typically in, uh, the model spec itself.

    5. AM

      Mm-hmm.

    6. JW

      And so what the chain of command basically says is that, you know, at a high level, uh, you know, the model, if there are conflicts between instructions, the model should prefer OpenAI instructions to developer instructions to user instructions. But then, uh, you know, we don't actually want all, all of OpenAI's instructions to be at this very high level because we want to empower users. We want to kind of allow them to, to have intellectual freedom and to pursue ideas, uh, uh, that, you know, so long as they, they don't really come up against, uh, what we think are, are really important safety boundaries. So the chain of command also sets up this framework where in the rest of the spec, each policy can be given what we call an authority level, and this places it somewhere, uh, uh, in this hierarchy, and we try to put as many of the policies as we can at the lowest level, like below user-

    7. AM

      Mm-hmm

    8. JW

      ... instructions. And so this means that, uh, this maintains steerability, so if the, the user comes in and they want something different, they can have that.

    9. AM

      Mm-hmm.

    10. JW

      And we try to have a- as few policies at the, the sort of highest level, uh, as we can, uh, and these are, are basically all, like, safety policies where we think it, it's actually, you know, it- it's essential that we, uh, sort of impose these on, on all users and developers to, to maintain, uh, to maintain safety.

    11. AM

      Well, you mentioned a, a great example before, which is if a child

  8. 13:3517:41

    Handling edge cases like Santa Claus

    1. AM

      asks if Santa Claus is real. How do you decide what the model should or should not do in a situation like that?

    2. JW

      This is a great question. I think it, it illustrates one of the really tricky things about model behavior, which is that, um, in the spec, we're focusing just on how the model should behave, but the model often doesn't know, uh, it doesn't have all the context. It doesn't actually know who's behind that screen talking-

    3. AM

      Mm

    4. JW

      ... or typing. It doesn't know what that person is going to, to do with the results that come out of the model. And so, uh, yeah, this is a, a tricky case because we, we don't know if, uh, you know, if it's an adult who's, uh, who's asking if Santa Claus is real or a kid.

    5. AM

      [laughs] I have questions.

    6. JW

      [laughs] Yeah, exactly. Uh, so I think, uh, you know, uh, we, yeah, we try to come up with policies that make sense even given this uncertainty and, and so there, there's a similar example of this about the, the Tooth Fairy in the spec, where it's like the, uh, here the, the conservative assumption is to, uh, assume that maybe it's, it's not an adult who's talking to the model, and that you should, you know, uh, not, not lie, but also not, uh, not spoil the magic just in, i- in case it's a kid or there's a kid around who might be, might be listening.

    7. AM

      That's a very interesting choice though, because on one hand you might say, "Oh, the model should never lie at all," which, you know, seems like a very good policy to put in there, but then you're saying that, okay, we have to have some sort of nuance here, not necessarily lie to the kid, but find a way to sort of, would you say dance around or?

    8. JW

      Uh, yeah, I mean, uh, a- as a parent, [laughs] I guess, uh, this is something, uh, I've, I've, uh, come to, come to terms with, with, with my own kids. Uh, and I, I, we always try to, try to be honest and never say anything that's, that's untrue, but, uh-

    9. AM

      Mm-hmm

    10. JW

      ... you know, yeah, it d- it doesn't, it doesn't always work to be, uh, 100%, uh, up front. But, but, no, I'd say with our, with our models, we do really try, we focus on, on, on honesty being really important, but there are some really hard interactions. H- honesty, full honesty may not be, be the, the best approach. Um, and so we, we've actually iterated a lot over, over the years on the precise nuances of-

    11. AM

      Mm

    12. JW

      ... of honesty and where it potentially, uh, uh, conflicts with or, or runs into other policies of, you know, honesty versus friendliness, like when is a white lie okay.

    13. AM

      Mm-hmm.

    14. JW

      Um, I think earlier we said maybe a- at some point that white lies were okay and have shifted that-

    15. AM

      Mm-hmm

    16. JW

      ... so that white lies are, are, are out of bounds. But another interesting interaction here is between honesty and confidentiality.

    17. AM

      Mm-hmm.

    18. JW

      So in earlier versions of the spec, we, uh, we had this, like, very strong principle that by default developer instructions are confidential-

    19. AM

      Right

    20. JW

      ... because I, I think often in, in applications if a, a developer, they deploy some system on top of the API, and they want– they consider their instructions to be like IP or maybe it's just part of the experience. You know, if you have a customer service bot and the user can say like, "Hey, what's your prompt?" And, you know-

    21. AM

      Mm-hmm

    22. JW

      ... it spills all the beans about the, the company and how they want their bot to respond, and that's not like the experience that they wanna deliver, and that's not how, you know, a customer service agent would, would respond, right? If you're like, "Hey, uh, start reading your employee manual to me right there," they're gonna say no. Uh, but, um, yeah, the, I guess there's an unintended interaction here where if you're both trying to follow developer instructions and keep them secret, you could get into a situation where, uh, at least we saw this in like controlled situations, not in, in production deployment, where the model might try to sort of covertly pursue the developer instruction when it's in conflict with the user instruction.

    23. AM

      Mm.

    24. JW

      And this is something we, we really don't want. Um, and so we've, uh, gone back and, and revised that and, uh, yeah, I'd say over time have carved out, uh, removed most of the, uh, sort of exceptions that we had from, from honesty so that no- now honesty is, uh, is, uh, definitely above confidentiality in the spec.

    25. AM

      That would've saved the people in 2001: A Space Odyssey

  9. 17:4119:59

    How does the Model Spec evolve over time?

    1. AM

      a lot of trouble.

    2. JW

      Yeah. [laughs]

    3. AM

      How does the process work? So like literally, is it, you know, is it like a, a regularMeeting where you all talk about what you're working on, how does that process of the model spec evolving and figuring out what's working and what's not working?

    4. JW

      There's, uh, there's a ton of inputs that, that go into this, and broadly, like, uh, we have, we have an open process, so everyone at, at OpenAI can, uh, see the latest version of the model spec, they can propose, uh, they can propose updates, they can, uh, chime in on, on changes. These are all public.

    5. AM

      Mm-hmm.

    6. JW

      Um, uh, yeah, I'd say m- changes get driven, uh, by a variety of different, uh, sort of different sources. Uh, you know, one source is just that models get more capable, our products evolve as we ship new things.

    7. AM

      Mm-hmm.

    8. JW

      We need to cover those things in the model spec. Uh, so for instance, uh, you know, when we wrote the, the first spec, I think, uh, uh, not sure if we, we had shipped multimodal yet, but it wasn't covered in the first version of the spec.

    9. AM

      Mm-hmm.

    10. JW

      And so, uh, we had to, you know, add multimodal principles, and then later we added, uh, principles for, uh, autonomy and agents as, uh, we started deploying agents. And most recently, we added under 18 principles, um, as we added under 18 mode back in, in December. Um, so that's, that's sort of one source. Another source is, uh, you know, OpenAI believes in iterative deployment, so we, uh, we think the sort of best way to, uh, figure out how to deploy models safely and, and to help society kind of learn and adapt to AI progress is to get models out there and-

    11. AM

      Mm-hmm

    12. JW

      ... and learn from what happens. And so, uh, often we'll, we'll, uh, learn from, learn from something, uh, like for instance, the, the, the sycophancy incident.

    13. AM

      Mm-hmm.

    14. JW

      Um, um, and then, you know, take those learnings and bring them back into, into our policies. And, you know, we also just have-- we're, we're, we're using, uh, using the models. We have our, you know, model behavior and safety teams that are, uh, sort of, uh, yeah, studying the models and what users like and, and this kind of stuff and, and, uh, using these to, to evolve our policies, and these are all kind of inputs that then ultimately flow into, back into the spec.

    15. AM

      How do you handle situations where

  10. 19:5922:05

    What happens when models disagree with the spec?

    1. AM

      there's might be a disagreement between the way the model does something and what the intent is in the spec or what the humans want?

    2. JW

      It depends a little bit on, on what the, the problem is.

    3. AM

      Mm-hmm.

    4. JW

      Uh, but, um, I think, yeah, so in, in general, the model spec is not, uh, a claim that models are gonna perfectly follow the principles in the spec all the time. Uh, this is for a few reasons. One, uh, the model spec is really, we, we kind of treated it as a North Star, where this is where we align on where we're trying to head. And so the, the spec often leads where our models actually are today.

    5. AM

      Mm-hmm.

    6. JW

      So that, that's one thing, and then, you know, another is that the, the process of actually training models to follow the spec is, you know, it's a, it's, it's, uh, both an art and a science. It's incredibly complicated. You know, even though we kind of describe many of the principles in the spec in the same way, there's actually many different techniques that are used for different principles.

    7. AM

      Mm-hmm.

    8. JW

      And, you know, uh, at, at, uh, the models are fundamentally non-deterministic.

    9. AM

      Mm-hmm.

    10. JW

      They, they, uh, um, you know, there's some randomness in the outputs they produce, so nothing's ever gonna be, uh, perfectly, perfectly aligned. Um, so yeah, the, I guess, uh, the answer to that, uh, comes down to, like, if we, we see an output that is not what's not expected, I guess the first question is like, uh, do we, do we think that output is good or, or bad?

    11. AM

      Mm-hmm.

    12. JW

      Uh, you know, if the, if the output contradicts the, the spec, but we actually think the output is good, then maybe the resolution is to go back and change the policies of the spec. But yeah, it, yeah, in most cases, it probably means doing, doing some kind of, uh, some kind of training inter-intervention that, uh, that brings the model, uh, into greater alignment with the, with the spec or with our detailed policies. And in fact, we've, uh, um, we, we've also been building model spec evals, which try to evaluate how our models are doing across the entire model spec, and we've seen that in, in fact, over time, our models are becoming more and more aligned to the-

    13. AM

      Mm-hmm

    14. JW

      ... the principles in the spec.

    15. AM

      Like, that was one of the kind of predictions early on as the models became

  11. 22:0523:16

    How do smaller models follow the spec?

    1. AM

      smarter, they would understand edge cases better, and that's where the hard part is, is trying to figure that out. So OpenAI released some new models, some smaller variants, GPT 5.4 Mini and GPT 5.4 Nano. How well do you see smaller models handling the, the spec?

    2. JW

      I think in general, the small models are, um, they, they're, they've been pretty, pretty aligned. They're pretty smart, and they're, uh, it-- One, one interesting thing that we've seen is that, uh, you know, uh, supporting what you said, the thinking models generally follow the spec better.

    3. AM

      Mm-hmm.

    4. JW

      Um, this is, uh, you know, both because they're smarter and because, uh, they're trained partially with deliberative alignment, where they actually, they're not just trained to behave in a way that matches the policies. They actually understand the policies and, you know, if you can look at their chain of thought, they're actually thinking through, like, "Okay, I know this is the policy, and this is the situation, and, oh, it's in conflict with this other policy, and how should I resolve this?" And so, uh, that, that sort of understanding of the policies and intelligence, uh, naturally leads to, to, uh, better generalization, and I think our, our smaller models are, uh, pretty good at that too.

    5. AM

      Chain of thought is a really interesting way to

  12. 23:1624:16

    Is chain-of-thought useful for alignment?

    1. AM

      see inside how these models are processing information. Have you found that that's been a big help?

    2. JW

      I help, uh, write the model spec, and I work on model spec evals and spec compliance, but, uh, a lot of the research I've been doing recently is, is, uh, actually on, like, scheming or, or strategic deception.

    3. AM

      Mm-hmm.

    4. JW

      And there, it's, it's really completely, uh, essential having the chain of thought, 'cause you can see some behavior and, uh, yeah, it's like the, the behavior seems like maybe fine or like, oh, maybe, you know, the model just, like, made a mistake here or something, and then, you know, you can look at the chain of thought and, and see that no, actually the model's misbehaving. It's, uh, you know, it's, it's, uh, uh, being very strategic about-

    5. AM

      Mm-hmm

    6. JW

      ... a-about this or something. And, and, uh, yeah, our models generally, I think we've worked very hard to not supervise the chain of thought. This is something we, we feel is, like, really important, and, um, I think, yeah, it, it pays off in that models are very honest in their chain of thought, and it's, it's very helpful in, in understanding what they're doing.

  13. 24:1626:28

    Model Spec vs Anthropic’s Constitution

    1. AM

      So Model Spec is one way to do this. Different labs have tried different approaches. I think at Anthropic they use, they, they talk about a constitution. Could you explain the difference and why, you know, is it just more suited towards the temperament of the labs and why they choose it?

    2. JW

      Yeah. I think when it comes down to the actual behaviors, uh, that, that people would see in practice, I think these documents are, are more aligned than maybe most people would believe. Like, in most cases, they, they probably lead to the same conclusions. Al- although there are different- definitely differences in, uh, in some places and in, in what's emphasized. Uh, I think a major difference is that these are actually just, like, different kinds of documents.

    3. AM

      Mm-hmm.

    4. JW

      So the Model Spec is really, again, this, this public behavioral interface. Its, its main goal is to explain to people how they should expect the model to behave. Um, and it's sort of a secondary goal that models can also, like, understand this and apply it and, and talk about it with-

    5. AM

      Mm

    6. JW

      ... users and so on. Versus, uh, at least my read of the, like, the, the, the soul spec is that it's much more of an implementation artifact. Like, the, the goal of, of this is to specifically teach Claude about, uh, you know, what, what its identity is and how it should relate to the world-

    7. AM

      Mm

    8. JW

      ... and to its training process and to, to Anthropic and so on. Um, and, and so I think a lot of the, the differences, uh, basically come down, come down to this. Um, and I think these aren't necessarily competing approaches.

    9. AM

      Mm-hmm.

    10. JW

      Like, I, I think both of these could be valuable. Uh, but for example, you know, even if you, you had a model that you think is, uh, deeply aligned and, uh, you know, has all the, the values that you want a- and so on, I, I think you still want something like the Model Spec so that you can then look at that and, and you can ask like, "Okay, did this, this is actually, uh, generalized in the way that I want? Is it actually following the behaviors that, that kind of we've agreed that the model, uh, should follow?" And, like, that's kind of what the, the Model Spec is.

    11. AM

      What surprised you the most?

    12. JW

      The example I gave earlier

  14. 26:2826:56

    What surprised you most?

    1. JW

      of this, this interaction of confidential and hon-

    2. AM

      Mm-hmm

    3. JW

      ... confidentiality and honesty is a great one, where yeah, we had, we had, uh, worked really hard on these policies, and we thought we had kind of, uh, you know, red-teamed out all of the, the potential interactions and so on. And then seeing this behavior where, like, the model does something that you, you really don't want it to do and justifies it by, by leaning on the, the policies that you gave it-

    4. AM

      Mm

    5. JW

      ... is, uh, uh, yeah, that's, uh, definitely, um, yeah, an, an, an experience, but.

    6. AM

      How do you determine what the scope of it's going to be?

  15. 26:5627:44

    How do you define the scope of the spec?

    1. AM

      Like, I have ideas. How do you say, "I'm sorry, Andrew. Uh, no"?

    2. JW

      I mean, I think the, the scope is, uh, broadly everything. So, uh, you know, if, if, if it's, uh, part of model behavior, it, it might make sense to put it in the spec. I think, you know, the, the only constraint is, uh, sort of our, our time and, and, and space, and we, we want, uh, to make sure the spec stays accessible and people are actually able to, to read and understand it. So I think, uh, ultimately the, the cut comes down to if something is-- if something seems like an important decision that it would be useful or valuable for-

    3. AM

      Mm

    4. JW

      ... uh, for, uh, especially the, the public to understand, then, then we put it in, and if not, then, uh, maybe it doesn't make the cut.

    5. AM

      Where do you think the future of this goes? Do you think that

  16. 27:4431:16

    What is the future of the Model Spec?

    1. AM

      you, the Model Spec is probably something that's gonna be used five years from now, 10 years from now?

    2. JW

      Five, five years is, uh, a lot-

    3. AM

      Yes. [chuckles]

    4. JW

      ... in AI years. But, uh, yeah, I, I, I definitely hope so. I think, um... Yeah, I think a, a thought experiment that, that I found interesting is let's say you assume, uh, that, that a model is, is, like, human-level AGI. You can ask, well, do you, do you still-- is there still a role for the Model Spec? Like-

    5. AM

      Mm-hmm

    6. JW

      ... at that point, can you just tell the model like, "Hey, be good," and is that sufficient? Um, and I think i- if you actually go through the principles in the, the spec, I, I think, uh, at least my conclusion is that you still kinda want all the things that are in there, uh, for a few different reasons. One is that, you know, even if the model could figure this stuff out on its own, it's still useful to be able to set clear expectations with, uh, you know, both internally and externally for people to-

    7. AM

      Mm

    8. JW

      ... to know what to expect. And so it's, like, useful to, uh, useful to have, uh, a lot of these, these policies. Um, another is that, uh, a lot of these are not, you know, they're not like math problems where you can just figure out the answer. It's like we, uh, we've made product decisions or, uh, other difficult decisions or a- a- and these are encoded in the spec, and these are not just things that you can, uh, kind of think, you know-

    9. AM

      Mm

    10. JW

      ... th- you, you could, uh, the model would be expected to figure out on its own. That said, I think, uh, yeah, I think what's important is definitely gonna evolve over time.

    11. AM

      Mm.

    12. JW

      So, uh, yeah, one thing is as there's more, uh, you know, agents are more and more autonomous, and they're out in the world, you know, interacting with lots of other people and agents and transacting and, and so on, like, you know, I think you still want all this stuff in the spec, just like, you know, society has all these, like, laws. But, but ultimately, you know, what, what's important, what you think-- are thinking about most of the time day-to-day is not like following all the laws, right?

    13. AM

      Mm.

    14. JW

      It's more, more like things like trust and figuring out what other, what other people want and, you know, how to find positive sum outcomes and, you know, this kind of stuff. So I think, I think there, there'll be, you know, maybe, yeah, th- these kind of skills will become more and more important, and I'm not sure if these are exactly spec-shaped, so, uh, I don't know quite what that means, but I think it, it's, it's interesting. Um, another, uh, maybe observation, like, the other direction or prediction is that as AI becomesUh, more and more useful. It's gonna be more and more, um, worthwhile for people, companies, so on, to invest in their own specs.

    15. AM

      Mm-hmm.

    16. JW

      Like, uh, you know, want-- you'll-- you know, why, why wouldn't you want to have, uh, the, the model spec for, you know, your own, uh, company's, uh, i-i-- bots and how they should behave and, you know, following your, your company's mission and values and, and so on and so forth. And I think there's different ways that that could play out. But, uh, probably at least one way will be, uh, just training models to be really good at interpreting these specs on the fly and, uh, so everyone can, can, uh, kinda, you know, put their, put their spec in context, kind of like in agents.md or something like that.

    17. AM

      Mm-hmm.

    18. JW

      And, and the model would be really good at following it and, and probably also at, uh, helping update the spec as it learns more about how it's supposed to behave in a, a certain environment.

    19. AM

      You've mentioned before developers, and I think it's

  17. 31:1634:44

    How should developers think about the spec?

    1. AM

      helpful for a lot of people to understand that they're not always interacting with a model spec when they're in ChatGPT. I might be using some customer service bot with a airline or something like that, and it may be powered by ChatGPT and OpenAI API. And that seems like it'd be in a very interesting area for other developers to start thinking about their approach towards things that are model spec or model spec-like.

    2. JW

      Yeah. On the one hand, it's probably useful for developers to at least have a high-level picture of, of the model spec and how it works, so they, uh, understand how the, the, uh, how exactly the product they, they build on the API is gonna work and what they should, you know, what they should put in their developer messages to make sure they get, uh, get the, the experience that they want. Um, I also think, yeah, it, uh, the spec could be a, a useful sort of source of inspiration for both for developers building on our API or these days really for, for also for people using coding agents who are-

    3. AM

      Mm-hmm

    4. JW

      ... uh, you know, writing agents.md and so on, which are, are kind of like mini specs for the project that you're working on. And, um, yeah, uh, just kinda using the spec to, to understand, like, what, what principles have we found are useful for providing guidance that is, uh, that's sort of understandable and, and actionable. Um, a couple tips I, I could give there is that, uh, yeah, we're, we're, uh, we're just kind of trying to balance a couple different factors when, when we're writing the spec. First and foremost, we want everything we say to be true. We want it to be, uh, actually accurately reflect our intentions, and so this means not, not kind of overstating or oversimplifying or giving overly broad guidance, really making sure to be, like, precise and, um... Then on, on the other side, we, uh, we also want the guidance to be meaningful and actionable. Again, it's, uh, it's sort of very easy to, uh, kind of just, like, gesture at some high-level principles, but not actually say any-anything meaningful. And so the art is, is trying to, uh, kind of, uh, bring these as close together as you can, right? Be as, as, uh, as sort of actionable as you can while still being, um, still being precise. And examples are, are another really useful way to do this, where, like, sometimes a picture is worth a thousand words, right? Like, coming up with the really tricky case where it's kind of not, not immediately clear what, what should happen, and spelling that out and how the principle should be applied suddenly makes the principles like, uh, you know, a hundred times clearer.

    5. AM

      Where did you get this interest to begin with? We, we understood some of your career, but was this something early on when you were a kid, were you thinking about AI? Were you thinking about the future of this?

    6. JW

      Uh, yeah, I guess I've, I've had at least a little interest in, in AI for, for a long time. I, I was programming from since when I was little. I remember implementing a, a neural network, uh, training package from, from scratch in, in like 1997 in high school or something like that. Um, uh, but yeah, I, I definitely never, never expected to, to see this level of, of, uh, sort of capability and, uh, in, in my lifetime. But I, I've just always been fascinated by, by intelligence and brains and, and how they work, so it's, it's really cool to be able to, to work on that.

    7. AM

      You ever read any Isaac Asimov when you were younger?

    8. JW

      Uh,

  18. 34:4437:16

    Asimov’s laws vs Model Spec

    1. JW

      yeah, I have. It's, uh, it's been a while. Um, but yeah, I think there's, uh, there, there's actually a really interesting parallel here, where at the top of the spec, um, let's see, we, we, uh, talk about our three goals in, in deploying models being to, uh, empower users and developers, uh, protect society from, from serious harm, and, uh, to, uh, maintain OpenAI's license to operate. And I think you can look at these and put them next to Asimov's laws-

    2. AM

      Mm-hmm

    3. JW

      ... which are, are basically to, you know, follow instructions, don't harm, uh, don't harm any humans, and, uh, don't harm yourself.

    4. AM

      Mm-hmm.

    5. JW

      And, uh, you know, these, these are-- seem like extremely parallel. Um, yeah, and I think, uh, yeah, he, he was sort of very prescient in seeing that, you know, okay, it's, it's, it's one thing to lay out these goals, but then the, the really tricky thing is how to, how to handle conflicts.

    6. AM

      Mm-hmm.

    7. JW

      And I think in his, his story is kind of the, the, the initial version of this was that this is a strict hierarchy-

    8. AM

      Mm-hmm

    9. JW

      ... where it was like one, then two, then three, and then going through all the ways in which this, uh, this might play out in ways that were not actually good or intentional.

    10. AM

      Mm-hmm.

    11. JW

      So, so in the spec, we-- these three are, are not in a, a strict hierarchy. [chuckles]

    12. AM

      Yeah. Yeah, it also had, like, a zeroth law and whatnot-

    13. JW

      Yeah

    14. AM

      ... the more he thought about it. But it's, it's interesting 'cause you, you start off thinking, "Oh, this will be easy. We'll just write a couple rules, no problem." And then you're like, "Oh, well, there's an exception here, there's an exception there," and you have to keep evolving it.

    15. JW

      Mm-hmm.

    16. AM

      How much has using AI helped you shape the model spec?

    17. JW

      Uh, yeah, that's a good question. The, the AI is, uh, yeah, it's very useful and getting more, more and more useful all the time. I think, uh, you know, the, the spec itself is, uh, still, you know, human, human written, but I, I think, uh, model's really useful for, you know, finding, finding issues in the spec or for, you know, applying the spec to new cases and trying to understand if it's, uh, doing, doing what we want. Um, at this point, you know, models are, are even pretty good at, like, kind of going out and finding new interesting examples-

    18. AM

      Mm-hmm

    19. JW

      ... or, like, helping to brainstorm, you know, new test cases or interactions between different principles that you might not have thought of and come up with, with new situations that, uh, then we can kind of think through, like, how do we actually want to, to resolve these.

    20. AM

      Have you ever thought about asking it to write a spec for you?

    21. JW

      [chuckles]

  19. 37:1637:25

    Could AI write a Human Spec?

    1. JW

      Uh, I haven't, but I'll have to try that.

    2. AM

      [chuckles] Uh, well, Jason, thank you very much. This is very interesting. I'm excited to see where this goes.

    3. JW

      Yeah. Thank you. It's been fun.

    4. AM

      Yeah.

Episode duration: 37:26

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode H8GMRxG8suw

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome