No PriorsNo Priors Ep.61 | OpenAI's Sora Leaders Aditya Ramesh, Tim Brooks and Bill Peebles
EVERY SPOKEN WORD
65 min read · 13,090 words- 0:00 – 1:05
Sora team Introduction
- SGSarah Guo
(instrumental music plays) Hi, listeners. Welcome to another episode of No Priors. Today, we're excited to be talking to the team behind OpenAI's Sora, which is a new generative video model that can take a text prompt and return a clip that is high definition, visually coherent, and up to a minute long. Sora also raised the question of whether these large video models are world simulators and applied the scalable transformers architecture to the video domain. We're here with the team behind it, Aditya Ramesh, Tim Brooks, and Bill Peebles. Welcome to No Priors, guys.
- TBTim Brooks
Thanks so much for having us.
- BPBill Peebles
Thanks.
- SGSarah Guo
To start off, why don't we just ask each of you to introduce yourselves so our listeners know, uh, who we're talking to. Aditya, mind starting us off?
- ARAditya Ramesh
Sure. I'm Aditya, I lead the Sora team together with Tim and Bill.
- TBTim Brooks
Hi, I'm Tim. I also lead the Sora team.
- BPBill Peebles
I'm Bill, also lead the Sora team.
- SGSarah Guo
Simple enough. Um, maybe we can just start with, you know, the OpenAI mission is AGI, right? Um, greater intelligence. Is text-to-video, like, on path to that
- 1:05 – 2:25
Simulating the world with Sora
- SGSarah Guo
mission? How'd you end up working on this?
- BPBill Peebles
Yeah, we absolutely believe models like Sora are really on the critical pathway to AGI. We think one sample that illustrates this kind of nicely is a scene with a bunch of people walking through Tokyo during the winter. And in that scene, there's so much complexity. So you have a camera which is flying through the scene. There's lots of people which are interacting with one another. They're talking, they're holding hands. There are people selling items at nearby stalls. And we really think this sample illustrates how Sora is on a pathway towards being able to model extremely complex environments and worlds, uh, all within the weights of a neural network. And looking forward, you know, in order to generate truly realistic video, you have to have learned some model of how people work, how they interact with others, how they think ultimately, and not only people, also animals and really any kind of object you want to model. And so looking forward as we continue to scale up models like Sora, we think we're going to be able to build these, like, world simulators where essentially, you know, anybody can interact with them. I, as a human, can have my own simulator running and I can go and, like, give a human in, in that simulator work to go do, and they can come back with it after they're done. And we think this is a pathway to AGI which is just going to happen as we scale up Sora in the future.
- SGSarah Guo
It's been said that we're still far away despite massive demand for a consumer product. Like, what, uh, is, is that on the roadmap? What do you have to work on before you, you have broader access to Sora?
- 2:25 – 5:50
Building the most valuable consumer product
- SGSarah Guo
Tim, you wanna talk about it?
- TBTim Brooks
Sure. Yeah, so we really want to engage with people outside of OpenAI in thinking about how Sora will impact the world, how it will be useful to people. And so we don't currently have immediate plans or even a timeline for creating a product. But what we are doing is we're giving access to Sora to a small group of artists, as well as to red teamers, to start learning about what impact Sora will have. And so we're getting feedback from artists about how we can make it most useful as a tool for them, as well as feedback from red teamers, um, about how we can make this safe, how we could introduce it to the public. And this is gonna set our roadmap for our future research and inform, if we do in the future end up coming up with a product or not, um, exactly what timelines that would have.
- SGSarah Guo
Aditya, can you tell us about some of the feedback you've gotten?
- ARAditya Ramesh
Yeah. So we have given access to Sora to, like, a small handful of artists and creators just to get early feedback. Um, in general, I think a big thing is just controllability. So right now, the model really only accepts text as input, and while that's useful, it's still pretty constraining in terms of being able to, uh, specify, like, precise descriptions of what you want. So we're thinking about, like, you know, how to extend the capabilities of the model p- potentially in the future so that you can supply inputs other than just text.
- EGElad Gil
Do y'all have a favorite thing that you've seen artists or others use it for, or a favorite video or something that you've found really inspiring? I know that when it launched, a lot of people were really stricken by just how beautiful some of the images were, how striking, how you'd see the shadow of a cat in a pool of water, things like that. But I was just curious what, what you've seen sort of emerge as people- more and more people have started using it.
- TBTim Brooks
Yeah, it's been really amazing to see what the artists do with the model because we have our own ideas of some things to try, but then people who, for their profession, are making creative content are, like, so creatively brilliant and do such amazing things. So Shy Kids have this really cool video that they made, this short story of, uh, Airhead with, um, this character that has a balloon, and they really, like, made this story. And there, it was really cool to see a way that Sora can unlock and make this story easier for them to tell. And I think there, it's even less about, like, a particular clip or video that Sora made and more about the story that the- these artists want to tell and are able to share, and then Sora can help enable that. So that is really amazing to see.
- SGSarah Guo
You, you mentioned the Tokyo scene.
- TBTim Brooks
Yeah.
- SGSarah Guo
Others?
- BPBill Peebles
My personal favorite sample that we've created is, uh, the Bling Zoo. So we- I posted this on my Twitter, uh, the day we launched Sora, and it's essentially a, a multi-shot scene of a zoo in New York which is also a jewelry store. And so you see, like, saber tooth tigers kinda, like, decked out with bling, you know?
- SGSarah Guo
It was very surreal, yeah.
- BPBill Peebles
Yeah, yeah. And so I love those kinds of samples because as someone who, you know, loves to generate creative content but doesn't really have the skills to do it, it's, like, so easy to go play with this model and to just fire off a bunch of ideas and, uh, get something that's pretty compelling. Like, the time it took to actually generate that in terms of iterating on prompts was, you know, really, like, less than an hour to, like, get something I really loved. Um, so I had so much fun just playing with the model to get something like that out of it, and it's great to see that artists are also enjoying using the models and getting great content from that.
- EGElad Gil
What do you think is a timeline to broader use of these sorts of models for
- 5:50 – 8:41
Alternative use cases and simulation capabilities
- EGElad Gil
short films or other things? 'Cause if you look at, for example, the evolution of Pixar, they really started making these Pixar shorts, and then a subset of them turned into these longer format movies. And, um, a lot of it had to do with how well could they actually world model even little things like the movement of hair or things like that. And so it's been wa- interesting to watch.... the evolution of that prior generation of technology, which I now think is 30 years old or something like that.
- TBTim Brooks
Mm-hmm.
- EGElad Gil
Do you have a prediction on when we'll start to see actual content, either from Sora or from other models, that will be professionally produced and sort of part of the broader media genre?
- TBTim Brooks
That's a good question. I- I don't have a prediction on the exact timeline, but- but one thing related to this I'm really interested in is what things other than, like, traditional films people might use this for. I do think that, yeah, maybe over the next couple years we'll see people starting to make, like, more and more films. But I think people will also find completely new ways to use these models that are just different from the current media that we're used to, 'cause it's a very different paradigm when you can tell these models kind of what you want them to see and they can respond in a way, and maybe there are just, like, new modes of interacting with content that, like, really creative artists will come up with. So I'm actually, like, most excited for what totally new things people will be doing that's just different from what-
- EGElad Gil
Yeah.
- TBTim Brooks
... we currently have.
- EGElad Gil
It's really interesting because one of the things you mentioned earlier, this is also a way to do world modeling. And I think
- ARAditya Ramesh
Yeah.
- EGElad Gil
... you've been at OpenAI for something like five years, and so you've seen a lot of the evolution of models in the company and what you've worked on. And I remember going to the office really early on, and it was initially things like robotic arms and it was self-playing games and things, or self play for games and things like that. Um, as you think about the capabilities of this world simulation model, do you think it'll become a physics engine for simulation where people are, you know, actually simulating, like, wind tunnels? Is it a basis for robotics? Can you assist there? Is it something else? I'm just sort of curious where some of these other future forward applications that could emerge.
- ARAditya Ramesh
Yeah. I- I totally think that carrying out simulations in the video model is- is something that we're gonna be able to do, um, in the future at some point. Um, Bill actually has a lot of thoughts about, uh, this sort of thing, so maybe you can-
- BPBill Peebles
Yeah, I mean, I- I think you hit the nail on the head with applications like robotics. Um, you know, there's so much you learn from video, which you don't necessarily get from other modalities, which companies like OpenAI have invested a lot in the past, like language. You know, like the minutia of like how arms and joints move through space. You know, again, getting back to that scene in Tokyo, how those legs are moving and how they're making contact with the ground in a physically accurate way. So you learn so much about the physical world, uh, just from training on raw video that we really believe that it's gonna be essential for, uh, things like physical embodiment moving forward.
- SGSarah Guo
And talking more about, uh, the model itself, there are a bunch of really interesting innovations here, right? So not to put you on the spot, Tim, but can you, uh, describe for a broad technical audience what a diffusion
- 8:41 – 10:15
Diffusion transformers explanation
- SGSarah Guo
transformer is?
- TBTim Brooks
Totally. So Sora builds on research from both the DALLE models and the GPT models at OpenAI, and diffusion is a process that creates, uh, data, in our case videos, by starting from noise and iteratively removing noise many times until eventually you've removed so much noise that it just creates a sample. And so that is our process for generating the videos. We start from a video of noise, and we remove it incrementally. But then architecturally, it's really important that our models are scalable and that they can learn from a lot of data and learn these really complex and challenging relationships in videos. And so we use an architecture that is similar to the GPT models, and that's called a transformer. And so diffusion transformers combining these two concepts, and the transformer architecture allows us to scale these models, and as we put more compute and more data into training them, they get better and better. And we even released a technical report on Sora, and we show the results that you get from the same prompt when you use a smaller amount of compute, an intermediate amount of compute, and more compute. And by using this method, as you use more and more compute, the results get better and better. And we strongly believe this trend will continue, so that by using this really simple methodology, we'll be able to continue improving these models by adding more compute, adding more data, and they will be able to do all these amazing things we've been talking about, having better simulation and longer term generations.
- 10:15 – 13:08
Scaling laws for video
- TBTim Brooks
- SGSarah Guo
Bill, uh, can we characterize at all what the scaling laws for this type of model look like yet?
- BPBill Peebles
Good question. So as Tim alluded to, you know, one of the benefits of using transformers is that you inherit all of their great properties that we've seen in other domains like language. Um, so you absolutely can begin to come up with scaling laws for video as opposed to language. And this is something that, you know, we're actively looking at in our team and, you know, not only constructing them but figuring out ways to make them better. So, you know, if I use the same amount of training compute, can I get an even better loss, uh, without fundamentally increasing the amount of compute needed? So these are a lot of the questions that we tackle day to day on the research team to make Sora and future models as good as possible.
- SGSarah Guo
One of the, like, questions about applying, you know, transformers in this domain is, um, like tokenization, right? Uh, and so by the way, I don't know who came up with this name, but like latent spacetime patches is like a great sci-fi name here.
- BPBill Peebles
(laughs)
- SGSarah Guo
Can you explain, like, what that is and, like, why- why it is relevant here? Because, you know, the ability to do minute long generation, um, and get to, uh, like visual and temporal coherence is really amazing.
- EGElad Gil
Right.
- TBTim Brooks
I don't think we came up with it, like, as a name so much as, like, a descriptive thing of exactly what it... Like, that's what we call it.
- SGSarah Guo
Yeah.
- TBTim Brooks
'Cause there-
- SGSarah Guo
Even better though.
- TBTim Brooks
Yeah.
- ARAditya Ramesh
Yeah.
- EGElad Gil
Yeah.
- BPBill Peebles
Yeah. So one of the critical successes for the LLM paradigm has been this notion of tokens. So if you look at the internet, there's all kinds of text data on it. There's books, there's code, there's math. And what's beautiful about language models is that they have this singular notion of a token, which enables them to be trained on this vast swath of, like, very diverse data. There's really no analog for prior visual generative models. So, you know, what was very standard in the past before Sora is that you would train, say, an image generative model or a video generative model on just, like, 256 by 256 resolution images or 256 by 256 video that's exactly, like, four seconds long.And this is very limiting because it limits the types of data you can use. You have to throw away so much of, you know, uh, the visual data that exists on the internet, and that limits, like, the generalist capabilities of the model. So with Sora, we introduced this notion of spacetime patches, where you can essentially just represent data however it exists in an image, in a really long video, in like a, a tall, vertical video, by just taking out cubes. So you can essentially imagine, right, a video as just like a stack, a vertical stack of, uh, individual images, and so you can just take these, like, 3D cubes out of it, and that is our notion of a token when we ultimately feed it into the transformer. And the result of this is that Sora, you know, can do a lot more than just generate, say, like, 720p video, um, at, for some, like, fixed duration, right? You can generate vertical videos, widescreen videos. You can do anything, uh, between like one to two aspect ratio to two to one. It can generate images. It's an image generation model. And so this is really the first generative model of visual content, uh, that has breadth in a way that language models have breadth. So that was really why we pursued this
- 13:08 – 15:30
Applying end-to-end deep learning to video
- BPBill Peebles
direction.
- SGSarah Guo
It feels just as important on the, like, input and training side, right? In- in terms of being able to take in different types of video?
- BPBill Peebles
Absolutely. And so a huge part of this project, uh, was really developing the infrastructure and systems needed to be able to work with this vast data, um, in a way that hasn't been needed for previous, uh, image or video generation systems.
- TBTim Brooks
A lot of the models before Sora that were working on video were really looking at extending image generation models, and so there was a lot of great work on image generation. And what many people have been doing is taking an image generator and extending it a bit. Instead of doing one image, you can do a few seconds. But what was really important for Sora and was really this difference in architecture was instead of starting from an image generator and trying to add on video, we started from scratch, and we started with the question of how are we gonna do a minute of HD footage? And that was our goal. And when you have that goal, we knew that we couldn't just extend an image generator. We knew that in order to do a minute of HD footage, we needed something that was scalable, that broke down data into a really simple way so that we could use scalable models. So I think that really was the architectural evolution from image generators to what led us to Sora.
- EGElad Gil
That's a really interesting framework, because it feels like it could be applied to all sorts of other areas where people aren't currently applying end-to-end deep learning.
- TBTim Brooks
Yeah, I think that's right. And it- it makes sense 'cause in the shortest term, right, we weren't the first to come out with a video generator. A l- a lot of people, and- and a lot of people have done impressive work on video generation. But we were like, okay, we'd rather pick a point further in the future and just, you know, work for a year on that. Um, and there is this pressure to do things fast because AI is so fast, and the fastest thing to do is, oh, let's take what's working now and let's kind of like add on something to it, and that probably is, as you're saying, more general than just image to video but other things. But sometimes it takes taking a step back and saying like, "What- what will the solution to this look like in three years? Let's start building that."
- EGElad Gil
Mm-hmm. Yeah, it seems like a very similar transition happened in self-driving recently-
- TBTim Brooks
Mm-hmm.
- EGElad Gil
... where- wh- where people went from bespoke, edge case sort of predictions-
- TBTim Brooks
Right.
- EGElad Gil
... and heuristics and all beta DL to like end-to-end deep learning-
- TBTim Brooks
Yeah.
- EGElad Gil
... in some of the new models. So it's- it's very exciting to see it applied to video. One of the striking things about Sora is just the visual aesthetic
- 15:30 – 17:08
Tuning the visual aesthetic of Sora
- EGElad Gil
of it, and I'm a little bit curious, how did you go about either, uh, tuning or crafting that aesthetic? Because I know that in some of the more traditional, um, image gen models, uh, you both have feedback that helps impact evolution of aesthetic over time, but in some cases, people are literally tuning the models, and so I'm a little bit curious how you thought about it in the context of Sora.
- ARAditya Ramesh
Yeah. Well, to be honest, we didn't spend a ton of effort on it for Sora.
- SGSarah Guo
The world is just beautiful?
- ARAditya Ramesh
Yeah. (laughs)
- SGSarah Guo
Oh, this is a great answer.
- EGElad Gil
There is. (laughs)
- ARAditya Ramesh
I- I think that's maybe the honest answer to most of it. I think Sora's language understanding definitely allows the user to steer it, uh, in a way that would be more difficult with like other models. So you can provide a lot of like hints and visual cues that will sort of steer the model toward the type of generations that you want.
- SGSarah Guo
But it's not like the ide- yeah, aesthetic is like deeply embedded?
- ARAditya Ramesh
Yeah, not yet.
- SGSarah Guo
(laughs)
- ARAditya Ramesh
Um, but I think moving to the future-
- EGElad Gil
That's for two.
- ARAditya Ramesh
... you know, I- I feel like the models kind of empowering people to sort of, uh, get it to grok your personal sense of aesthetic is gonna be something that, uh, a lot of people will look forward to. Uh, many of the artists and creators that we talked to, they'd love to just like upload their whole portfolio of assets to the model and be able to draw upon like a large body of work when they're writing captions and have the model understand like the jargon of their design firm accumulated over many decades and so on. Um, so I think personalization and- and, uh, how that will kind of work together with, uh, aesthetics is gonna be a cool thing to explore later on.
- SGSarah Guo
I think to the point, um, Tim was making about just like, uh, you know, new applications beyond traditional
- 17:08 – 20:12
The road to “desktop Pixar” for everyone
- SGSarah Guo
entertainment, I work and I travel and I have young kids, and so I don't know if this is like something to be judged for or not, but one of the things I do today is, um, generate what amount to like short audio books with voice cloning, um, DALL-E images and, you know, stories in the style of like The Magic Tree House or whatever in a- around some topic that either I'm interested in, like, ah, you know, hang out with Roman emperor X, right, or, um, something the- the girls, my kids are interested in. But this is computationally expensive and hard and not y- quite possible. But I imagine there's some version of like desktop Pixar for everyone, which is like, you know, I- I think kids are gonna find this first, but I'm gonna narrate a story and have like magical visuals happen in real time. I think that's a very different entertainment paradigm than we have now.
- TBTim Brooks
Totally. I mean-
- SGSarah Guo
Are we gonna get it?
- TBTim Brooks
I- yeah, I think we're headed there, and a different entertainment paradigm and also a different educational paradigm and a communication paradigm. Entertainment's a big part of that, but I think there are actually many...... potential applications once this really understands our world. And so much of our world and how we experience it is visual. And something really cool about these models is that they're starting to better understand our world and what we live in and the things that we do. And we can potentially use them to entertain us, but also to educate us. And, like, sometimes if I'm trying to learn something, the best thing would be if I could get a custom-tailored educational video to explain it to me. Or if I'm trying to communicate something to someone, you know, maybe the best communication I could do is make a video to explain my point. So I think that entertainment, but also kind of a much broader set of potential things that video models could be useful for.
- SGSarah Guo
That makes sense. I mean, that resonates in that I think if you ask people under some certain age cutoff, th- they'd say the, the biggest driver of education in the world is YouTube today.
- TBTim Brooks
Right.
- SGSarah Guo
Better or worse.
- TBTim Brooks
Yeah.
- EGElad Gil
Have you all tried applying this to things like digital avatars? I mean, there's companies like Synesthesia, HeyGen, et cetera. They're doing interesting things in this area about having a true, um, uh, something that really encapsulates a person in a very deep and rich way. Uh, seems kind of fascinating as one potential adaption- adaptive, uh, approach to this. I'm just sort of curious if you've tried anything along those lines yet. Or if it, if it's not really applicable, given that it's more of, like, text-to-video prompts.
- TBTim Brooks
So we haven't f- We've really focused on just the core technology behind it so far. So we haven't focused that much on, for that matter, particular applications, including the idea of avatars, which makes a lot of sense, and I think that would be very cool to try. I think where we are in the trajectory of Sora right now is like, this is the GPT-1 of these n- this new paradigm of visual models, in that we're really looking at the fundamental research into making these way better, making it a way better engine that could power all these different things. So that's s- so our focus is just on this fundamental development of the technology right now, maybe more so than specific downstream applications.
- EGElad Gil
That makes sense. Yeah, one of the reasons I ask about, uh, the avatar stuff as well is it starts to open questions around safety. And so I
- 20:12 – 22:34
Safety for visual models
- EGElad Gil
was...
- TBTim Brooks
Yeah.
- EGElad Gil
... a little bit curious, you know, how you all thought about, um, safety in the context of video models and the potential to do deepfakes or spoofs or things like that.
- ARAditya Ramesh
Yeah, I can speak a little bit to that. It's definitely a pretty complex topic. I think a lot of the safety mitigations could probably be ported over from DALL-E 3. Um, for example, the way we handle, like, racy images or gory images, things like that. Um, there's definitely gonna be new safety issues to worry about, for example, misinformation. Um, or for example, like, do we allow users to generate images that have offensive words on them? And I think one key thing to figure out here is, like, how much responsibility, uh, do the companies deploying this technology bear? Uh, how much should social media companies do, for example, to inform users that content they're seeing, uh, may not be from a trusted source? And how much responsibility does the user bear for, you know, using this technology to create something in the first place? Um, so I think it's tricky and we need to think hard about these issues to sort of, uh, reach a position that, that we think is, is gonna be best for people.
- EGElad Gil
Yeah, that makes sense. It's also, there's a lot of precedent. Like, people used to use Photoshop to manipulate images and then publish them.
- ARAditya Ramesh
Yeah.
- EGElad Gil
And make claims, and it's not like, uh, people said that therefore the maker of Photoshop is liable for somebody abusing the technology. So, i- it seems like there's a lot of precedent in terms of how you can think about some of these things as well.
- ARAditya Ramesh
Yeah, totally. Like, we want to release something that people feel like they really have the freedom to express themselves and do what they want to do. Um, but at the same time, sometimes that's at odds with, uh, you know, doing something that is responsible and sort of gradually, um, releasing the technology in a way that people can get used to it.
- EGElad Gil
Uh, I guess a, a question for all of you, maybe starting with Tim, is like... And if you can share this, great. If not
- BPBill Peebles
(laughs)
- EGElad Gil
... understood (laughs) . But, uh, what is the thing you're most excited about in terms of the future product roadmap or where you're heading or some of the capabilities that you're working on next?
- TBTim Brooks
Yeah. Um, great question. I'm really excited about the things that people will create with this. I think there are so many brilliant, creative people with ideas of things that they want to make. And sometimes being able to make that is really hard because it requires resources or tools or things that you don't have access to. And there's the potential for this technology to enable so many people with brilliant, creative ideas to make things. And I'm really excited for what awesome things they're gonna make and that this technology will help them make.
- SGSarah Guo
Bill, maybe
- 22:34 – 25:04
Limitations of Sora
- SGSarah Guo
one, one question for you would just be if this is, um, as you just mentioned, like, the GPT-1, uh, we have a long way to w- go. Uh, this isn't something that the general public has an opportunity to experiment with yet. Can you sort of characterize what the limitations are or what the gaps are that you wanna work on besides the obvious around, like, length, right?
- BPBill Peebles
Yeah. So, I think in terms of making this something that's more widely available, um, you know, there's a lot of serving kind of considerations that have to go in there. So a big one here is making it cheap enough for people to use. So we've said, you know, in the past that in terms of generating videos, it, it depends a lot on the exact parameters of, you know, like, the resolution and the duration of the video you're creating. Uh, but, you know, it's not instant, and you have to wait at least, like, a few minutes, uh, for, like, these really long videos that we're generating. And so, we're actively working on threads here to make that cheaper in order to democratize this, uh, more broadly. Uh, I think there's a lot of considerations as Aditya and Tim were alluding to on the safety side as well. Um, so in order for this to really become more broadly accessible, we need to, you know, make sure that, especially in an election year, we're being really careful with the potential for misinformation and any surrounding risks. Uh, we're actively working on adjusting these threads today. That's a big part of our research roadmap.
- SGSarah Guo
What about just core, um, like, uh, for lack of a better term, like, quality issues?
- BPBill Peebles
Yeah, yeah.
- SGSarah Guo
Right? Are there specific things, like if it's object permanence or certain types of interactions you're thinking through?
- BPBill Peebles
Yeah. So as we look, you know, forward to, you know, like, the GPT-2 or GPT-3 moment, uh, I think we're really excited for very complex long-term physical interactions to become, uh, much more accurate. So to give a concrete example of where Sora falls short today...... you know, if I have a video of someone, like, playing soccer and they're kicking around a ball, at some point, you know, that ball is probably gonna, like, vaporize and maybe come back. Um, so it can do certain kinds of simpler interactions pretty reliably, you know, things like people walking, for example. Um, but these types of more detailed object to object interactions are definitely, uh, you know, still a feature that's in the oven and we think it's gonna get a lot better with scale. But that's something to look forward to moving forward.
- SGSarah Guo
There's one sample that I think is, like, a glimpse of the few... I mean, sure, there, there are many, but there's one I've seen, uh, which is, um, you know, a man taking a bite of a burger-
- BPBill Peebles
Yeah.
- SGSarah Guo
... and the bite being in the burger in terms of, like, keeping state-
- BPBill Peebles
Yeah.
- SGSarah Guo
... which is very cool.
- BPBill Peebles
Yeah. We're really excited about that one. Also, there's another one where, uh, it's, like, a woman, like, painting with watercolors on a canvas, and it actually leaves a trail. So there's, like, glimmers of, you know, this kind of capability in the current model, as you said, uh, and we think it's gonna get much better in the future.
- SGSarah Guo
Is there anything
- 25:04 – 29:32
Learning from how Sora is learning
- SGSarah Guo
you can say about how, um, the work you've done with Sora, uh, sort of affects the broader research roadmap?
- TBTim Brooks
Yeah, so I think something here is about s- the knowledge that Sora ends up learning about the world, just from seeing all this visual data. It understands 3D, which is one cool thing, because we haven't trained it to. We didn't explicitly bake 3D information into it whatsoever, we just trained it on video data and it learned about 3D because 3D exists in those videos, and it learned that when you take a bite out of a hamburger that you leave a bite mark. So it's learning so much about our world. And when we interact with the world, so much of it is visual, so much of what we see and learn throughout our lives is visual information. So we really think that just in terms of intelligence, in terms of leading toward AI models that are more intelligent, that better understand the world like we do, this will actually be really important for them to have this grounding of like, "Hey, this is the world that we live in." There's so much complexity in it, there's so much about how people interact, how things happen, how events in the past end up impacting events in the future, that this will actually lead to just much more intelligent AI models, more broadly than even generating videos.
- EGElad Gil
It's almost like you invented, like, the future visual cortex plus some part of the, uh, reasoning parts of the brain or something, sort of simultaneously. Uh-
- TBTim Brooks
Yeah. And, and that's a cool comparison, because a lot of the intelligence that humans have is actually about world modeling, right? All the time when we're thinking about how we're going to do things, we're playing out scenarios in our head. We have dreams where we're playing out scenarios in our head. We're thinking in advance of doing things, "If I did this, this thing would happen. If I did this other thing, what would happen?" Right? So we have a world model, and building Sora as a world model is very similar to a big part of the intelligence that humans have.
- SGSarah Guo
Um, how do you guys think about the, uh, sort of analogy to humans as having a very approximate world model versus something that is, um, as accurate as, like, let's say a, uh, a physics engine in the traditional sense, right? Because if I, you know, hold an apple and I drop it, I expect it to fall at a certain rate, but most humans do not think of that as articulating a path with a speed as a calculation. Um, do you think that, uh, sort of learning is, like, parallel in, um, large models?
- BPBill Peebles
I think it's a, a really interesting observation. I think how we think about things is that it's almost like a deficiency, you know, in humans, that it's not so high fidelity.
- SGSarah Guo
Mm-hmm.
- BPBill Peebles
So, you know, the fact that we actually can't do very accurate long-term prediction when you get down to a really narrow set of physics-
- SGSarah Guo
Mm-hmm.
- BPBill Peebles
... um, it's something that we can improve upon with some of these systems. And so we're optimistic that Sora will, you know, supersede that kind of capability and will, you know, in the long run, enable it to be more intelligent one day than humans as world models.
- SGSarah Guo
Mm-hmm.
- BPBill Peebles
Um, but it is, you know, certainly a, an existence proof that it's not necessary for other types of intelligence. Regardless of that, it's still something that Sora and, and models in the future will be able to improve upon.
- SGSarah Guo
Okay, so it's very clear that the trajectory prediction for, like, throwing a football is gonna be better-
- BPBill Peebles
Yeah.
- SGSarah Guo
... than the next, next versions of these models than mine is, let's say.
- TBTim Brooks
I- if I could add something to that, this relates to the paradigm of scale and, uh, the bitter lesson a bit about how we want methods that as you increase compute, get better and better. And something that works really well in this paradigm is doing the simple but challenging task of just predicting data. And you can try coming up with more complicated tasks, for example, something that doesn't use video explicitly, but is maybe in some, like, space that simulates approximate things or something. But all this complexity actually isn't beneficial when it comes to the scaling laws of how methods improve as you increase scale. And what works really well as you increase scale is just predict data, and that's what we do with text, we just predict text, and that's exactly what we're doing with visual data with Sora, which is we're not making some complicated, trying to figure out some new thing to optimize. We're saying, "Hey, the best way to learn intelligence in a scalable manner-
- SGSarah Guo
Yeah.
- TBTim Brooks
... is to just predict data."
- SGSarah Guo
That makes sense in relating to what you said, Bill, like, predictions will just get much better with no necessary limit that approximates-
- BPBill Peebles
That's right.
- SGSarah Guo
... humans. Right. Aditya, is there, is there anything, uh, you feel like the general public misunderstands about video
- 29:32 – 31:24
The biggest misconceptions about video models
- SGSarah Guo
models or about Sora or you want them to know?
- ARAditya Ramesh
I think maybe the biggest update to people with the release of Sora is that internally, we've always made an analogy, as Bill and Tim said, between Sora and GPT models in that, um, you know, when GPT-1 and GPT-2 came out, it started to become increasingly clear, uh, to some people that simply scaling up these models would give them amazing capabilities. Uh, and it wasn't clear right away if, like, "Oh, well, scaling up next token prediction, uh, result in a language model that's helpful for writing code." Um, to us, like, it's felt pretty clear that d- applying the same methodology to video models is also gonna result in really amazing capabilities. Um, and I think Sora 1.0 is kind of an existence proof that there's one point on the scaling curve now, and we're very excited for what this is gonna lead to.
- SGSarah Guo
Yeah, amazing. Well, I, I don't know why it's such a surprise to everybody, but bitter lesson once again.
- BPBill Peebles
Yeah. (laughs)
- SGSarah Guo
Yeah.
- BPBill Peebles
I would just say that, as both Tim and Aditya were alluding to, we really do feel like this is the GPT-1 moment, and these models are going to get a lot better very quickly. And we're really excited both for the incredible benefits we think this is gonna bring to the creative world, what the implications are long term for AGI, um, and at the same time, we're trying to be very mindful about the safety considerations and building a robust stack now to, to make sure that society is actually gonna get the benefits of this with, while mitigating the downsides. Uh, but it's exciting times and we're looking forward to what future models are gonna be capable of.
- SGSarah Guo
Yeah, congrats on such an amazing, amazing release. Find us on Twitter @nopriorspod. Subscribe to our YouTube channel if you wanna see our faces. Follow the show on Apple Podcasts, Spotify, or wherever you listen. That way, you get a new episode every week. And sign up for emails or find transcripts for every episode at no-priors.com.
Episode duration: 31:24
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode reMnn6bV_fI
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome