a16zGoogle DeepMind Lead Researchers on Genie 3 & the Future of World-Building
EVERY SPOKEN WORD
45 min read · 8,547 words- 0:00 – 0:29
Introduction
- SFShlomi Fruchter
All of the applications basically stem from the ability to generate a world that, that just, just from a few words. You look at it and like there's a world, you know, that's generated in front of your eyes, and it's amazing that it's happening. I was very excited about how far can, can, can we push that.
- JPJack Parker-Holder
And it, it's at the point where, like, a human who is not an expert, like, will watch it and think it looks real, right? And I think that's pretty incredible. [upbeat music]
- ETErik Torenberg
Jack,
- 0:29 – 1:10
The Evolution of Generative Models
- ETErik Torenberg
Shlomi, uh, Genie 3 has taken over the internet. Uh, we're honored to have you on the podcast today. H- has the response surprised you? Uh, reflect a l- a little bit about, uh, the reaction.
- SFShlomi Fruchter
We weren't sure, like, how, how big it's gonna be, um, but we definitely felt, I, I felt definitely that, uh, we have something that was kinda like, uh, uh, for a long time coming, basically being able to generate environments in real time. Uh, I think a lot of work that was done in Google DeepMind and outside pointed to that direction, but we really wanted to make it happen, and I, I hope we have. Yeah.
- ETErik Torenberg
Team, why don't we reflect a little-- internally a little bit about what we found so game-changing about Genie 3 and why we're so excited to have this conversation. Mark.
- MMMarco Mascorro
Yeah,
- 1:10 – 4:35
Real-Time Interactivity & User Experience
- MMMarco Mascorro
for sure. I mean, uh, first of all, it's, it's an amazing model. I think, uh, there's a lot of excitement around the special memory, the consistency across all the frames. I think this is the first time I can see, like, you can have some sort of interactive way of, of doing this stuff with videos, 'cause it used to be like you would do one prompt, and you would have, like, 15 seconds of a video. But now you can actually have some sort of interactive, um, kind of element to it, which I think is very exciting. So, uh, can you elaborate a little bit more on, like, your insights on this? Like, how was, like, for example, figuring out what data you should collect, how you make it very interactive, and keeping the flow of the whole video, which I thought it was phenomenal.
- JPJack Parker-Holder
Uh, sure, yeah. So I think you kind of highlighted a few capabilities, um, sort of the length of the generation, um, the consistency of the worlds, maybe diversity as well of the time-- kind of things you can generate. I think, um, the main thing is that we're-- obviously we made progress in quite a few different fronts, right? In separate efforts, right? So we had this Genie 2 project that was much more sort of like 3D environments that it could generate, and it wasn't super high quality. It felt like it coming from Genie 1, but it wasn't the same quality as things like Veo 2, which, um, you know, the state-of-the-art video model at the time came out in December, roughly exactly the same time. It came out a week later than, than Genie 2. And obviously internally, uh, there was a lot of, like, discussion between the two projects, uh, about the different directions we were pursuing. And then Shlomi had also worked on GameNGen, right? Um, which is the Doom paper, uh, as, as people know it, which I think you guys also wrote a, a nice piece, uh, on straight after that came out. So, uh, I think that also, like, attracted a lot of attention. And so then we felt that across these different, like, different projects, we had quite a lot of interesting things that would naturally kind of combine, and we could basically take the most, like, ambitious version of the combined project and see if it was possible. And fortunately it was, and quite, quite-- I think the timeline is probably the bit that surprised many of us because obviously we set ourselves these goals and, like, we tried very hard to achieve them. But you can never be totally sure how it's gonna actually, um, feel when you've, when you've got to that point. And I think it ended up being something that resonated with people a lot more than maybe we expected, but, um, we were always believers, so...
- SFShlomi Fruchter
Yeah, I'll just add to this that, um, I think the real time, so component is really important. Um, and, you know, people, you know, not many people experience it firsthand, but we really tried in the release to at least have a few trusted testers interact with it and also get the feel of it by adding these overlays that show what happens, how people can, like, uh, use the keyboard to control it. And I, I think there is something magical about the real time aspect. I felt it for the first time when our model, like the, uh, actually GameNGen model started working fast enough, and we were just like, "Oh my God. It's actually-- I can actually walk around." And it was a bit of an wow-- an, like, uh, an wow moment. Um, and, and y- yeah, I think there is something when it responds immediately that is really magical. Um, I think that's kinda sparked the imagination of many people when, when, uh, the Doom kind of like simulation came out and here we really wanted to push it to somewhere we weren't sure it's gonna work. So it was definitely at, at the edge of possib- what's possible, I think. I-- that's how we felt. So we just said, "Yeah, let's, let's try and, and see if we can make it happen."
- 4:35 – 8:15
Applications and Use Cases
- JMJustine Moore
I think you guys, like, I don't know if this was on purpose or not, but you perfectly timed it, uh, when everyone on, like, X and Reddit and everywhere was making those videos of, like, characters walking through games. But they obviously, like, weren't interactive. They weren't real time, and then you guys came out with this, um, release that was like, now, now this is an actual product, and, and it blew folks away. I'm curious, um, 'cause you can imagine so many different applications for this, right? Like more controllable video generation or, um, making it much easier to create games, even personal gaming where someone's just kind of creating their own world they walk through, like RL environments for agents, robotics. Like, are there any particular use cases that, that you're most excited about?
- SFShlomi Fruchter
I think all of the, all of the applications basically stem from the ability to generate a world that, that just, just from a few words. And I, I think, uh, for me, kind of like this, this potential when I started looking at video models, I think it was pretty early when I think it was one of the models were like Imagine Video, which was a model by Google Research. But there are a lot of, of models that kinda like they were very basic compared to what we have today, but the ability to simulate something, like you look at it and like there's a world, you know, that's generated in front of your eyes, and it's amazing that it's happening. And I think at this point, um, I was very excited about how far can, can, can we push that, right? So I think Veo was one way to do it, and, and Genie is definitely another way to make it a bit more interactive. Uh, so I think all of the applications basically stem from this ca- core capability, so it can be entertainment, of course, as you, as you said. It can be, uh, training agents. It can be helping agents to reason about the world, uh, education. So I don't think any particular application is more important or than others. I think it's really up to how, uh, you know, developers in the future will build on top of that.
- JPJack Parker-Holder
Yeah, I would give basically the same answer in the end with a different journey to get there, right? Which is, like I personally myself worked in reinforcement learning for a few years before starting the Genie Project, um, in 2022. Uh, and the motivation originally was like that in RL at the time, we had this problem where we'd say like, "Which environment should we try and solve," right? Because once you've already done Go, which people thought was years or decades away, and then that was solved in 2016. Well, solved, but we reached superhuman level in 2016. And then StarCraft three years later, which is not particularly a long time for something incrementally significant. Towards 2021 time, it was a big question of what should we try and do with RL. We know that the algorithms can learn superhuman capabilities if they have the right environment, but we don't know what the environment would be. And so we were working on designing our own ones, right, with colors. Uh, but then instead it seemed like the more, more promising path when you had the first text-to-image models coming, coming out, where it's like, what if we just think long term? What's the way to really unlock unlimited environments? Um, that being said, over the course of the project, uh, and originally we started it, um, I guess in 2022, uh, it was very focused on that one application, but it seems quite clear now that this could have a big impact on all those other areas you mentioned, right? So I think it's like language models in 2021 maybe, you probably wouldn't have guessed like an IMO gold medal a few years later would, would come that fast. Um, but as a direct application of that technology, right? It was probably, "Oh, it can help me with my emails," or whatever it was. Uh, and I think it's really cool to build these kind of new class of foundation models and then see what people can imagine doing with it. Um, and that's one of the very exciting things about sharing the research preview, right, is we've got this kind of feedback. Um, so we're hoping a lot of these things can happen.
- 8:15 – 13:12
The Importance of Special Memory
- AMAnjney Midha
One of the things in the research preview post, Jack, that blew me away was this-- And it wasn't even your first GIF, I think, in the blog post. It was either second or third. You had, you had this visual of somebody painting the wall with the paintbrush, and then the character moves-
- SFShlomi Fruchter
Yeah, the spatial memory
- AMAnjney Midha
... out of-
- SFShlomi Fruchter
Yeah.
- AMAnjney Midha
Right? Like out, out of-
- SFShlomi Fruchter
Phenomenal
- AMAnjney Midha
... to a different part of the wall, paints, and then moves back, and the original paint is still there. And I didn't believe it. I was like, "There's no, there's no way." Like, the-- And then I read, and you're right. It was described as a spatial memory. So the persistence part for me... I mean, I, I'm not, I'm not taking away from all the other stuff. The interactivity is amazing, you know. But I, I think broadly speaking-
- JPJack Parker-Holder
Yeah
- AMAnjney Midha
... folks expected that at some point, you know, video generation, for example, would become real time. Um, like, you know, when, when I saw the Genie 3 post, it was like, "Oh, okay. They, they actually went and did it." But the spatial memory, the persistence, was when I kind of sat up in my chair and I was like, "How did that happen?" Could you talk a little bit about, um, when did you discover that as an emergent property, or was that a specific design goal? What's the backstory on that? 'Cause that feels like a big unlock. Jack, uh, why don't we start with you?
- JPJack Parker-Holder
Yeah. So, um, that's a great question. Um, I'll say a few things. So it- the, the TLDR is it was totally planned for, but still incredibly surprising when it worked that well, right? So that specific sample, when I saw it, it was hard to believe. I actually wasn't sure that the model generated it for a second. I was like, that... It took me to watch it a few times and like really check and like freeze the frames and look back and check that it was the same. Um, but so from, from-- Go back a few steps. So obviously Genie 2 had some memory, right? So this got kind of lost because, I mean, Genie 2 came at a time when there were lots of announcements, very exciting announcements. I mean, Veo 2 only a few days later. Um, [clears throat] it was a busy time of the year, and the main headline act was that we could do-- generate new worlds at all, right?
- AMAnjney Midha
Mm.
- JPJack Parker-Holder
So that was the thing that we wanted to emphasize. But it did have, you know, a few seconds of memory, and we had a couple of examples. Like, I created a robot near a pyramid, looked away, looked back, and the pyramid's there.
- AMAnjney Midha
Mm.
- JPJack Parker-Holder
But it's like kind of blurry. Uh, it's not perfect. Um, but some other models around the same time or more recently didn't have this feature, right? So people kind of indexed to that because they didn't notice the, like, early signs of it in the Genie 2 work. And then for Genie 3, we basically went, um, much more ambitious on the same sort of approach, right? And we, um, made it like a head- headline goal for ourselves is like, can we make the memory be what it is, right? We, we said we want minute, uh, plus memory, uh, and real time and this high resolution all in the same model. And those are kind of conflicting objectives, right? Um, so we set ourselves this kind of technical challenge. Uh, and we said, like, if we target this, then it's just about feasible and it'll be pretty incredible. And then, um, you still don't know, obviously, it's gonna pan out. So then when you get to the end of the research, one, you know, seven months later, um, to see the samples still is quite mind-blowing, to be honest. Uh, so yeah. [chuckles] Uh, it's kind of planned for, but still pretty cool and exciting when you see it because it's like, at the end of the day, research projects aren't sure things, are they? So...
- SFShlomi Fruchter
One thing that we didn't wanna do, and we didn't wanna build an explicit representation, right? So there are definitely methods that are able to, to achieve consistency, and they did that through an explicit, some 3D, you know, NeRFs, Gaussian splatting, and other methods that basically say, "Okay, if we know how the world looks like, we use this kind of like prior, uh, assumptions on how the world remains static pretty much, then we can build representation, and then know what we- you're looking at." So that's, that's great, I think, for some applications, but we didn't wanna go down this path because we felt it's somewhat limiting. And, and I think, um, so we can definitely say that the model doesn't do that. Um, and it does generate like frame by frame, and we think this was, this is really key for the generalization to, to actually work.
- JPJack Parker-Holder
Every time someone interacts with it for the first time and they like test-- They look away and then look back, I'm always like holding my breath. [laughs] And then it, and then it looks back and it's the same, I'm like, "Whoa." [laughs] It's still re- really, it's really cool.
- SFShlomi Fruchter
It's very cool.
- JPJack Parker-Holder
Um, yeah.
- SFShlomi Fruchter
And, and how long is this spatial memory? I don't know if you can talk about it. You mentioned a minute plus, but like is there some sort of like measure that you have? Is it like can you keep it for half an hour or what is the, what is the limit on that?There was not like, there is no like fundamental limitation, but currently the current design we're limited to one minute of, of this type of memories.
- JPJack Parker-Holder
Yeah. It's also a real time, um, trade-off for the guests as well. We, we felt that because of the breadth and the other capabilities that like a minute was sufficient for, for this version. Like it's quite a significant leap, but, um, obviously eventually you'd want to extend
- 13:12 – 19:45
Emergent Behaviors & Model Capabilities
- JPJack Parker-Holder
this.
- MMMarco Mascorro
One more question that related on the between Genie 1 to... Like, you know, like in, for example, in LLMs, like you have like DeepSeeker-1, like they saw in this paper, like the longer they keep it running, they suddenly will see like these interesting behaviors like the model will start like reasoning or like would give like a, like a, "Oh, I'm wrong in this. I should self-correct." Do you see anything in kind of like this scaling from two to three, do you see any sort of like interesting behavior that you were not expecting that suddenly just appear by increasing the amount of data and the amount of compute?
- SFShlomi Fruchter
Yeah, I'll just say I think there is a bit of like, um, like overall definitely, uh, like many generative models, we see that improvements happen with scale. I think that's not, not, not a secret. And, um, I don't know if we can, you know... I don't think it's not the same type of intelligence, I would say, like an LLM has. Like I'm not sure if reasoning is, is the right term. Um, but we do see that some definitely things like it can infer from if you approach like a door, it's, and it makes sense for, for the agents to maybe open it. So you might see that it's starting to do that, for example. Or there's some like better word understanding that happens over time, and it just like things look better and more realistic. So I think these are the trends that we've, we've observed. Yeah.
- JPJack Parker-Holder
Yeah. And from Genie 2 to 3, it's like, I think the real world capabilities really increased, right? So on the physics side, um, some of the water, um, simulations, you can see, uh, some of the lighting as well, like are really breathtaking. I, I, I think we have this example of the storm on the, on the blog, and that one I think is super cool. Um, and it, it's at the point where like a human who is not an expert, like will watch it and think it looks real, right? And I think that's pretty incredible. Whereas with Genie 2, it was like, it kind of understands roughly what these things should do, but you know it's not real, right? You can look at it, and you can clearly see that it's sort of, um, not completely photorealistic. So I think that's quite a big leap on the quality in that side.
- JMJustine Moore
Yeah. One of the things that was really cool in all the examples was the wa- water is sort of a great way to see like does it understand like what the world is and how objects interact, and, um, that example someone posted of like the, the feet going in the puddle was amazing. But then there was also that example of-
- JPJack Parker-Holder
Yeah
- JMJustine Moore
... like a cartoon character, it was more of like an animated style, who was like running across this kind of green patch of land and then ran into this blue kind of wavy thing that looked like water, and he started swimming, which I thought was, was really interesting. Like were there particular things you had to do around that to, for the model to be able to understand like how characters should interact in different environments and different styles?
- JPJack Parker-Holder
What you're basically describing is like the real breadth of different kind of environment terrains and worlds and things like that, like water or walking on sand versus going downhill in snow, and how the agents' sort of interactions should differ given the, the, like terrain that they're in. And I think that that really is the property of, of, of scale and breadth of, of training. So, um, this is very much like an emergent thing. I don't think there's anything like really specific we do for this, right? Um, you again, like you hope the model has learned this because it should have like a general world knowledge. Um, it doesn't always work perfectly, but in general it's pretty good. Like, so for the skiing examples, you do go fast when you go downhill, and then when you turn, try and go back uphill, it's very slow, if not at all possible. Um, when you go into water, obviously you hope, as you said, that the agent will start swimming and splashing, and this does typically happen. Um, when you look down near a puddle, hopefully you're wearing Wellington boots. Um, like this kind of stuff does just kind of make sense. And I think it feels pretty magical because it, it very much aligns with what you were thinking about the world, and the model's just generated it all. So yeah, that's, that's also one of the really exciting things for sure.
- SFShlomi Fruchter
Yeah. And on top of that, I, I... One, one kind of trade-off that typically we have is that we want the model to do two things. We want the model to create the world in a way that looks consistent. So as Jack said, like if you, if you walk in rain or in puddles, then probably wearing boots. But if we provide it with a different description or like the prompt is saying something else, we want it to still follow the prompt. And there is some tension here because some things are very unlikely, right? You might say, "I wanna wear flip-flops," and, I don't know, jump in the rain or whatever. Um, um, then, then the model still has to try and create something that is very unlikely, and that's where typically, you know, video models maybe find it more challenging, and that's where, you know, our models might, might find it more challenging, but still, it's still successful to a, to a surprising degree to go into these kind of low probability areas. And I think that's really, uh, in a way, that's what we want, right? Like many people, they don't wanna just look at the video that looks like their, their own, you know, maybe this room. Um, uh, but, but more something a bit more exciting, and that's where like we... where I think this is the magic of the, of the models, that they can take you to places that maybe are not so likely, uh, to be in reality.
- JPJack Parker-Holder
The text following is really amazing in this model, um, and that does feel really magical. I think that's something that, that Veo does really well as well, right? Like pretty much what you ask for, um, it's really well aligned with text, and so, and, and we have that with Genie 3. So you could describe very specific worlds and really kind of like arbitrary silly things, and it pretty much works. Um, like-We actually had this d- this discussion because, um, people were very disappointed to find out that the, the video I made of my dog actually was not my dog's photograph. I just described her in text.
- AMAnjney Midha
Mm.
- JPJack Parker-Holder
Uh, [chuckles] and, um, yeah, I don't know if that's a big, a big secret, but it looks exactly like her. Um, and, and the model just kind of knows, right? Um, and I think that's pretty amazing. Um, so I think that that's actually a really important capability that we didn't have with Genie 2 as well, right? Because we relied on image prompting, and so there was some transfer issue, like, where you, you rely on Imagen to generate the image, and that often does look really good, but it's not necessarily a good im- image for starting the world. Um, whereas, like, going directly from text, you get the controllability to print anything you want. Um, plus it just kind of naturally works because it's in the, like, correct space for the model to do its thing. Uh, and that's something really powerful.
- AMAnjney Midha
And
- 19:45 – 20:48
Instruction Following & Text Adherence
- AMAnjney Midha
why is that, Jack? What do you think led to such a massive instruction following a text adherence gain? 'Cause it's a pretty hard thing to do.
- JPJack Parker-Holder
Well, I mean, our, our team had never really worked on this, and so Genie 1 and 2 both worked with image prompting. Um-
- AMAnjney Midha
Yeah
- JPJack Parker-Holder
... and so obviously like, um, for this next phase, we, um, we leveraged a lot of the research done internally on other projects, um, and, um, personnel-wise, I mean, Shlomi's obviously worked-- been co-leading the Veo project, and so we were able to kind of build on a lot of, uh, other work and ideas internally, um, a- and that basically, like, may allow us to kind of like turbocharge progress, right? So if we'd done this sort of by incrementally building, like, ourselves in a-- in, in-- on an island, it would've taken, I think, a lot longer than being part of Google DeepMind, where we have these teams that have a lot of knowledge in different areas that we can sort of lean and build on. Uh, which I think is super exciting about our being in this company right now, is that we have so many experts in different areas that we can, like, seek out advice and help from.
- 20:48 – 21:56
Comparing Genie 3 and Other Models
- AMAnjney Midha
And Shlomi, a question for you on that is, you know, having led the Veo 3 work, which is kind of mind-blowing, is, is there a reason why this is Genie 3 and not, like, Veo 3 real time?
- SFShlomi Fruchter
So I, I think it's definitely a bit different, right? Like, uh, Genie allows you to navigate environment and then maybe take actions, right? And that's not something that Veo at this point can do.
- AMAnjney Midha
Yeah.
- SFShlomi Fruchter
Um, but there are other aspects that are diff- that, that Genie doesn't have, right? Doesn't-- Genie doesn't have audio, for example.
- AMAnjney Midha
Right.
- SFShlomi Fruchter
So we, we just think it's, it's, um... while definitely there are potential similarities, it's sufficiently different. Um, also another thing is that at this point, Genie 3 is not available, you know, as a product, and Veo, we do think about it as like a product that is kinda like makes-- make mainstream and became very, um, very popular and, and, you know, what the future holds, I don't know. But I mean, at this point, we just felt it's sufficiently different in terms of, of what capabilities and how kinda like we think about this. So Genie 3 is pretty much a research preview, right? It's not something we are releasing
- 21:56 – 32:23
The Future of World Models & Modalities
- SFShlomi Fruchter
at this point.
- AMAnjney Midha
You know, so-something we think about a lot is there-- what are the edges of a mo- modality? We're talking about this all the time, which is, you know, the, the, the lines start blurring pretty quickly between real-time image and video and then v- real-time video and interactive, whatever, world generation, world model. The... And I don't, I don't think we have a good word for what Genie 3 is yet, but you guys called it world model, which, which is I- I think a great term. But w- in your mind, like, where does a v- video generation modalities stop and real-time worlds take, you know, start? And do you think in the future, are these converging into basically one modality? Or if you had to predict over the next few years, do you guys think actually, yeah, these, these will c- these will diverge into completely different disciplines? Um, it seems like they share kind of one, one parent today, which is, you know, video generation, but wh- where is the world going, do you think? Are these two completely different fields?
- SFShlomi Fruchter
From my perspective, um, they're different. So I would say modality is a, mo- modality is one thing, right? We have text, we have audio. Even with, uh, within audio, there are different type of submodalities. Speech is not the same as music. We have different products for, for music generation, and we have other models for, for, for speech generation, speech understanding. So, um, even within one modality, you can have different flavors. Um, and then of course, you have video and, and other things. So, um, I, I think basically I would say the modality is one, one dimension, and another is how fast or, uh, how quickly we can create the... we, we can create-
- AMAnjney Midha
Yeah
- SFShlomi Fruchter
... kinda like new samples, and completely orthogonal, uh, maybe the direction is or dimension is how much control we have, right? So I think we ch- we kinda picked specific direction or a specific, uh, vector in the space for Genie 3. Um, I think different products, different models can, can try and go in different, different direction. I think the space is pretty big, and there are a lot of trade-offs to be made. So, um, yeah, I don't know. I think it really depends. Uh, some people believe there is, you know, one model that will do everything, or I think there is... still, still open-ended, um, what's the best way. Like, we- we're in a place where engineering is a big part of our research, right? And actually making those, like, it's not a paper, right? Where we wanna build something that people can actually use. Um, so I think this really makes it, like, um, abstract ideas go to some, to some, to get you to some point, but to actually build things, we have to make some, some concrete decisions.
- AMAnjney Midha
Yeah.
- SFShlomi Fruchter
And I think it kinda like forces you to decide what you wanna do and what you're gonna.
- JPJack Parker-Holder
Yeah, I think this is a, a really interesting point, Mike, and ultimately it has to be driven by, like, technical decisions, um, and also, like, the, the goals, right? So we-- if you look at the, the models right now, we obviously made a choice that we want Veo 3 and Genie 3 to be separate projects this year, right? And if you look at, look at them both as they are right now, they have very different capabilities that the other model does not have.Um, and technically to combine all of that already into one model, right, would be, I think, very challenging to-
- AMAnjney Midha
Mm-hmm
- JPJack Parker-Holder
... to, I mean, Veo 3 is clearly a higher quality threshold than, um, than Genie 3, right? Um, and it has very different, different priorities, right? So, um, then y- then the natural thing is you could say, "Oh, well, you know, what if we just took these together and combined them?" But that may not be the best next step for either of those two, uh, two models, right? So, um, it may not be the case that the thing that the other one has is actually the most compelling thing for a completely different experience. Um, and I think that given the, the breadth of, of interest in both models, right, there's actually quite a small set of people that are, like, really actively using both, and they tend to be more folks like yourselves who are just more broadly interested in AI, right? Rather than, like, really downstream, um, use cases. So, like, you mentioned agent training, um, for one, uh, which is a very sort of, like, high action frequency, requires more egocentric sort of, uh, I guess more like worlds where tasks can be achieved, but doesn't require, you know, the, like, high-quality cinema style videos you could generate with a Veo model, right? It's quite different. And then on the filmmaking element, I mean, I'm not so sure that Genie 3 is really there at this point. Um, and that would be necessarily the goal.
- AMAnjney Midha
I d- I don't know. On filmmaking, J- Justine can do some pretty incredible things with the, with the filmmaking tools today. You'd be surprised. [laughs]
- JPJack Parker-Holder
[laughs]
- JMJustine Moore
Give me access and I will make amazing films with, with Genie 3. Um, I, I guess that did kind of get to my, one of my questions, though, which is the work you guys are doing is incredible, and you clearly probably have so much going on in your brains just to coordinate training these models and managing these teams. How much do you also have to think about, like, what are the downstream use cases of the model when you're training it? 'Cause you could imagine a world in which you're just like, "We don't really know or care what people are gonna do with it yet. We're just gonna g- go in the research direction we think we should go and see what happens." But, but based on how you guys are talking about it, it sounds like you've also been pretty thoughtful around what are the different capabilities or features needed for different potential use cases, at least, of different models.
- SFShlomi Fruchter
Yeah. I, I'll say that basically, uh, we, we have some applications in mind, but that's not what's driving the research. Um, it's more about can we-- how far can we push in this particular direction? Can we make all of that work, like, really great quality, really fa- really fast generation, real time, very controllable. I think we-- that's kinda like what drive us, I think, there to, to, to have a, to develop Genie 3. And the applications kinda, like, follow, and I don't think, you know, to be honest, I don't know what would be the applications for... Like, I think we're very surprised. Um, you know, I'd like to mention, like, Veo 3, we, we all, like, people find new, new ways how, in, in how it can be useful and to prompt it. We have, like, visual stuff, you know, people just discovered, right? [chuckles] We didn't even think about initially. So I, I expect kinda the same thing, and I think that's why I'm excited for more people to be, to be able to access it in the future. And in general, our approach is to, um, to make sure that, uh, over time, um, there is more access to, to, um, to the models we build. Um, and, and I think that's the only way to discover what's the real potentials.
- MMMarco Mascorro
I guess one, one re- somewhat related to that, like, how do you think going forward, like, Genie 4 or 5 or any other models, like, what is, like, top of mind right now? Like, if you wanted, for example, to focus on, I don't know, like, seems like gaming could be one of the applications, having multiplayer type of games where you have two special memories or, uh, two different, uh, completely views, but at some point they merge. H- how are you thinking on, like, going forward? Like, what's next? Is it, like, scaling these models just on more data, more compute? Is it creating these sort of like multi-universe type of things where you're, you have multiple players, multiple people looking at the same model but in different views? What's, uh, top of mind for you guys?
- JPJack Parker-Holder
Uh, top of mind, I think for the next few days might be a vacation.
- MMMarco Mascorro
[laughs]
- SFShlomi Fruchter
[laughs]
- JPJack Parker-Holder
Uh, after that, [chuckles] um, maybe walking my dog in the real world. Uh, and then I think you mentioned a bunch of really interesting things, to be honest. And, like, uh, I think we are-- we're still collecting a lot of feedback on this current model, right? Um, and I think that in general, we are most interested in building just the most capable models, right? And so we would hope to have m- even broader impact in future, uh, and really enable other teams to do cool things with it, right? Both internally and externally. Um, and for me, it's like I just started this with, like, a very, very focused vision about AGI, and I still think, honestly, for my-- what I'm excited about for, for AGI and which is more embodied agents, um, I really believe this is the, the fastest path to getting these agents, like, in the real world. Um, and I think we made a big step towards that. But I'm still, like, I'm sometimes even more excited about applications I never thought of that come up from other people seeing the model, right? So I think it's kind of this, like, trade-off of, you know, obviously you wanna focus on some applications, but then, um, you wanna be open-minded about others. And I think that's the real joy of building models like this, right? Is you get to see all of these people who can be way more creative than, than me with it. So I think that there's, like, all these really cool things that we can do, and I h- honestly don't really, can't really tell you in one year what the biggest application will be. Um, but we'll definitely be trying to build better models.
- SFShlomi Fruchter
Yeah. I'm, I'm really excited by... I, I think we are only, um, as, as impressive, you know, maybe the model is. I think they're all very far from actually simulating the world accurately and being able to do, to kind of put a person in there and then do whatever they want. Um, and, and I mean, when I say far, it doesn't mean it's far in terms of, you know, uh, calendar time because we are, we live in an accelerated timeline, but it, it feels like there is more work to do, uh, to get there. Um, and, and I think I just imagine, like, one- once we can actually, you know-Yeah, what- whatever the, the, uh, form factor would be, but step into this world and just kinda like ex- maybe tell it how you want to, to-- what you want to experience. There's so many applications. Imagine, for example, someone is afraid of, uh, talking to people on a stage or in a podcast, right? They can simulate that.
- JPJack Parker-Holder
[laughs]
- SFShlomi Fruchter
Right? Or you can have someone who's, like, afraid of spiders. They can maybe actually see themselves getting over that. So that's like, you know, just, just one example of something that's... But actually my wife thought about it. It's not my idea. [laughs] So, uh, I think it's really-- Like, there's so many things, right? So, so, um, I think this is just-- it's, uh, it's all, it all hinges on the ability to simulate the world and maybe put ourselves in it, maybe seeing ourself from, from a side, um, and potentially having agents interacting with things and, and, yeah, the, the realism and, and really making it work in the way that is similar to our world I think is really key.
- JPJack Parker-Holder
I'm actually personally petrified of skiing, and the model-
- SFShlomi Fruchter
Ah. [laughs]
- JPJack Parker-Holder
... is already quite good at that. So I might, when things quieten down, spend some time, 'cause I promised my, my wife that our children would grow up knowing how to ski.
- SFShlomi Fruchter
[laughs]
- JPJack Parker-Holder
And we're getting close to the age where I have to live up to my promise, and I'm not sure if I wanna do it yet. So [laughs]
- SFShlomi Fruchter
We have to improve the model for you, Jack, so you can actually-
- JPJack Parker-Holder
[laughs]
- SFShlomi Fruchter
... get that in distribution.
- 32:23 – 37:58
Robotics, Simulation, and Real-World Impact
- MMMarco Mascorro
just talking about before the-- we, we started that, uh, we might see applications like in robotics. I mean, Jack, you were talking about embodied AI and, like, now, like, limitation-
- JPJack Parker-Holder
Yeah
- MMMarco Mascorro
... in robotics is the data, right? Like, how much data you can collect, and now probably you can just generate a lot of different scenes that you were not able to do before purely from, like, just recording videos or so. So I think that's another thing that is pretty exciting and, uh, I mean, congrats on the, on the, on the model. It's, it's phenomenal.
- AMAnjney Midha
On the robotics application, uh, there was a, a conversation that I was listening to from Demis yesterday where he was talking about your guys' work on Genie 3, and he mentioned that there's a, there's an agent, I think you guys call it SIMA-
- JPJack Parker-Holder
Yeah
- AMAnjney Midha
... right? Which can then interact with the Genie agent.
- JPJack Parker-Holder
Yeah.
- AMAnjney Midha
And as I was hearing him describe it, which, which was kind of breaking my mind, which is that you had one simulation agent j- asking the world-- uh, asking the Genie agent to essentially create a real time environment for it to interact in, right? Um-
- JPJack Parker-Holder
Yeah
- AMAnjney Midha
... which was when I realized, oh, the, the, the way you guys have built it, it's, it's composable with other agents. Can you talk a little bit about why that's so important for robotics, like Marco was saying, and w-what, what are the major k- limited- limitations today that you think we'd have to overcome as a space to make the robotics, uh, sort of progress, the rate of progress in robotics much faster than it is now?
- JPJack Parker-Holder
So, um, we designed it to be an envi-environment rather than an agent, right? So, so Genie 3 is very much like an environment model. Like, we don't see it as, like, an agent itself that can, like, think and act in the world. It's more just a general purpose sort of simulator in a sense, right, that can actually simulate experiences for agents. And we know that, like, learning from experience is a really important paradigm for agents, right? That's how we got AlphaGo, because the agent, uh, AlphaGo learned by playing Go by itself, trying new things, right, and then learning from feedback, um, with reinforcement learning, learning to improve itself and, and actually discover new things. Like, it discovered new moves, like move 37, that humans didn't think was a, was a worthwhile move, right? But, but actually AlphaGo learned that it was for-- because it could experience and try by- things for itself. And in robotics, we have this paradigm right now where there's some data-driven approaches, right, where you can collect, uh, data i-in a, in a quite a laborious way, um, but it looks like the downstream task. So it looks real, and there's not so much of a mismatch between the, the two domains. Or you can, you can learn in simulation, right? But the robotics simulations are-- even the best ones, and, and we have some of the best ones, um, at DeepMind. We have MuJoCo, right, which we work with. Um, they're still quite far away from the real world, right? And so you have a sim to real gap. Um, but even the sim to real gap itself, uh, I think is kind of, like, poorly named because what peop-people consider to be real in robotics is typically still a lab or some very constrained environment where you've got a bunch of spotlights on a robot and then tons of researchers crowding round watching, you know. Um, whereas really re-real for me is, uh, making many references, it's the ability to walk my dog when, when I'm too busy [laughs] uh, to, to, to hold it-- the lead, cross the street, you know, see someone who's scared of dogs, know to go around them, see someone with a ball, change directions. Like, all these challenging situations in the real world, right? And of course, you still have gripping, you still have these other, other tasks, but you need to really discover your own behaviors from your own experience, right? And that's-- But doing that in physical embodied worlds is super challenging because there's so many reasons why firstly that could be expensive to collect data in those, in those settings. You'd have to keep moving the robot back to where it started every time it, it, like, doesn't do something right. And also it could be unsafe, right? Um, so there's many reasons why we can't really do learning from experience in the physical world, right? So we do it in simulation. But really what we think with, with Genie 3 is it's the best of both, right? Because you're taking a real world data-driven approach, right? But then you've got the ability to learn in simulation. So it kind of combines the, the good parts of each of those. Uh, and so that's why I think it could be super powerful. Um, not just for, for a robot's example, but I really love this idea of having [laughs] when it rains in London a lot, uh, not having to take my dog for the second walk would be great.
- SFShlomi Fruchter
And as you can see, we built a model basically for Jack personal. [laughs] Applications, that's what's driving the project. That's the point, yeah.
- AMAnjney Midha
But clearly, Jack-
- MMMarco Mascorro
There's a lot of dog owners out there.
- AMAnjney Midha
Yeah.
- SFShlomi Fruchter
Yeah.
- MMMarco Mascorro
Yeah.
- AMAnjney Midha
I'm just saying clearly, Jack, it, it's time to move to California.
- MMMarco Mascorro
Yeah. [laughs] Less rain.
- SFShlomi Fruchter
Less lag.
- JPJack Parker-Holder
I mean, I personally love California, but my wife's not, my wife's not convinced, so [laughs]
- SPSpeaker
We'll convince her.
- SFShlomi Fruchter
Yeah. Just, just to touch on, you know, maybe a final point on the robots, uh, kind of like robotics part. I think there are-- Like, it's definitely, you know, robotics means it is more than visual, right? Like, we need to be able to... I think this is an important point. Um, we want-- we can drive the decisions of the robot by looking around, but still it has to, to kind of, you know, do extrusions, decide where to move, how to respond to the environment. So I think there are, there are definitely some gaps, but still at the core of the problem, being able to reason about the environment, uh, we think this is something that, that the, you know, world models, uh, or general purpose world models such as Genie 3 can really help with and, and maybe with future research we can actually bridge those gaps of physical, um, kind of like, uh, understanding and actually getting responses, physical responses from the world, which is a very interesting direction to explore.
- 37:58 – 40:41
Looking Ahead: Genie 4, 5, and Future World Models
- SPSpeaker
One last question from my side. The, the-- And I don't know if you can answer this, but, like, is it gonna become public? Like, can developers access it at some point, or is there, like, some sort of idea on this?
- SFShlomi Fruchter
So as you can see, we are very excited about having more people accessing it, so we're, we're definitely want to make it happen. Um, there is no kind of like a, a concrete timeline at the moment. Uh, but you know, I'm sure once we have more to share, we, we'll do.
- SPSpeaker
Awesome.
- JMJustine Moore
One of the things I've been thinking about a lot is we see sort of with every, like, modality, like, you know, maybe first LLMs and then image and video and audio, there's like early kind of glimmers of something really exciting in a project or a research preview, and then there's like a ton of data and compute and researchers kind of poured at the problem and, and you hopefully see this sort of like exponential progress till you eventually get to the point where, like, you're out of data or, or the improvements don't come as easily. I'm wondering for your thoughts, like, where we are on sort of that curve for world models.
- JPJack Parker-Holder
That's a really good question. Um, I actually have a super hand-wavy, somewhat swerving answer, right? Uh, and I think it's actually both. So I think the current capabilities are actually already quite compelling, and so you could make the case that, like, if what you wanted was a, a minute of photorealistic any world generation with memory, that could actually be the end goal, right? And, and two or three years ago, I probably would have said that was a five-year goal. Um, and so at that point, if you just wanted to improve that, I think you probably end up with this maybe like... I think the jump from Genie 2 to Genie 3 was, was absolutely massive, um, and went from being like kind of a cool bit of research that was like showing signs of life, something that could already be very compelling. But I think there's a lot more that you can do with this, and, and Shlomi kind of referenced this to himself, right? Like, it's not the case that you're dropping yourself in the world, right? And, like, it's like the real-- being in the real world, for example. It's actually quite different to that. When you do, you know, take a minute to look away from the computer screen, it's quite a bit richer out there. [laughs] Um, and that's just for the real world. We also want this ability to generate completely new things, right? So, um, [clears throat] I think we've got a huge gap to, to close, right, with, um, the new capabilities that we want to add. But I think it's maybe a bit different to language models. Or actually, maybe it is similar to language models, but with language models, there's been, like, lots of new steps that have actually come on top, right, that, that maybe we didn't think were possible. We thought things were plateauing, and then a new idea came that made a significant change. Uh, and that has happened a couple of times, um, in the past few years. So I think that there's a few more of those left for sure.
- 40:41 – 42:21
Are We Living in a Simulation?
- SPSpeaker
Uh, my, my final question for you guys is, are we living in a simulation? [laughs]
- SFShlomi Fruchter
Oh, yeah. That's every, every interview has to, uh, finish.
- SPSpeaker
[laughs]
- SFShlomi Fruchter
My, my thinking about that is... Actually, yeah, I thought about it a bit. Um, I think the, the-- If we live in a simulation, my take is that it doesn't run on our current hardware because, um-
- SPSpeaker
[laughs]
- SFShlomi Fruchter
... it's, it's analog and not like, you know, it's continuous. All of the observations are ever continuous, and there is nothing like... But maybe, uh, the quantum level is, is, you know, some limitation of our... You wanted to go philosophical, so here we go.
- SPSpeaker
[laughs]
- SFShlomi Fruchter
Uh, it's some kind of like a hardware limitation of the, the simulation we run on. So yeah, take it or leave it. [laughs]
- SPSpeaker
[laughs] Great answer.
- JPJack Parker-Holder
Clearly, it's a lot of work for the TPU team to do. [laughs]
- SFShlomi Fruchter
Yeah, maybe quantum computing will be actually, uh, will be running our actual simulation, so yeah. Yeah.
- SPSpeaker
That, that's a great place to wrap. Shlomi, Jack, thank you so much for coming on the podcast.
- JPJack Parker-Holder
Thank you, guys.
- SFShlomi Fruchter
Thanks for having us.
- SPSpeaker
Thank you, guys. [upbeat music]
Episode duration: 42:21
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode tWgjhC7dJRo
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome