a16zHow Fei-Fei Li Is Rebuilding AI for the Real World
EVERY SPOKEN WORD
20 min read · 4,073 words- 0:00 – 0:39
Spatial Intelligence
- FLFei-Fei Li
Space, the 3D space, the space out there, the space in your mind's eye, the spatial intelligence is a part of, uh, a critical part of intelligence. Suddenly we can actually create infinite universes. Some are for robots, some are for creativity, some are for socialization, some are for travel, some are for storytelling. It suddenly will enable us to live in a multiverse. I didn't need LLM to convince me LWM is important. [upbeat music]
- 0:39 – 1:17
Fei-Fei Li’s Background
- ETErik Torenberg
Martin, why, why don't you, uh, briefly brag on behalf of Fei-Fei a little bit and-
- MCMartin Casado
Yeah, yeah
- ETErik Torenberg
... uh, just share how would you summarize her contributions to AI for people unfamiliar?
- MCMartin Casado
So someone that doesn't need a lot of introduction, and she's done so many things that I can't fill in, so maybe I'll just do the ones that's, that's appropriate to this. I mean, you know, of course she was on the Twitter board. She was a Google exec. She founder and CEO of World Labs. Um, but very, very importantly, like we all know AI and everybody, you know, we all talk about kind of neural networks and there's a number of people that focused on, you know, making those effective. Um, but Fei-Fei really singularly brought in data to the equation, which now we're recognizing is actually probably the bigger problem, the more interesting one. And so she truly is the godmother of AI as everybody calls her.
- 1:17 – 5:14
Building a World Model
- ETErik Torenberg
And, and Fei-Fei, why did you have to have Martin as a, as the first investor?
- FLFei-Fei Li
Well, first of all, I knew Martin for more than a decade.
- ETErik Torenberg
Long time.
- FLFei-Fei Li
You know, I joined Stanford in 2009 as a young assistant professor, and Martin was finishing his, uh, uh, PhD there. So I always know, and of course Martin's advisor, Nick McCune, was a good friend, and I always know Martin went on to became a very successful entrepreneur and very successful, uh, uh, investor. So I, we, we see each other, we talk about things, but as I was, uh, formulating the idea of World Labs, I was looking for, uh, what I would call my unicorn investor. [laughs] I don't know if that's a word, but, uh-
- ETErik Torenberg
[laughs]
- FLFei-Fei Li
... that's how I think about this-
- ETErik Torenberg
All clear. [laughs]
- FLFei-Fei Li
... is, uh, who is not only, um, obviously a very, uh, established and successful investor who can be with entrepreneurs on this journey through the ups and downs, who can be very insightful, who can bring the kind of, uh, uh, knowledge, advice, uh, resource, but I was also particularly looking for an intellectual partner.
- ETErik Torenberg
Yeah.
- FLFei-Fei Li
Because what we are doing at World Labs is very deep tech. We are trying to do something no one else has done. Um, we know with a lot of conviction it will change the world literally.
- ETErik Torenberg
Yeah.
- FLFei-Fei Li
But I need someone who is, um, a computer scientist, who is a student of AI, who is, uh, understand product market, um-
- ETErik Torenberg
Go to market
- FLFei-Fei Li
... customers.
- ETErik Torenberg
Yeah.
- FLFei-Fei Li
Go to market, and, and, and just can be on the phone or in person with me every moment of the day as an intellectual partner.
- MCMartin Casado
Well, it's actually, it's actually the, the origin story of u- us first connecting is actually pretty interesting. So Fei-Fei has clearly been thinking about this idea for like, you know, a very long time, like well before starting this, so m- maybe years even and-
- FLFei-Fei Li
Yeah
- MCMartin Casado
... and, and she'll talk about, she has this very deep intuition of what AI needs in order to basically navigate the world, right? But we were at this, this one of Mark's fancy dinner or lunches and, uh, there's a bunch of AI people, and everybody was so excited about LLMs, right? And they was talking about language, and, uh, I'd kind of come to this independent conclusion just 'cause I've actually done a lot of like image investing, um, that like, that wasn't the end of the story. And so Fei-Fei, we're at the end of this table, this, you know, a lot of these people talking about it, and Fei-Fei leans over to me and she's like, "You know what we're missing?" I said, "What are we missing?" She said, "We're missing a world model." And I'm like, "Yes." Yeah, and it kind of fell into place then 'cause I'd been like thinking about stuff at a high level, but like, I mean, she just kind of perfectly-- as she does, she just kind of perfectly articulated this. So she'd, uh, had, you know, a year's worth of thinking about this, had talked to people, et cetera. And so in some way we kind of in our own crooked paths had arrived at a very similar intuition. Hers was like way more filled out [laughs] than like, you know, mine was just kind of this kind of fancy thing. But then after that we actually had a number of conversations where we both kind of agreed that we were aligned on this kind of idea.
- FLFei-Fei Li
Actually, I don't know if you know this, so of course during that lunch, uh, we were just-- we hit it off on this world model idea, but I was at that point already talking to various people, not just, uh, computer scientists, technologists, but also investors and, and potentially business partners. And to be honest, most people didn't get it.
- ETErik Torenberg
Mm.
- FLFei-Fei Li
So, and I was... They, they, they... You know, when I say world model, they nod, but I can just kind of tell that was just a polite nod. So I called Martin, I'm like, "Do you mind coming over to Stanford campus and have coffee with me?" And I said, "Martin, can you define your world model to me?"
- ETErik Torenberg
[laughs]
- FLFei-Fei Li
I really wanted to hear if Martin actually meant it, and the way he de- defined it about an AI model that truly understand the 3D structure, shape, and the compositionality of the world was exactly what I was talking about, and I was like, wow, he's the only person so far-
- ETErik Torenberg
[laughs]
- FLFei-Fei Li
... I've talked to-
- ETErik Torenberg
True
- FLFei-Fei Li
... who actually meant it.
- ETErik Torenberg
Wow.
- FLFei-Fei Li
It's not just nodding.
- 5:14 – 8:07
Reflecting on AI's Evolution
- ETErik Torenberg
World Labs and, and the specifics of this, but maybe first let's, I wanna take you back both to your, your PhD days, your professor days, and just sort of reflect on if you could go back in time and sort of have knowledge of what's happened the preceding 10 years in, in AI, what, what do you think would've been the biggest surprises, or what was the thing that you didn't see coming, uh, that would've shocked your, your younger self?
- FLFei-Fei Li
Yeah, it's ironic to say because as Martin said, I was the person who brought data into the AI world, but I still continue to be so surprised emotionally that the, the data hungry models, the data-driven AI can come this far and genuinely have incredible emergent behaviors-
- ETErik Torenberg
Yeah
- FLFei-Fei Li
... of thinking machine, right? So...
- ETErik Torenberg
Yeah. Why start another foundation model company? W- Why aren't LLMs enough?
- FLFei-Fei Li
You know, my intellectual journeyIs not about company or papers, is about finding the North Star problem. So it's not like I woke up and say, "I have to do a company." I woke up or every day, day after day for the past few years thinking that there is so much more than language. Language is a incredibly powerful, uh, encoding of thoughts and information, but it's actually not a powerful encoding of what the 3D physical world that all animals and living things living. And if you look at human intelligence, so much is beyond the realm of language. Language is l- a lossy way to capture, um, the world. And, uh, also one subtlety of language is language is purely generative.
- ETErik Torenberg
Mm.
- FLFei-Fei Li
Language doesn't exist in, in nature. We look around, there's not a syllabus or, or, or word, whereas the entire physical, perceptual, visual world is there, and animals' entire evolutionary history is built upon so much perceptual and, uh, and eventually embody intelligence. Humans, not only we survive, live, work, but we build civilization upon constructing the world and, and changing the world. So, so that's the problem I wanna tackle, and in order to tackle that problem, obviously research was important, and I spent years doing that, uh, as a academic, and it's still fun, but I do realize, and especially talking to Martin, that the time has come that concentrated industry-grade effort, focused effort in terms of, um, compute data talent is, is really the answer to, to bringing this to life.
- ETErik Torenberg
Yeah.
- FLFei-Fei Li
And that's why I wanted to start World Labs.
- ETErik Torenberg
Amazing.
- 8:07 – 10:20
The Importance of 3D Understanding
- MCMartin Casado
Yeah, Eric, you can do a very simple thought experiment that kind of highlights the difference between language and space. So if I put you in a room and I blindfolded you, and I just described the room, and then I asked you to do a task, the chances of you being able to do it are very little. I'm like, "Oh, 10 foot in front of you is, like, a cup," and like, [chuckles] you know? Like, on the left is... Like, this is just... It's this very inaccurate way to con-convey reality, 'cause reality is so complex and it's so exact, right? On the other hand, if I took off the blindfold, [laughs] and you can see the actual space, right? And you... And what your brain is doing is actually reconstructing the 3D, right? Then you can actually go and manipulate things and touch things, right? And so one way to think about it is we do a lot of language processing, and we use that to communicate and, and, and, you know, high-level ideas, et cetera. But when it comes to navigating the actual world, we really, really rely on the world itself and our, our ability to reconstruct that.
- ETErik Torenberg
And how and when did you realize that, that language model w-weren't enough? 'Cause it seems like it's not super widely known. I don't, I don't hear about this all the time.
- MCMartin Casado
Well, f- so there's this k- you know, so if you ask me, like, kind of what is kind of this, you know, surprising, you know, breakthrough, it's that, um, it's that language went first because we've, like, worked so hard on robotics, right? I mean, I feel like even to look at autonomous vehicles, I mean, as, as an industry, we've invested, like, what, $100 billion in it. You know, I remember when Sebastian Thrun, like, actually won, like, the DARPA Grand Challenge in-
- FLFei-Fei Li
2006.
- MCMartin Casado
2006, and we're like, "Hooray. [laughs] AV is done," right?
- ETErik Torenberg
Yeah.
- MCMartin Casado
And then, you know, 20 years later, like, we're finally there, $100 billion in, et cetera. This is, like, a 2D problem.
- ETErik Torenberg
Yeah.
- MCMartin Casado
And so that was the path we were going on is do you actually solve, like, world navigation? And it's harder than out of nowhere comes these LLMs, and they, they, they, they, they are unit economic positive. They solve all of these language problems, like, basically immediately. And so it just took me a moment... Actually, Fei-Fei said it beautifully, which is, you know, the part of our brain that actually deals with language is actually pretty recent, and so we're actually pretty inefficient at it, right? And so the fact that a computer does it better is, like, not super surprising, but the part of the brain that actually does the navigation, you know, the, the spatial has been around since a million brains.
- FLFei-Fei Li
Yeah.
- MCMartin Casado
Maybe the reptilian brains. We're about 4 million years.
- FLFei-Fei Li
No, it's, it's even more than that. It's a trilobite brain.
- MCMartin Casado
Yeah, yeah. Right?
- FLFei-Fei Li
Trilobite had brain.
- MCMartin Casado
Right.
- FLFei-Fei Li
500
- 10:20 – 12:19
Unrolling Evolution: Why 3D Intelligence Is Harder Than Language
- FLFei-Fei Li
million years.
- MCMartin Casado
Yeah.
- ETErik Torenberg
Wow.
- MCMartin Casado
So, so almost like we're unrolling evolution, right? Like, so the language part is actually very, very important for, like, like, high-level concepts and, like, you know, the laptop class type work-
- ETErik Torenberg
Yeah
- MCMartin Casado
... which is what it's impacting right now. But when it comes to space, and this is everything from, like, you know, robotics to anything where you're trying to construct something physical, you have to solve this problem. And then we know from AV that it's a very tough problem. But, and then maybe this is what is worth talking about, like, the, the, the generative wave gave us some insight on how you might, you might wanna do it. So it really felt like-
- FLFei-Fei Li
Yeah
- MCMartin Casado
... that was the time to-
- FLFei-Fei Li
Well-
- MCMartin Casado
... talk about that
- FLFei-Fei Li
... my journey is very different because I've always been in vision, right? So I feel like I didn't need LLM to convince me LWM is important. I do wanna say we're not here bashing language.
- MCMartin Casado
Of course. Yeah.
- FLFei-Fei Li
I'm, I'm just so excited.
- MCMartin Casado
Yeah.
- FLFei-Fei Li
In fact, seeing ChatGPT and LLMs and these foundation models having such breakthrough success inspires us to, to realize the moment is closer for world models. But Martin said it so beautifully. It's that, uh, space, the 3D space, the space out there, the space in your mind's eye, the, the, the spatial intelligence that enable people to do so many things that's beyond language is a part of, uh, a critical part of intelligence. It goes from ancient animals all the way to humanity's most innoven- innovative findings, such as the structure of DNA, right? That double helix in 3D space, there's no way you could use language alone to reason that out.
- ETErik Torenberg
Yeah.
- FLFei-Fei Li
You know? That, that... So that's just one, one example. Another one of my favorite scientific ex-example is Buckyball. Um-
- ETErik Torenberg
Oh, yeah. Right. Yeah, yeah, yeah
- FLFei-Fei Li
... you know, the carbon-
- ETErik Torenberg
Yeah
- FLFei-Fei Li
... carbon molecule structure that is so beautifully constructed. Uh, that kind of example shows how, uh, incredibly profound space-
- ETErik Torenberg
Yeah
- FLFei-Fei Li
... and, and 3D world is.
- ETErik Torenberg
Let's, um, let's paint even more of a picture.
- 12:19 – 16:52
From Single Reality to Infinite Virtual Universes
- ETErik Torenberg
When, when World Labs is, is, is... has achieved its vision or, or language world models have achieved their vision, what, what is the, uh, what are some applications or use cases or... that we can present to the audience to help, help make it concrete?
- FLFei-Fei Li
Yeah, there is a lot, right? Um, for example, creativity is very visual.
- MCMartin Casado
Yeah.
- FLFei-Fei Li
We have creators from, uh, design to movie to architecture to industry design. Creativity is not just only for entertainment, it could be for productivity, for, uh, for machinery, for many things. That alone is a highly, highly visual, perceptual, spatial, um, um, area or areas of, um, um, work. And of course, we mentioned robotics. Robotics to me is any embodied machines. It's not just humanoids or cars. There's so much in between, but all of them have to somehow figured out, uh, the 3D space it lives in, have to be trained, uh, to understand the 3D space, and have to do things, sometimes even collaboratively with humans, and that needs spatial intelligence. Of course, I think what one thing that's very exciting for me is that for the entirety of human civilization, we all collectively as people lived in one 3D world, and that is the physical Earth 3D world. A few of us went to the, uh, Moon, but, you know-
- MCMartin Casado
[laughs]
- FLFei-Fei Li
... very small number.
- MCMartin Casado
[laughs]
- FLFei-Fei Li
But that's, that's one world.
- MCMartin Casado
Yeah.
- FLFei-Fei Li
But that's what makes the digital virtual world incredible. With this technology, which we should talk about, it's the combination of generation and reconstruction, suddenly we can actually create infinite universes.
- MCMartin Casado
Hmm.
- FLFei-Fei Li
We can, for, some are for robots, some are for creativity, some are for socialization, some are for travel, some are for storytelling. You can, it, it, it suddenly will enable us to live in a multiverse, uh, way, and that's just, uh, the, the, the imagination is boundless.
- MCMartin Casado
These conversations can sound abstract, but they're actually not. But the way, the reason they sound abstract is because it's truly horizontal, just like LLMs are, right? So like if you ask, say like, "What are LLMs good at?" Like, the same LLM we use for like, you know, like an emotional conversation, we use to write code.
- FLFei-Fei Li
Yeah.
- MCMartin Casado
We use to, like, do lists. We use it for self-actualization, right? And so I think we can get actually pretty concrete about, like, what these models do, right? With these models, you can take a view of the world, like a 2D view of the world, like an... And then you could actually create a 3D full representation, including what you're not seeing, like, like the back of the table, for example, within the computer. So given just a 2D view, you have the full thing, and then you ask, okay, well, what can you do with that thing, for example? Well, you can manipulate it, you can move it, you can measure it, you can stack it. So anything that you would do with space, you could do, right? That means you could do architecture, you could do design. But it turns out the ability to fill out the back of the table means that you can fill out stuff that was never there to begin with, right?
- FLFei-Fei Li
Yeah.
- MCMartin Casado
So let's say that I just had a 2D picture of this. I could create a 360 of everything.
- FLFei-Fei Li
Yeah.
- MCMartin Casado
Right? And so now you have fully generative. And so what does that mean? That means, you know, that's video games, that's creativity.
- FLFei-Fei Li
Yeah.
- MCMartin Casado
And so it's a super, super horizontal piece that takes basically a computer with a, a single view in the world or maybe multiple views in the world, and creates a full 3D representation that that computer then a- can act on. And so you can see that that's a very, like, concrete, pivotal thing from everything from, like, robotics to video games to, to, to art and design. It seems like we haven't fully been appreciating sort of the 3D co- components u-until now. Is that, is that fair to say?
- FLFei-Fei Li
It is fair to say. In fact, I think, um, ev- took evolution a long time. 3D is-
- MCMartin Casado
[laughs]
- FLFei-Fei Li
... not a easy problem, but I always, uh, come back to the, the fact that I had a conversation with my six-year-old, uh, years ago about why trees don't have eyes, right?
- MCMartin Casado
[laughs]
- FLFei-Fei Li
And the fundamental thing is trees don't move. They don't need eyes.
- MCMartin Casado
Yeah.
- FLFei-Fei Li
So the fact that the entire basis of animal life is moving and doing things and interacting gives life to perception and spatial intelligence.
- MCMartin Casado
Yeah.
- FLFei-Fei Li
And in turn, spatial intelligence is gonna reinvent horizontally, as Martin said, so many of the, the way of work and life that, uh, humans are, are doing.
- 16:52 – 17:57
3D vs 2D: Why 2D Isn’t Enough for Machines
- MCMartin Casado
need to be 3D or why can't you just use 2D?
- FLFei-Fei Li
Physics happens in 3D, and interaction happens in 3D. Navigating behind the back of the table needs to happen in 3D. Composing the world, whether physically, digitally, needs to happen in 3D. Um, so fundamentally the problem is a, a 3D problem.
- MCMartin Casado
One way to think about it is, um, if it's a human being looking at a, say a 2D video, the human being can reconstruct the 3D in their head, right? But, like, if you need a com- Like, let's say I've got a robot that has the output of the model. If that's 2D, and then you ask the robot to do, I don't know, distance [laughs] or to grab something, like, like that information's missing. Like the, you know, you've got the X, Y, Z plane. The Z plane just isn't there at all, right? And so for many things that are spatial, you need to provide that information to the computer so that you can actually navigate in 3D space. And so 2D video's great if it's a human because we already can turn it into 3D, but, like, for any computer program, it'll need to be 3D.
- FLFei-Fei Li
Actually, I want to tell you a personal story. About, uh,
- 17:57 – 19:24
Fei-Fei’s Personal Story of Losing Stereo Vision
- FLFei-Fei Li
five years ago, ironically, I lost my stereo vision for a few months because I had a cornea injury, and that means I'm lit- I was literally seeing with one eye. And like Martin said, my whole life has been trained with stereo vision, so even if I, I was seeing with one eye, I kind of know what the 3D world looked like. But it was, it was a fascinating period as a vision scientist-
- MCMartin Casado
[laughs]
- FLFei-Fei Li
... for me to experiment what the world is. And one thing that truly drove home literally was I was frightened to drive.
- MCMartin Casado
Wow.
- FLFei-Fei Li
Is b- first of all, I couldn't get on a highway. That, that speed, I could not, you know.
- MCMartin Casado
Mm.
- FLFei-Fei Li
But I was just driving in my own neighborhood, and I realized-I don't have a good distance measure between my car and the parked car-
- MCMartin Casado
Oh
- FLFei-Fei Li
... on a local, you know, small road. When even though I have perfect understanding of how big is my car almost, how big is the neighbor's, the, you know, the, the, the parked cars, I know the roads for years and years, but just driving there, I had to be so slow, like almost 10 miles an hour so that I don't scratch the cars.
- MCMartin Casado
Wow.
- FLFei-Fei Li
And that was exactly why we needed, uh, stereo vision. And, uh-
- MCMartin Casado
That's a great... That's actually a great articulation of why this, like, 3D is just-
- FLFei-Fei Li
Yeah
- MCMartin Casado
... X and P if you're doing some processing, right? Like-
- FLFei-Fei Li
Yeah. So I don't recommend it, but if you're-
- MCMartin Casado
[laughs]
- FLFei-Fei Li
... very... But park your car one and drive your car two with one eye and, and feel it. That's your own car. [laughs]
- 19:24 – 22:24
Research and Development at World Labs
- MCMartin Casado
Yeah. With LLMs, a lot of the research was done at the big companies. What's the state of the research here?
- FLFei-Fei Li
This is definitely, um, a newer area of research compared to LLM. It's not totally fair to say new, because in computer vision, we have been, as a field, we have been doing bits and pieces. For example, one important revolution that has happened in 3D computer vision was, uh, a neural radian field, or NeRF, and that was done by our cofounder, uh, Ben Mildenhall-
- MCMartin Casado
Yeah
- FLFei-Fei Li
... and, uh, and his, uh, colleagues at Berkeley. And that was a, um, a way to do 3D re- reconstruction, uh, using deep learning that was really taking the world by storm about four years ago. We've also, uh, got a cofounder, Chris- Christoph Lasser, whose pioneering work, um, was part of the reason Gaussian splat representation, uh, started to, um, again, uh, become really popular as a way to represent 3D, volumetric, uh, 3D. And of course, uh, Justin Johnson, who was my former student, also cofounder of, uh, World Labs, were, um, among the first generation of deep learning computer vision student who did so much foundational work in image generation when before transformer were out, we were using GAAS to do image generation and then style transfer, which, uh, you know, um, w- was really popularized some of the, the, the components or ingredients of, uh, of what we're doing, uh, here. So, so things were happening in, in academia. Things were happening in industry. At World Lab, we just have the conviction that we're gonna, we're gonna be all in on this one singular big North Star problem-
- MCMartin Casado
Yeah
- FLFei-Fei Li
... concentrating on the world's smartest people in computer vision, in, uh, diffusion models, in graph- computer graphics, in, um, optimization, in AI, all of-
- MCMartin Casado
Data
- FLFei-Fei Li
... in data. All of them come into this one team and try to make this work and to product, uh, productize this.
- MCMartin Casado
I, I mean, I will say from an outsider standpoint, and so I'm not, you know, like I'm, I'm not an expert in any of these spaces, but it really feels like to solve this problem, you need experts both in AI, and that's like the data and the models, like the actual model architecture, um, and graphics, which is, like, how do you actually represent these things in memory in a computer and then on the screen? To take... It's a very special team to actually crack this problem, which, which, you know, Fei-Fei's managed to put together. [upbeat music]
Episode duration: 22:25
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode fQGu016AlVo
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome