“The Future of AI is Here” — Fei-Fei Li Unveils the Next Frontier of AI

Fei-Fei Li and Justin Johnson are pioneers in AI. While the world has only recently witnessed a surge in consumer AI, our guests have long been laying the groundwork for innovations that are transforming industries today. In this episode, a16z General Partner Martin Casado joins Fei-Fei and Justin to explore the journey from early AI winters to the rise of deep learning and the rapid expansion of multimodal AI. From foundational advancements like ImageNet to the cutting-edge realm of spatial intelligence, Fei-Fei and Justin share the breakthroughs that have shaped the AI landscape and reveal what's next for innovation at World Labs. If you're curious about how AI is evolving beyond language models and into a new realm of 3D, generative worlds, this episode is a must-listen. Timestamps: 00:00 - Spatial Intelligence: A New Frontier 01:38 - Scaling AI: The Impact of ImageNet on Computer Vision 06:56 - The Role of Compute 09:16 - Data as the Key Driver 17:01 - Defining AI’s Ultimate Goal 18:58 - What is Spatial Intelligence? Unlocking 3D Understanding in AI 26:35 - Comparing Models: Spatial Intelligence vs. Language-Based AI 29:41 - 1D vs. 3D 32:39 - Building Immersive Worlds with Spatial Intelligence 35:11 - From Static Scenes to Dynamic Worlds 37:42 - The Future of VR and AR 40:42 - Creating Deep Tech Platforms 44:26 - Building a World-Class Team 45:54 - Measuring Success: Milestones in Spatial Intelligence Resources: Learn more about World Labs: https://www.worldlabs.ai Find Fei-Fei on Twitter: https://x.com/drfeifei Find Justin on Twitter: https://x.com/jcjohnss Find Martin on Twitter: https://x.com/martin_casado Stay Updated: Let us know what you think: https://ratethispodcast.com/a16z Find a16z on Twitter: https://twitter.com/a16z Find a16z on LinkedIn: https://www.linkedin.com/company/a16z Subscribe on your favorite podcast app: https://a16z.simplecast.com/ Follow our host: https://twitter.com/stephsmithio Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.

Fei-Fei LiguestJustin JohnsonguestMartin Casadohost

Sep 20, 202448mWatch on YouTube ↗

EVERY SPOKEN WORD

45 min read · 9,476 words

0:00 – 1:38
Spatial Intelligence: A New Frontier
1. FLFei-Fei Li
  Visual-spatial intelligence is so fundamental. It's as fundamental as language. We've got these ingredients, compute deeper understanding of data, and we've got some advancement of algorithms. We are in the right moment to really make a bet and to focus and just unlock that.
2. JJJustin Johnson
  [upbeat music]
3. MCMartin Casado
  Well, over the last two years, we've seen this kind of massive rush of consumer AI companies and technology, and it's been quite wild, but you've been doing this now for decades. And so maybe walk through a little bit about how we got here, kind of like your key contributions and insights along the way.
4. FLFei-Fei Li
  So it is a very exciting moment, right? Just zooming back, AI is in a very exciting moment.
5. MCMartin Casado
  Yeah.
6. FLFei-Fei Li
  I personally have been doing this for, for two decades plus, and, you know, we have come out of the last AI winter. We have seen the birth of modern AI. Then we have seen deep learning taking off, showing us possibilities like playing chess. But then we're starting to see the, the, the deepening of the technology-
7. MCMartin Casado
  Yeah
8. FLFei-Fei Li
  ... and the industry, um, adoption of, uh, of some of the earlier possibilities like language models. And now I think we're in the middle of a Cambrian explosion in almost a literal sense because now in addition to texts, you're seeing pixels, videos, audios all coming out with possible AI applications and models, so it's a very exciting
1:38 – 6:56
Scaling AI: The Impact of ImageNet on Computer Vision
1. FLFei-Fei Li
  moment.
2. MCMartin Casado
  I know you both so well, and many people know you both so well 'cause you're so prominent in the field. But not everybody, like, grew up in AI, so maybe it's kind of worth just going through, like, your quick backgrounds just to kind of level set the audience.
3. JJJustin Johnson
  Yeah, sure. So I first got into AI, uh, at the end of my undergrad. Uh, I did math and computer science for undergrad at Caltech. It was awesome. But then towards the end of that, there was this paper that came out that was, at the time, a very famous paper, the Cat paper-
4. MCMartin Casado
  Yeah
5. JJJustin Johnson
  ... um, from Honglak Lee and Andrew Ng and others that were at Google Brain at the time. And that was, like, the first time that I came across this concept of deep learning. Um, and to me, it just felt like this amazing technology. And that was the first time that I came across this recipe that would come to define the next, like, more than decade of my life, which is that you can get these amazingly powerful learning algorithms that are very generic, couple them with very large amounts of compute, couple them with very large amounts of data, and magic things started to happen when you combine those ingredients. So I, I first came across that idea, like, around twenty eleven, twenty twelve-ish, and I just thought, like, "Oh my God, this is, this is gonna be what I wanna do." So it was obvious you gotta go to grad school to do this stuff and then, um-
6. MCMartin Casado
  [laughs]
7. JJJustin Johnson
  ... sort of saw that Fei-Fei was at Stanford, one of the few people in the world at the time who was kind of on that, on that train. And that was just an amazing time to be in deep learning and computer vision specifically because that was really the era when this went from these first nascent bits of technology that were just starting to work and really got developed acro- and spread across a ton of different applications. So then over that time, we saw the beginnings of language modeling. We saw the beginnings of discriminative computer vision, where you could take pictures and understand what's in them in a lot of different ways. We also saw some of the early bits of what we would now call gen AI, generative modeling, generating images, generating text. A lot of those core algorithmic, algorithmic pieces actually got figured out by the academic community, um, during my PhD years. Like, there was a time I would just, like, wake up every morning and check the new papers on archive and just be read- It was like unwrapping presents on Christmas-
8. MCMartin Casado
  Yeah
9. JJJustin Johnson
  ... that, like, every day you know there's gonna be some amazing new discovery, some amazing new application or algorithm somewhere in the world. What happened is in the next, last two years, everyone else in the world kinda came to the same realization using AI to get new Christmas presents every day. But I think for those of us that have been in the field for a decade or more, um, we've sort of had that experience for a very long time.
10. FLFei-Fei Li
  Obviously, I'm much older than Justin. [laughs]
11. JJJustin Johnson
  Okay.
12. FLFei-Fei Li
  I, [laughs] I come to AI through a different angle, which is from physics, because my undergraduate, uh, background was physics. But physics is the kind of discipline that teaches you to think audacious questions and think about what is the remaining mystery of the world. Of course, in physics, it's atomic world, um, you know, universe and all that. But somehow I-- that kind of training thinking got me into the audacious question that really captured my own imagination, which is intelligence. So I did my PhD in AI and computational neuroscience at Caltech. So Justin and I actually didn't overlap, but we share, um, uh, the same alma mater, um, at Caltech.
13. JJJustin Johnson
  Oh, and, and the same advisor at Caltech.
14. FLFei-Fei Li
  Yes, same advisor, your undergraduate advisor and my PhD advisor, uh, Pietro Perona. And my PhD time, which is similar to your, your, your PhD time, was when AI was still in the winter in the public eye. But it was not in the winter in my eye because it's that pre-spring hibernation. There's so much life. Machine learning, statistical modeling was really gaining, uh-
15. MCMartin Casado
  Yeah
16. FLFei-Fei Li
  ... gaining power and we... I, I think I was one of the native generation in machine learning and AI, whereas I look at Justin's generation as the native deep learning generation.
17. JJJustin Johnson
  Yeah.
18. FLFei-Fei Li
  So, so, so machine learning was the precursor of deep learning, and we were experimenting with all kinds of models. But one thing came out at the end of my PhD and the beginning of my assistant professor time. There was a overlooked elements of AI that is mathematically important to drive generalization. But the whole field was not thinking that way, and it was data.
19. JJJustin Johnson
  Yeah.
20. FLFei-Fei Li
  Because we were thinking about, um, you know, the intricacy of Bayesian models or, or whatever, you know, um, uh, kernel methods and all that. But what was fundamental that my students in my lab realized probably, uh, earlier than most people is that if youIf you let data drive models, you can unleash the kind of power that we haven't seen before. And that was really the, the, the reason we went on a pretty crazy bet on ImageNet, which is, you know what? Just forget about any scale we're seeing now, which is thousands of data points. At that point, uh, NLP community has their own datasets. I remember UC Irvine dataset or some dataset in NLP was-- it was small. Computer vision community has their datasets, but all in the order of thousands or tens of thousands. We're like, "We need to drive it to internet scale."
21. MCMartin Casado
  Yeah.
22. FLFei-Fei Li
  And luckily, it was also the, the, the coming of age of internet. So we were-
23. MCMartin Casado
  Right
24. FLFei-Fei Li
  ... riding that wave, and that's when I came to Stanford.
6:56 – 9:16
The Role of Compute
1. MCMartin Casado
  So it-- these epochs are what we often talk about. Like ImageNet is clearly the epoch that created, you know, or, or at least like maybe made like popular and viable computer vision. In the gen AI wave, we talk about two kind of core unlocks. One is like the Transformers paper, which is attention, then we talk about stable diffusion. Is that a fair way to think about this, which is like there's these two algorithmic unlocks that came from academia or Google-
2. FLFei-Fei Li
  [clears throat]
3. MCMartin Casado
  ... and like that's where everything comes from, or has it been more deliberate? Or have there been other kind of big unlocks that kind of brought us here that we don't talk as much about?
4. JJJustin Johnson
  Yeah. I, I think the big unlock is compute. Like, I know the story of AI is often the story of compute, but even no matter how much people talk about it, I pe- I think people underestimate it.
5. FLFei-Fei Li
  Yeah.
6. JJJustin Johnson
  Right? And the amount of, the amount of growth that we've seen in computational power over the last decade is astounding. The first paper that's really credited with the, like, the breakthrough moment in computer vision for deep learning was AlexNet, um, which was a twenty twelve paper wh- uh, that where a deep neural network did really well on the ImageNet challenge and just blew away all the other algorithms that Fei-Fei had been working on and the types of algorithms that you'd been working on more in grad school. That AlexNet was a sixty million parameter deep neural network, um, and it was trained for six days on two GTX 580s-
7. FLFei-Fei Li
  Mm-hmm
8. JJJustin Johnson
  ... which was the top consumer card at the time, which came out in two thousand and ten. Um, so I was looking at some numbers last night just to, you know, put these in perspective. The newest, the latest and greatest from NVIDIA is the GB200. Um, do either of you wanna guess how much raw compute factor we have between the GTX 580 and the GB200?
9. MCMartin Casado
  Shoot. No. What?
10. FLFei-Fei Li
  Go for it.
11. JJJustin Johnson
  It's, it's, uh, it's, it's in the thousands.
12. MCMartin Casado
  [chuckles] Oh my-
13. JJJustin Johnson
  So I, I ran the numbers last night, like that two week ra- that two week training run-
14. MCMartin Casado
  Yeah
15. JJJustin Johnson
  ... that, uh, six days on two GTX 580s, if you scale-
16. FLFei-Fei Li
  Seconds
17. JJJustin Johnson
  ... it, it comes out to just under five minutes-
18. MCMartin Casado
  Wow
19. JJJustin Johnson
  ... on a single GB, on a single GB200.
20. MCMartin Casado
  Wow.
21. FLFei-Fei Li
  Justin is making a really good point.
22. MCMartin Casado
  Yeah.
23. FLFei-Fei Li
  The twenty twelve AlexNet paper on ImageNet Challenge-
24. MCMartin Casado
  Yeah
25. FLFei-Fei Li
  ... is literally a very classic model, and that is the convolutional neural network model.
26. MCMartin Casado
  Yeah.
27. FLFei-Fei Li
  And that was published in nineteen eighties, the first paper. I remember as a graduate student learning that, and it more or less also has six, seven layers. The-- practically, the only difference between AlexNet and the ConvNet, what's the difference, is the GPUs-
28. MCMartin Casado
  Yeah
29. FLFei-Fei Li
  ... the two GPUs and the deluge of data.
9:16 – 17:01
Data as the Key Driver
1. MCMartin Casado
  Yeah. Well, so that's where I was gonna go, which is like-
2. FLFei-Fei Li
  Yeah.
3. JJJustin Johnson
  Mm-hmm
4. MCMartin Casado
  ... so I think most people now are familiar with like quote, "the bitter lesson."
5. JJJustin Johnson
  Yeah.
6. MCMartin Casado
  And the bitter lesson says is if you make an algorithm, don't be cute.
7. JJJustin Johnson
  Yeah.
8. MCMartin Casado
  Just make sure you can take advantage of available compute, 'cause the available com-compute will show up, right? And so, like, you just, like, need to, like-
9. JJJustin Johnson
  Yeah
10. MCMartin Casado
  ... vibe, like, the com... On the other hand, there's another narrative, um, which seems to me to be, like, just as credible, which is like it's actually new data sources that unlock deep learning, right? Like ImageNet is a great example. But, like, a lot of people, like, self-attention is great from Transformers, but they'll also say, this is a way you can exploit human labeling of data because, like, it's the humans-
11. JJJustin Johnson
  Yeah
12. MCMartin Casado
  ... that put the structure in the sentences. And if you look at CLIP, they'll say, well, like, we're using the internet to, like, actually, like, have humans use the alt tag to label images, right?
13. JJJustin Johnson
  Yeah.
14. MCMartin Casado
  And so, like, that's a story of data that's not a story of compute. And so-
15. JJJustin Johnson
  Yeah
16. MCMartin Casado
  ... is it just-- is the answer just both or is, like, one more than the other or-
17. JJJustin Johnson
  It's, it's-- I think it's both, but you're hitting on, on another really good point. So I think there's actually two epochs that to me feel quite distinct in the algorithmics here.
18. FLFei-Fei Li
  Mm.
19. JJJustin Johnson
  So, like, the ImageNet era is actually the era of supervised learning.
20. FLFei-Fei Li
  Mm-hmm.
21. JJJustin Johnson
  Um, so in the era of supervised learning, you have a lot of data, but you don't know how to use data on its own. Like, the expectation of ImageNet and other datasets of that time period was that we're gonna get a lot of images, but we need people to label every one. And all of the training data that we're gonna train on, like, a person, a human labeler has looked at every one and said something about that image.
22. MCMartin Casado
  Yeah.
23. JJJustin Johnson
  Um, and the big algorithmic unlocks, we know how to train on things that don't require human-labeled data.
24. MCMartin Casado
  As the, as the naive person in the room that doesn't have an AI background, it seems to me if you're training on human data, like, the humans have labeled it, it's just not explicit, like-
25. FLFei-Fei Li
  I, I knew you were gonna say that-
26. JJJustin Johnson
  [laughs]
27. MCMartin Casado
  [laughs]
28. FLFei-Fei Li
  ... Martin. I knew that.
29. JJJustin Johnson
  Right.
30. FLFei-Fei Li
  Yes, philosophically, that's a really important question, but that actually is more true in language than pixels.
17:01 – 18:58
Defining AI’s Ultimate Goal
1. MCMartin Casado
  it's a phenomenal book. I like... I, I really recommend you read it. And it seems for a long time, like, a lot of the-- and I'm talking to you, Fei-Fei, like, a lot of your research has been, you know, and your direction has been towards kind of spatial stuff and pixel stuff and intelligence, and now you're doing World Labs, and it's around spatial intelligence. And so maybe talk through, like, you know, has this been part of a long journey for you? Like, why did you decide to do it now? Is it a technical unlock? Is it a personal unlock? Just kind of, like, move us from that kind of milieu of AI research to, to World Labs.
2. FLFei-Fei Li
  Sure. For me is, uh, um, it is, uh, uh, both personal and intellectual, right? My entire-- You talk about my book. My entire intellectual journey is really this passion to seek North Stars, but also believing that those North Stars are critically important for the advancement of our field. So at the beginning, I remembered after graduate school, I thought my North Star was telling stories ofUh, images, because for me, that's such an important piece of visual intelligence. That's part of what you call AI or AGI. But when Justin and Andrej did that, I was like, "Oh my God, that's, [laughs] that was my live stream. What do I do l- next?" [laughs] So it, it came a lot faster than... I thought it would take 100 years-
3. MCMartin Casado
  [laughs]
4. FLFei-Fei Li
  -to do that. So, um, but visual intelligence is my passion because I do believe for every intelligent, uh, being, like people or robots or some other form, um, knowing how to see the world, reason about it, interact in it, whether you're navigating or, or, or manipulating or making things, you can even build civilization upon it. It--
18:58 – 26:35
What is Spatial Intelligence? Unlocking 3D Understanding in AI
1. FLFei-Fei Li
  Visual-spatial intelligence is so fundamental. It's as fundamental as language, possibly more ancient and, and more fundamental in certain ways. So, so it's very natural for me that, um, World Labs is-- our North Star is to unlock spatial intelligence. The moment to me is right to do it, like Justin was saying, compute. We've got these ingredients. We've got compute. We've got a much deeper understanding of data, way deeper than ImageNet days. You know, uh, compared to, to that, those days, [laughs] uh, we're so much more sophisticated. And we've got some advancement of algorithms-
2. MCMartin Casado
  Mm
3. FLFei-Fei Li
  ... including co-founders in World Lab like Ben Mildenhall and, uh, Christof, uh, Lasser. They were at the cutting edge of NeRF, that we are in the right moment to really make a bet and to focus and just unlock that.
4. MCMartin Casado
  Great. So I just wanna clarify it for, for folks that are listening to this, which is, so, you know, you're starting this company, World Labs. Spatial intelligence is kind of how you're generally describing the problem you're solving. Can you maybe try to crisply describe what that means?
5. JJJustin Johnson
  Yeah. So spatial intelligence is about machines' ability to under-- to perceive, reason, and act in 3D and-- 3D space and time, to understand how objects and events are positioned in 3D space and time, how interactions in the world can affect those 3D posit-- 3D spa-- 4D positions over space time, um, and both sort of perceive, reason about, generate, interact with, really take the machine out of the mainframe or out of the data center and putting it out into the world and understanding the 3D/4D world with all of its richness.
6. MCMartin Casado
  And so to be very clear, are we talking about the physical world, or are we just talking about an abstract notion of world?
7. JJJustin Johnson
  I think it can be both. I think it can be both, and that encompasses our vision long term.
8. FLFei-Fei Li
  Yeah.
9. JJJustin Johnson
  Even if you're generating worlds, even if you're generating content, um, doing that in-- positioned in 3D with 3D, uh, has a lot of benefits. Um, or if you're recognizing the real world, being able to put 3D understanding into the m- into the real world as well is-
10. MCMartin Casado
  Yeah
11. JJJustin Johnson
  ... part of it.
12. MCMartin Casado
  Great. So I mean, ju-just for everybody listening, like, the, the two other co-founders, Ben Mildenhall and Christof Lasser, are absolute legends in, in the field at the, at the same level. These four decided to come out and do this company now, [laughs] and so I'm trying to get... dig to, like, like, why now is the, the, the right time.
13. JJJustin Johnson
  Yeah. I mean, this is, again, part of a longer evolution for me. But, like, really after PhD when I was really wanting to develop into my own independent researcher, both at, uh, for my later career, I was just thinking, "What are the big problems in AI and computer vision?" Um, and the conclusion that I came to about that time was that the previous decade had mostly been about understanding data that already exists, um, but the next decade was going to be about understanding new data. And if we think about that, the data that already exists was all of the images and videos that maybe existed on the web already, and the next decade was gonna be about understanding new data, right? Like, people are, people have smartphones. Smartphones are collecting cameras. Those cameras have new sensors. Those cameras are positioned in the 3D world. It's not just you're gonna get a bag of pixels from the internet and know nothing about it and try to say if it's a cat or a dog. We wanna treat these i- treat images as universal sensors to the physical world, and how can we use that to understand the 3D and 4D structure of the world, um, e- either in physical spaces or, or, or generative spaces. So I made a pretty big pivot post-PhD into 3D computer vision-
14. MCMartin Casado
  Okay
15. JJJustin Johnson
  ... predicting 3D shapes of objects with some of my colleagues at FAIR at the time.
16. MCMartin Casado
  Mm-hmm.
17. JJJustin Johnson
  Then later, I got really enamored by this idea of learning 3D structure through 2D, right? Because we talk about data a lot. It's, it's, um, you know, 3D data is hard to get on its own, um, but there-- because there's a very strong mathematical connection here, um, our 2D images are projections of a 3D world, and there's a lot of mathematical structure here we can take advantage of. So even if you have a lot of 2D data, there's, there's a lot of people who've done amazing work to figure out how can you back out the 3D structure of the world from large quantities of 2D observations. Um, and then in 2020, you asked about very breakthrough moments. There was a really big breakthrough moment from, from our co-founder, Ben Mildenhall, at the time with his paper NeRF-
18. MCMartin Casado
  Yeah
19. JJJustin Johnson
  ... um, Neural Radiance Fields.
20. MCMartin Casado
  Great.
21. JJJustin Johnson
  And that was a very simple, very clear way of backing out 3D structure from 2D observations. That just lit a fire under this whole space of 3D computer vision.
22. FLFei-Fei Li
  Mm-hmm.
23. JJJustin Johnson
  I think there's another aspect here that maybe, uh, people outside the field don't quite understand. Uh, so that was also a time when large language models were starting to take off.
24. FLFei-Fei Li
  Yep.
25. JJJustin Johnson
  So a lot of the stuff with language modeling actually had gotten developed in academia. Even during my PhD, I did some early work with Andrej Karpathy on language modeling in 2014.
26. FLFei-Fei Li
  LSTM, I still remember. [laughs]
27. JJJustin Johnson
  Yeah, yeah, it was LSTMs, RNNs-
28. FLFei-Fei Li
  Pre-BERT
29. JJJustin Johnson
  ... GRUs.
30. FLFei-Fei Li
  Yes. [laughs]
26:35 – 29:41
Comparing Models: Spatial Intelligence vs. Language-Based AI
1. MCMartin Casado
  Throughout this entire conversation, you're talking about languages, and you're talking about pixels. So maybe it's a good time to talk about how like spatial intelligence and what you're working on contrasts with language approaches, which of course are very popular now. Like, is it complementary? Is it orthogonal?
2. JJJustin Johnson
  Yeah, I, I think, I think they're complementary.
3. MCMartin Casado
  I, I, I don't mean to be too, too leading here. Like maybe just contrast them. Like, everybody s- says like, "Listen, I, I, I know OpenAI, and I know GPT, and I know multimodal models," and a lot of what you're talking about is like they've got pixels, and they've got languages and like doesn't this kind of do what we want to do with spatial reasoning?
4. JJJustin Johnson
  Yeah. So I think to do that, you need to open up the black box a little bit of how these systems work under the hood. Um, so with language models and the multimodal language models that we're seeing nowadays, their, their, their un- their underlying representation under the hood is, is a one-dimensional representation.
5. FLFei-Fei Li
  Yeah.
6. JJJustin Johnson
  We talk about context lengths. We talk about transformers. We talk about sequences.
7. FLFei-Fei Li
  Attention.
8. JJJustin Johnson
  Attention.
9. FLFei-Fei Li
  Yeah.
10. JJJustin Johnson
  Fundamentally, their representation of the world is, is one-dimensional, so these things fundamentally operate on a one-dimensional sequence of tokens. So this is a very natural representation when you're talking about language because written text is a one-dimensional sequence of discrete letters, so that kind of un- underlying representation is the thing that led to LLMs. And now the multimodal LLMs that we're seeing now, you kind of end up shoehorning the other modalities into this underlying representation of a 1D sequence of tokens.
11. MCMartin Casado
  Yeah.
12. JJJustin Johnson
  Um, now when we move to spatial intelligence, it's kind of going the other way, where we're saying that the three-dimensional nature of the world should be front and center in the representation.
13. FLFei-Fei Li
  Mm-hmm.
14. JJJustin Johnson
  So at an algorithmic perspective, that opens up the door for us to process data in different ways to get different kinds of outputs out of it, um, and to tackle slightly different problems. So even at, at a course level, you kind of look at outside, and you say, "Oh, multimodal LLMs can look at images too." Well, they can, but I, I think that it's-- they don't have that fundamental 3D representation at the heart of their approaches.
15. FLFei-Fei Li
  I totally agree with Justin. I think talking about the 1D versus fundamentally 3D representation is one of the most core differentiation. The other thing is, uh, slightly philosophical, but it's really important to, for me at least, is language is fundamentally a purely generated signal. There's no language out there. You don't go out in the nature, and there's words written in the sky for you. Whatever data you feed in, you pretty much can just somehow regurgitate with enough generalizability, uh, the, the same data out, and that's language to language. And but, but 3D world is not. There is a 3D world out there that follows laws of physics, that has its own structures due to materials and, and many other things. And to, to fundamentally back that information out and be able to represent it and be able to generate it is just fundamentally quite a different problem. We will be borrowing, um, similar ideas or, or useful ideas from language and LLMs, but this is fundamentally philosophically, to me, a
29:41 – 32:39
1D vs. 3D
1. FLFei-Fei Li
  different problem.
2. JJJustin Johnson
  Right.
3. MCMartin Casado
  So language 1D and probably a bad representation of the physical world because it's been generated by humans and is probably lossy. There's a whole nother modality of generative AI models which are pixels, and these are 2D image and 2D video. And like one could say that like if you look at a video, it looks, you know, you can see 3D stuff because like you can pan a camera or whatever it is. And so like, how would like spatial intelligence be different than, say, 2D video?
4. JJJustin Johnson
  Here when I think about this, it's useful to disentangle two things. Um, one is the underlying representation, and then two is kind of the, the user-facing affordances that you have.
5. FLFei-Fei Li
  Yeah.
6. JJJustin Johnson
  Um, and here's where, where you can get sometimes confused because, um, fundamentally, we see 2D, right? Like our retinas are 2D structures in our bodies, and we've got two of them. So like fundamentally, our visual system perceives 2D images. Um, but the problem is that depending on what res-representation you use, there could be different affordances that are more natural or less natural. So even if you, uh, are... At the end of the day, you're might be seeing a 2D image or a 2D video, um, your brain is perceiving that as a projection of a 3D world. So there's things you might want to do, like move objects around, move the camera around. Um, in principle, you might be able to do these with a purely 2D representation and model, but it's just not a fit to the problems that you're asking the model to do, right? Like modeling the 2D projections of a dynamic 3D world is, is a function that probably can be modeled. But by putting a 3D representation into the heart of a model, there's just gonna be a better fit between the kind of representation that the model is working on and the kind of tasks that you want that model to do. So our bet is that by threading a little bit more 3D representation under the hood, that'll en-enable better affordances for, for users.
7. FLFei-Fei Li
  And this also goes back to the North Star. For me, you know, why is it spatial intelligence? Why is it not flat pixel intelligence? Is because I think the arc of intelligence has to go to what Justin calls affordances. And, uh, and the arc of intelligence, if you look at evolution, right, the arc of intelligence eventually enables animals and humans, especially human as an intelligent animal, to move around the world, interact with it, create civilization, create life, create a piece of sandwich, whatever you do in this 3D world. And, and translating that into a piece of technology, that three native 3D-ness is fundamentally important for the flood, flood gate, um, of possible applications-
8. JJJustin Johnson
  Yeah
9. FLFei-Fei Li
  ... even if some of them, the, the serving of them looks 2D, but the, but it's innately 3D-
10. JJJustin Johnson
  Great
11. FLFei-Fei Li
  ... um, to me.
12. MCMartin Casado
  I think this is actually very subtle-
13. FLFei-Fei Li
  Yeah
14. MCMartin Casado
  ... and incredibly critical-
15. FLFei-Fei Li
  Mm-hmm
16. MCMartin Casado
  ... point, and so I think it's worth digging into, and a good way to do this is talking about
32:39 – 35:11
Building Immersive Worlds with Spatial Intelligence
1. MCMartin Casado
  use cases. And so just to level set this is we're talking about generating a, a, a technology, let's call it a model, that can do spatial intelligence. So maybe in the abstract, what might that look like kind of a little bit more concretely? What would be the potential use cases that you could apply this to?
2. JJJustin Johnson
  So I think there's a, there's a couple different kinds of things we imagine these spatially intelligent models able to do over time. Um, and one that I'm really excited about is world generation. We're all, we're all used to something like a text-to-image generator or starting to see text-to-video generators where you put an image, put in a video, and out pops an amazing image or an amazing two-second clip. Um, but I, I think you could imagine leveling this up and getting 3D worlds out. So one thing that we could imagine spatial intelligence helping us with in the future are upleveling these experiences into 3D, where we're not getting just an image out or just a clip out, but you're getting out a full simulated but vibrant and interactive 3D world.
3. FLFei-Fei Li
  For gaming?
4. JJJustin Johnson
  Maybe for gaming, right?
5. FLFei-Fei Li
  [laughs]
6. JJJustin Johnson
  Maybe for gaming, maybe for virtual photography, like you name it. There's... I think there's, there's... Even if you got this to work, there'd be, there'd be a million applications.
7. FLFei-Fei Li
  For education even, right?
8. JJJustin Johnson
  Yeah, for education. I mean, I guess one of, one of my things is that, like, we-- In, in some sense, this enables a new form of media, right? Because we already have the ability to create virtual interactive worlds, um, but it costs hundreds of mil- hundreds of millions of dollars and a ton, and a ton of development time. And as a result, like what are the places that people derive this technological ability is, is video games, right? Because if we, we do have the ability as a society to create amazingly detailed virtual interactive worlds that give you amazing experiences, but because it takes so much labor to do so, then the only economically viable use of that technology in its form today is, is games that can be sold for seventy dollars a piece to millions and millions of people to recoup the investment. If we had the ability to create these same virtual interactive, vibrant 3D worlds, um, you could see a lot of other applications of this, right? Because if you bring down that cost of producing that kind of content, then people are gonna use it for other things. What if you could have a, an interact-- like sort of a personalized expe- 3D experience that's as good and as rich, as detailed as one of these triple A video games that cost hundreds of millions of dollars to produce, but it could be catered to like this very niche thing that only maybe a couple people would want that particular thing. That's not a particular product or a particular roadmap, but I think that's a vision of a new kind of media that would be enabled by, um, spatial intelligence in the generative realms.
35:11 – 37:42
From Static Scenes to Dynamic Worlds
1. MCMartin Casado
  If I think about a world, I actually think about things that are not just scene generation. I think about stuff like movement and physics and so like, like in the limit, is that included? And then the second one is-
2. FLFei-Fei Li
  Absolutely
3. MCMartin Casado
  ... if I'm interacting with it, like, like are there semantics? And I mean by that, like if I open a book, are there like pages and are there words in it, and do, do they mean... Like, like are we talking like a full depth experience or are we talking about like kind of a static scene?
4. JJJustin Johnson
  I think you'll see a progression of this technology over time.
5. FLFei-Fei Li
  Yeah.
6. JJJustin Johnson
  This is really hard stuff to build.
7. MCMartin Casado
  Yeah.
8. JJJustin Johnson
  So I think the static, the static problem is a little bit easier. Um, but in the limit, I think we want this to be fully dynamic, fully interactable, all the things that you just said.
9. MCMartin Casado
  Yeah.
10. FLFei-Fei Li
  I mean, that's the definition of spatial intelligence.
11. JJJustin Johnson
  Exactly.
12. FLFei-Fei Li
  Yeah. So, so there is gonna be a progression. We'll start with more static, but everything you've said isIs in the, in the roadmap of, uh, spatial intelligence
13. JJJustin Johnson
  I mean, this is kind of in, in the name of the company itself
14. FLFei-Fei Li
  Yeah
15. JJJustin Johnson
  ... World Labs. Um, like the world is about building and understanding worlds
16. MCMartin Casado
  I love that
17. JJJustin Johnson
  And, and like this is actually a little bit of inside baseball. I realized after we told the name to people, they don't always get it because in computer vision and, and reconstruction and generation, we often make a distinction or a delineation about the kinds of things you can do. Um, and kinda the first level is objects, right? Like a microphone, a cup, a chair. Like, these are discrete things in the world. Um, and a lot of the ImageNet-style stuff that Fei-Fei worked on was about recognizing objects in the world. Then leveling up the next level of objects I think of as scenes. Like, scenes are compositions of objects. Like, now we've got this recording studio with a table and microphones and people in chairs. It's some composition of objects. But, but then, like, we, we envision worlds as a step beyond scenes, right? Like, scenes are kind of maybe individual things, but we wanna break the boundaries, go outside the door, like step out from the table, walk out from the door, walk down the street, and see the cars buzzing past and see, like, the, the, the, the, the leaves on the trees moving and be able to interact with those things
18. FLFei-Fei Li
  Another thing that's really exciting, 'cause Justin mentioned the word new media, with this technology, the boundary between real world and virtual imagined world or augmented world or predicted world is all blurry. You really-
19. JJJustin Johnson
  Yeah
20. FLFei-Fei Li
  ... it- there- this real world is 3D, right? So in the digital world, you have to have a 3D representation to even blend with the real world, you know?
21. JJJustin Johnson
  Yeah.
22. FLFei-Fei Li
  You cannot have a 2D, you cannot have a 1D-
23. JJJustin Johnson
  Right
24. FLFei-Fei Li
  ... to be able to interface-
25. JJJustin Johnson
  Right
26. FLFei-Fei Li
  ... with the real 3D world in a effective way. Uh, with this, it unlocks it, so it, it, the use cases can, can be quite limitless because
37:42 – 40:42
The Future of VR and AR
1. FLFei-Fei Li
  of this
2. MCMartin Casado
  Right. So the first use case that, that Justin was talking about would be, like, the generation of a virtual world for any number of use cases. The one that you're just alluding to would be more of an augmented reality, right? Where like-
3. FLFei-Fei Li
  Yes. Just around the time World Lab was, uh, um, being formed, uh, Vision Pro was released by-
4. MCMartin Casado
  Yeah
5. FLFei-Fei Li
  ... Apple. And, uh, they used the word spatial computing. We're almost... Like, they almost stole our- [laughs]
6. JJJustin Johnson
  Our name. [laughs]
7. FLFei-Fei Li
  Our... But we're spatial intelligence. Uh, so spatial computing needs spatial intelligence. That's exactly, uh, right. So we don't know what hardware form it will take. It'll be goggles, glasses-
8. JJJustin Johnson
  Contact lenses
9. FLFei-Fei Li
  ... contact lenses. But that interface between the true real world and what you can do on top of it, whether it's to help you to augment your capability to work on a piece of machine and fix your car even if you are not a trained mechanic or to just be in, uh, Pokemon Go, uh, uh, uh, plus, plus for entertainment. Suddenly this piece of technology is, is going to be the, the, the operating system basically, uh, for, for AR, VR, uh, mixed VR.
10. JJJustin Johnson
  In the limit, like, what does an AR device need to do? It's this thing that's always on. It's with you. It's looking out into the world, so it needs to understand the stuff that you're seeing, um, and maybe help you out with tasks in your daily life. But I'm, I'm real- also really excited about this blend between virtual and physical that becomes-
11. FLFei-Fei Li
  Yep
12. JJJustin Johnson
  ... really critical. If you have the ability to understand what's around you in real time in perfect 3D, then it actually starts to deprecate large parts of the real world as well. Like, right now, how many differently sized screens do we all own for different use cases?
13. FLFei-Fei Li
  Too many.
14. MCMartin Casado
  Lots. Yeah, lots. Yeah
15. JJJustin Johnson
  Right? You've got, you've got your, you've got your phone, you've got your iPad, you've got your computer monitor, you've got your TV. Like-
16. MCMartin Casado
  Your watch
17. JJJustin Johnson
  ... you've got your watch. Like, these are all basically different-sized screens because they need to present information to you in different contexts and in different, different positions. But if you've got the ability to seamlessly blend virtual content with the physical world, it kind of deprecates the need for all of those. It just ideally seamlessly blends the information that you need to know in the moment with the right way, mechanism of, of giving you that information.
18. MCMartin Casado
  Totally.
19. FLFei-Fei Li
  Another huge case of being able to blend the, the digital virtual world with the 3D physical world is for any agents to be able to do things in the physical world. And if humans use this mixed AR devices to do things, like I said, I don't know how to fix a car, but if I have to, I put on this, this goggle or glass and suddenly I'm guided to do that. But there are other types of agents, namely robots, any kind of robots, so not just humanoid, and, uh, their interface by definition is the 3D world, but their compute, their brain by definition is the digital world.
20. JJJustin Johnson
  Yeah.
21. FLFei-Fei Li
  So what connects that from the learning to, to behaving between a robot brain to the real-world brain? It has to be spatial intelligence.
40:42 – 44:26
Creating Deep Tech Platforms
1. MCMartin Casado
  So you've talked about virtual worlds, you've talked about kind of more of an augmented reality, and now you've just talked about the purely physical world, basically, which would be used for robotics. Um, for any company, that would be, like, a very large charter, especially if you're gonna get into each one of these different areas. So how do you think about the idea of, like, deep, deep tech versus any of these specific application areas?
2. FLFei-Fei Li
  We see ourselves as a deep tech company, as the platform company that provides models-
3. MCMartin Casado
  Yeah
4. FLFei-Fei Li
  ... that, uh, that can serve different use cases. Uh-
5. MCMartin Casado
  Is... O- of these three, is there any one that you think is kind of more natural early on that people can kind of expect the company to lean into, or is it-
6. FLFei-Fei Li
  I think it's suffices to say the devices are not totally ready. [laughs]
7. JJJustin Johnson
  Actually, I got my first VR headset in grad school. Um, and just, like, that's one of these tr- transformative technology experiences. You put it on, you're like, "Oh, my God." Like, "This is crazy." And I think a lot of people have that experience the first time they use VR. Um, so I've, I've been excited about this space for a long time, and I, I love the Vision Pro. Like, I stayed up late to order one of the first ones-
8. FLFei-Fei Li
  [laughs]
9. JJJustin Johnson
  ... like the first day it came out. Um, but I, I think the reality is it's just not there yet as a platform-
10. MCMartin Casado
  Yeah
11. JJJustin Johnson
  ... for mass market appeal
12. FLFei-Fei Li
  So very likely as a company, we'll, we'll, we'll move into a market that's more ready than-
13. JJJustin Johnson
  Then I, I think there can sometimes be simplicity and generality, right? Like if you-- we, we have this notion of being a deep tech company. We, we believe that there is some under-underlying fundamental problems that need to be solved really well, and if solved really well, can apply to a lot of different domains. We really view this long arc of the company as building and realizing the, the dreams of spatial intelligence writ large.
14. MCMartin Casado
  So this is a lot of technologies to build, it seems to me.
15. JJJustin Johnson
  Yeah. I think it's a really hard problem. Um, I think sometimes from people who are not directly in the AI space, they just see it as like AI as one de-undifferentiated mass of talent.
16. MCMartin Casado
  Yep.
17. JJJustin Johnson
  Um, and, and for those of us who have been here a long-- for, for longer, you realize that there's a lot of different, a lot of different kinds of talent that need to come together to build anything in A- in AI, in particular this one. We talked a little bit about the, the data problem. We've talked a little bit about some of the algorithms that we-- that I worked on during my PhD, but there's a lot of other stuff we need to do this too. Um, you need really high quality, large scale engineering. You need really deep understanding of 3D, of the 3D world. You need really-- there's actually a lot of connections with computer graphics, um, because they've been kind of atta-attacking a lot of the same problems from the, from the opposite direction. So when we think about team construction, we think about how do we find expert, like absolute top of the world, best experts in the world at each of these different sub-domains that are necessary to build this really hard thing.
18. FLFei-Fei Li
  When I thought, uh, thought about how we form the best founding team for World Labs, it has to start with the a, a group of phenomenal multidisciplinary founders.
19. JJJustin Johnson
  Yeah.
20. FLFei-Fei Li
  And of course, Justin is natural for me [chuckles] who Justin cover your years as one of my best students and, uh, uh, one of the smartest, uh, technologists. But there are two, two other people I have known by reputation and, and one of them Justin even worked with that I was drooling for, right? One is Ben Mildenhall. We talked about his-
21. JJJustin Johnson
  Yeah. NeRF. Yeah
22. FLFei-Fei Li
  ... um, seminal work in NeRF. But another person is, uh, Christophe Lasserre, who has been reputated in the community of computer graphics, and, uh, especially he had the foresight of working on a precursor of the Gaussian splat, um, representation for 3D modeling five years, right-
23. JJJustin Johnson
  Yeah
24. FLFei-Fei Li
  ... before the, uh, the Gaussian splat take off. And when, when we heard about-- when we talk about the potential, uh, possibility of working with Christophe Lasserre, Justin just, uh, jumped off his chair.
25. MCMartin Casado
  Ben and Christophe are, are, are legends.
44:26 – 45:54
Building a World-Class Team
1. MCMartin Casado
  And maybe just quickly talk about kind of like how you thought about the build out of the rest of the team, because again, like it's, you know, there's a lot to build here and a lot to work on, not just in kind of AI or graphics, but like systems and so forth.
2. FLFei-Fei Li
  Yeah. Um, this is what so far I'm personally most proud of, is the formidable team. I've had the privilege of working with the smartest young people in my entire career, right?
3. JJJustin Johnson
  Of course. Yeah.
4. FLFei-Fei Li
  From, from the top universities, being a professor at Stanford. But the kind of talent that we put together here at, uh, at, uh, World Labs is just phenomenal. I've never seen the concentration. And I think the biggest differentiating, um, element here is that we're believers of, uh, spatial intelligence. All of the multidisciplinary talents, whether it's system engineering, machine, uh, machine learning infra, to, you know, uh, generative modeling, to data, to, you know, graphics, all of us, whether it's our personal research journey or, or technology journey or even personal hobby, we believe that spatial intelligence has to happen at this moment with this group of people. And, uh, that's how we really found our founding team. And, uh, and that focus of energy and talent is, is, is really just, uh, um, humbling to me. I, I just love it.
45:54 – 47:58
Measuring Success: Milestones in Spatial Intelligence
1. MCMartin Casado
  So I know you've been guided by a North Star. So something about North Stars is like you can't actually reach them-
2. JJJustin Johnson
  [laughs]
3. MCMartin Casado
  ... because they're in the sky, but it's a great way to have guidance. So how will you know when you've accomplished what you've set out to accomplish? Or is this a lifelong thing that's gonna continue kind of infinitely?
4. FLFei-Fei Li
  First of all, there's real North Stars and virtual North Stars.
5. MCMartin Casado
  Right. [laughs]
6. FLFei-Fei Li
  Sometimes you can reach virtual North Stars.
7. MCMartin Casado
  Fair enough. Good to know.
8. FLFei-Fei Li
  Like the-
9. MCMartin Casado
  In, in, in the world, in the world model-
10. FLFei-Fei Li
  Exactly. [laughs]
11. MCMartin Casado
  ... you can hit the North Star. [laughs]
12. FLFei-Fei Li
  Like I said, we-- I thought one of my North Star that would take a hundred years with storytelling of images and, um, Justin and Andre-
13. MCMartin Casado
  Fair
14. FLFei-Fei Li
  ... you know, in my opinion, solved it for me.
15. MCMartin Casado
  Great.
16. FLFei-Fei Li
  So, um, so we could get to our North Star. But I think for me is when so many people and so many businesses are using our models to unlock their, um, needs for spatial intelligence, and that's the moment I know we have reached a major milestone.
17. MCMartin Casado
  Actual deployment.
18. FLFei-Fei Li
  Yes.
19. MCMartin Casado
  Actual impact.
20. FLFei-Fei Li
  Yeah.
21. MCMartin Casado
  Actually, yes.
22. JJJustin Johnson
  Yeah. I, I don't think we're ever gonna get there. Um, I, I think that this is such a fundamental thing, like the universe is a giant evolving four-dimensional structure.
23. MCMartin Casado
  Hundred percent.
24. JJJustin Johnson
  And spatial intelligence writ large is just understanding that i-in all of its depths and figuring out all the applications to that. So I, I think that we have a, we have a particular set of ideas in mind today, but I, I think this, I think this journey is gonna take us places that we can't even imagine right now.
25. FLFei-Fei Li
  The magic of good technology is that technology opens up more possibilities and, and unknowns. So, so we will be pushing and then the possibilities will, will be expanding.
26. MCMartin Casado
  Brilliant. Thank you, Justin. Thank you, Fei-Fei. This was fantastic.
27. FLFei-Fei Li
  Thank you, Martin.
28. JJJustin Johnson
  Yeah. Thank you, Martin.
29. SPSpeaker
  [upbeat music] Thank you so much for listening to the a16z podcast. If you've made it this far, don't forget to subscribe so that you are the first to get our exclusive video content. Or you can check out this video that we've hand selected for you. [upbeat music]

Episode duration: 48:09

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode vIXfYFB7aBI

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome