Skip to content
a16za16z

“The Future of AI is Here” — Fei-Fei Li Unveils the Next Frontier of AI

Fei-Fei Li and Justin Johnson are pioneers in AI. While the world has only recently witnessed a surge in consumer AI, our guests have long been laying the groundwork for innovations that are transforming industries today. In this episode, a16z General Partner Martin Casado joins Fei-Fei and Justin to explore the journey from early AI winters to the rise of deep learning and the rapid expansion of multimodal AI. From foundational advancements like ImageNet to the cutting-edge realm of spatial intelligence, Fei-Fei and Justin share the breakthroughs that have shaped the AI landscape and reveal what's next for innovation at World Labs. If you're curious about how AI is evolving beyond language models and into a new realm of 3D, generative worlds, this episode is a must-listen. Timestamps: 00:00 - Spatial Intelligence: A New Frontier 01:38 - Scaling AI: The Impact of ImageNet on Computer Vision 06:56 - The Role of Compute 09:16 - Data as the Key Driver 17:01 - Defining AI’s Ultimate Goal 18:58 - What is Spatial Intelligence? Unlocking 3D Understanding in AI 26:35 - Comparing Models: Spatial Intelligence vs. Language-Based AI 29:41 - 1D vs. 3D 32:39 - Building Immersive Worlds with Spatial Intelligence 35:11 - From Static Scenes to Dynamic Worlds 37:42 - The Future of VR and AR 40:42 - Creating Deep Tech Platforms 44:26 - Building a World-Class Team 45:54 - Measuring Success: Milestones in Spatial Intelligence Resources: Learn more about World Labs: https://www.worldlabs.ai Find Fei-Fei on Twitter: https://x.com/drfeifei Find Justin on Twitter: https://x.com/jcjohnss Find Martin on Twitter: https://x.com/martin_casado Stay Updated: Let us know what you think: https://ratethispodcast.com/a16z Find a16z on Twitter: https://twitter.com/a16z Find a16z on LinkedIn: https://www.linkedin.com/company/a16z Subscribe on your favorite podcast app: https://a16z.simplecast.com/ Follow our host: https://twitter.com/stephsmithio Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.

Fei-Fei LiguestJustin JohnsonguestMartin Casadohost
Sep 19, 202448mWatch on YouTube ↗

At a glance

WHAT IT’S REALLY ABOUT

Fei-Fei Li’s spatial intelligence bet: AI’s next 3D leap

  1. The speakers frame AI’s recent progress as a convergence of compute, data, and algorithms, with ImageNet exemplifying how scaling labeled data unlocked modern computer vision.
  2. They argue compute growth is still underappreciated, citing AlexNet’s once-massive training run now compressible to minutes on modern NVIDIA hardware.
  3. They describe a shift from supervised learning toward methods that exploit unlabeled or implicitly labeled data, enabling more open-ended generative capabilities.
  4. They define “spatial intelligence” as machine understanding and interaction in 3D space and time (4D), emphasizing that current LLM-style 1D token representations are a poor native fit for physical reality.
  5. They position World Labs as a deep-tech platform company aiming to generate and understand interactive worlds for media, AR/VR, and robotics, enabled by breakthroughs like NeRF that merge reconstruction and generation.

IDEAS WORTH REMEMBERING

5 ideas

Spatial intelligence is positioned as co-equal to language for general intelligence.

Li argues visual-spatial capability underpins navigation, manipulation, and building in the world, making it a foundational substrate for agents and new applications—not just a “modality” to add onto text.

Compute is a primary accelerant, and its impact is often underestimated.

Johnson highlights that AlexNet (trained for days on 2010-era GPUs) would take minutes on a GB200-class GPU, illustrating how algorithmic ideas can look revolutionary once hardware catches up.

Data unlocked vision once, but the next leap depends on learning beyond explicit labels.

They distinguish the ImageNet era (human-labeled supervised learning with fixed ontologies) from today’s more scalable approaches that learn from less constrained, more implicit, or self-supervised signals.

3D representation is the key differentiator from multimodal LLMs, not just “seeing pixels.”

They claim many multimodal systems still “shoehorn” vision into 1D token sequences, whereas spatial intelligence makes 3D/4D structure central, enabling more natural camera/object control and interaction.

NeRF marked a practical inflection point because it made 3D reconstruction tractable and fast.

NeRF demonstrated a simple method to infer 3D structure from 2D views with modest compute, catalyzing a wave of academic progress and helping merge reconstruction (from real scenes) with generation (imagined scenes).

WORDS WORTH SAVING

5 quotes

Visual-spatial intelligence is so fundamental. It's as fundamental as language.

Fei-Fei Li

Even no matter how much people talk about it, I think people underestimate [compute].

Justin Johnson

The only difference between AlexNet and the ConvNet... is the GPUs... and the deluge of data.

Fei-Fei Li

Their underlying representation under the hood is... one-dimensional... [but] the three-dimensional nature of the world should be front and center.

Justin Johnson

When NeRF happened... suddenly reconstruction and generation start to really merge.

Fei-Fei Li

ImageNet and the supervised learning eraCompute scaling and the “bitter lesson”Data scale vs. labeling and implicit supervisionGenerative modeling continuum (style transfer, GANs, diffusion)NeRF and 3D from 2D observationsSpatial intelligence (3D/4D representations and affordances)Applications: world generation, AR/VR, robotics; platform strategy

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome