“The Future of AI is Here” — Fei-Fei Li Unveils the Next Frontier of AI

“The Future of AI is Here” — Fei-Fei Li Unveils the Next Frontier of AI

a16zSep 20, 202448m

Fei-Fei Li (guest), Justin Johnson (guest), Martin Casado (host)

ImageNet and the supervised learning eraCompute scaling and the “bitter lesson”Data scale vs. labeling and implicit supervisionGenerative modeling continuum (style transfer, GANs, diffusion)NeRF and 3D from 2D observationsSpatial intelligence (3D/4D representations and affordances)Applications: world generation, AR/VR, robotics; platform strategy

In this episode of a16z, featuring Fei-Fei Li and Justin Johnson, “The Future of AI is Here” — Fei-Fei Li Unveils the Next Frontier of AI explores fei-Fei Li’s spatial intelligence bet: AI’s next 3D leap The speakers frame AI’s recent progress as a convergence of compute, data, and algorithms, with ImageNet exemplifying how scaling labeled data unlocked modern computer vision.

Fei-Fei Li’s spatial intelligence bet: AI’s next 3D leap

The speakers frame AI’s recent progress as a convergence of compute, data, and algorithms, with ImageNet exemplifying how scaling labeled data unlocked modern computer vision.

They argue compute growth is still underappreciated, citing AlexNet’s once-massive training run now compressible to minutes on modern NVIDIA hardware.

They describe a shift from supervised learning toward methods that exploit unlabeled or implicitly labeled data, enabling more open-ended generative capabilities.

They define “spatial intelligence” as machine understanding and interaction in 3D space and time (4D), emphasizing that current LLM-style 1D token representations are a poor native fit for physical reality.

They position World Labs as a deep-tech platform company aiming to generate and understand interactive worlds for media, AR/VR, and robotics, enabled by breakthroughs like NeRF that merge reconstruction and generation.

Key Takeaways

Spatial intelligence is positioned as co-equal to language for general intelligence.

Li argues visual-spatial capability underpins navigation, manipulation, and building in the world, making it a foundational substrate for agents and new applications—not just a “modality” to add onto text.

Compute is a primary accelerant, and its impact is often underestimated.

Johnson highlights that AlexNet (trained for days on 2010-era GPUs) would take minutes on a GB200-class GPU, illustrating how algorithmic ideas can look revolutionary once hardware catches up.

Data unlocked vision once, but the next leap depends on learning beyond explicit labels.

They distinguish the ImageNet era (human-labeled supervised learning with fixed ontologies) from today’s more scalable approaches that learn from less constrained, more implicit, or self-supervised signals.

3D representation is the key differentiator from multimodal LLMs, not just “seeing pixels.”

They claim many multimodal systems still “shoehorn” vision into 1D token sequences, whereas spatial intelligence makes 3D/4D structure central, enabling more natural camera/object control and interaction.

NeRF marked a practical inflection point because it made 3D reconstruction tractable and fast.

NeRF demonstrated a simple method to infer 3D structure from 2D views with modest compute, catalyzing a wave of academic progress and helping merge reconstruction (from real scenes) with generation (imagined scenes).

World generation could become a new medium if creation costs collapse.

They argue today’s rich interactive worlds are mostly limited to AAA games due to labor and cost; spatially intelligent generative systems could make bespoke, interactive 3D experiences economically viable for many niches.

A platform approach requires multidisciplinary talent spanning vision, graphics, systems, and data.

They emphasize the problem is not “just AI”: success depends on unifying ML, 3D geometry, computer graphics, and large-scale engineering—reflected in their founding team and hiring philosophy.

Notable Quotes

Visual-spatial intelligence is so fundamental. It's as fundamental as language.

Fei-Fei Li

Even no matter how much people talk about it, I think people underestimate [compute].

Justin Johnson

The only difference between AlexNet and the ConvNet... is the GPUs... and the deluge of data.

Fei-Fei Li

Their underlying representation under the hood is... one-dimensional... [but] the three-dimensional nature of the world should be front and center.

Justin Johnson

When NeRF happened... suddenly reconstruction and generation start to really merge.

Fei-Fei Li

Questions Answered in This Episode

What would a “native 3D/4D representation” look like architecturally—tokens, grids, Gaussians, radiance fields, or something else?

The speakers frame AI’s recent progress as a convergence of compute, data, and algorithms, with ImageNet exemplifying how scaling labeled data unlocked modern computer vision.

You argue multimodal LLMs are fundamentally 1D; what specific spatial tasks (camera control, object permanence, physics) expose the limits most clearly today?

They argue compute growth is still underappreciated, citing AlexNet’s once-massive training run now compressible to minutes on modern NVIDIA hardware.

NeRF training was described as feasible on a single GPU—what are the next bottlenecks for scaling spatial intelligence: compute, data capture, evaluation, or interaction/physics modeling?

They describe a shift from supervised learning toward methods that exploit unlabeled or implicitly labeled data, enabling more open-ended generative capabilities.

How do you plan to source or generate the right kind of data for spatial intelligence, given that high-quality 3D ground truth is scarce?

They define “spatial intelligence” as machine understanding and interaction in 3D space and time (4D), emphasizing that current LLM-style 1D token representations are a poor native fit for physical reality.

Where do diffusion models fit in your roadmap: are they the generative “engine,” or do you expect new model classes optimized for 3D/4D?

They position World Labs as a deep-tech platform company aiming to generate and understand interactive worlds for media, AR/VR, and robotics, enabled by breakthroughs like NeRF that merge reconstruction and generation.

EVERY SPOKEN WORD

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome