a16z“The Future of AI is Here” — Fei-Fei Li Unveils the Next Frontier of AI
At a glance
WHAT IT’S REALLY ABOUT
Fei-Fei Li’s spatial intelligence bet: AI’s next 3D leap
- The speakers frame AI’s recent progress as a convergence of compute, data, and algorithms, with ImageNet exemplifying how scaling labeled data unlocked modern computer vision.
- They argue compute growth is still underappreciated, citing AlexNet’s once-massive training run now compressible to minutes on modern NVIDIA hardware.
- They describe a shift from supervised learning toward methods that exploit unlabeled or implicitly labeled data, enabling more open-ended generative capabilities.
- They define “spatial intelligence” as machine understanding and interaction in 3D space and time (4D), emphasizing that current LLM-style 1D token representations are a poor native fit for physical reality.
- They position World Labs as a deep-tech platform company aiming to generate and understand interactive worlds for media, AR/VR, and robotics, enabled by breakthroughs like NeRF that merge reconstruction and generation.
IDEAS WORTH REMEMBERING
5 ideasSpatial intelligence is positioned as co-equal to language for general intelligence.
Li argues visual-spatial capability underpins navigation, manipulation, and building in the world, making it a foundational substrate for agents and new applications—not just a “modality” to add onto text.
Compute is a primary accelerant, and its impact is often underestimated.
Johnson highlights that AlexNet (trained for days on 2010-era GPUs) would take minutes on a GB200-class GPU, illustrating how algorithmic ideas can look revolutionary once hardware catches up.
Data unlocked vision once, but the next leap depends on learning beyond explicit labels.
They distinguish the ImageNet era (human-labeled supervised learning with fixed ontologies) from today’s more scalable approaches that learn from less constrained, more implicit, or self-supervised signals.
3D representation is the key differentiator from multimodal LLMs, not just “seeing pixels.”
They claim many multimodal systems still “shoehorn” vision into 1D token sequences, whereas spatial intelligence makes 3D/4D structure central, enabling more natural camera/object control and interaction.
NeRF marked a practical inflection point because it made 3D reconstruction tractable and fast.
NeRF demonstrated a simple method to infer 3D structure from 2D views with modest compute, catalyzing a wave of academic progress and helping merge reconstruction (from real scenes) with generation (imagined scenes).
WORDS WORTH SAVING
5 quotesVisual-spatial intelligence is so fundamental. It's as fundamental as language.
— Fei-Fei Li
Even no matter how much people talk about it, I think people underestimate [compute].
— Justin Johnson
The only difference between AlexNet and the ConvNet... is the GPUs... and the deluge of data.
— Fei-Fei Li
Their underlying representation under the hood is... one-dimensional... [but] the three-dimensional nature of the world should be front and center.
— Justin Johnson
When NeRF happened... suddenly reconstruction and generation start to really merge.
— Fei-Fei Li
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome