a16zHow Fei-Fei Li Is Rebuilding AI for the Real World
CHAPTERS
Spatial intelligence as the next frontier beyond language
Fei-Fei Li frames “space”—the 3D world outside us and in our mind’s eye—as a core component of intelligence. She argues that progress in AI will increasingly require models that understand and operate in 3D, not just in words.
- •Spatial intelligence is fundamental to animal and human intelligence
- •3D understanding unlocks capabilities that language alone can’t encode
- •The conversation sets up a shift from language-first AI to world-first AI
Fei-Fei Li’s impact: making data central to modern AI
Martin Casado summarizes Fei-Fei’s major contributions and why she’s often called the “godmother of AI.” He emphasizes her role in elevating data as a first-class driver of AI progress, alongside model architecture.
- •Fei-Fei helped bring large-scale data to the center of AI progress
- •Her career spans academia and major tech leadership roles
- •Data quality/scale is positioned as an enduring differentiator in AI
Why World Labs needed an “intellectual partner” investor
Fei-Fei explains why she chose Martin as the first investor: not just for capital, but for deep technical alignment and ongoing collaboration. She describes World Labs as “deep tech” requiring sustained, high-conviction partnership.
- •World Labs’ mission is framed as a North Star, long-horizon effort
- •Investor criteria: technical depth, go-to-market insight, and day-to-day intellectual partnership
- •Emphasis on concentrated resources (compute, data, talent) to make the leap
The “world model” dinner moment: aligning on what’s missing
Martin recounts a key conversation during an LLM-hype dinner where Fei-Fei crystallized the idea: AI is missing a true “world model.” Fei-Fei adds that many people nodded at the phrase without understanding it, so she tested alignment by asking Martin to define it precisely.
- •“World model” means 3D structure, shape, and compositional understanding
- •Shared intuition: LLMs aren’t the end of the story for real-world AI
- •Clear articulation and shared definitions were pivotal to forming the company
Looking back at AI’s evolution: surprise at data-driven emergence
Asked what would surprise her younger self, Fei-Fei highlights the emotional and scientific surprise of how far data-hungry models have gone. She notes emergent behaviors that feel like “thinking machines,” even given her long-standing belief in data-centric AI.
- •Unexpected pace/scale of capability emergence from data-driven training
- •Emotional surprise despite being an early champion of data in AI
- •Sets context for extending foundation-model ideas beyond language
Why LLMs aren’t enough: language is lossy for the physical world
Fei-Fei argues language is powerful but an incomplete encoding of reality—especially the 3D physical world where perception, interaction, and embodiment matter. She contrasts language’s generative nature with the grounded, ever-present structure of the perceptual world.
- •Language captures thought but is a lossy medium for 3D physical reality
- •Much of intelligence (perception, action, construction) is beyond language
- •Motivation for building world models with industry-grade focus
A thought experiment: blindfolded instructions vs seeing the room
Martin illustrates why words fail as a substitute for spatial representation: describing a room to a blindfolded person is inadequate for precise tasks. Vision (and internal 3D reconstruction) enables accurate manipulation and navigation.
- •Reality is high-dimensional and exact; language is approximate
- •Humans act effectively when reconstructing 3D from perception
- •World models aim to give machines a manipulable 3D representation
“Unrolling evolution”: why 3D intelligence is harder than language
The discussion argues that language capabilities arrived first in AI partly because spatial navigation is deeply rooted in evolution and far more demanding. They point to decades of expensive robotics and autonomous vehicle efforts as evidence that world interaction remains difficult.
- •Language processing is evolutionarily recent; spatial navigation is ancient
- •Robotics/AV illustrate how hard real-world navigation is (even “2D” versions)
- •Generative-model breakthroughs suggest a new path for 3D world modeling
Why 3D matters: science, creativity, and human breakthroughs depend on space
Fei-Fei connects spatial reasoning to major human achievements, from deciphering DNA’s double helix to understanding molecular structures like buckyballs. The point is that core reasoning and innovation often require 3D mental models, not just verbal ones.
- •Spatial reasoning underpins scientific discovery and invention
- •Examples: DNA structure (double helix), buckyball molecular geometry
- •3D intelligence is framed as a critical axis of general intelligence
From one reality to infinite virtual universes (the multiverse vision)
Fei-Fei describes how combining reconstruction and generation could let us create “infinite universes” for robotics training, creativity, travel, socialization, and storytelling. The promise is a horizontal platform, akin to LLMs, but for spatial worlds.
- •3D world models enable both reconstruction (what’s there) and generation (what could be there)
- •Use cases span robots, creative tools, social/virtual experiences, and storytelling
- •The “multiverse” idea: moving from one shared physical world to many virtual ones
3D vs 2D: why 2D isn’t enough for machines
They argue that physics and interaction happen in 3D, so machine agents need explicit depth and geometry—especially for tasks like measuring distances and grasping objects. Humans can infer 3D from 2D video, but robots/computers need that structure represented directly.
- •Z-depth is essential for interaction, manipulation, and navigation
- •2D is often sufficient for humans due to built-in 3D reconstruction
- •Machine action requires explicit 3D state to plan and execute
Fei-Fei’s stereo vision injury: a personal proof of 3D’s importance
Fei-Fei recounts temporarily losing stereo vision due to a cornea injury and how it made driving feel unsafe. The experience highlights how critical depth perception is for accurate distance estimation and real-world behavior.
- •Loss of stereo vision impaired distance judgment even in familiar environments
- •Driving required extreme caution and slow speeds to avoid collisions
- •A vivid analogy for why AI systems need true depth understanding
Inside World Labs R&D: the 3D toolkit and team composition
Fei-Fei outlines the state of the field and the building blocks World Labs is combining: NeRFs, Gaussian splats, image generation, and broader advances from academia and industry. Both she and Martin emphasize that success requires a rare blend of AI/modeling, data, and computer graphics expertise to represent and render 3D worlds effectively.
- •Key technical pillars: NeRF (3D reconstruction), Gaussian splats (3D representation), early deep-learning image generation
- •World Labs’ approach: 집중 (concentrate) top talent + compute + data around one North Star problem
- •Solving 3D world models requires integrating AI architectures with graphics/representation in memory and on screen