How Fei-Fei Li Is Rebuilding AI for the Real World

What if the next leap in artificial intelligence isn’t about better language—but better understanding of space? In this episode, a16z General Partner Erik Torenberg moderates chats with Fei-Fei Li, cofounder and CEO of World Labs, and Martin Casado, a16z General Partner and early investor in the company. Together, they explore the concept of world models—AI systems that understand and reason about the physical, 3D world—not just text. Fei-Fei, often called the “godmother of AI,” explains why spatial intelligence is a critical (and missing) component of today's AI systems, and why her new company is going all-in on solving this challenge. Martin shares the story of how he and Fei-Fei aligned on this vision long before it was trendy - and why it may define the future of robotics, creativity, and computation itself. From the limitations of LLMs to the promise of embodied AI, from personal anecdotes to deep technical insights, this is a discussion on what it truly means to build intelligence for the real (and virtual) world. Timecodes: 00:00 Spatial Intelligence 00:39 Fei-Fei Li’s Background 01:17 Building a World Model 05:14 Reflecting on AI's Evolution 08:07 The Importance of 3D Understanding 10:20 Unrolling Evolution: Why 3D Intelligence Is Harder Than Language 12:19 From Single Reality to Infinite Virtual Universes 16:52 3D vs 2D: Why 2D Isn’t Enough for Machines 17:57 Fei-Fei’s Personal Story of Losing Stereo Vision 19:24 Research and Development at World Labs Resources: Find Fei-Fei on X: https://x.com/drfeifei Find Martin on X: https://x.com/martin_casado Learn more about World Labs: https://www.worldlabs.ai/ Stay Updated: Let us know what you think: https://ratethispodcast.com/a16z Find a16z on Twitter: https://twitter.com/a16z Find a16z on LinkedIn: https://www.linkedin.com/company/a16z Subscribe on your favorite podcast app: https://a16z.simplecast.com/ Follow our host: https://x.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.

Fei-Fei LiguestErik TorenberghostMartin Casadoguest

Jun 4, 202522mWatch on YouTube ↗

CHAPTERS

Spatial intelligence as the next frontier beyond language
Fei-Fei Li frames “space”—the 3D world outside us and in our mind’s eye—as a core component of intelligence. She argues that progress in AI will increasingly require models that understand and operate in 3D, not just in words.
- •Spatial intelligence is fundamental to animal and human intelligence
- •3D understanding unlocks capabilities that language alone can’t encode
- •The conversation sets up a shift from language-first AI to world-first AI
Fei-Fei Li’s impact: making data central to modern AI
Martin Casado summarizes Fei-Fei’s major contributions and why she’s often called the “godmother of AI.” He emphasizes her role in elevating data as a first-class driver of AI progress, alongside model architecture.
- •Fei-Fei helped bring large-scale data to the center of AI progress
- •Her career spans academia and major tech leadership roles
- •Data quality/scale is positioned as an enduring differentiator in AI
Why World Labs needed an “intellectual partner” investor
Fei-Fei explains why she chose Martin as the first investor: not just for capital, but for deep technical alignment and ongoing collaboration. She describes World Labs as “deep tech” requiring sustained, high-conviction partnership.
- •World Labs’ mission is framed as a North Star, long-horizon effort
- •Investor criteria: technical depth, go-to-market insight, and day-to-day intellectual partnership
- •Emphasis on concentrated resources (compute, data, talent) to make the leap
The “world model” dinner moment: aligning on what’s missing
Martin recounts a key conversation during an LLM-hype dinner where Fei-Fei crystallized the idea: AI is missing a true “world model.” Fei-Fei adds that many people nodded at the phrase without understanding it, so she tested alignment by asking Martin to define it precisely.
- •“World model” means 3D structure, shape, and compositional understanding
- •Shared intuition: LLMs aren’t the end of the story for real-world AI
- •Clear articulation and shared definitions were pivotal to forming the company
Looking back at AI’s evolution: surprise at data-driven emergence
Asked what would surprise her younger self, Fei-Fei highlights the emotional and scientific surprise of how far data-hungry models have gone. She notes emergent behaviors that feel like “thinking machines,” even given her long-standing belief in data-centric AI.
- •Unexpected pace/scale of capability emergence from data-driven training
- •Emotional surprise despite being an early champion of data in AI
- •Sets context for extending foundation-model ideas beyond language
Why LLMs aren’t enough: language is lossy for the physical world
Fei-Fei argues language is powerful but an incomplete encoding of reality—especially the 3D physical world where perception, interaction, and embodiment matter. She contrasts language’s generative nature with the grounded, ever-present structure of the perceptual world.
- •Language captures thought but is a lossy medium for 3D physical reality
- •Much of intelligence (perception, action, construction) is beyond language
- •Motivation for building world models with industry-grade focus
A thought experiment: blindfolded instructions vs seeing the room
Martin illustrates why words fail as a substitute for spatial representation: describing a room to a blindfolded person is inadequate for precise tasks. Vision (and internal 3D reconstruction) enables accurate manipulation and navigation.
- •Reality is high-dimensional and exact; language is approximate
- •Humans act effectively when reconstructing 3D from perception
- •World models aim to give machines a manipulable 3D representation
“Unrolling evolution”: why 3D intelligence is harder than language
The discussion argues that language capabilities arrived first in AI partly because spatial navigation is deeply rooted in evolution and far more demanding. They point to decades of expensive robotics and autonomous vehicle efforts as evidence that world interaction remains difficult.
- •Language processing is evolutionarily recent; spatial navigation is ancient
- •Robotics/AV illustrate how hard real-world navigation is (even “2D” versions)
- •Generative-model breakthroughs suggest a new path for 3D world modeling
Why 3D matters: science, creativity, and human breakthroughs depend on space
Fei-Fei connects spatial reasoning to major human achievements, from deciphering DNA’s double helix to understanding molecular structures like buckyballs. The point is that core reasoning and innovation often require 3D mental models, not just verbal ones.
- •Spatial reasoning underpins scientific discovery and invention
- •Examples: DNA structure (double helix), buckyball molecular geometry
- •3D intelligence is framed as a critical axis of general intelligence
From one reality to infinite virtual universes (the multiverse vision)
Fei-Fei describes how combining reconstruction and generation could let us create “infinite universes” for robotics training, creativity, travel, socialization, and storytelling. The promise is a horizontal platform, akin to LLMs, but for spatial worlds.
- •3D world models enable both reconstruction (what’s there) and generation (what could be there)
- •Use cases span robots, creative tools, social/virtual experiences, and storytelling
- •The “multiverse” idea: moving from one shared physical world to many virtual ones
3D vs 2D: why 2D isn’t enough for machines
They argue that physics and interaction happen in 3D, so machine agents need explicit depth and geometry—especially for tasks like measuring distances and grasping objects. Humans can infer 3D from 2D video, but robots/computers need that structure represented directly.
- •Z-depth is essential for interaction, manipulation, and navigation
- •2D is often sufficient for humans due to built-in 3D reconstruction
- •Machine action requires explicit 3D state to plan and execute
Fei-Fei’s stereo vision injury: a personal proof of 3D’s importance
Fei-Fei recounts temporarily losing stereo vision due to a cornea injury and how it made driving feel unsafe. The experience highlights how critical depth perception is for accurate distance estimation and real-world behavior.
- •Loss of stereo vision impaired distance judgment even in familiar environments
- •Driving required extreme caution and slow speeds to avoid collisions
- •A vivid analogy for why AI systems need true depth understanding
Inside World Labs R&D: the 3D toolkit and team composition
Fei-Fei outlines the state of the field and the building blocks World Labs is combining: NeRFs, Gaussian splats, image generation, and broader advances from academia and industry. Both she and Martin emphasize that success requires a rare blend of AI/modeling, data, and computer graphics expertise to represent and render 3D worlds effectively.
- •Key technical pillars: NeRF (3D reconstruction), Gaussian splats (3D representation), early deep-learning image generation
- •World Labs’ approach: 집중 (concentrate) top talent + compute + data around one North Star problem
- •Solving 3D world models requires integrating AI architectures with graphics/representation in memory and on screen