No Priors Ep. 117 | With Co-Director of Stanford's HAI & Founder of World Labs Dr. Fei-Fei Li

In this episode of No Priors, Sarah and Elad are joined by Fei-Fei Li, AI pioneer, co-director of Stanford’s Human-Centered AI Institute, and founder of World Labs. Fei-Fei shares why she’s building at the intersection of embodiment and intelligence, and what today’s AI systems are still missing. From the early days of ImageNet to her vision for the next generation of robotics, she unpacks the human and technical motivations behind World Labs. They also discuss the challenges of 3D world modeling, her approach to building exceptional teams, and the special qualities that have led her students like Andrej Karpathy to make major breakthroughs. Show Notes: 0:00 Why and what Fei-Fei is building 3:00 World models at World Labs 6:44 Missing gaps in the AI future 9:16 Robotics and physical intelligence 16:15 Greatest challenges of 3D 19:08 Fei-Fei’s work in PhD in imagenet 23:05 Special moments in career 29:33 Building teams 32:05 Human-centered AI

Sarah GuohostFei-Fei LiguestElad Gilhost

Jun 5, 202535mWatch on YouTube ↗

CHAPTERS

0:00 – 1:23
Why Fei-Fei Li is founding World Labs now: building spatial intelligence
Fei-Fei explains why she chose to start a company at this moment: a personal drive to build and a conviction that spatial intelligence is the next foundational capability AI must master. She frames 3D world understanding as broadly enabling across many industries and use cases.
- •Motivation to transition from academia/policy into building a product company
- •Spatial intelligence as a missing pillar for AI’s next phase
- •3D world models as a broadly enabling technology layer
- •Excitement about building with a high-caliber, mission-driven team
1:23 – 2:54
Defining spatial intelligence: understanding, reasoning, and generating 3D worlds
Fei-Fei defines spatial intelligence as the capacity to perceive, reason about, interact with, and generate 3D environments. She argues that because reality is fundamentally 3D, AI systems that lack 3D understanding will remain incomplete.
- •Spatial intelligence = understand/reason/interact/generate in 3D
- •3D representations unlock design, navigation, simulation, AR/VR
- •Spatial intelligence is fundamental across animals and human evolution
- •Claim: AI without spatial intelligence is incomplete
2:54 – 4:18
World Labs’ technical thesis: 3D world models and 3D generative foundation models
The conversation turns to what World Labs is building: 3D-native world models and a foundation model for 3D generation. Fei-Fei emphasizes realism/plausibility—geometry and physics that hold together—so models can be used for real interaction and not just visuals.
- •World Labs focusing on 3D-native world models
- •Positioning: first company they know targeting 3D generation foundation models
- •Realism vs plausibility: physics and geometry must be coherent
- •3D world models as the substrate for many downstream spatial tasks
4:18 – 6:43
Why vision and spatial cognition matter (neuroscience framing)
Fei-Fei contrasts spatial intelligence with language: evolution had to solve the problem of reconstructing 3D structure from sensory inputs. She highlights that even humans struggle to explicitly generate complex 3D models without training, which creates an opportunity for AI to augment creation and interaction.
- •Animals reconstruct 3D worlds from light captured by eyes
- •Spatial reasoning is essential for navigation and manipulation
- •Humans aren’t naturally good at explicit 3D generation without training
- •AI could make 3D creation/editing fluid and accessible
6:43 – 9:11
Gaps in the AI future: beyond language—3D, robotics systems, emotional intelligence
Elad asks what other major modeling gaps remain after language progress. Fei-Fei proposes three broad buckets—language, 3D/spatial, and emotional intelligence—while noting robotics requires system integration beyond “just the brain” of a model.
- •Language is largely solved relative to earlier eras (to a large extent)
- •3D/spatial is as hard and as critical as language
- •Emotional intelligence remains largely unsolved and hard to define
- •Robotics is a system integration challenge (not only model capability)
9:11 – 11:48
Robotics and physical intelligence: data mixtures, simulation, and haptics
Fei-Fei discusses the coming era of humans cohabiting with robots, clarifying robots need not be humanoid. She argues training will require hybrid data sources and says simulation is underappreciated, while haptics is an especially underrated modality for manipulation.
- •Robots will be ubiquitous; “robot” ≠ humanoid by default
- •Training will rely on hybrid data (video, sim, teleop, embodied collection)
- •Simulation and synthetic data will be central to scaling robotics
- •Haptics is crucial for manipulation and often underintegrated with vision
11:48 – 13:21
Morphological intelligence: why robot form factors will diversify
Building on robotics, Fei-Fei explains “morphological intelligence,” where an agent’s form can optimize for tasks. She predicts a diversity of robot form factors driven by efficiency—fish-like underwater robots, specialized flying systems—rather than a single generalized humanoid shape.
- •Morphology can be optimized for tasks (morphological intelligence)
- •Economic/energy efficiency pushes toward specialization and diversity
- •Humanoid forms are inefficient for many environments (underwater, flight)
- •Expectation: many robot embodiments over time, not one standard form
13:21 – 15:29
Near-term commercial applications: creativity tools and 3D content for XR/metaverse
Sarah asks about practical applications of 3D world generation. Fei-Fei points to creativity as the immediate wedge—AI as a collaborator for designers and artists—and argues AR/VR adoption is bottlenecked by content creation that 3D generative models can accelerate.
- •Creativity as a primary near-term use case for spatial AI
- •Analogy to LLM copilots in software engineering (human-AI collaboration)
- •Applications: designers, 3D/VFX artists, marketing, game development
- •XR/metaverse needs scalable 3D content creation beyond hardware progress
15:29 – 16:18
World models as a path to scalable RL and interactive agents
The hosts probe whether world models help create more generalizable agents via reinforcement learning. Fei-Fei argues that many 3D tasks (like design) naturally fit RL-style optimization objectives and require 3D interaction to be modeled well.
- •3D interaction is central to real-world and digital agent behavior
- •World models provide a substrate for training and evaluation loops
- •Design involves multi-objective optimization (beauty, efficiency, constraints)
- •RL becomes more natural with interactive, editable 3D environments
16:18 – 18:05
Hard problems in 3D: data scarcity and productization challenges
Fei-Fei outlines why 3D foundation models are difficult: high-quality 3D data is scarce and requires sophisticated acquisition and synthesis. She also notes that 3D is harder to “deliver” as a product than language—people don’t passively consume 3D the way they read text—so UX and workflows matter.
- •3D data is scarce; requires heavy data engineering and synthesis
- •Contrast with NLP: abundant internet text enabled rapid scaling
- •3D is harder to productize and integrate into everyday workflows
- •Need for active interaction/editing experiences rather than passive viewing
18:05 – 19:03
Personal curiosity and imagined worlds: from microscopic spaces to “inside a dishwasher”
In a lighter segment, Fei-Fei describes the types of worlds she’d love to explore through spatial modeling. Her examples emphasize educational and experiential possibilities—seeing hidden systems and scales that are inaccessible in daily life.
- •Interest in exploring worlds we can’t normally see
- •Examples: microscopic environments, inside an engine, inside a dishwasher
- •Spatial models as experiential learning tools
- •Virtual exploration as a compelling driver for 3D world generation
19:03 – 23:01
Early career data lessons: the 101-category dataset and the road to ImageNet
Prompted by Sarah (and Andrej Karpathy’s suggestion), Fei-Fei recounts her PhD-era push for object recognition and the realization that data was the limiting factor. She describes curating an early 101-category dataset from dictionary words and early Google image search—foreshadowing the later ImageNet-scale approach.
- •Object recognition was blocked by lack of training data in early 2000s
- •Advisor encouraged dataset creation; debate over scope led to 101 classes
- •Using dictionary words + manual curation/cleaning from early image search
- •Hands-on effort and data pragmatism as a recurring theme
23:01 – 28:07
Seminal moments: ImageNet’s validation and early vision-language convergence
Fei-Fei reflects on defining career moments: ImageNet’s long arc—from skepticism and tenure risk to AlexNet and broad validation—and the excitement of early image captioning/storytelling work that combined CNNs with sequential models (LSTMs). She emphasizes scientific validation as “making a difference,” not just recognition.
- •ImageNet journey: early resistance, scaling via Mechanical Turk, AlexNet win
- •Validation for scientists = proving a doubted hypothesis can work
- •Early convergence of language+vision via captioning/storytelling research
- •Surprise at how quickly deep learning advanced these “lifetime” problems
28:07 – 35:46
Advice and leadership: fearlessness, hiring for diverse expertise, and human-centered AI
Fei-Fei’s core advice is to be fearless—bold enough to pursue big problems without being recklessly irrational. She then describes how she hires at World Labs (diverse disciplines + fearless mindset) and closes with her vision for human-centered AI: technology that augments people while preserving values like justice and prosperity, with healthcare as a major application area.
- •Advice: be fearless—balance rational boldness with ambition
- •Hiring: diversity of thinking across graphics, vision, data, infra, optimization, product
- •Assessing fearlessness via curiosity, comfort with uncertainty, and motivation
- •Human-centered AI vision: collaboration/superpowering humans, with healthcare impact

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

Why Fei-Fei Li is founding World Labs now: building spatial intelligence

Defining spatial intelligence: understanding, reasoning, and generating 3D worlds

World Labs’ technical thesis: 3D world models and 3D generative foundation models

Why vision and spatial cognition matter (neuroscience framing)

Gaps in the AI future: beyond language—3D, robotics systems, emotional intelligence

Robotics and physical intelligence: data mixtures, simulation, and haptics

Morphological intelligence: why robot form factors will diversify

Near-term commercial applications: creativity tools and 3D content for XR/metaverse

World models as a path to scalable RL and interactive agents

Hard problems in 3D: data scarcity and productization challenges

Personal curiosity and imagined worlds: from microscopic spaces to “inside a dishwasher”

Early career data lessons: the 101-category dataset and the road to ImageNet

Seminal moments: ImageNet’s validation and early vision-language convergence

Advice and leadership: fearlessness, hiring for diverse expertise, and human-centered AI

Get more out of YouTube videos.