a16zHow Fei-Fei Li Is Rebuilding AI for the Real World
At a glance
WHAT IT’S REALLY ABOUT
Fei-Fei Li’s push for 3D “world models” beyond language
- Fei-Fei Li argues that language is a powerful but lossy, non-natural encoding that cannot capture the full structure of the 3D physical world where animals and humans act.
- The conversation frames “world models” as AI systems that reconstruct and generate complete 3D representations from limited views, enabling measurement, manipulation, and interaction in space.
- They explain why spatial intelligence is evolutionarily older and harder than language—highlighted by the slow progress and massive investment required in autonomy and robotics.
- World models are positioned as a horizontal platform technology, unlocking applications from robotics and embodied agents to design, architecture, games, and “infinite virtual universes.”
- Fei-Fei outlines World Labs’ approach: concentrate top talent across computer vision, graphics, diffusion/generative modeling, optimization, and data, building on advances like NeRFs and Gaussian splats.
IDEAS WORTH REMEMBERING
5 ideasLLMs don’t equal full intelligence because the world isn’t made of words.
Fei-Fei emphasizes that language is a human-made, generative abstraction; the physical environment is inherently perceptual and 3D, and much human reasoning and action depends on non-linguistic spatial understanding.
A “world model” means reconstructing and reasoning over 3D structure, not just recognizing images.
The goal is to infer full 3D geometry and compositional structure (including occluded parts like “the back of the table”) so a machine can compute distance, manipulate objects, and plan actions.
3D is harder than language because it’s tied to action, physics, and evolution’s oldest capabilities.
They argue spatial cognition predates language by hundreds of millions of years, and modern evidence (e.g., decades-long autonomous vehicle progress) suggests real-world navigation and interaction are fundamentally difficult.
2D data can be enough for humans, but not for robots—because humans supply the missing Z-axis mentally.
A human can watch 2D video and internally reconstruct depth, but a robot asked to grasp, measure, or avoid collisions needs explicit 3D state to act safely and precisely.
World models are a horizontal platform like LLMs, with many downstream use cases.
Once a system can create and edit a 3D scene representation, it can power robotics training, industrial/architectural design workflows, creative world-building, and immersive simulation environments.
WORDS WORTH SAVING
5 quotesYou know what we're missing? ... We're missing a world model.
— Fei-Fei Li
Language is a lossy way to capture, um, the world.
— Fei-Fei Li
Physics happens in 3D, and interaction happens in 3D.
— Fei-Fei Li
With this technology, which we should talk about, it's the combination of generation and reconstruction, suddenly we can actually create infinite universes.
— Fei-Fei Li
But I was just driving in my own neighborhood, and I realized-I don't have a good distance measure between my car and the parked car-
— Fei-Fei Li
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome