Skip to content
a16za16z

How Fei-Fei Li Is Rebuilding AI for the Real World

What if the next leap in artificial intelligence isn’t about better language—but better understanding of space? In this episode, a16z General Partner Erik Torenberg moderates chats with Fei-Fei Li, cofounder and CEO of World Labs, and Martin Casado, a16z General Partner and early investor in the company. Together, they explore the concept of world models—AI systems that understand and reason about the physical, 3D world—not just text. Fei-Fei, often called the “godmother of AI,” explains why spatial intelligence is a critical (and missing) component of today's AI systems, and why her new company is going all-in on solving this challenge. Martin shares the story of how he and Fei-Fei aligned on this vision long before it was trendy - and why it may define the future of robotics, creativity, and computation itself. From the limitations of LLMs to the promise of embodied AI, from personal anecdotes to deep technical insights, this is a discussion on what it truly means to build intelligence for the real (and virtual) world. Timecodes: 00:00 Spatial Intelligence 00:39 Fei-Fei Li’s Background 01:17 Building a World Model 05:14 Reflecting on AI's Evolution 08:07 The Importance of 3D Understanding 10:20 Unrolling Evolution: Why 3D Intelligence Is Harder Than Language 12:19 From Single Reality to Infinite Virtual Universes 16:52 3D vs 2D: Why 2D Isn’t Enough for Machines 17:57 Fei-Fei’s Personal Story of Losing Stereo Vision 19:24 Research and Development at World Labs Resources: Find Fei-Fei on X: https://x.com/drfeifei Find Martin on X: https://x.com/martin_casado Learn more about World Labs: https://www.worldlabs.ai/ Stay Updated: Let us know what you think: https://ratethispodcast.com/a16z Find a16z on Twitter: https://twitter.com/a16z Find a16z on LinkedIn: https://www.linkedin.com/company/a16z Subscribe on your favorite podcast app: https://a16z.simplecast.com/ Follow our host: https://x.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.

Fei-Fei LiguestErik TorenberghostMartin Casadoguest
Jun 3, 202522mWatch on YouTube ↗

At a glance

WHAT IT’S REALLY ABOUT

Fei-Fei Li’s push for 3D “world models” beyond language

  1. Fei-Fei Li argues that language is a powerful but lossy, non-natural encoding that cannot capture the full structure of the 3D physical world where animals and humans act.
  2. The conversation frames “world models” as AI systems that reconstruct and generate complete 3D representations from limited views, enabling measurement, manipulation, and interaction in space.
  3. They explain why spatial intelligence is evolutionarily older and harder than language—highlighted by the slow progress and massive investment required in autonomy and robotics.
  4. World models are positioned as a horizontal platform technology, unlocking applications from robotics and embodied agents to design, architecture, games, and “infinite virtual universes.”
  5. Fei-Fei outlines World Labs’ approach: concentrate top talent across computer vision, graphics, diffusion/generative modeling, optimization, and data, building on advances like NeRFs and Gaussian splats.

IDEAS WORTH REMEMBERING

5 ideas

LLMs don’t equal full intelligence because the world isn’t made of words.

Fei-Fei emphasizes that language is a human-made, generative abstraction; the physical environment is inherently perceptual and 3D, and much human reasoning and action depends on non-linguistic spatial understanding.

A “world model” means reconstructing and reasoning over 3D structure, not just recognizing images.

The goal is to infer full 3D geometry and compositional structure (including occluded parts like “the back of the table”) so a machine can compute distance, manipulate objects, and plan actions.

3D is harder than language because it’s tied to action, physics, and evolution’s oldest capabilities.

They argue spatial cognition predates language by hundreds of millions of years, and modern evidence (e.g., decades-long autonomous vehicle progress) suggests real-world navigation and interaction are fundamentally difficult.

2D data can be enough for humans, but not for robots—because humans supply the missing Z-axis mentally.

A human can watch 2D video and internally reconstruct depth, but a robot asked to grasp, measure, or avoid collisions needs explicit 3D state to act safely and precisely.

World models are a horizontal platform like LLMs, with many downstream use cases.

Once a system can create and edit a 3D scene representation, it can power robotics training, industrial/architectural design workflows, creative world-building, and immersive simulation environments.

WORDS WORTH SAVING

5 quotes

You know what we're missing? ... We're missing a world model.

Fei-Fei Li

Language is a lossy way to capture, um, the world.

Fei-Fei Li

Physics happens in 3D, and interaction happens in 3D.

Fei-Fei Li

With this technology, which we should talk about, it's the combination of generation and reconstruction, suddenly we can actually create infinite universes.

Fei-Fei Li

But I was just driving in my own neighborhood, and I realized-I don't have a good distance measure between my car and the parked car-

Fei-Fei Li

Spatial intelligence as core to general intelligenceWorld models vs large language models (LLMs)3D reconstruction and generation from 2D viewsWhy 2D representations are insufficient for machinesRobotics, autonomy, and embodied AI constraintsInfinite virtual universes and creative toolingTechnical foundations: NeRF, Gaussian splats, early generative vision work

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome