“The Future of AI is Here” — Fei-Fei Li Unveils the Next Frontier of AI

Fei-Fei Li and Justin Johnson are pioneers in AI. While the world has only recently witnessed a surge in consumer AI, our guests have long been laying the groundwork for innovations that are transforming industries today. In this episode, a16z General Partner Martin Casado joins Fei-Fei and Justin to explore the journey from early AI winters to the rise of deep learning and the rapid expansion of multimodal AI. From foundational advancements like ImageNet to the cutting-edge realm of spatial intelligence, Fei-Fei and Justin share the breakthroughs that have shaped the AI landscape and reveal what's next for innovation at World Labs. If you're curious about how AI is evolving beyond language models and into a new realm of 3D, generative worlds, this episode is a must-listen. Timestamps: 00:00 - Spatial Intelligence: A New Frontier 01:38 - Scaling AI: The Impact of ImageNet on Computer Vision 06:56 - The Role of Compute 09:16 - Data as the Key Driver 17:01 - Defining AI’s Ultimate Goal 18:58 - What is Spatial Intelligence? Unlocking 3D Understanding in AI 26:35 - Comparing Models: Spatial Intelligence vs. Language-Based AI 29:41 - 1D vs. 3D 32:39 - Building Immersive Worlds with Spatial Intelligence 35:11 - From Static Scenes to Dynamic Worlds 37:42 - The Future of VR and AR 40:42 - Creating Deep Tech Platforms 44:26 - Building a World-Class Team 45:54 - Measuring Success: Milestones in Spatial Intelligence Resources: Learn more about World Labs: https://www.worldlabs.ai Find Fei-Fei on Twitter: https://x.com/drfeifei Find Justin on Twitter: https://x.com/jcjohnss Find Martin on Twitter: https://x.com/martin_casado Stay Updated: Let us know what you think: https://ratethispodcast.com/a16z Find a16z on Twitter: https://twitter.com/a16z Find a16z on LinkedIn: https://www.linkedin.com/company/a16z Subscribe on your favorite podcast app: https://a16z.simplecast.com/ Follow our host: https://twitter.com/stephsmithio Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.

Fei-Fei LiguestJustin JohnsonguestMartin Casadohost

Sep 20, 202448mWatch on YouTube ↗

CHAPTERS

Why spatial intelligence is the next major AI bet
Fei-Fei Li frames visual-spatial intelligence as a fundamental pillar of intelligence—on par with (and in some ways more ancient than) language. She argues the field now has the right mix of compute, data understanding, and algorithms to “unlock” this frontier.
From AI winters to a multimodal “Cambrian explosion”
The conversation zooms out to place today’s consumer AI boom in historical context. Fei-Fei describes the transition from AI winter to deep learning’s rise and today’s rapid expansion from text into pixels, video, and audio.
Backgrounds that shaped the field: deep learning, vision, and scientific “North Stars”
Justin Johnson and Fei-Fei Li share personal origin stories that mirror AI’s evolution. Justin describes discovering deep learning via the “Cat paper” and the compute+data recipe; Fei-Fei describes coming from physics into AI and computational neuroscience with a focus on big questions.
ImageNet’s legacy: data scale as an overlooked unlock for vision
Fei-Fei recounts the ImageNet bet: pushing computer vision datasets from thousands to internet-scale. She argues that letting data drive models was a critical—underappreciated—driver of generalization and progress.
Compute as the accelerant: why AlexNet was really about GPUs
Justin emphasizes compute as the dominant force multiplier behind breakthroughs. Using AlexNet as an example, he highlights how enormous hardware gains compress what took days in 2012 into minutes today—changing what research and industry can attempt.
From supervised learning to today’s generative era
The group distinguishes the ImageNet-era supervised paradigm from newer approaches that can learn without explicit human labeling. They unpack how constrained ontologies (fixed category lists) gave way to broader, more open-ended generative and representation learning.
A continuum to gen AI: matching, captioning, style transfer, and early text-to-image
Fei-Fei and Justin trace a research arc showing that “generative AI” didn’t appear overnight. Justin’s PhD work moves from aligning images and words, to generating descriptions, to real-time style transfer, and finally to structured text-to-image via scene graphs and GANs.
Why World Labs now: turning a long-standing vision into a company
Fei-Fei explains World Labs as the next North Star after earlier goals (like image “storytelling”) became achievable sooner than expected. With new algorithms (e.g., NeRF lineage), deeper data sophistication, and ample compute, she sees a rare window to focus the field on spatial intelligence.
Defining spatial intelligence: perceiving, reasoning, and acting in 3D/4D
Justin offers a crisp definition: spatial intelligence is a machine’s ability to understand and operate in 3D space and time—tracking objects, events, and interactions. This applies to both physical reality and generated/virtual worlds.
Spatial intelligence vs. multimodal LLMs: 1D tokens aren’t a native world model
They contrast language-centric architectures with spatial approaches. Multimodal LLMs can ‘see,’ but their underlying representation is still largely 1D token sequences; spatial intelligence puts 3D structure at the center, aiming for better task fit and richer affordances.
Why 3D beats “just pixels”: affordances, interaction, and the leap from scenes to worlds
The discussion differentiates 2D outputs (images/video) from a model that truly supports 3D operations: moving cameras, manipulating objects, and interacting naturally. They outline a hierarchy from objects to scenes to full worlds that extend beyond a single frame and support continuous navigation.
Use cases: generative worlds, new media, AR/VR blending, and robotics
They map spatial intelligence to multiple high-impact domains. Generating interactive 3D worlds could transform media creation economics; AR/VR needs real-time 3D understanding to merge digital and physical; robotics requires spatial grounding to connect digital “brains” to physical environments.
A deep-tech platform strategy: building foundational models before chasing devices
Fei-Fei positions World Labs as a platform company supplying models for multiple markets rather than a single application. They acknowledge mass-market XR hardware isn’t fully ready, so the company will likely focus where near-term adoption is feasible while building core capabilities.
Building the team: multidisciplinary excellence across vision, graphics, systems, and data
They stress that spatial intelligence requires multiple specialized disciplines, not a monolithic ‘AI talent’ pool. Founders and hires span 3D vision, generative modeling, computer graphics, systems/infra, and large-scale engineering.
Measuring success: real-world deployment and expanding horizons
Fei-Fei defines milestones as widespread adoption—when many businesses and builders use their models to unlock spatial needs. Justin argues the endpoint is effectively unbounded because the world is a complex evolving 4D system; progress will continually open new possibilities.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome

Why spatial intelligence is the next major AI bet

From AI winters to a multimodal “Cambrian explosion”

Backgrounds that shaped the field: deep learning, vision, and scientific “North Stars”

ImageNet’s legacy: data scale as an overlooked unlock for vision

Compute as the accelerant: why AlexNet was really about GPUs

From supervised learning to today’s generative era

A continuum to gen AI: matching, captioning, style transfer, and early text-to-image

Why World Labs now: turning a long-standing vision into a company

Defining spatial intelligence: perceiving, reasoning, and acting in 3D/4D

Spatial intelligence vs. multimodal LLMs: 1D tokens aren’t a native world model

Why 3D beats “just pixels”: affordances, interaction, and the leap from scenes to worlds

Use cases: generative worlds, new media, AR/VR blending, and robotics

A deep-tech platform strategy: building foundational models before chasing devices

Building the team: multidisciplinary excellence across vision, graphics, systems, and data

Measuring success: real-world deployment and expanding horizons

Get more out of YouTube videos.