Skip to content
a16za16z

“The Future of AI is Here” — Fei-Fei Li Unveils the Next Frontier of AI

Fei-Fei Li and Justin Johnson are pioneers in AI. While the world has only recently witnessed a surge in consumer AI, our guests have long been laying the groundwork for innovations that are transforming industries today. In this episode, a16z General Partner Martin Casado joins Fei-Fei and Justin to explore the journey from early AI winters to the rise of deep learning and the rapid expansion of multimodal AI. From foundational advancements like ImageNet to the cutting-edge realm of spatial intelligence, Fei-Fei and Justin share the breakthroughs that have shaped the AI landscape and reveal what's next for innovation at World Labs. If you're curious about how AI is evolving beyond language models and into a new realm of 3D, generative worlds, this episode is a must-listen. Timestamps: 00:00 - Spatial Intelligence: A New Frontier 01:38 - Scaling AI: The Impact of ImageNet on Computer Vision 06:56 - The Role of Compute 09:16 - Data as the Key Driver 17:01 - Defining AI’s Ultimate Goal 18:58 - What is Spatial Intelligence? Unlocking 3D Understanding in AI 26:35 - Comparing Models: Spatial Intelligence vs. Language-Based AI 29:41 - 1D vs. 3D 32:39 - Building Immersive Worlds with Spatial Intelligence 35:11 - From Static Scenes to Dynamic Worlds 37:42 - The Future of VR and AR 40:42 - Creating Deep Tech Platforms 44:26 - Building a World-Class Team 45:54 - Measuring Success: Milestones in Spatial Intelligence Resources: Learn more about World Labs: https://www.worldlabs.ai Find Fei-Fei on Twitter: https://x.com/drfeifei Find Justin on Twitter: https://x.com/jcjohnss Find Martin on Twitter: https://x.com/martin_casado Stay Updated: Let us know what you think: https://ratethispodcast.com/a16z Find a16z on Twitter: https://twitter.com/a16z Find a16z on LinkedIn: https://www.linkedin.com/company/a16z Subscribe on your favorite podcast app: https://a16z.simplecast.com/ Follow our host: https://twitter.com/stephsmithio Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.

Fei-Fei LiguestJustin JohnsonguestMartin Casadohost
Sep 20, 202448mWatch on YouTube ↗

CHAPTERS

  1. Why spatial intelligence is the next major AI bet

    Fei-Fei Li frames visual-spatial intelligence as a fundamental pillar of intelligence—on par with (and in some ways more ancient than) language. She argues the field now has the right mix of compute, data understanding, and algorithms to “unlock” this frontier.

    • Spatial intelligence is positioned as a core missing piece in today’s AI progress
    • Current moment is enabled by converging ingredients: compute, data, algorithms
    • The goal is deeper 3D/4D understanding and interaction, not just perception
    • Sets up World Labs’ North Star: building spatial intelligence
  2. From AI winters to a multimodal “Cambrian explosion”

    The conversation zooms out to place today’s consumer AI boom in historical context. Fei-Fei describes the transition from AI winter to deep learning’s rise and today’s rapid expansion from text into pixels, video, and audio.

    • AI has moved from winter to modern deep learning-driven spring
    • Industry adoption first deepened around language models
    • Now AI is expanding across modalities (text, images, video, audio)
    • This expansion creates new application and model opportunities
  3. Backgrounds that shaped the field: deep learning, vision, and scientific “North Stars”

    Justin Johnson and Fei-Fei Li share personal origin stories that mirror AI’s evolution. Justin describes discovering deep learning via the “Cat paper” and the compute+data recipe; Fei-Fei describes coming from physics into AI and computational neuroscience with a focus on big questions.

    • Justin: deep learning clicked as a generic recipe (algorithms + compute + data)
    • Fei-Fei: physics training encouraged audacious questions; turned to intelligence
    • AI research advanced even during public “winter” through ML/statistical methods
    • Their trajectories converge around vision as a pathway to general intelligence
  4. ImageNet’s legacy: data scale as an overlooked unlock for vision

    Fei-Fei recounts the ImageNet bet: pushing computer vision datasets from thousands to internet-scale. She argues that letting data drive models was a critical—underappreciated—driver of generalization and progress.

    • Early CV/NLP datasets were limited to thousands/tens of thousands of examples
    • ImageNet aimed for internet-scale categories and labeling
    • Data scale enabled models to generalize far better than prior approaches
    • Internet growth made large-scale dataset construction feasible
  5. Compute as the accelerant: why AlexNet was really about GPUs

    Justin emphasizes compute as the dominant force multiplier behind breakthroughs. Using AlexNet as an example, he highlights how enormous hardware gains compress what took days in 2012 into minutes today—changing what research and industry can attempt.

    • AlexNet (2012) trained 6 days on 2 consumer GPUs (GTX 580)
    • Modern GPUs (e.g., GB200) offer compute gains in the thousands
    • The same training would take ~minutes on today’s hardware
    • Fei-Fei: core ConvNet ideas existed since the 1980s—GPUs + data made them win
  6. From supervised learning to today’s generative era

    The group distinguishes the ImageNet-era supervised paradigm from newer approaches that can learn without explicit human labeling. They unpack how constrained ontologies (fixed category lists) gave way to broader, more open-ended generative and representation learning.

    • Supervised learning required explicit labels and predefined ontologies (e.g., 1,000 ImageNet classes)
    • Modern methods learn from weaker/implicit signals and massive unlabeled corpora
    • “Bitter lesson” framing: design to scale with compute rather than hand-crafted tricks
    • Discussion of implicit human structure in language and web-scale supervision (e.g., CLIP)
  7. A continuum to gen AI: matching, captioning, style transfer, and early text-to-image

    Fei-Fei and Justin trace a research arc showing that “generative AI” didn’t appear overnight. Justin’s PhD work moves from aligning images and words, to generating descriptions, to real-time style transfer, and finally to structured text-to-image via scene graphs and GANs.

    • Phase 1: image–text alignment (retrieval via structured representations like scene graphs)
    • Phase 2: pixels-to-words (captioning) as an early bridge between vision and language
    • 2015 style transfer sparked early ‘gen AI’ excitement; key issue was speed/productionization
    • Early text-to-image required structured inputs (scene graphs) because free-form language wasn’t ready
  8. Why World Labs now: turning a long-standing vision into a company

    Fei-Fei explains World Labs as the next North Star after earlier goals (like image “storytelling”) became achievable sooner than expected. With new algorithms (e.g., NeRF lineage), deeper data sophistication, and ample compute, she sees a rare window to focus the field on spatial intelligence.

    • World Labs’ mission: unlock spatial intelligence as a foundational capability
    • Motivation is both personal (North Star seeking) and technical (readiness)
    • Advances in algorithms and representation learning make the bet timely
    • Founding includes pioneers tied to NeRF-era breakthroughs
  9. Defining spatial intelligence: perceiving, reasoning, and acting in 3D/4D

    Justin offers a crisp definition: spatial intelligence is a machine’s ability to understand and operate in 3D space and time—tracking objects, events, and interactions. This applies to both physical reality and generated/virtual worlds.

    • Core capability: understand positions and interactions over space-time (3D + time = 4D)
    • Beyond recognition: includes generation, interaction, and action
    • Applies to real-world perception and synthetic world creation
    • Goal is to move AI from data centers into embodied/world-facing contexts
  10. Spatial intelligence vs. multimodal LLMs: 1D tokens aren’t a native world model

    They contrast language-centric architectures with spatial approaches. Multimodal LLMs can ‘see,’ but their underlying representation is still largely 1D token sequences; spatial intelligence puts 3D structure at the center, aiming for better task fit and richer affordances.

    • LLMs represent information primarily as 1D sequences (tokens, context windows)
    • Other modalities often get ‘shoehorned’ into 1D representations
    • Spatial intelligence prioritizes 3D representation as a first-class primitive
    • Fei-Fei: language is generated and lossy; the physical world has independent structure and physics
  11. Why 3D beats “just pixels”: affordances, interaction, and the leap from scenes to worlds

    The discussion differentiates 2D outputs (images/video) from a model that truly supports 3D operations: moving cameras, manipulating objects, and interacting naturally. They outline a hierarchy from objects to scenes to full worlds that extend beyond a single frame and support continuous navigation.

    • Even if perception is 2D, useful interaction often demands a 3D internal model
    • 3D-centric representations enable camera/object movement more naturally
    • Progression: objects → scenes (compositions) → worlds (continuous, navigable, dynamic)
    • Long-term aim includes physics, semantics, and full interactivity (not just static renderings)
  12. Use cases: generative worlds, new media, AR/VR blending, and robotics

    They map spatial intelligence to multiple high-impact domains. Generating interactive 3D worlds could transform media creation economics; AR/VR needs real-time 3D understanding to merge digital and physical; robotics requires spatial grounding to connect digital “brains” to physical environments.

    • World generation: interactive 3D environments rather than short 2D clips
    • New media economics: reduce cost/time from AAA-game-level production to on-demand creation
    • AR/VR: spatial computing requires spatial intelligence; future hardware could replace many screens
    • Robotics: spatial intelligence bridges perception, planning, and action in the real world
  13. A deep-tech platform strategy: building foundational models before chasing devices

    Fei-Fei positions World Labs as a platform company supplying models for multiple markets rather than a single application. They acknowledge mass-market XR hardware isn’t fully ready, so the company will likely focus where near-term adoption is feasible while building core capabilities.

    • World Labs intends to provide foundational spatial models across domains
    • Near-term product focus may avoid immature device ecosystems
    • Platform ambition: solve fundamental problems once, then generalize across use cases
    • Emphasis on ‘simplicity and generality’ through strong underlying primitives
  14. Building the team: multidisciplinary excellence across vision, graphics, systems, and data

    They stress that spatial intelligence requires multiple specialized disciplines, not a monolithic ‘AI talent’ pool. Founders and hires span 3D vision, generative modeling, computer graphics, systems/infra, and large-scale engineering.

    • Spatial intelligence requires expertise in ML, 3D geometry, graphics, and systems engineering
    • Graphics is a key complement—solving similar problems from the opposite direction
    • Founding team highlights NeRF (Ben Mildenhall) and early Gaussian-splat precursors (Christof Lasser)
    • Team cohesion is driven by shared conviction that ‘now is the moment’ for spatial intelligence
  15. Measuring success: real-world deployment and expanding horizons

    Fei-Fei defines milestones as widespread adoption—when many businesses and builders use their models to unlock spatial needs. Justin argues the endpoint is effectively unbounded because the world is a complex evolving 4D system; progress will continually open new possibilities.

    • Success metric: models deployed broadly to solve real spatial-intelligence needs
    • Milestones are achievable even if the ultimate ‘North Star’ keeps moving
    • Spatial intelligence is open-ended due to the complexity of the physical universe
    • Core belief: good technology expands the space of what’s possible, creating new unknowns

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.