a16z“The Future of AI is Here” — Fei-Fei Li Unveils the Next Frontier of AI
CHAPTERS
Why spatial intelligence is the next major AI bet
Fei-Fei Li frames visual-spatial intelligence as a fundamental pillar of intelligence—on par with (and in some ways more ancient than) language. She argues the field now has the right mix of compute, data understanding, and algorithms to “unlock” this frontier.
- •Spatial intelligence is positioned as a core missing piece in today’s AI progress
- •Current moment is enabled by converging ingredients: compute, data, algorithms
- •The goal is deeper 3D/4D understanding and interaction, not just perception
- •Sets up World Labs’ North Star: building spatial intelligence
From AI winters to a multimodal “Cambrian explosion”
The conversation zooms out to place today’s consumer AI boom in historical context. Fei-Fei describes the transition from AI winter to deep learning’s rise and today’s rapid expansion from text into pixels, video, and audio.
- •AI has moved from winter to modern deep learning-driven spring
- •Industry adoption first deepened around language models
- •Now AI is expanding across modalities (text, images, video, audio)
- •This expansion creates new application and model opportunities
Backgrounds that shaped the field: deep learning, vision, and scientific “North Stars”
Justin Johnson and Fei-Fei Li share personal origin stories that mirror AI’s evolution. Justin describes discovering deep learning via the “Cat paper” and the compute+data recipe; Fei-Fei describes coming from physics into AI and computational neuroscience with a focus on big questions.
- •Justin: deep learning clicked as a generic recipe (algorithms + compute + data)
- •Fei-Fei: physics training encouraged audacious questions; turned to intelligence
- •AI research advanced even during public “winter” through ML/statistical methods
- •Their trajectories converge around vision as a pathway to general intelligence
ImageNet’s legacy: data scale as an overlooked unlock for vision
Fei-Fei recounts the ImageNet bet: pushing computer vision datasets from thousands to internet-scale. She argues that letting data drive models was a critical—underappreciated—driver of generalization and progress.
- •Early CV/NLP datasets were limited to thousands/tens of thousands of examples
- •ImageNet aimed for internet-scale categories and labeling
- •Data scale enabled models to generalize far better than prior approaches
- •Internet growth made large-scale dataset construction feasible
Compute as the accelerant: why AlexNet was really about GPUs
Justin emphasizes compute as the dominant force multiplier behind breakthroughs. Using AlexNet as an example, he highlights how enormous hardware gains compress what took days in 2012 into minutes today—changing what research and industry can attempt.
- •AlexNet (2012) trained 6 days on 2 consumer GPUs (GTX 580)
- •Modern GPUs (e.g., GB200) offer compute gains in the thousands
- •The same training would take ~minutes on today’s hardware
- •Fei-Fei: core ConvNet ideas existed since the 1980s—GPUs + data made them win
From supervised learning to today’s generative era
The group distinguishes the ImageNet-era supervised paradigm from newer approaches that can learn without explicit human labeling. They unpack how constrained ontologies (fixed category lists) gave way to broader, more open-ended generative and representation learning.
- •Supervised learning required explicit labels and predefined ontologies (e.g., 1,000 ImageNet classes)
- •Modern methods learn from weaker/implicit signals and massive unlabeled corpora
- •“Bitter lesson” framing: design to scale with compute rather than hand-crafted tricks
- •Discussion of implicit human structure in language and web-scale supervision (e.g., CLIP)
A continuum to gen AI: matching, captioning, style transfer, and early text-to-image
Fei-Fei and Justin trace a research arc showing that “generative AI” didn’t appear overnight. Justin’s PhD work moves from aligning images and words, to generating descriptions, to real-time style transfer, and finally to structured text-to-image via scene graphs and GANs.
- •Phase 1: image–text alignment (retrieval via structured representations like scene graphs)
- •Phase 2: pixels-to-words (captioning) as an early bridge between vision and language
- •2015 style transfer sparked early ‘gen AI’ excitement; key issue was speed/productionization
- •Early text-to-image required structured inputs (scene graphs) because free-form language wasn’t ready
Why World Labs now: turning a long-standing vision into a company
Fei-Fei explains World Labs as the next North Star after earlier goals (like image “storytelling”) became achievable sooner than expected. With new algorithms (e.g., NeRF lineage), deeper data sophistication, and ample compute, she sees a rare window to focus the field on spatial intelligence.
- •World Labs’ mission: unlock spatial intelligence as a foundational capability
- •Motivation is both personal (North Star seeking) and technical (readiness)
- •Advances in algorithms and representation learning make the bet timely
- •Founding includes pioneers tied to NeRF-era breakthroughs
Defining spatial intelligence: perceiving, reasoning, and acting in 3D/4D
Justin offers a crisp definition: spatial intelligence is a machine’s ability to understand and operate in 3D space and time—tracking objects, events, and interactions. This applies to both physical reality and generated/virtual worlds.
- •Core capability: understand positions and interactions over space-time (3D + time = 4D)
- •Beyond recognition: includes generation, interaction, and action
- •Applies to real-world perception and synthetic world creation
- •Goal is to move AI from data centers into embodied/world-facing contexts
Spatial intelligence vs. multimodal LLMs: 1D tokens aren’t a native world model
They contrast language-centric architectures with spatial approaches. Multimodal LLMs can ‘see,’ but their underlying representation is still largely 1D token sequences; spatial intelligence puts 3D structure at the center, aiming for better task fit and richer affordances.
- •LLMs represent information primarily as 1D sequences (tokens, context windows)
- •Other modalities often get ‘shoehorned’ into 1D representations
- •Spatial intelligence prioritizes 3D representation as a first-class primitive
- •Fei-Fei: language is generated and lossy; the physical world has independent structure and physics
Why 3D beats “just pixels”: affordances, interaction, and the leap from scenes to worlds
The discussion differentiates 2D outputs (images/video) from a model that truly supports 3D operations: moving cameras, manipulating objects, and interacting naturally. They outline a hierarchy from objects to scenes to full worlds that extend beyond a single frame and support continuous navigation.
- •Even if perception is 2D, useful interaction often demands a 3D internal model
- •3D-centric representations enable camera/object movement more naturally
- •Progression: objects → scenes (compositions) → worlds (continuous, navigable, dynamic)
- •Long-term aim includes physics, semantics, and full interactivity (not just static renderings)
Use cases: generative worlds, new media, AR/VR blending, and robotics
They map spatial intelligence to multiple high-impact domains. Generating interactive 3D worlds could transform media creation economics; AR/VR needs real-time 3D understanding to merge digital and physical; robotics requires spatial grounding to connect digital “brains” to physical environments.
- •World generation: interactive 3D environments rather than short 2D clips
- •New media economics: reduce cost/time from AAA-game-level production to on-demand creation
- •AR/VR: spatial computing requires spatial intelligence; future hardware could replace many screens
- •Robotics: spatial intelligence bridges perception, planning, and action in the real world
A deep-tech platform strategy: building foundational models before chasing devices
Fei-Fei positions World Labs as a platform company supplying models for multiple markets rather than a single application. They acknowledge mass-market XR hardware isn’t fully ready, so the company will likely focus where near-term adoption is feasible while building core capabilities.
- •World Labs intends to provide foundational spatial models across domains
- •Near-term product focus may avoid immature device ecosystems
- •Platform ambition: solve fundamental problems once, then generalize across use cases
- •Emphasis on ‘simplicity and generality’ through strong underlying primitives
Building the team: multidisciplinary excellence across vision, graphics, systems, and data
They stress that spatial intelligence requires multiple specialized disciplines, not a monolithic ‘AI talent’ pool. Founders and hires span 3D vision, generative modeling, computer graphics, systems/infra, and large-scale engineering.
- •Spatial intelligence requires expertise in ML, 3D geometry, graphics, and systems engineering
- •Graphics is a key complement—solving similar problems from the opposite direction
- •Founding team highlights NeRF (Ben Mildenhall) and early Gaussian-splat precursors (Christof Lasser)
- •Team cohesion is driven by shared conviction that ‘now is the moment’ for spatial intelligence
Measuring success: real-world deployment and expanding horizons
Fei-Fei defines milestones as widespread adoption—when many businesses and builders use their models to unlock spatial needs. Justin argues the endpoint is effectively unbounded because the world is a complex evolving 4D system; progress will continually open new possibilities.
- •Success metric: models deployed broadly to solve real spatial-intelligence needs
- •Milestones are achievable even if the ultimate ‘North Star’ keeps moving
- •Spatial intelligence is open-ended due to the complexity of the physical universe
- •Core belief: good technology expands the space of what’s possible, creating new unknowns