Skip to content
No PriorsNo Priors

No Priors Ep. 107 | With Physical Intelligence Co-Founder Chelsea Finn

This week on No Priors, Elad speaks with Chelsea Finn, cofounder of Physical Intelligence and currently Associate Professor at Stanford, leading the Intelligence through Learning and Interaction Lab. They dive into how robots learn, the challenges of training AI models for the physical world, and the importance of diverse data in reaching generalizable intelligence. Chelsea explains the evolving landscape of open-source vs. closed-source robotics and where AI models are likely to have the biggest impact first. They also compare the development of robotics to self-driving cars, explore the future of humanoid and non-humanoid robots, and discuss what’s still missing for AI to function effectively in the real world. If you’re curious about the next phase of AI beyond the digital space, this episode is a must-listen. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @ChelseaFinn Show Notes: 0:00 Introduction 0:31 Chelsea’s background in robotics 3:10 Physical Intelligence 5:13 Defining their approach and model architecture 7:39 Reaching generalizability and diversifying robot data 9:46 Open source vs. closed source 12:32 Where will PI’s models integrate first? 14:34 Humanoid as a form factor 16:28 Embodied intelligence 17:36 Key turning points in robotics progress 20:05 Hierarchical interactive robot and decision making 22:21 Choosing data inputs 26:25 Self driving vs robotics market 28:37 Advice to robotics founders 29:24 Observational data and data generation 31:57 Future robotic forms

Sarah GuohostChelsea Finnguest
Mar 20, 202535mWatch on YouTube ↗

CHAPTERS

  1. 0:05 – 3:09

    Chelsea Finn’s robotics journey: from Berkeley PhD to Stanford and Google Brain

    Sarah Guo opens by introducing Chelsea Finn and her research focus on general-purpose skills learned through interaction. Chelsea recounts her path into robotics and early work on end-to-end neural network control from pixels to torques, and why generalization quickly became the central challenge.

    • Motivation: real-world impact + fascination with machine perception and intelligence
    • Early PhD work: neural networks mapping images directly to robot control commands
    • Robotics success on single tasks vs. difficulty generalizing across objects/environments
    • Exploration of learning paradigms: RL, imitation learning, video prediction
    • Transition through Google Brain to building a Stanford lab and advising students
  2. 3:09 – 4:55

    Physical Intelligence’s core bet: one model to control many robots across embodiments

    Chelsea explains Physical Intelligence’s mission: a large neural network that can control any robot for many tasks in many scenarios. She contrasts this with the traditional approach of building a robot for one narrow application, and emphasizes transfer across robot platforms to avoid “throwing away” data after hardware changes.

    • Goal: general-purpose robot foundation model for diverse tasks and scenarios
    • Avoiding the trap of single-application robotics productization
    • Training on data from many robot types (different joints, arms, morphologies)
    • Cross-embodiment transfer enables reusing data even as hardware iterates
    • Long-term orientation over near-term single vertical optimization
  3. 4:55 – 7:37

    Model approach and early results: scaling real robot data + transformers + VLM pretraining

    The discussion turns to architecture and the practical starting point: scaling real-world robot data collection. Chelsea describes teleoperation as the “bread and butter,” outlines their October demo tasks, and explains why transformers and pretrained vision-language models help bring internet knowledge into robotics.

    • Robotics lacks a “Wikipedia of robot motions,” so data must be collected
    • Teleoperating robots in real environments as primary data scaling method
    • October results: laundry folding, table cleaning, box construction
    • Transformers as core architecture; leveraging pretrained vision-language models
    • Pretraining enables concept transfer (e.g., recognizing entities unseen in robot data)
  4. 7:37 – 9:46

    What unlocks generalization: diversity of environments, tasks, and objects (not just scale)

    Chelsea identifies the biggest missing ingredient: more diverse robot data. She distinguishes diversity from raw quantity, describes moving robots into many locations as both a data and operational learning strategy, and notes additional levers like web video and basic reasoning.

    • Top priority: increase diversity (locations, objects, tasks), not only dataset size
    • Prior dataset collected in only a few buildings—far less varied than the internet
    • Operational byproduct: learning what it takes to run robots in many real settings
    • Supplementary sources: web videos, pretrained models, human demonstrations
    • Reasoning needs are often “basic but essential” (constraints, preferences, allergies)
  5. 9:46 – 12:31

    Open vs. closed: why PI chooses openness early

    Sarah asks about open-source strategy, and Chelsea explains PI’s deliberate openness—sharing weights, technical details, and even collaborating with hardware companies. She argues openness accelerates field readiness, attracts top researchers, and that the existential risk is failure to solve the problem, not competition.

    • PI has open-sourced components and shared designs with hardware partners
    • Rationale: the field is early; future models will improve dramatically
    • Ecosystem-building: better robots and better operator expertise prepare the world
    • Talent magnet: strong researchers value publishing and open collaboration
    • Primary worry: robotics is hard and might not work at all; low tolerance for errors
  6. 12:31 – 14:31

    Where models may deploy first: autonomy, human oversight, and designing for mistake tolerance

    The conversation shifts to real-world adoption and why robotics differs from other ML deployments. Chelsea notes many ML outputs are checked by humans, whereas robots often act autonomously, requiring new deployment patterns that tolerate mistakes or enable close human-robot collaboration.

    • PI avoids betting on a single application due to hidden failure modes
    • Robots often must act without a human validating each action output
    • Need for “tolerance for mistakes” or collaborative human-in-the-loop workflows
    • Language interaction is motivated by enabling user guidance and corrections
    • Deployment challenge: autonomy raises the bar for reliability and safety
  7. 14:31 – 16:08

    Humanoids vs. practical manipulators: data collection and teleoperation as the constraint

    Sarah probes humanoid robots as a general form factor. Chelsea calls humanoids “cool but overrated,” arguing that with today’s data bottleneck, ease of teleoperation and cheap scalable platforms matter more than human-like shape.

    • Humanoids fit human environments but are difficult to teleoperate effectively
    • If data is the bottleneck, optimizing for fast, cheap data collection is crucial
    • Preference for simpler/cheaper robots with strong teleop interfaces
    • Embodiment choice should serve iteration speed and dataset breadth
    • Humanoids may be valuable later, but not necessarily the best near-term path
  8. 16:08 – 17:28

    Why embodied intelligence is underrated: motor control is real intelligence

    Chelsea contrasts the AI community’s emphasis on “reasoning” with the complexity of physical control. She argues that mundane household actions require deep intelligence, shaped by evolution, and that embodiment is central rather than secondary to building capable systems.

    • Motor control is complex and often underestimated compared to abstract reasoning
    • Human dexterity reflects extensive evolutionary optimization
    • Seemingly simple tasks (pouring water, making cereal) are hard for robots
    • Embodiment grounds intelligence in interaction and feedback
    • Physical intelligence may be a core path toward broader machine capability
  9. 17:28 – 19:59

    Robotics turning points: SayCan, web knowledge transfer, and cross-robot training

    Sarah asks what research triggered the current surge in robotics startups. Chelsea lists key inflection points where language models enabled planning, web data improved generalization, and aggregated multi-lab robot datasets demonstrated strong cross-embodiment transfer—evidence that scaling laws may apply.

    • SayCan: using LMs for high-level planning paired with low-level control
    • RG2-style results: web-scale VLM knowledge improves robot concept generalization
    • RTX: aggregating multi-lab robot data into a shared format to train generalist policies
    • Strong transfer: checkpoints run in other labs outperform locally-tuned models
    • ALOHA/Mobile ALOHA: teleop-driven data collection enabling dexterous manipulation
  10. 19:59 – 22:20

    HI robot: hierarchical + interactive language-guided control for long-horizon tasks

    Chelsea explains PI’s “hierarchical interactive robot” system designed for tasks lasting minutes and for mid-task user corrections. The approach splits control into a high-level model that interprets prompts and chooses next steps, and a low-level policy that executes motor commands over short horizons.

    • Long-horizon tasks benefit from explicit hierarchy rather than single-step reactive policies
    • Interactive prompting bridges the gap between simple commands and real user intent
    • High-level model: interprets prompt, plans next subtask (e.g., ‘pick up tomato’)
    • Low-level model: converts subtask into short sequences of motor commands
    • Demos: sandwich-making variants, grocery shopping, and table-cleaning behaviors
  11. 22:20 – 25:20

    Choosing inputs: vision-first today, tactile ‘skin’ later, and memory before new sensors

    The discussion turns to sensor stacks and whether robots need more than vision. Chelsea says RGB plus wrist and scene cameras already go far; tactile sensors are currently expensive or fragile, and she’d prioritize adding memory/temporal context to policies before adding new modalities like audio or smell.

    • Effective baseline: RGB with external scene cameras + wrist-mounted cameras
    • Tactile sensors lag human skin in robustness, cost, and resolution
    • Wrist cameras often substitute for tactile information in manipulation
    • Multimodal redundancy (audio, smell) can improve robustness but isn’t the bottleneck
    • Near-term priority: add memory/temporal modeling; current policies are frame-based
  12. 25:20 – 26:21

    Robotics vs. self-driving: dimensionality, data scarcity, and more viable niche deployments

    Sarah compares robotics to self-driving’s trajectory and consolidation. Chelsea highlights robotics’ harder control precision and higher-dimensional action spaces, but notes robotics can succeed in narrower commercial distributions without solving the entire world, reducing safety and coverage requirements compared to driving.

    • Harder than driving in action dimensionality and precision requirements
    • Less initial data available compared to the driving ecosystem
    • Driving demands broad distributional coverage to be viable (many edge cases)
    • Robotics can win in constrained commercial settings without full generality
    • Self-driving progress (e.g., Waymo presence) is encouraging for embodied AI
  13. 26:21 – 29:14

    Market dynamics and founder advice: move fast, deploy early, learn in the real world

    Chelsea reflects on whether incumbents will dominate robotics as they did in self-driving. She argues the timing is better now due to deep learning advances, startups can move faster than big companies, and the key advice is rapid real-world deployment to learn and iterate quickly.

    • Self-driving consolidation may reflect being ‘too early’ a decade ago
    • Robotics may still be early, but capability trends are more promising now
    • Startups can avoid big-company friction (e.g., taking robots off-campus)
    • Large companies have capital but often move slower due to process constraints
    • Founder advice: deploy quickly, learn fast, and iterate based on real use
  14. 29:14 – 31:56

    Using observational human video and generating robot experience: why embodiment still matters

    Sarah asks about training on human observational data like YouTube. Chelsea sees it as valuable for expanding coverage, but insufficient alone—robots need experience in their own bodies to learn motor control, complemented by autonomous data collection and reinforcement learning-style bootstrapping.

    • Human video can help broaden task concepts and contexts
    • Analogy: watching experts isn’t enough; practice is required for motor skill
    • Robots need embodied experience: actions + proprioception + camera observations
    • Teleop ‘puppeteering’ creates aligned action-observation datasets (ALOHA-style)
    • Autonomous experience and RL bootstrapping may become increasingly important
  15. 31:56 – 35:14

    Future form factors: a ‘Cambrian explosion’ of specialized robots powered by shared intelligence

    The episode closes with speculation about what robots will look like in the future. Chelsea predicts many hardware types optimized for cost and tasks, enabled by general-purpose intelligence layers, while acknowledging manufacturing scale and supply-chain economics might push toward fewer standardized platforms.

    • Prediction: wide variety of robot platforms rather than one universal humanoid
    • Shared intelligence could enable rapid proliferation of specialized hardware
    • Analogy to kitchens: many devices optimized for different functions
    • Trade-off: specialization vs. manufacturing scale and supply-chain efficiency
    • Long-term possibility: robots enabling flexible manufacturing (‘robots all the way down’)

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.