Skip to content
Lex Fridman PodcastLex Fridman Podcast

Yann Lecun: Meta AI, Open Source, Limits of LLMs, AGI & the Future of AI | Lex Fridman Podcast #416

Yann LeCun is the Chief AI Scientist at Meta, professor at NYU, Turing Award winner, and one of the most influential researchers in the history of AI. Please support this podcast by checking out our sponsors: - HiddenLayer: https://hiddenlayer.com/lex - LMNT: https://drinkLMNT.com/lex to get free sample pack - Shopify: https://shopify.com/lex to get $1 per month trial - AG1: https://drinkag1.com/lex to get 1 month supply of fish oil TRANSCRIPT: https://lexfridman.com/yann-lecun-3-transcript EPISODE LINKS: Yann's Twitter: https://twitter.com/ylecun Yann's Facebook: https://facebook.com/yann.lecun Meta AI: https://ai.meta.com/ PODCAST INFO: Podcast website: https://lexfridman.com/podcast Apple Podcasts: https://apple.co/2lwqZIr Spotify: https://spoti.fi/2nEwCF8 RSS: https://lexfridman.com/feed/podcast/ Full episodes playlist: https://www.youtube.com/playlist?list=PLrAXtmErZgOdP_8GztsuKi9nrraNbKKp4 Clips playlist: https://www.youtube.com/playlist?list=PLrAXtmErZgOeciFP3CBCIEElOJeitOr41 OUTLINE: 0:00 - Introduction 2:18 - Limits of LLMs 13:54 - Bilingualism and thinking 17:46 - Video prediction 25:07 - JEPA (Joint-Embedding Predictive Architecture) 28:15 - JEPA vs LLMs 37:31 - DINO and I-JEPA 38:51 - V-JEPA 44:22 - Hierarchical planning 50:40 - Autoregressive LLMs 1:06:06 - AI hallucination 1:11:30 - Reasoning in AI 1:29:02 - Reinforcement learning 1:34:10 - Woke AI 1:43:48 - Open source 1:47:26 - AI and ideology 1:49:58 - Marc Andreesen 1:57:56 - Llama 3 2:04:20 - AGI 2:08:48 - AI doomers 2:24:38 - Joscha Bach 2:28:51 - Humanoid robots 2:38:00 - Hope for the future SOCIAL: - Twitter: https://twitter.com/lexfridman - LinkedIn: https://www.linkedin.com/in/lexfridman - Facebook: https://www.facebook.com/lexfridman - Instagram: https://www.instagram.com/lexfridman - Medium: https://medium.com/@lexfridman - Reddit: https://reddit.com/r/lexfridman - Support on Patreon: https://www.patreon.com/lexfridman

Yann LeCunguestLex Fridmanhost
Mar 7, 20242h 47mWatch on YouTube ↗

CHAPTERS

  1. 0:00 – 2:29

    Open-source AI as a check on concentrated power

    Yann argues that proprietary AI creates a dangerous concentration of power over the “information diet” of society. He and Lex connect this to democratic values, free speech, and the need for diversity of AI systems rather than centralized control.

    • Proprietary AI as a bigger risk than most other AI fears
    • Security arguments for locking down AI can backfire socially
    • Information mediation by a few companies threatens democracy
    • Open access as a route to pluralism in AI assistants
  2. 2:29 – 5:57

    Why autoregressive LLMs hit a ceiling: world, memory, reasoning, planning

    LeCun lays out core competencies of intelligence—understanding the physical world, persistent memory, reasoning, and planning—and claims LLMs only approximate these. He emphasizes that LLMs can be useful while still being the wrong endgame architecture for human-level intelligence.

    • Four pillars: world understanding, memory, reasoning, planning
    • LLMs are powerful but lack key competencies in a deep way
    • Text training data is huge yet limited compared to sensory experience
    • Most early learning (human/animal) is non-linguistic
  3. 5:57 – 17:58

    Grounding debate: can language alone build a world model?

    Lex pushes the idea that language is compressed wisdom and might encode the world implicitly. LeCun counters that intelligence needs grounding (physical or simulated) and that much of reality’s structure is not expressed explicitly in text, invoking Moravec’s paradox and robotics examples.

    • Intelligence requires grounding in rich environments
    • Language is low-bandwidth compared to perception and action
    • Moravec paradox: “hard” human skills are easy for humans, hard for AI
    • Why LLMs can pass exams but can’t learn driving/dishwashing like kids
  4. 17:58 – 25:01

    Why video prediction by pixel generation fails (and what that reveals)

    LeCun explains the long-running difficulty of predicting future video frames: modeling distributions in high-dimensional continuous spaces is hard, and the future contains too many unpredictable details. Attempts with latent variables, GANs, VAEs, and reconstruction objectives did not yield robust representations.

    • Text prediction works differently than video prediction
    • Predicting distributions over frames/pixels is intractable in practice
    • Uncertainty in the world makes pixel-accurate prediction wasteful
    • Generative reconstruction objectives fail to learn useful representations
  5. 25:01 – 26:17

    JEPA: predicting in representation space instead of generating pixels

    LeCun introduces Joint-Embedding Predictive Architectures (JEPA) as an alternative to reconstruction-based self-supervision. The key idea is to predict abstract representations of the uncorrupted input from a corrupted view, enabling models to ignore unpredictable details and learn higher-level structure.

    • Joint embeddings from clean vs corrupted/translated views
    • Predict representations—not pixels—to avoid modeling irrelevant details
    • Abstraction helps discard noise (e.g., wind-blown leaves)
    • JEPA as a foundational step toward world models
  6. 26:17 – 35:49

    Training JEPAs without collapse: contrastive vs non-contrastive learning

    The conversation turns to the training problem: naive alignment collapses to constant representations. LeCun reviews contrastive learning (negatives) and the newer non-contrastive/disto-style approaches that avoid negatives via architectural and regularization tricks.

    • Collapse problem in joint-embedding training
    • Contrastive learning: positives pulled together, negatives pushed apart
    • Non-contrastive methods: avoid negatives with additional constraints
    • Key methods referenced: BYOL, VICReg, DINO, I-JEPA
  7. 35:49 – 38:41

    DINO and I-JEPA: practical self-supervised vision representation learning

    LeCun describes how DINO relies on image-specific augmentations while I-JEPA can work by masking without needing image-aware transformations. The aim is robust, transferable representations that perform well when attached to supervised heads downstream.

    • DINO uses heavy image augmentations; needs image priors
    • I-JEPA masks regions; simpler, less augmentation-dependent
    • Distillation-style two-branch training dynamics
    • Representations validated via improved downstream recognition
  8. 38:41 – 40:48

    V-JEPA: extending JEPA to video and learning intuitive physics signals

    LeCun presents V-JEPA, which masks spatiotemporal “tubes” across frames and predicts representations of the full video. He highlights early results: strong action recognition features and a nascent ability to detect physically impossible events (object teleportation/disappearance).

    • Masking a temporal tube across ~16 frames
    • First strong self-supervised video representations in this line of work
    • Downstream action classification as a proxy for representation quality
    • Early signs of capturing physical constraints (impossibility detection)
  9. 40:48 – 44:20

    From world models to planning: actions, prediction, and model-predictive control

    LeCun connects learned world models to planning by conditioning predictions on actions (e.g., steering angle) and using predicted future states to optimize objectives. He frames this as model-predictive control—planning at inference time using a learned dynamics model.

    • World model: state at T + action → predicted state at T+Δ
    • Action-conditioned prediction enables counterfactual simulation
    • Planning as inference-time optimization, not additional training
    • Model-predictive control as the classical template
  10. 44:20 – 1:06:07

    Hierarchical planning: why it’s necessary and still largely unsolved

    Using a New York-to-Paris example, LeCun argues complex behavior needs multi-level goal decomposition and re-planning under uncertainty. He claims AI lacks methods to learn the required hierarchical representations for planning rather than hand-designing abstraction levels.

    • Hierarchical decomposition from abstract goals to motor control
    • Planning must be adaptive because conditions are unknown
    • Two-level hierarchies exist but are hand-designed
    • Learning hierarchical planning representations remains an open problem
  11. 1:06:07 – 1:15:48

    Hallucinations and shallow “reasoning” in LLMs: long tails and fixed compute

    LeCun explains hallucinations as compounding sampling errors in autoregressive generation and highlights the huge prompt space outside fine-tuning coverage. He argues LLM “reasoning” is primitive because computation per token is essentially fixed, unlike human deliberation that scales effort with difficulty.

    • Autoregressive sampling can drift; errors accumulate with length
    • Fine-tuning covers frequent queries but fails on the long tail
    • Jailbreaks illustrate brittleness to out-of-distribution prompts
    • LLMs don’t allocate more compute to harder problems by default
  12. 1:15:48 – 1:29:17

    A different blueprint: energy-based models and optimization in latent concept space

    LeCun proposes objective-driven systems that evaluate compatibility between prompt and candidate answer via a scalar “energy” and then optimize latent variables using gradient-based inference. The optimized latent ‘thought’ is then decoded into text, enabling deliberation, controllability, and language-independence.

    • Energy-based model: score compatibility of X (prompt) and Y (answer)
    • Inference via gradient descent in continuous representation space
    • Continuous optimization vs discrete hypothesis sampling
    • Training via contrastive or non-contrastive mechanisms; RLHF parallels reward modeling
  13. 1:29:17 – 1:34:10

    RL skepticism and where RLHF fits: use RL sparingly after world models

    LeCun argues RL is sample-inefficient and should be minimized, with most learning coming from observation-based world modeling. He reframes RLHF’s success as primarily ‘human feedback’ (often supervised), and suggests such evaluators could be used more efficiently for planning rather than only fine-tuning.

    • RL is inefficient; prioritize self-supervised world models first
    • Use planning when possible; RL mainly for adaptation and exploration
    • Curiosity/play as targeted exploration to fix model inaccuracies
    • RLHF’s impact is mostly HF; reward models resemble objective functions
  14. 1:34:10 – 1:44:32

    Woke AI, bias, and censorship: why diversity requires open-source foundations

    Discussing Gemini controversies, LeCun argues perfect “unbiased” AI is impossible because bias is observer-dependent. The practical societal solution is diversity of assistants, made feasible by open-source base models that groups can fine-tune for culture, language, values, and domain needs.

    • Bias cannot be eliminated universally; trade-offs are inevitable
    • Big companies over-constrain outputs to avoid offending audiences
    • Open source enables localized, culturally aligned fine-tuning
    • Examples: India’s many languages; medical access via local-language models in Africa
  15. 1:44:32 – 2:04:21

    How Meta can open-source LLaMA and still build a business

    LeCun explains Meta’s business model: services monetized via ads or business tools, leveraging a large user/customer base. Open-sourcing base models accelerates innovation through external contributions and creates an ecosystem Meta can integrate or acquire from.

    • Revenue comes from services and distribution, not model secrecy alone
    • Open source accelerates iteration via community improvements
    • Ecosystem of vertical fine-tunes benefits Meta’s customers too
    • LLaMA roadmap: bigger, better, multimodal; longer-term planning/world-model capabilities
  16. 2:04:21 – 2:28:43

    AGI/AMI timeline, doomers, and safety as iterative engineering

    LeCun argues “AGI” won’t be a single event but gradual progress across world models, memory, reasoning, and planning. He critiques doomer assumptions (sudden takeoff, inevitable dominance drive) and advocates guardrails embedded in objective-driven systems, comparing AI safety progress to turbojet reliability engineering.

    • Human-level AI is a multi-component integration problem; likely decades
    • Doomer fallacies: sudden emergence, inevitability of power-seeking
    • Guardrails as part of objective-driven optimization, not bolt-ons
    • Safety improves iteratively, like aviation engineering—not via a single proof
  17. 2:28:43 – 2:37:57

    Robots and embodiment: why home/factory autonomy still needs better world models

    The discussion shifts to humanoids and domestic robotics, with LeCun predicting major progress over the next decade but not immediately. He emphasizes that true household competence requires learned world models and planning beyond today’s specialized robotics approaches.

    • Robotics has waited decades for robust perception + planning breakthroughs
    • Specialized systems work; general domestic robots remain out of reach
    • Embodied AI demos exist (navigate, fetch), but generalization is hard
    • World models learned from video/action are key to next robotics wave
  18. 2:37:57 – 2:47:16

    Hopeful future: AI assistants as a printing-press-level upgrade for humanity

    LeCun ends on optimism: AI can amplify human intelligence by giving everyone capable assistants, analogous to how the printing press expanded access to knowledge and enabled societal progress. He acknowledges disruptions and conflicts but argues the net effect will be positive if power is decentralized and people are trusted.

    • AI assistants as scalable intelligence for everyone
    • Printing press analogy: smarter societies despite transitional turmoil
    • Fear of change is recurring; focus on real vs imagined dangers
    • Open-source + trust in people as the path to a healthier AI future

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.