Skip to content
a16za16z

Google DeepMind Lead Researchers on Genie 3 & the Future of World-Building

Genie 3 can generate fully interactive, persistent worlds from just text, in real time. In this episode, Google DeepMind’s Jack Parker-Holder (Research Scientist) and Shlomi Fruchter (Research Director) join Anjney Midha, Marco Mascorro, and Justine Moore of a16z, with host Erik Torenberg, to discuss how they built it, the breakthrough “special memory” feature, and the future of AI-powered gaming, robotics, and world models. They share: - How Genie 3 generates interactive environments in real time - Why its “special memory” feature is such a breakthrough - The evolution of generative models and emergent behaviors - Instruction following, text adherence, and model comparisons - Potential applications in gaming, robotics, simulation, and more - What’s next: Genie 4, Genie 5, and the future of world models This conversation offers a first-hand look at one of the most advanced world models ever created. Timecodes: 0:00 Introduction 0:29 The Evolution of Generative Models 1:10 Real-Time Interactivity & User Experience 4:35 Applications and Use Cases 8:15 The Importance of Special Memory 13:12 Emergent Behaviors & Model Capabilities 19:45 Instruction Following & Text Adherence 20:48 Comparing Genie 3 and Other Models 21:56 The Future of World Models & Modalities 32:23 Robotics, Simulation, and Real-World Impact 37:58 Looking Ahead: Genie 4, 5, and Future World Models 40:41 Are We Living in a Simulation? Resources: Find Shlomi on X: https://x.com/shlomifruchter Find Jack on X: https://x.com/jparkerholder Find Anjney on X: https://x.com/anjneymidha Find Justine on X: https://x.com/venturetwins Find Marco on X: https://x.com/Mascobot Stay Updated: Let us know what you think: https://ratethispodcast.com/a16z Find a16z on Twitter: https://twitter.com/a16z Find a16z on LinkedIn: https://www.linkedin.com/company/a16z Subscribe on your favorite podcast app: https://a16z.simplecast.com/ Follow our host: https://x.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details, please see a16z.com/disclosures.

Shlomi FruchterguestJack Parker-HolderguestErik TorenberghostMarco MascorrohostJustine MoorehostAnjney Midhahost
Aug 16, 202542mWatch on YouTube ↗

CHAPTERS

  1. Why Genie 3 went viral: real-time world generation from a few words

    The conversation opens with the hosts and DeepMind researchers reacting to the public response to Genie 3. They frame what feels “game-changing”: generating a believable world from text and letting people interact with it in real time.

  2. From Genie 1/2, Veo, and GameNGen to Genie 3: combining research threads

    Jack explains how Genie 3 emerged by unifying multiple internal lines of work: earlier Genie world generation, Veo’s high-quality video advances, and GameNGen’s real-time simulation lessons. The team describes ambitious goals and a surprisingly fast timeline to reach them.

  3. Real-time interactivity as a UX breakthrough (and why it feels different than video)

    Shlomi and Jack emphasize that interactivity isn’t just faster video—it changes the experience. They describe the “walk around” moment when latency drops enough to feel embodied and responsive.

  4. Use cases: entertainment, games, training agents, education—enabled by the same primitive

    The group explores applications, but the researchers argue the primary unlock is the core capability: generating a coherent world from language. Jack ties the motivation to reinforcement learning’s need for unlimited, diverse environments.

  5. Spatial memory & persistence: designing for minute-long consistency without explicit 3D reconstruction

    A standout capability is “spatial memory”—objects persist when you look away and return (e.g., painted wall remains painted). The team explains it was a headline goal, still surprising in practice, and they avoided explicit 3D representations to preserve generality.

  6. Emergent behaviors from scale: physics, terrain interaction, and “common sense” dynamics

    They discuss how scaling data/compute yields unexpected improvements that resemble world understanding: doors, water, puddles, skiing speed on slopes, and lighting/storm realism. These behaviors aren’t hand-engineered; they emerge from breadth of training.

  7. Instruction following & text adherence: controllability vs realism (and handling unlikely prompts)

    Text adherence improves dramatically, enabling detailed and even silly world descriptions. The team highlights a key tension: models want to generate likely, coherent scenes, but users want improbable or imaginative variations—prompt-following must override priors.

  8. Why Genie 3 isn’t just ‘Veo 3 in real time’: differences in goals, features, and product posture

    Shlomi clarifies that Genie and Veo optimize different axes: Genie for navigation/action in a world, Veo for cinematic quality and other capabilities like audio. Genie 3 is positioned as a research preview rather than a product release.

  9. World models vs video models: modalities, speed, control, and why convergence isn’t guaranteed

    They zoom out to taxonomy: “modality” is only one axis; speed and control are orthogonal dimensions with real engineering trade-offs. Jack notes most users will specialize—filmmaking needs differ from agent training—so divergence may persist even if the tech shares roots.

  10. How much to optimize for downstream use cases vs pushing frontier capabilities

    Justine asks how use cases influence training decisions. Shlomi says they keep some applications in mind, but the main driver is pushing a technical vector: quality, speed, real-time, and control—then letting applications emerge through access and experimentation.

  11. Looking ahead: Genie 4/5, richer simulation, ‘stepping into’ worlds, and therapeutic training scenarios

    Asked about future directions, the researchers stay general: build more capable models, gather feedback, and expand realism and interactivity. Shlomi imagines experiential simulations for training and therapy (stage fright, phobias), while Jack frames world models as a path toward embodied AGI.

  12. Robotics and composability with agents (e.g., SIMA): bridging data bottlenecks and sim-to-real gaps

    Jack explains Genie 3 as an environment model, not an agent—meant to generate experiences for agents to learn from. They argue robotics is constrained by data, safety, and the sim-to-real gap; generative world models could combine real-world realism with simulation scalability, though non-visual physics/control remain gaps.

  13. Access and trajectory: when developers might get it, and whether progress will plateau

    They indicate a desire to broaden access but offer no timeline. On progress curves, Jack argues today’s capability is already compelling, yet far from the richness of real life—suggesting more breakthroughs and “new steps” (like in LLMs) are still ahead.

  14. Closing philosophy: are we living in a simulation?

    The episode ends with a playful but thoughtful detour into simulation hypothesis. Shlomi speculates that if reality is simulated, it likely isn’t on today’s digital hardware, hinting at continuity/analog properties and possible quantum constraints.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome