Google DeepMind Lead Researchers on Genie 3 & the Future of World-Building

Genie 3 can generate fully interactive, persistent worlds from just text, in real time. In this episode, Google DeepMind’s Jack Parker-Holder (Research Scientist) and Shlomi Fruchter (Research Director) join Anjney Midha, Marco Mascorro, and Justine Moore of a16z, with host Erik Torenberg, to discuss how they built it, the breakthrough “special memory” feature, and the future of AI-powered gaming, robotics, and world models. They share: - How Genie 3 generates interactive environments in real time - Why its “special memory” feature is such a breakthrough - The evolution of generative models and emergent behaviors - Instruction following, text adherence, and model comparisons - Potential applications in gaming, robotics, simulation, and more - What’s next: Genie 4, Genie 5, and the future of world models This conversation offers a first-hand look at one of the most advanced world models ever created. Timecodes: 0:00 Introduction 0:29 The Evolution of Generative Models 1:10 Real-Time Interactivity & User Experience 4:35 Applications and Use Cases 8:15 The Importance of Special Memory 13:12 Emergent Behaviors & Model Capabilities 19:45 Instruction Following & Text Adherence 20:48 Comparing Genie 3 and Other Models 21:56 The Future of World Models & Modalities 32:23 Robotics, Simulation, and Real-World Impact 37:58 Looking Ahead: Genie 4, 5, and Future World Models 40:41 Are We Living in a Simulation? Resources: Find Shlomi on X: https://x.com/shlomifruchter Find Jack on X: https://x.com/jparkerholder Find Anjney on X: https://x.com/anjneymidha Find Justine on X: https://x.com/venturetwins Find Marco on X: https://x.com/Mascobot Stay Updated: Let us know what you think: https://ratethispodcast.com/a16z Find a16z on Twitter: https://twitter.com/a16z Find a16z on LinkedIn: https://www.linkedin.com/company/a16z Subscribe on your favorite podcast app: https://a16z.simplecast.com/ Follow our host: https://x.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details, please see a16z.com/disclosures.

Shlomi FruchterguestJack Parker-HolderguestErik TorenberghostMarco MascorrohostJustine MoorehostAnjney Midhahost

Aug 15, 202542mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

DeepMind’s Genie 3 enables real-time interactive worlds with lasting memory

Genie 3 combines multiple internal research threads (Genie 2-style world generation, GameNGen-style real-time simulation, and Veo-era text adherence) into a single interactive, real-time “world model.”
The standout capability is spatial memory/persistence—objects and edits remain consistent when you look away and return—achieved without an explicit 3D scene representation.
Scaling data and compute improves realism and physics-like behaviors (e.g., water, lighting, terrain interactions), producing outputs that can look real to non-experts.
Genie 3 prioritizes interactivity, speed, controllability, and minute-long memory, while Veo prioritizes cinematic quality and (in Veo’s case) modalities like audio—highlighting deliberate product/research trade-offs.
The team argues world models could accelerate reinforcement learning and robotics by providing experience-rich, data-driven simulation that helps close the “sim-to-real” gap, though true real-world embodiment remains a major frontier.

IDEAS WORTH REMEMBERING

5 ideas

Real-time response changes how people perceive generated video.

The researchers emphasize that immediate feedback (keyboard-controlled navigation/actions) creates a “magical” sense of presence that non-interactive clips can’t match, even if the visuals are similar.

Spatial memory is a planned feature, but its quality still surprised the team.

They targeted “minute-plus” persistence as a headline goal, and even the creators found some examples (like painting a wall, leaving, and returning to see the paint unchanged) hard to believe on first viewing.

Genie 3 achieves consistency without explicit 3D reconstruction.

They intentionally avoided approaches like NeRFs or Gaussian splatting to prevent limiting generalization; instead the model generates frame-by-frame while still maintaining surprisingly stable world state.

Scaling yields ‘world-knowledge’ behaviors that feel like physics understanding.

With more breadth of training, the model often does the intuitively correct thing—swimming when entering water, slowing uphill on skis, plausible puddle interactions—without special-case programming, though failures still occur.

Text adherence is a major unlock versus earlier Genie versions.

Genie 1/2 relied on image prompting (and suffered from “start state” mismatches); Genie 3’s direct text-to-world controllability benefits from internal DeepMind know-how (including lessons from Veo), enabling highly specific, even silly prompts to work.

WORDS WORTH SAVING

5 quotes

I felt it for the first time when our model, like the, uh, actually GameNGen model started working fast enough, and we were just like, "Oh my God. It's actually-- I can actually walk around."

— Shlomi Fruchter

The TLDR is it was totally planned for, but still incredibly surprising when it worked that well, right? So that specific sample, when I saw it, it was hard to believe.

— Jack Parker-Holder

Every time someone interacts with it for the first time and they like test-- They look away and then look back, I'm always like holding my breath. And then it, and then it looks back and it's the same, I'm like, "Whoa."

— Jack Parker-Holder

All of the applications basically stem from the ability to generate a world that, that just, just from a few words.

— Shlomi Fruchter

We designed it to be an envi-environment rather than an agent, right? So, so Genie 3 is very much like an environment model.

— Jack Parker-Holder

Real-time interactive world generationSpatial memory and persistence across framesEmergent physics/terrain behaviors from scalingInstruction following and text adherence improvementsGenie vs. Veo trade-offs (quality, audio, interactivity)World models for RL agents and roboticsFuture directions: longer memory, richer embodiment, multi-user worlds

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.