a16zGoogle DeepMind Lead Researchers on Genie 3 & the Future of World-Building
At a glance
WHAT IT’S REALLY ABOUT
DeepMind’s Genie 3 enables real-time interactive worlds with lasting memory
- Genie 3 combines multiple internal research threads (Genie 2-style world generation, GameNGen-style real-time simulation, and Veo-era text adherence) into a single interactive, real-time “world model.”
- The standout capability is spatial memory/persistence—objects and edits remain consistent when you look away and return—achieved without an explicit 3D scene representation.
- Scaling data and compute improves realism and physics-like behaviors (e.g., water, lighting, terrain interactions), producing outputs that can look real to non-experts.
- Genie 3 prioritizes interactivity, speed, controllability, and minute-long memory, while Veo prioritizes cinematic quality and (in Veo’s case) modalities like audio—highlighting deliberate product/research trade-offs.
- The team argues world models could accelerate reinforcement learning and robotics by providing experience-rich, data-driven simulation that helps close the “sim-to-real” gap, though true real-world embodiment remains a major frontier.
IDEAS WORTH REMEMBERING
5 ideasReal-time response changes how people perceive generated video.
The researchers emphasize that immediate feedback (keyboard-controlled navigation/actions) creates a “magical” sense of presence that non-interactive clips can’t match, even if the visuals are similar.
Spatial memory is a planned feature, but its quality still surprised the team.
They targeted “minute-plus” persistence as a headline goal, and even the creators found some examples (like painting a wall, leaving, and returning to see the paint unchanged) hard to believe on first viewing.
Genie 3 achieves consistency without explicit 3D reconstruction.
They intentionally avoided approaches like NeRFs or Gaussian splatting to prevent limiting generalization; instead the model generates frame-by-frame while still maintaining surprisingly stable world state.
Scaling yields ‘world-knowledge’ behaviors that feel like physics understanding.
With more breadth of training, the model often does the intuitively correct thing—swimming when entering water, slowing uphill on skis, plausible puddle interactions—without special-case programming, though failures still occur.
Text adherence is a major unlock versus earlier Genie versions.
Genie 1/2 relied on image prompting (and suffered from “start state” mismatches); Genie 3’s direct text-to-world controllability benefits from internal DeepMind know-how (including lessons from Veo), enabling highly specific, even silly prompts to work.
WORDS WORTH SAVING
5 quotesI felt it for the first time when our model, like the, uh, actually GameNGen model started working fast enough, and we were just like, "Oh my God. It's actually-- I can actually walk around."
— Shlomi Fruchter
The TLDR is it was totally planned for, but still incredibly surprising when it worked that well, right? So that specific sample, when I saw it, it was hard to believe.
— Jack Parker-Holder
Every time someone interacts with it for the first time and they like test-- They look away and then look back, I'm always like holding my breath. And then it, and then it looks back and it's the same, I'm like, "Whoa."
— Jack Parker-Holder
All of the applications basically stem from the ability to generate a world that, that just, just from a few words.
— Shlomi Fruchter
We designed it to be an envi-environment rather than an agent, right? So, so Genie 3 is very much like an environment model.
— Jack Parker-Holder
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome