Google DeepMind Developers: How Nano Banana Was Made

Google DeepMind’s new image model Nano Banana took the internet by storm. In this episode, we sit down with Principal Scientist Oliver Wang and Group Product Manager Nicole Brichtova to discuss how Nano Banana was created, why it’s so viral, and the future of image and video editing. Timestamps: 00:00 Intro 02:00 The Origin of Nano Banana and How It Got Its Name 04:15 The “Wow” Moments and Viral Launch 06:20 Seeing Yourself in AI 08:40 How AI Is Changing Art and Creative Work 11:00 Control, Customization & Character Consistency 14:00 Building Interfaces for Artists and Everyday Users 17:10 AI in Education and Visual Learning 20:25 Multimodal AI and the Future of Creativity 24:10 2D vs 3D: The Debate Over World Models 27:20 The Challenge of Taste, Preference & Artistic Style 31:10 The Japan Phenomenon & Creative Communities 35:00 From Images to Video: The Next Frontier 41:00 Working With Artists and Designing With Intent 47:30 The Next Era of Image Models 53:50 Closing Thoughts Follow Oliver on X: https://x.com/oliver_wang2 Follow Nicole on X: https://x.com/nbrichtova Follow Guido on X: https://x.com/appenz Follow Yoko on X: https://x.com/stuffyokodraws Follow Justine on X: https://x.com/venturetwins Stay Updated: If you enjoyed this episode, be sure to like, subscribe, and share with your friends! Follow a16z on X: https://x.com/a16z Subscribe to a16z on Substack: https://a16z.substack.com/ Follow a16z on LinkedIn: https://www.linkedin.com/company/a16z Listen to the a16z Podcast on Spotify: https://open.spotify.com/show/5bC65RDvs3oxnLyqqvkUYX Listen to the a16z Podcast on Apple Podcasts: https://podcasts.apple.com/us/podcast/a16z-podcast/id842818711 Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.

Nicole Brichtovaguest

Oct 28, 202554mWatch on YouTube ↗

CHAPTERS

Why AI image editing feels empowering for creators
The conversation opens on a core promise of modern image models: shifting creators away from tedious, manual editing and toward higher-leverage creative decisions. The hosts frame AI as a new “medium” that can expand what artists can do rather than replace artistry.
From Imagine to Gemini: the origin of “Nano Banana”
The team traces Nano Banana’s lineage through DeepMind’s Imagine family and Gemini’s earlier image capabilities. Nano Banana emerges from combining Gemini’s conversational, interactive editing with Imagine’s visual quality—plus a nickname that was too memorable to drop.
The viral moment: launch dynamics and sudden demand
Oliver describes realizing the model’s breakout appeal only after release on LLM Arena, when demand exceeded provisioning expectations. The team’s internal perception shifted from “we built a good editor” to “this is broadly useful and people will seek it out.”
Seeing yourself in AI: personalization as the emotional hook
Nicole recounts the first time a zero-shot edit produced an output that actually looked like her—without fine-tuning or multiple reference images. The discussion highlights why “it looks like me” is a uniquely sticky feature: it becomes personally meaningful once users try it on themselves, families, or pets.
What becomes of art: intent, iteration, and professional craft
The hosts ask whether art is “out-of-distribution,” prompting a broader claim: intent matters more than novelty. The model is positioned as a tool—professionals with taste and ideas will still outperform casual users, and iterative creation remains central.
Control, customization, and character consistency as make-or-break features
They dig into why many artists previously avoided AI tools: lack of control and inability to keep characters consistent across a narrative. The team discusses optimizing for customization, multi-image style transfer, and the challenges of long conversations where instruction-following can degrade.
UI/UX for everyone: from chat to pro node graphs
A major thread is interface design: how to serve casual users, prosumers, and professionals with vastly different tolerance for complexity. They contrast chat-based workflows (great onboarding) with node-based systems like ComfyUI (powerful, composable, but complex), and discuss smart UI suggestions as a bridge.
One model or many: ecosystems, workflows, and specialization
Asked whether a single provider/model will do everything or whether workflows will stitch many components together, Oliver argues diversity will persist. Different users value different behaviors (strict instruction-following vs. ideation “go crazy”), making specialization inevitable.
AI for education and visual learning: beyond text-only tutoring
They pivot to education, arguing that most learners benefit from visual explanations—not just text dialogue. The promise is multimodal tutoring that generates diagrams, figures, and step-by-step visuals, but it depends on better factuality and reliable text rendering.
Multimodal futures: pixels, SVGs, code+image hybrid outputs
The discussion explores whether pixels are the endgame representation or whether mixed formats (SVG, layers, parametrics) will matter for editability. They note an emerging advantage: models that natively generate both code and images can create hybrid artifacts—part raster, part structured.
2D vs 3D world models: projections, data, and robotics constraints
They address the debate over explicit 3D world models versus learning latent 3D from 2D projections. Oliver argues 2D projections may solve many creative/interface needs given available training data, while acknowledging robotics and physical interaction require stronger 3D grounding.
Evaluating likeness and quality: the “uncanny valley” and taste problem
Character consistency evaluation is described as unusually hard: faces you don’t know don’t reveal failure modes, but familiar faces do. More broadly, they argue that model quality can’t be collapsed into a single score—users weight dimensions differently, and lab “taste” influences releases.
Communities and force multipliers: Japan’s workflows, latency, and downstream creation
They discuss how creative communities amplify capabilities—highlighting Japan’s intense adoption, manga/anime-focused tooling, and workflow wrappers. They also identify “force multipliers” such as low latency and reliable quality, which enable rapid iteration and unlock larger downstream products like video.
From images to video and “visual deep research”: sequence, state, and interleave storytelling
The conversation ties images to video as adjacent sequence-prediction problems, moving toward interactive, temporal creativity. Oliver calls out “interleave generation” (multi-image story sequences with consistent characters) as an underused capability, and they discuss future models that self-critique and iterate at inference time.
Lower-bound quality, brand compliance, and what comes next for image models
They close on the next technical frontier: not making the best cherry-picked output, but improving the worst-case (“lemon picking”). Raising the floor unlocks high-trust use cases like education factuality and brand-compliant creative generation that follows long guideline documents.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome

Why AI image editing feels empowering for creators

From Imagine to Gemini: the origin of “Nano Banana”

The viral moment: launch dynamics and sudden demand

Seeing yourself in AI: personalization as the emotional hook

What becomes of art: intent, iteration, and professional craft

Control, customization, and character consistency as make-or-break features

UI/UX for everyone: from chat to pro node graphs

One model or many: ecosystems, workflows, and specialization

AI for education and visual learning: beyond text-only tutoring

Multimodal futures: pixels, SVGs, code+image hybrid outputs

2D vs 3D world models: projections, data, and robotics constraints

Evaluating likeness and quality: the “uncanny valley” and taste problem

Communities and force multipliers: Japan’s workflows, latency, and downstream creation

From images to video and “visual deep research”: sequence, state, and interleave storytelling

Lower-bound quality, brand compliance, and what comes next for image models

Get more out of YouTube videos.