Google DeepMind Developers: How Nano Banana Was Made

Google DeepMind’s new image model Nano Banana took the internet by storm. In this episode, we sit down with Principal Scientist Oliver Wang and Group Product Manager Nicole Brichtova to discuss how Nano Banana was created, why it’s so viral, and the future of image and video editing. Timestamps: 00:00 Intro 02:00 The Origin of Nano Banana and How It Got Its Name 04:15 The “Wow” Moments and Viral Launch 06:20 Seeing Yourself in AI 08:40 How AI Is Changing Art and Creative Work 11:00 Control, Customization & Character Consistency 14:00 Building Interfaces for Artists and Everyday Users 17:10 AI in Education and Visual Learning 20:25 Multimodal AI and the Future of Creativity 24:10 2D vs 3D: The Debate Over World Models 27:20 The Challenge of Taste, Preference & Artistic Style 31:10 The Japan Phenomenon & Creative Communities 35:00 From Images to Video: The Next Frontier 41:00 Working With Artists and Designing With Intent 47:30 The Next Era of Image Models 53:50 Closing Thoughts Follow Oliver on X: https://x.com/oliver_wang2 Follow Nicole on X: https://x.com/nbrichtova Follow Guido on X: https://x.com/appenz Follow Yoko on X: https://x.com/stuffyokodraws Follow Justine on X: https://x.com/venturetwins Stay Updated: If you enjoyed this episode, be sure to like, subscribe, and share with your friends! Follow a16z on X: https://x.com/a16z Subscribe to a16z on Substack: https://a16z.substack.com/ Follow a16z on LinkedIn: https://www.linkedin.com/company/a16z Listen to the a16z Podcast on Spotify: https://open.spotify.com/show/5bC65RDvs3oxnLyqqvkUYX Listen to the a16z Podcast on Apple Podcasts: https://podcasts.apple.com/us/podcast/a16z-podcast/id842818711 Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.

Nicole Brichtovaguest

Oct 27, 202554mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

DeepMind’s Nano Banana: multimodal image editing, creativity, control, and future

Nano Banana emerged from combining Imagine’s high visual quality with Gemini’s multimodal, conversational editing to create a “best of both worlds” image model optimized for interactive use cases.
The team’s biggest “wow” moments came from real user demand (LLM Arena traffic spikes) and from zero-shot personalization where a single reference image could produce outputs that convincingly looked like the user.
A central theme is creative empowerment through control: character consistency, iterative multi-turn editing, and customization reduce tedious work while keeping “intent” and taste as the differentiator for real art.
They argue future products will span a spectrum from simple chat-based interfaces for consumers to node-based/pro workflows (e.g., ComfyUI), with models increasingly suggesting next steps rather than requiring users to learn hundreds of controls.
Key technical frontiers include harder evaluation (multi-dimensional quality, “lemon picking” worst-case outputs), better long-context instruction adherence (e.g., brand guidelines), factual visual reasoning for education, and the transition from images to video and agentic “visual deep research.”

IDEAS WORTH REMEMBERING

5 ideas

Nano Banana is positioned as a hybrid: Gemini’s “smartness” plus Imagine-quality visuals.

The team reframed the problem from pure generation to conversational, multimodal editing while lifting visual fidelity to match their best image models, which is what made it broadly useful rather than just impressive.

Personalization is an adoption catalyst because it creates emotional resonance.

Seeing yourself (or your kids/pets) generated convincingly from a single image is a “zero-shot” breakthrough that turns AI imagery from novelty into something people feel compelled to try and share.

Control (character consistency, iterative edits) reduces artist skepticism more than raw generation ability.

Early one-shot text-to-image felt like the model made most decisions; increasing controllability and multi-turn collaboration helps creatives express intent, making the tool feel like an extension of craft rather than replacement.

Interface design will bifurcate: chat for accessibility, workflow graphs for power users, and a big “middle.”

Consumers benefit from not learning new UIs, pros demand knobs/dials, and there’s an underserved prosumer tier needing more control than chat but less complexity than full pro software.

No single model will satisfy all creative goals; diversity and ensembles will persist.

Optimizing for strict instruction-following can reduce “ideation” value (and vice versa), so the ecosystem will likely look like multiple specialized models connected in workflows (e.g., Nano Banana as a node for storyboards/keyframes).

WORDS WORTH SAVING

5 quotes

So I think it’s like, to me, I think that the most important thing for art is intent.

— Oliver Wang

It was the first time when the output actually looked like me.

— Nicole Brichtova

Fun is kind of a gateway to utility, where, you know, people come to make a figurine image of themselves, but then they stay because it helps them with their math homework, or it helps them write something, right?

— Nicole Brichtova

I don’t think this is gonna be, like, a single to rule, a single model to rule them all.

— Oliver Wang

People look at these images and say, "Oh, it’s almost perfect. We must be done." And for a while, we were in this like cherry pick phase where we would, you know, everyone would pick their best images. So you look at those and, and they’re great, but actually what’s more important now is the worst image, is we’re in a lemon picking stage. ’Cause every model can cherry pick images that look perfect.

— Oliver Wang

Origin story: Imagine + Gemini multimodal editingViral adoption via LLM Arena and personalizationIntent, taste, and the definition of artCharacter consistency and customization workflowsUI spectrum: chatbot vs pro node-based toolsEvals, benchmarks, and “worst-image” optimizationEducation, factuality, and visual reasoning2D projections vs explicit 3D world modelsImages-to-video, keyframes, and interactive timeInterleave generation (multi-image story output)

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.