Skip to content
a16za16z

Google DeepMind Developers: How Nano Banana Was Made

Google DeepMind’s new image model Nano Banana took the internet by storm. In this episode, we sit down with Principal Scientist Oliver Wang and Group Product Manager Nicole Brichtova to discuss how Nano Banana was created, why it’s so viral, and the future of image and video editing. Timestamps: 00:00 Intro 02:00 The Origin of Nano Banana and How It Got Its Name 04:15 The “Wow” Moments and Viral Launch 06:20 Seeing Yourself in AI 08:40 How AI Is Changing Art and Creative Work 11:00 Control, Customization & Character Consistency 14:00 Building Interfaces for Artists and Everyday Users 17:10 AI in Education and Visual Learning 20:25 Multimodal AI and the Future of Creativity 24:10 2D vs 3D: The Debate Over World Models 27:20 The Challenge of Taste, Preference & Artistic Style 31:10 The Japan Phenomenon & Creative Communities 35:00 From Images to Video: The Next Frontier 41:00 Working With Artists and Designing With Intent 47:30 The Next Era of Image Models 53:50 Closing Thoughts Follow Oliver on X: https://x.com/oliver_wang2 Follow Nicole on X: https://x.com/nbrichtova Follow Guido on X: https://x.com/appenz Follow Yoko on X: https://x.com/stuffyokodraws Follow Justine on X: https://x.com/venturetwins Stay Updated: If you enjoyed this episode, be sure to like, subscribe, and share with your friends! Follow a16z on X: https://x.com/a16z Subscribe to a16z on Substack: https://a16z.substack.com/ Follow a16z on LinkedIn: https://www.linkedin.com/company/a16z Listen to the a16z Podcast on Spotify: https://open.spotify.com/show/5bC65RDvs3oxnLyqqvkUYX Listen to the a16z Podcast on Apple Podcasts: https://podcasts.apple.com/us/podcast/a16z-podcast/id842818711 Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.

Nicole Brichtovaguest
Oct 28, 202554mWatch on YouTube ↗

CHAPTERS

  1. Why AI image editing feels empowering for creators

    The conversation opens on a core promise of modern image models: shifting creators away from tedious, manual editing and toward higher-leverage creative decisions. The hosts frame AI as a new “medium” that can expand what artists can do rather than replace artistry.

    • AI reduces repetitive production work so creators can spend more time on creative intent
    • Models function like new artistic tools/media (e.g., “watercolors for Michelangelo”)
    • Framing tension: empowerment vs. fear of automation in creative work
  2. From Imagine to Gemini: the origin of “Nano Banana”

    The team traces Nano Banana’s lineage through DeepMind’s Imagine family and Gemini’s earlier image capabilities. Nano Banana emerges from combining Gemini’s conversational, interactive editing with Imagine’s visual quality—plus a nickname that was too memorable to drop.

    • Imagine models were strong on visual quality; Gemini focused on conversational use cases
    • A joint push toward interactive, multimodal editing led to the new model
    • Goal: “best of both worlds”—Gemini smartness + Imagine aesthetics
    • The name “Nano Banana” stuck because it’s simple and memorable
  3. The viral moment: launch dynamics and sudden demand

    Oliver describes realizing the model’s breakout appeal only after release on LLM Arena, when demand exceeded provisioning expectations. The team’s internal perception shifted from “we built a good editor” to “this is broadly useful and people will seek it out.”

    • Adoption signal: escalating QPS needs on LLM Arena compared to prior models
    • Users tolerated partial availability just to access the model
    • Virality tied to usefulness in conversational image editing, not just novelty
  4. Seeing yourself in AI: personalization as the emotional hook

    Nicole recounts the first time a zero-shot edit produced an output that actually looked like her—without fine-tuning or multiple reference images. The discussion highlights why “it looks like me” is a uniquely sticky feature: it becomes personally meaningful once users try it on themselves, families, or pets.

    • Zero-shot likeness from a single input image felt like a step-change vs. LoRA fine-tuning workflows
    • Emotional resonance increases when the subject is you/your family/your dog
    • Internal excitement grew as people made transformations (e.g., ’80s makeovers)
    • Personalization becomes a gateway to repeated use
  5. What becomes of art: intent, iteration, and professional craft

    The hosts ask whether art is “out-of-distribution,” prompting a broader claim: intent matters more than novelty. The model is positioned as a tool—professionals with taste and ideas will still outperform casual users, and iterative creation remains central.

    • Art isn’t necessarily out-of-distribution; it often builds on prior work
    • Intent is the key differentiator between outputs that feel like art vs. noise
    • Models raise baseline capability, but don’t automatically confer taste or vision
    • Creative work is iterative; conversational editing aligns with that process
  6. Control, customization, and character consistency as make-or-break features

    They dig into why many artists previously avoided AI tools: lack of control and inability to keep characters consistent across a narrative. The team discusses optimizing for customization, multi-image style transfer, and the challenges of long conversations where instruction-following can degrade.

    • Artist demand drivers: character/object consistency and fine-grained control
    • Multi-image conditioning enables style transfer and targeted edits
    • Interactive, multi-turn editing mirrors real creative workflows
    • Open issue: long-context conversations can reduce instruction adherence; room to improve
  7. UI/UX for everyone: from chat to pro node graphs

    A major thread is interface design: how to serve casual users, prosumers, and professionals with vastly different tolerance for complexity. They contrast chat-based workflows (great onboarding) with node-based systems like ComfyUI (powerful, composable, but complex), and discuss smart UI suggestions as a bridge.

    • Trade-off: simple phone/voice/chat interfaces vs. deep “knobs and dials” for pros
    • ComfyUI-style node graphs enable robust workflows and model/tool chaining
    • Prosumer gap: needs more control than chat, less complexity than pro suites
    • Future UI idea: models suggest next edits because language doesn’t map cleanly to visual intent
  8. One model or many: ecosystems, workflows, and specialization

    Asked whether a single provider/model will do everything or whether workflows will stitch many components together, Oliver argues diversity will persist. Different users value different behaviors (strict instruction-following vs. ideation “go crazy”), making specialization inevitable.

    • No “single model to rule them all” due to varied user goals and tasks
    • Models can be tuned toward precision vs. creativity/inspiration; trade-offs remain
    • Composable workflows (e.g., storyboards/keyframes feeding video models) will continue
    • Ecosystem likely includes multiple models and tools connected in pipelines
  9. AI for education and visual learning: beyond text-only tutoring

    They pivot to education, arguing that most learners benefit from visual explanations—not just text dialogue. The promise is multimodal tutoring that generates diagrams, figures, and step-by-step visuals, but it depends on better factuality and reliable text rendering.

    • Text-only tutoring is misaligned with how many students learn
    • Multimodal tutors can generate diagrams and visual cues alongside explanations
    • “Visual deep research” concept: models explore options and present curated outputs
    • Requirements: factuality, good text rendering, and trustworthy explanations
  10. Multimodal futures: pixels, SVGs, code+image hybrid outputs

    The discussion explores whether pixels are the endgame representation or whether mixed formats (SVG, layers, parametrics) will matter for editability. They note an emerging advantage: models that natively generate both code and images can create hybrid artifacts—part raster, part structured.

    • Pixels can approximate everything, but structured representations aid editability
    • Mixed outputs could combine raster images with SVG/parametric components
    • Native multimodal capability enables workflows like “generate code + generate image”
    • Examples include rendering webpages from code images and unconventional pixel grids (e.g., Excel-as-pixels)
  11. 2D vs 3D world models: projections, data, and robotics constraints

    They address the debate over explicit 3D world models versus learning latent 3D from 2D projections. Oliver argues 2D projections may solve many creative/interface needs given available training data, while acknowledging robotics and physical interaction require stronger 3D grounding.

    • 3D models offer consistency; 2D projection data is far more abundant
    • Video models already show strong latent 3D understanding (reconstruction works well)
    • Human creative tools and interfaces are historically 2D (cave walls to screens)
    • Robotics/locomotion likely needs explicit 3D for reliable physical navigation
  12. Evaluating likeness and quality: the “uncanny valley” and taste problem

    Character consistency evaluation is described as unusually hard: faces you don’t know don’t reveal failure modes, but familiar faces do. More broadly, they argue that model quality can’t be collapsed into a single score—users weight dimensions differently, and lab “taste” influences releases.

    • Best likeness eval: test on yourself/people you know; unfamiliar faces are misleading
    • Human perception is uneven; benchmark aggregation hides multidimensional trade-offs
    • Deployment involves priorities (don’t regress on key wins like consistency) and accepted gaps (e.g., text rendering)
    • “Taste” and preference meaningfully shape which models labs choose to ship
  13. Communities and force multipliers: Japan’s workflows, latency, and downstream creation

    They discuss how creative communities amplify capabilities—highlighting Japan’s intense adoption, manga/anime-focused tooling, and workflow wrappers. They also identify “force multipliers” such as low latency and reliable quality, which enable rapid iteration and unlock larger downstream products like video.

    • Japan community built extensions/workflows (e.g., manga/anime prompting wrappers)
    • Latency is a major multiplier: fast iteration changes user behavior and output volume
    • Quality must clear a threshold; above it, speed and consistency compound value
    • Downstream unlocks: storyboard/keyframe generation feeding video creation pipelines
  14. From images to video and “visual deep research”: sequence, state, and interleave storytelling

    The conversation ties images to video as adjacent sequence-prediction problems, moving toward interactive, temporal creativity. Oliver calls out “interleave generation” (multi-image story sequences with consistent characters) as an underused capability, and they discuss future models that self-critique and iterate at inference time.

    • Images and video sit on a continuum; temporal reasoning enables “what happens if…” edits
    • Interleave generation: request multiple images in one prompt to form a story sequence
    • Future: longer “thinking/iteration” at inference time (drafts, self-critique, refinement)
    • Vision: models return options and step-by-step breakdowns like manuals or design explorations
  15. Lower-bound quality, brand compliance, and what comes next for image models

    They close on the next technical frontier: not making the best cherry-picked output, but improving the worst-case (“lemon picking”). Raising the floor unlocks high-trust use cases like education factuality and brand-compliant creative generation that follows long guideline documents.

    • Shift from cherry-picking best images to improving worst-case reliability
    • Key future applications: factual, trustworthy educational visuals
    • Using long context windows to follow detailed brand guidelines (colors, fonts, rules)
    • Inference-time critique loops for compliance: generate → check against rules → revise

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.