EVERY SPOKEN WORD
60 min read · 12,353 words- 0:00 – 2:00
Intro
- NBNicole Brichtova
These models are allowing creators to do, um, less tedious parts of the job, right? They can be more creative.
- SPSpeaker
Yeah.
- NBNicole Brichtova
And they can spend, you know, 90% of their time being creative versus 90% of their time, like, editing things and doing these tedious kind of manual operations.
- SPSpeaker
I'm convinced that this ultimately really empowers the artist, right? It gives you new tools, right? It's like, hey, we now have, I don't know, watercolors for Michelangelo. Let's see what he does with it, right? And amazing things come out. [laughs]
- SPSpeaker
Maybe start by telling us about the backstory behind the Nano Banana model. How did it come to be? How did you all start working on it?
- SPSpeaker
Sure. So, um, you know, our, our team has worked on image models for some time. We developed the Imagine family of models, which is, goes back a couple of years. Um, and, and actually, there was also an, um, an image generation model in Gemini before, the Gemini 2.0 image generation model. So what happened was the, um, the teams kind of started to focus more on the Gemini use cases, so, like, interactive, conversational, and, and editing.
- SPSpeaker
Mm.
- SPSpeaker
Um, and, and essentially what happened was we, we teamed up and we s- we built this model, which became what's known as Nano Banana. Um, so yeah, that's sort of the origin story. But...
- NBNicole Brichtova
Yeah. And I, and I think maybe just some more background on that. So our Imagine models were always kind of top of the charts for visual quality.
- SPSpeaker
Mm-hmm.
- NBNicole Brichtova
And, you know, we really focus on kind of these specialized generation editing use cases. And then when 2.0 Flash came out, that's when we really started to see k- some of the magic of, like, being able to generate images and texts at the same time, so you can maybe tell a story. Um, just the magic of being able to talk to images and edit them conversationally. Uh, but the visual quality was maybe not where we wanted it to be. And so Nano Banana or Gemini 2.5 Flash image, um, we're-
- SPSpeaker
Nano Banana is way cooler.
- NBNicole Brichtova
It's, it's easier to say.
- SPSpeaker
[laughs]
- NBNicole Brichtova
It's a lot easier to say.
- SPSpeaker
It's, it's a name that stuck, right? [laughs]
- NBNicole Brichtova
Yes. It's, it's the name that stuck. Uh, but it really became kind of the best of both worlds in that sense, like the Gemini smartness and
- 2:00 – 4:15
The Origin of Nano Banana and How It Got Its Name
- NBNicole Brichtova
the multimodal kind of conversational nature of it, plus the visual quality of Imagine, and I feel like that's maybe what resonates a lot with people.
- SPSpeaker
Wow, amazing. Um, so I guess when you were testing out a model as you were developing it, what were some wow moments, um, that you found, "I know this is gonna go viral, I know people will love this"?
- SPSpeaker
I... So I actually didn't feel like it was going to go viral until we had released on LLM Arena. And what we saw was that we budgeted, like, you know, a comparable amount of queries per second as we had for our previous models-
- SPSpeaker
Mm
- SPSpeaker
... that were on LLM Arena. And we had to keep upping that number as people were going to LLM Arena to use the model. And I feel like that was the first time when I was really like, "Oh, wow, this is something that's very, very useful to a lot of people." Like, it, it surprised even me. I don't know about the whole team, but, like, we, you know, we were trying to make the best conversational editing model possible, but, um, but then it really started taking off when, when, yeah, when people were, like, going out of their way and using an, using a website that would actually only give you the model some percentage of the time, but even that was worth, like, using, going to that website to use the model. So I think that was really the moment, at least for me, that I was like, "Oh, wow, this is, this is gonna be bigger."
- SPSpeaker
That's actually the best way to condition people, like, only giving them a reward partially. [laughs]
- SPSpeaker
[laughs]
- SPSpeaker
Not all the time.
- NBNicole Brichtova
Not by design. Uh, I had a moment earlier, um, and that was when... So I've been trying some similar queries on kind of multiple generations of models over time. Um, and a lot of them have to do with, like, things I wanted to be as a kid. So, like, an astronaut, explorer, or, you know, put me on the red carpet, and I tried it on a demo that we had internally before we released the model. And it was the first time when the output actually looked like me. Um, and you know, y- you guys play with these models all the time. The only time that I've seen that before is if, you know, you fine-tune a model, you know, using LoRA or some other method to, like, do that, and you need multiple images, and takes a really long time, and then you have to, like, actually serve it somewhere. So this was the first time when it was, like, zero shot, oh, wow, just one image of me, and it looks like me, and I was like, "Wow." And then there became these, like, th- we have decks that are just, like, covered in my face as I was trying to convince other people that it was really cool. Um, and really, I think the moment more people realized
- 4:15 – 6:20
The “Wow” Moments and Viral Launch
- NBNicole Brichtova
that it was, like, a really fun feature to use is when they tried it on themselves. 'Cause it's, it's kind of fun when you see it on another person, but it doesn't really resonate with people emotionally.
- SPSpeaker
It makes it so personal, right? Yeah.
- NBNicole Brichtova
It makes it so personal when it's, like-
- SPSpeaker
Totally
- NBNicole Brichtova
... you, your kids, you know, your spouse.
- SPSpeaker
Mm.
- NBNicole Brichtova
And, and I think that's-
- SPSpeaker
Your dog, right?
- SPSpeaker
Yeah. [laughs]
- NBNicole Brichtova
Y- y- your dog, and, and, and that's really what started kind of resonating internally, and then people just started making all these, like, '80s makeover versions of themselves. And that's when we really started to see, like, a lot of internal activity, and we were like, "Okay, we're onto something."
- SPSpeaker
It's, it's a lot of fun to test these models when-
- SPSpeaker
Yeah
- SPSpeaker
... we're making them because you just, you see all these amazing creative things that people make, like, "Oh, wow, I, I never thought that was possible."
- NBNicole Brichtova
Mm.
- SPSpeaker
So it's, it's really fun.
- SPSpeaker
No, it's... I mean, we're, we've dealt with the whole, with the whole family, and it's, it's, it's a, it's a crazy amount of fun. So think a bit about long term, where does this lead, right? I mean, we, we built these new tools that I think will cr- change visual arts forever, right? I mean, they're, they're... We suddenly can transfer style. We ca- suddenly can, you know, generate consistent images of a subject, right? I have, I have what used to be a very complex manual Photoshop process, suddenly I type one command, it magically happens. W- what's the end state of this? I mean, do, do, do we have an idea yet? You know, how will, how will creative arts be taught in a university in, you know, five years from now?
- SPSpeaker
You wanna take that?
- NBNicole Brichtova
[laughs] So I, I think it's going to be a spectrum of things, right? I think on the professional side, a lot of what we're hearing is that these models are allowing creators to do, um, less tedious parts of the job, right? They can be more creative.
- SPSpeaker
Yeah.
- NBNicole Brichtova
And they can spend, you know, 90% of their time being creative versus 90% of their time, like, editing things and doing these tedious kind of manual operations. So I'm really excited about that. Like, I think we'll see kind of an explosion of creativity, like, on that side of the spectrum. Um, and then I think for consumers, there's sort of, like, two spectr- two sides of the spectrum for this probably. One is-
- SPSpeaker
You know, you might just be doing some of these fun things like Halloween costumes for my kid, right?
- SPSpeaker
Mm.
- SPSpeaker
And, and the out- the goal there is probably just to like share it with somebody, right? Your family or your friends. Um, on the other
- 6:20 – 8:40
Seeing Yourself in AI
- SPSpeaker
side of the spectrum, you might have these tasks like putting together a slide deck, right? I started out as a consultant, we talked about it at the beginning. Um, and you spend a lot of time on like very tedious things-
- SPSpeaker
Right. Yeah
- SPSpeaker
... like trying to make things look good, trying to make the story make sense. I think for those types of tasks, you probably just have an agent who you give the specs of what you're trying to do.
- SPSpeaker
Yeah.
- SPSpeaker
And then it goes out and like actually lays it out nicely for you. It creates the right visual for the information that you're trying to convey, and it really is going to be this, I think, spectrum, depending on what you're trying to do. Like, do you want to be in the creative process and actually tinker with things and collaborate with the model, or do you just want the model to like, go do the task and be as minimally involved as possible?
- SPSpeaker
So, so in this new world, then what, what is art? I mean, somebody recently said art is if you can create an, an out of distribution sample. Is, is that a good definition or, or is it, is it, is it aiming too high?
- SPSpeaker
I-
- SPSpeaker
Or do you think if art is out of distribution or in distribution for the model?
- SPSpeaker
There we go. [laughs]
- SPSpeaker
[laughs]
- SPSpeaker
[laughs]
- SPSpeaker
[laughs]
- SPSpeaker
I think that out of distribution sample, that is a little bit too restrictive. I think a lot of great art is actually in distribution for art that occurred before it. So I, I mean, what is art? I think it's like a very philosophical debate, and there's a lot of people that do discuss this. Like, to me, I think that the most important thing for art is intent. And so the, the, what is generated from these models is, is a tool to allow people to create art. And I'm actually not worried about the high-end and the creatives and the professionals, because I've seen, like, if you put me in front of one of these models, like, I can't create anything that anyone wants to see.
- SPSpeaker
Yeah, same here. [laughs]
- SPSpeaker
But like, like, and but I've seen what people can do who are creative people and who have like intent and these ideas, and I think it's like, that's the most in- interesting thing to me is, is the things they create are really amazing and, and inspiring for me. So I feel like the, the high-end and the, the pr- the professionals and the creatives, like they'll always use state-of-the-art tools, and this is like another tool in the tool belt for people to make cool things.
- SPSpeaker
I think one of the, the really interesting things that I kept hearing about this model in particular from like creatives and artists was a lot of them felt like they couldn't use a lot of AI tools before because it didn't allow them the level of control that they expected for their art. What... At o- on one side, that was like the, um, characters or object consistency, like they really used that to have a compelling
- 8:40 – 11:00
How AI Is Changing Art and Creative Work
- SPSpeaker
narrative for a story. And so before, when you couldn't get the same character over and over, it was very difficult. And then I think second, the like second thing I hear all the time from artists is like, um, they love being able to upload multiple images and say like, "Use the style of this on this character," or, "Add this thing to this image," which is something that I think was very hard to do even with previous image edit models. I guess I'm curious, like was that something you guys were really optimizing for when, when you trained this one? Or, or how did you think about that?
- SPSpeaker
I mean, yeah, definitely sort of cus- customizability and character consistency are things that we m- closely monitored during the development, and we tried to do the best job we could on them. Um, I think another thing is also, uh, the iterative nature of kind of like an interactive conversation. Um, and you know, art tends to be iterative as well, where you, you make lots of changes, you see how it, where it's going, and you make more. Um, and this is another thing that I think makes the model more useful. And, and actually, that's an area that I also feel like we can improve the model greatly. Like, I know that, um, once you get into real long conversations, like it, it starts to follow, um, your instructions a little bit worse. But like that's something that we're planning to improve on and make the model more kind of like a natural conversation partner or like a creative partner in, in making something.
- SPSpeaker
One thing that's so interesting is after you guys launched Nano Banana, we start to hear about editing models all the time everywhere.
- SPSpeaker
Mm.
- SPSpeaker
Like, it's like after you launched, the world woke up and they went, "Editing model, it's great. Everyone wants it." [laughs] And then obviously like a, a kind of, um, you know, goes into the customizability, the personalization of it. And then, uh, Oliver, I know you used to be at Adobe, and then there's also software where we used to manually edit things. How do you see the knobs evolve now on the model layer versus what we used to do?
- SPSpeaker
Um, yeah, I mean, I think that, you know, one thing that, that Adobe has always done, and the professional tools generally require, is lots of, of control, lots of knobs, lots of, of... So there's always a balance of like, we want someone to be able to use this on their phone-
- SPSpeaker
Mm-hmm
- SPSpeaker
... um, maybe with just like a, a voice interface.
- SPSpeaker
Mm-hmm.
- SPSpeaker
And we also want someone who can really, like a, a really professional or a creative to be able to do fine scale adjustments. I think we haven't exactly figured out how to enable both of
- 11:00 – 14:00
Control, Customization & Character Consistency
- SPSpeaker
those yet. Um, but there's a lot of people that are building really compelling UIs, like, um, and, and I think it's a, you know, we're... Yeah, I think, I think there's different ways it can be done.
- SPSpeaker
Mm-hmm.
- SPSpeaker
Um, I don't know. You have thoughts on this? Well, I, I also hope that we get to a point where you don't have to learn what all these controls mean, and the model can maybe smartly suggest what you could do next based on the context of what you've already done.
- SPSpeaker
Mm-hmm. Yeah.
- SPSpeaker
Right? Um, and that feels like it's kind of prime for someone to tackle that on.
- SPSpeaker
Yeah.
- SPSpeaker
So like what do the UIs of the future look like, um, in a way where you probably don't need to learn 100 things that you had to before-
- SPSpeaker
Mm-hmm
- SPSpeaker
... but like the tools should be smart enough to suggest to you what it can do based on what you're already doing.
- SPSpeaker
That's such an insightful take. I definitely had moments when when I used Nano Banana, uh, I was like, "I didn't know I wanted this," but [laughs]
- SPSpeaker
[laughs]
- SPSpeaker
... but I didn't even ask for this style. I don't even kn- have the words for that, what that style even, you know, is called. So this is like very insightful on how image embedding and the language embedding is not one to one, like we cannot map to like all the editing tasks with language. So... Oh, go ahead.
- SPSpeaker
Yeah. Let, let, let me, let me sort of take a little bit of the counterpoint-
- SPSpeaker
Yeah
- SPSpeaker
... just to see where this goes.The, the other question of how complex the interface be can be limited by sort of what we can express in software, how easy we can make something in software-
- NBNicole Brichtova
Mm-hmm
- SPSpeaker
... to some degree is also limited by how much complexity is a user willing to tolerate. And-
- NBNicole Brichtova
Yes
- SPSpeaker
... you know, if you have a professional, they only care about the result. They're willing to tolerate a vast amount of, uh, of complexity. They have, they have the training, they have the education, they have the experience to use that, right? Then we may end up with lots of knobs and dials. It's just very different-
- NBNicole Brichtova
Mm-hmm
- SPSpeaker
... knobs and dials. But I mean, today, I don't know, if you use a cursor or so for coding, it's not that it has a super easy, you know, single text prompt interface. It has, it has a, a, a good amount of, you know, here, add context here, different modes and so on, right? So, so w- will we have a, will we have, like, the, the ultra-sophisticated, uh, interface for the, for the power user, and, and how, how would that look like?
- SPSpeaker
So I'm a big fan of Comfy UI and node-based interfaces in general.
- SPSpeaker
And that is complex. [laughs]
- NBNicole Brichtova
That is-
- SPSpeaker
And, and that's complex-
- SPSpeaker
Yes. Yeah
- SPSpeaker
... but it's also, it's very robust, and you can do a lot of things.
- SPSpeaker
It's incredible, yeah.
- SPSpeaker
And so, you know, after we released Nano Banana, we saw people building all these really complicated Comfy UI, um, workflows, where they were combining a bunch of different models together-
- SPSpeaker
Yeah
- 14:00 – 17:10
Building Interfaces for Artists and Everyday Users
- NBNicole Brichtova
may want to be doing this, but they were too intimidated by the professional tools in the past. And for them, I do think that there's a space of, like, that, that you need more control than the chatbot gives you, uh, but you don't need as much control as what the professional tools give you, and, like, what's that kind of in-between state?
- SPSpeaker
There's a ton of opportunity there.
- NBNicole Brichtova
There's a ton of opportunity there.
- SPSpeaker
Yeah.
- NBNicole Brichtova
Yeah.
- SPSpeaker
It is interesting you mentioned Comfy UI, 'cause it's on the other far spectrum of workflow. [laughs] Like, a workflow can have hundreds of a steps and nodes, and you need to make sure all of them work. Whereas, on the other side of the spectrum, there's Nano Banana. You kind of describe something with words, and then you get something out. Like, I don't know what's the model architecture, stuff like that, but, um, I guess is your view that the world is moving to ensemble, a model hosted by o- one provider doing it all, or do you think the world is mo- moving to more of a everyone building a workflow, Nano Banana is one of the nodes in Comfy Work UI?
- SPSpeaker
Um, I, I definitely don't think that, that the br- the broad amount of use cases will be fully satisfied by one model at any point. So I think that there will always be a diversity of models. Some, like, um, I'll give you an example, but some, you know, we could, we could optimize for, um, instruction following in our models and make sure it does exactly what you want. But it might be, um, a worse model for someone who's looking for ideation or kind of inspiration-
- SPSpeaker
Mm
- SPSpeaker
... where they want the model to kind of take over and-
- SPSpeaker
Go crazy. Yeah
- SPSpeaker
... and do other things. Go crazy. So, like, I just think there's so many different use cases and so many types of people that, like, there's a lot of space. There's a lot of room in this space for multiple models. So at, that's, that's where I see us going. I don't think this is gonna be, like, a single to rule, a single model to rule them all.
- SPSpeaker
Makes sense. Let's go to the very other end of the spectrum from the professional. Do, do you think kindergarteners in the future will learn drawing by, by sketching something, you know, on a, on a little tablet, and then you have the-
- SPSpeaker
That's a good question
- SPSpeaker
... AI make, turn that into a beautiful image?
- NBNicole Brichtova
[laughs]
- SPSpeaker
And, and so that's how, how you, they, they along get in touch with art?
- NBNicole Brichtova
I don't know if you always want it to turn into a beautiful image, but I, but I think there's something there about the AI being, again, a partner and a teacher to you-
- SPSpeaker
Yeah
- NBNicole Brichtova
... in a way that you, like, didn't have. So I didn't know how to draw, still don't. Um, don't have any talent for it really. Uh, but I think it would be great if we could use these tools in a way that actually teaches you kind of the step-by-steps and helps you critique, and maybe again, shows you kind of like an auto-complete almost for images. Like, what-
- SPSpeaker
Exactly
- NBNicole Brichtova
... like, like what's the next step that I could take, right? Or maybe show me a couple of options, and, like, how do I actually do this? So I hope it's more that direction. I d- I don't think we all want, you know, every five-year-old's image to suddenly look perfect. [laughs]
- SPSpeaker
[laughs] Right.
- SPSpeaker
[laughs]
- NBNicole Brichtova
We, we, we would probably lose something, um, in the process.
- SPSpeaker
As someone who struggled the most in high school out of all my classes of the art and the sketching class, I actually would've, would've preferred it, but I know a lot of people want their kids to learn to draw, which I understand. [laughs]
- SPSpeaker
It's funny 'cause we've been trying to get the model to create, um, like, childlike crayon drawings-
- SPSpeaker
Mm
- SPSpeaker
... which is actually quite challenging.
- SPSpeaker
Yeah.
- SPSpeaker
Um, it ironically, you know, sometimes the, the things that are hard to make are... Because the ev- level of abstraction is very large.
- 17:10 – 20:25
AI in Education and Visual Learning
- NBNicole Brichtova
in our evals right now-
- SPSpeaker
All right. [laughs]
- NBNicole Brichtova
... to, to, to try to see if we're getting better. Um-
- SPSpeaker
I'm a... In general, I'm very optimistic about AI for education, and, you know, part of the reason is I think that most of us are visual learners, right? So the AI right now as a tutor, basically all it can do is, is talk to you or give you text to read, and that's definitely not how students learn. So I think that these models have a lot of potential as a way to help education by giving people sort of visual cues. You know, you imagine if you could get an explanation for something where you get the text ex- explanation, but you also get images and figures that kind of, like, help explain how they work. I think it just, everything would be much more useful, much more accessible for students. So I'm really excited about that. That is a future direction.
- SPSpeaker
On that point, one thing that's very interesting to us is that when Nano Banana came out, it almost felt like there's, part of a use case is the reasoning model. Like, you have a diagram.
- NBNicole Brichtova
Mm-hmm.
- SPSpeaker
Absolutely, yeah.
- SPSpeaker
Right?
- NBNicole Brichtova
Mm-hmm.
- SPSpeaker
Like, you can explain some knowledge visually, so the model not just doing approximation of the visual aspect. There's the reasoning aspect to it, too. Do you think that's where we're going to? Do you think all the model, large models will realize that, oh, like, to be a good LM or VR, like, uh, VLM, we have to have both image and language and audio and so on and so forth?
- SPSpeaker
100%. I, I definitely think so. Um, the, the future for these AI models that I'm most excited by is where they are tools for people to accomplish more things.
- NBNicole Brichtova
Mm.
- SPSpeaker
Like, I think if you imagine a, a future where you have these agentic models that just talk to each other and do all the work, then it becomes a little bit less necessary that there's, like, this visual mode of communication.
- NBNicole Brichtova
Mm.
- SPSpeaker
But as long as there's people in the loop, and as long as the, the kind of the, the, the motivation for the task they're solving comes from people, I think it makes total sense that, that visual modality is gonna be really critical for any of these AI agents going forward.
- SPSpeaker
Will we get to a point where there's actually... So if, you know, I'm, I'm asking you to create an image, it sits there for two hours, reasons with itself, has drafts, explores different directions, and then comes back with a final answer?
- SPSpeaker
Yeah, absolutely.
- NBNicole Brichtova
And-
- SPSpeaker
If it's, if it's ne- necessary, yeah. Like-
- NBNicole Brichtova
And, and maybe not just for a single image, but to the point of, you know, maybe you're redesigning your house, and maybe you actually-
- SPSpeaker
I see
- NBNicole Brichtova
... really don't wanna be involved in the process, right? But you're like, "Okay, this is what it looks like. Like, this some, this some inspiration that I like," and then you send it to, um, a model the same way that you would send it to, like, a designer.
- SPSpeaker
So it's the, the visual deep research.
- NBNicole Brichtova
The vis-
- SPSpeaker
Mm-hmm
- NBNicole Brichtova
... it's like visual deep research, basically.
- SPSpeaker
Yeah.
- NBNicole Brichtova
I really like that term. Um, and then it goes off and does its thing and searches for maybe the furniture that would go with your environment, and then it comes back to you, and maybe it presents you with options-
- SPSpeaker
Yes
- NBNicole Brichtova
... 'cause maybe, maybe you don't wanna sit for two hours and pick one thing.
- 20:25 – 24:10
Multimodal AI and the Future of Creativity
- SPSpeaker
or a world model that has explicit 3D representations, there's a lot of advantages. For example, everything stays consistent all the time.
- SPSpeaker
Yeah.
- SPSpeaker
Um, now the main challenge is that we don't walk around with 3D capture devices in our pocket, so in terms of, like, the available data for training these models, it's largely the projection onto, onto 2D. So I think that both viewpoints are totally valid for where we're going. I come a bit from the projection side. Like, I think it, we can solve almost all the problems, if not all the problems, working on the projection of the 3D world directly and letting the models learn the latent world represent- representations. I mean, we see this already that the video models have very good 3D understanding. You can run reconstruction algorithms over the videos you generate, and they're, they're very acc- very accurate. Um, and in general, if you look at, like, the history of human art, like, it, it, it starts as, like, the projection, right? People drawing on, on cave walls. Um, all of our interfaces are in 2D. So I think that, like, humans h- are very s- very well suited for working on this projection of the 3D world into a 2D plane, and it's a really natural environment for interfaces and for viewing, so.
- SPSpeaker
That is very true. Like, um, so I'm a cartoonist in my spare time, and then drawing in 2D is just light and shadow, and then you present yourself with 3D, kinda, we trick ourselves-
- NBNicole Brichtova
Mm
- SPSpeaker
... to believing it's 3D, where it's, you know, on a piece of paper. But then what human can do that, you know, like, uh, a drawing or, like, a model can do is you, we can navigate the world. Like, we see a table, we can't walk past it. I guess the question becomes, if everything is 2D, how do you solve that problem?
- SPSpeaker
Well, I don't think... Yeah, so if we're trying to solve the robotics problems, I think maybe the 2D, um, representation is useful for planning and visualizing kind of at a high level.
- SPSpeaker
Mm-hmm.
- SPSpeaker
Like, I think people navigate by, um, by remembering kind of 2D projections of the world. Like, you don't, you don't build a 3D map in your head. You're more like, "Oh, I know I see this building. I turn left."
- SPSpeaker
Yeah.
- SPSpeaker
So I think that, like, for that kind of planning, it's reasonable, but for the actual locomotion around the space, like, uh, definitely 3D is important there.
- SPSpeaker
Yeah.
- SPSpeaker
So robotics, yeah, they probably need 3D. [laughs]
- SPSpeaker
[laughs]
- SPSpeaker
That's the saving grace.
- SPSpeaker
Yeah.
- SPSpeaker
[laughs]
- SPSpeaker
Yeah. Um, so character consistency, which you previously mentioned, I really love the example of, like, when the model feels so personal, like people are so tempted to try it. How did you unlock that moment? The reason why I ask is that character consistency is so hard.
- NBNicole Brichtova
Mm-hmm.
- SPSpeaker
Uh, there's a huge uncanny valley to it.
- NBNicole Brichtova
Mm-hmm.
- SPSpeaker
Like, you know, like, if it's someone I don't know, if I see their AI generation, I'm like, "Okay, it's maybe the same person," but if it's someone I know, if there's just a little bit of a difference, uh, I, I'm, I actually feel very turned off by it, 'cause I'm like-
- SPSpeaker
Mm
- SPSpeaker
... "This is not a real person." So in that case, how do you know what you're generating is good? And then is it mostly by user feedback or, like, "I love this," or is it something else?
- NBNicole Brichtova
You look at faces you know.
- SPSpeaker
[laughs]
- SPSpeaker
Yeah.
- NBNicole Brichtova
And-
- SPSpeaker
But that's with your small sample size, right?
- NBNicole Brichtova
No, no, no, no, really.
- 24:10 – 27:20
2D vs 3D: The Debate Over World Models
- SPSpeaker
really it's hard, it's very hard to know, like, how good is the character consistency of a model, and, um, is it good enough? Is it not good enough? Like, you know, I think there's, there's still a lot of improvement we can make on character consistency, but I think that for some use cases, like, we got to a point, and that's... You know, we weren't the first edit model by any means, but I think that-
- SPSpeaker
Right
- SPSpeaker
... like, once the quality gets above a certain level for character consistency, it can kinda just take off-
- SPSpeaker
Yeah
- SPSpeaker
... because it becomes useful for so much more. And I think as it gets better, um-It'll be useful for even more things too.
- NBNicole Brichtova
Yeah.
- SPSpeaker
So.
- SPSpeaker
I think one of the really interesting things we're seeing across a bunch of modalities of which image edit and generation obviously is one, is like, um, I think the arenas and benchmarks and everything are awesome, but especially when you have like multidimensional things like image and video, um, it's very hard as all of the models get better and better to condense every quality of a model into like one judgment. So it's like, you know, you're judging, okay, you swap a character in im- into an image, and you change the style of the image. Maybe one did the character swap and consistency much better and the other did the style much better. Like, how do you say which output is better? And it probably comes down to like what the person cares most about and what they're, what they wanna use it for. Um, are there like certain, you know, characteristics of the model that you guys value more than other things in like making those trade-offs when deciding which version of the model to deploy or like what to really focus on during training?
- SPSpeaker
Um, yes, there are. I... One of the things I like about this space is that, uh, there is no right answer, so actually there's quite a lot of, of, I don't know if it's taste, but it's like preference that goes into the models, and I think you can kind of see the difference in preferences of the different research labs in the models that they release.
- SPSpeaker
Mm.
- SPSpeaker
So, like when we're balancing two things, a lot of it comes down to like, "Oh, well, I, I don't know, I just like this, this look better," or, uh, you know, "This, this feature is more important to us."
- SPSpeaker
I'd imagine it's hard for, for you guys too, 'cause you have, you have so many users, right? Like, Google, like being in the Gemini app, like everyone in the world can use that, versus like many other AI companies just think about like, "We're only going for the professional creatives," or, "We're only going for the consumer meme makers." And like you guys have the unique and exciting, but challenging task of like literally anyone in the world could do this. How do we decide what everyone would want?
- NBNicole Brichtova
Yeah, and it is... Sometimes we do make these trade-offs. We do have a set of things that are sort of like super high priority that we don't want to regret rests on, right? So now because character consistency was so awesome and so many people are using it, we don't want our next models to get worse on that dimension, right? So we pay a lot of attention to it. We care a lot about images looking photorealistic when you want photos, and this is important. One, I think we'd all prefer that style. [laughs] Two, um, you know, for advertising use cases for example, like a lot of it is kind of photorealistic images of products and people, and so we wanna make sure that we can kinda do that. And then sometimes there are just things that like will kind of fall down the wayside. So for this first release, the model's not as good as text rendering at, as we would like it to be, and that's something that we want to fix in the future. But it was
- 27:20 – 31:10
The Challenge of Taste, Preference & Artistic Style
- NBNicole Brichtova
kind of one of those things where we looked at, okay, the model's good at X, Y, Z. It's not as good at this, but we still think it's okay to release, and it will still be an exciting thing for people to play with.
- SPSpeaker
So if, if you look in the past, right, we, we, we had for previous model generations, a lot of things we did with like sidecar models like ControlNet or something like that-
- SPSpeaker
Mm
- SPSpeaker
... where we, we basically figured out a way to provide structured data to the model to achieve a particular result. It seems like these newer models it has taken a step back just because they're so incredibly good at just prompting or, or, you know, giving a reference image and picking things up from there. Where will this go long term? Do you think this will come back to some degree? Um, you know, like, I mean, the, for, from the creator's perspective, right, having, I don't know, open pose informations, I can get, get a pose exactly right, right? For multiple characters, this seems very, very tempting, right? Is it like... Or to rephrase it a little bit, uh, it's like does the bitter lesson hold here [laughs] at the end of the day, everything's just one big model and you just throw things in, or is there's a little structure we can, we can offer to make this, uh, better?
- SPSpeaker
Um, I mean, I think that there will be, there'll always be users that want control that the model doesn't give you out of the box, but I think we, we tried to make it so that, um, you know, 'cause really what, really what an artist wants when they wanna do something is they want the intent to be understood.
- SPSpeaker
Yeah.
- SPSpeaker
And I think that, that these, um, AI models are getting better at understanding the intent of users. So often when you ask text queries now, the, the model gets what you're going for.
- SPSpeaker
Yeah.
- SPSpeaker
So, you know, I- in that sense, I think we can, we can get pretty far with understanding the intent of our users, and, um, and maybe some of that is personalization, like we need to know information about what you're trying to do or what you've done in the past. But I think once you can understand the intent, then you can, you can generally do the, the type of edit. Like, is this like a very structure preserving edit, or is this like a freeform kind of... Like, we can learn these, uh, these kinds of effects, I think. Um, but still of course there's one person who's gonna really care about every pixel, and like, "This, this thing needs to be slightly to the left and a little bit more blue," and like those people will use existing tools to, to do that.
- SPSpeaker
[laughs]
- SPSpeaker
I mean, I, I, I think it's like, you know, I want an image with 26 people spelling out every letter of the alphabet or something like that.
- NBNicole Brichtova
Mm-hmm.
- SPSpeaker
Right? That's sort of the thing where I think we're still quite a bit away from getting that right, uh, you know, on the first try. On the other hand, with pose information, eh, could potentially get.
- NBNicole Brichtova
Mm-hmm.
- SPSpeaker
Right.
- NBNicole Brichtova
But then the, then the question I guess is like, do you really want to be the one who's like extracting the pose and providing that as information? Or-
- SPSpeaker
That's a very good question. [laughs]
- NBNicole Brichtova
Or, or, or, or do you just want to provide some reference image and say like, "This is actually what I want." Like, "Model, model, model-"
- SPSpeaker
I see. Yeah. And I could be like-
- NBNicole Brichtova
"... model, go figure this out," right? [laughs]
- SPSpeaker
There, there are 26 people-
- NBNicole Brichtova
Yes. Yes
- SPSpeaker
... reading every letter of the alphabet now-
- NBNicole Brichtova
Yes. Yes
- SPSpeaker
... in a different, different style. Fair enough.
- NBNicole Brichtova
Yeah.
- SPSpeaker
Yeah. I think in that, in that case, I wouldn't spend a ton of time building a custom, um, interface for making this, this picture of 26 people. Seems like the kind of thing that we can, we can solve.
- SPSpeaker
So just transfer.
- SPSpeaker
Do you think the representation of what the AI images are will change? So the reason why I ask the question is that as artists there's different formats we play with. There's the SVGs. We have anchor points-
- SPSpeaker
Yeah
- 31:10 – 35:00
The Japan Phenomenon & Creative Communities
- NBNicole Brichtova
Mm-hmm.
- SPSpeaker
Um, and so, you know, in, in cases where you need to have your font, or you wanna change the text, or you wanna move things around just like with control points, um, it could be useful to have, um, kind of mixed generation, which consists of pixels and, um, SVGs and other, other forms. Um, but if we can do it all, if we can-- if, if the multi-turn interaction is enough, then I think you can get pretty far with pixels. Um, I will say that one of the things that's exciting about these, um, these models that have native capabilities is that you now have a model that can generate code and it can generate images.
- NBNicole Brichtova
Mm.
- SPSpeaker
So there's a lot of interesting things that come in that intersection, right? Like maybe I want it to write some code and then make, make some, some things be rasterized, some things be parametric.
- NBNicole Brichtova
Yeah.
- SPSpeaker
Like stick it all together, train it together. Like this would be very cool.
- NBNicole Brichtova
That's such a good point, 'cause I did see a tweet of someone asking Claude Sonnet to replicate a image on a Excel sheet where every cell is a pixel. [laughs]
- SPSpeaker
[laughs]
- NBNicole Brichtova
Which is like a very fun exercise. It was like a coding model and like-
- SPSpeaker
Yeah
- NBNicole Brichtova
... it doesn't really know anything about, you know, images.
- SPSpeaker
Yeah.
- NBNicole Brichtova
Yet it worked.
- SPSpeaker
Yeah.
- NBNicole Brichtova
So-
- SPSpeaker
There's the classic pelican riding a bicycle test.
- NBNicole Brichtova
Right, yeah. [laughs]
- SPSpeaker
Yeah. Famous.
- NBNicole Brichtova
Yeah, totally.
- SPSpeaker
I have one on, on model, like on interfaces if that's okay. I don't... Sorry if I'm bringing up too much product stuff, guys. I'm just very curious on, on the product front. Like, um, I guess I'm curious how you think about like owning the interface where people are editing or generating images with Nano Banana versus really just w- wanting a ton of people to use the model for different things in the API. Like we, we've talked about so many different use cases like ads, you know, education, um, design, uh, like architecture. Each of those things could be-- there could be a standalone product built on top of Nano Banana that prompts the model in the right way or allows certain types of inputs or whatever. Is your guys' vision like that the kind of the product in the Gemini app is like a playground for people to explore, and then developers will build the individual products that are used for certain use cases? Or is that something you're also kind of interested in owning?
- NBNicole Brichtova
I think it's a little bit of everything. Um, so I definitely think that the Gemini app is an entry point for people to explore. And the one-- the nice thing about Nano Banana is I think it shows that fun is kind of a gateway to utility, where, you know, people come to make a figurine image of themselves, but then they stay because it helps them with their math homework, or it helps them write something, right? And, and so I think that's a really powerful kind of transition point. Um, there's definitely interfaces that we're interested in building and exploring as a company. And so, um, you know, you may have seen Flow from Josh's team in Labs, that's, that's really trying to rethink like what's, what's the tool for AI filmmakers, right? And for AI filmmakers, image is actually a big part of the iteration journey, right? Because video creation's expensive. A lot of people kind of think in frames, um, when they, when they initially start creating, and a lot of them even start in the LLM space for like brainstorming and thinking about what they wanna create in the first place. Um, and so there's definitely kind of plays that we have in that space of just us trying to think about, like what does this look like? Um, we have the advantage of kind of sitting close to the models and the interfaces so we can kind of bu- build that in, in a tight coupling. Um, and then there's definitely the, you know, we're probably not going to go build a software for an architecture firm. Uh, my dad is an architect, and he would probably love that.
- SPSpeaker
[laughs]
- NBNicole Brichtova
Um, but I don't think that's something that we will do, but somebody should go and do that. Um, and that's why it's exciting because we do have the developer business, and we have the enterprise business, and so people can go use these models and then figure out like, what's the next generation workflow for like this specific audience so that it can help them solve a problem. So I, I think the answer is kind of like yes, [laughs] all three.
- SPSpeaker
Yeah. Yeah. I, I brought that up. I don't know if you guys have been following the reception of Nano Banana in Japan,
- 35:00 – 41:00
From Images to Video: The Next Frontier
- SPSpeaker
but, um, um, I'm sure you've had... It's, it's been insane, and it's so funny. Like I... Now half of my X feed is these really heavy Nano Banana users in Japan who have created like Chrome extensions called-- There's one called like Easy Banana that's specifically for using Nano Banana for like manga generation and specific types of anime and things like that. And like they go super deep into basically prompting the model for you and storing the outputs in various places, um, using obviously your, your underlying model to generate these like amazing anime that you would never guess were AI generated because like the level of, of precision and consistency and that sort of thing is just beyond what I've seen any single model be able to do today.
- SPSpeaker
I guess, um, what are some, like to Justine's point, what are some force multipliers that you guys have seen in the model? So what I mean by this is, for example, if you unlock character consistency, you can generate different frames, and then you can make a video, and then you can make a movie, right? Um, so these are the things that if you get it right and get it really well, there's so much more downstream tasks that can derive from it. Um, just curious, like how do you think about what are the force mut- multipliers that you want to unlock? So the next-
- SPSpeaker
What's the next big one?
- SPSpeaker
What's the next-
- SPSpeaker
Yeah
- SPSpeaker
... yeah, big wave of people who can just use Nano Banana as the base model for all the downstream tasks?
- NBNicole Brichtova
So I think one, one current one actually is also the latency point, right?
- SPSpeaker
Mm-hmm.
- NBNicole Brichtova
'Cause I think, 'cause I think it's also just like, it makes it really fun to iterate with these models when it just takes 10 seconds to generate-
- SPSpeaker
Yeah
- NBNicole Brichtova
... the next frame, right? If you had to sit there and wait for two minutes, like you would probably just give up-
- SPSpeaker
Different experience
- NBNicole Brichtova
... and leave.
- SPSpeaker
Yeah.
- NBNicole Brichtova
Very different experience. So I think that's one, just like there has to be some quality bar because if it's just fast and the quality isn't there, then it also doesn't matter, right? Like you have to hit a quality bar, and then, um, then speed becomes a force m- multiplier. I think the general idea of just like visualizing information, to your education point from earlier, is sort of another one, right?
- SPSpeaker
Mm-hmm.
- NBNicole Brichtova
And that needsGood text, it needs factuality, right? Because if you're gonna start making kind of visual explainers about something, um, it, it looks nice, but it also needs to be accurate.
- SPSpeaker
Right.
- NBNicole Brichtova
And so, and so I think that's probably kind of the next level where at some point then you could also just have a personalized textbook to you.
- SPSpeaker
Yeah.
- NBNicole Brichtova
Right? Where it's not just the text that's different, but it's also-
- SPSpeaker
Have you read the book-
- NBNicole Brichtova
... all the visuals. Yeah
- SPSpeaker
... The Diamond Age?
- SPSpeaker
Yes.
- SPSpeaker
That was basically-
- NBNicole Brichtova
Yeah.
- SPSpeaker
Yeah.
- NBNicole Brichtova
Yeah. Basically. Um, and then it should also internationalize really well, right?
- SPSpeaker
Mm.
- 41:00 – 47:30
Working With Artists and Designing With Intent
- NBNicole Brichtova
on the team who are just like very creative. Um, we have a team, um, who just works really closely with us on models that we're developing, and then they just like push the boundary. They'll do like crazy things with the models and-
- SPSpeaker
What's the most surprising thing you've seen here? [laughs] Like, I didn't know-
- SPSpeaker
What have you heard about?
- SPSpeaker
... our model can do this. Yeah.
- NBNicole Brichtova
I- this is even just kind of like simple things where people have been doing like texture transfer. Like they will take-
- SPSpeaker
Texture?
- NBNicole Brichtova
Yeah, like you take a portrait of a person and then you're like, "What would it look like but if it had the texture of this piece of wood?" And I'm like, I would've never-
- SPSpeaker
Wow
- NBNicole Brichtova
... I would've never thought of this being a use case because my brain just doesn't work that way. Um, but people like kind of just push the boundaries of what you're, what you can do with these things.
- SPSpeaker
That is an interesting-
- SPSpeaker
Yeah
- SPSpeaker
... uh, example of the world knowledge 'cause texture technically is 3D 'cause there's like-
- NBNicole Brichtova
Mm-hmm
- SPSpeaker
... the whole 3D-
- NBNicole Brichtova
Mm-hmm
- SPSpeaker
... aspect of it. There's a light and shadow of it, but this is a 2D transfer. Yeah, so that's very cool.
- SPSpeaker
I think for me, the, the thing I'm most excited by and maybe most impressed by is, um, are the, the use cases that test the reasoning abilities of the models.
- SPSpeaker
Mm-hmm.
- SPSpeaker
So, um, some people on our team figured out you could like give geometry problems to the model and like ask it to kind of, you know, solve for X here or fill in this missing thing or like present this, this from a slightly different, like a different view.
- SPSpeaker
Mm.
- SPSpeaker
And like these types of, um, of things that really require world knowledge and the reasoning ability of like a state-of-the-art language model are the things that are making me really go, "Wow, that's amazing," or, "I didn't think we would be able to do that."
- SPSpeaker
Can it, uh, generate compile code on a blackboard yet and [laughs] like if I take a picture of my, I don't know, like code-
- SPSpeaker
Yeah
- SPSpeaker
... on the laptop, would it know if it compiles on the image model?
- SPSpeaker
Um, I've, I've seen examples where people give it like a, an image of HTML code and have the model render the, the webpage and it can-
- SPSpeaker
Wow
- SPSpeaker
... it can do that.
- SPSpeaker
That is very cool.
- SPSpeaker
The coolest example I saw, so I came from academia, so I spent a lot of time writing papers and making figures, and, um, one of our colleagues, uh, took a picture of one of the result figures, uh, from one of their papers with a method that could do a bunch of different things. This, this one, you know, a bunch of different, um, type of applications in the paper, and asked the model to, and like sort of erased the, um, the results, so you have like the inputs, and asked the model to like solve all of these in picture form in a figure of a paper.
- SPSpeaker
Mm.
- 47:30 – 53:50
The Next Era of Image Models
- SPSpeaker
do this. And I think-
- SPSpeaker
We still need artists.
- SPSpeaker
We still need artists, and I think artists will be able to also recognize when, when people have actually like put a lot of control and intent into it.
- SPSpeaker
I will still not be an artist, basically is what I'm saying. [laughs]
- NBNicole Brichtova
[laughs]
- SPSpeaker
Yeah.
- NBNicole Brichtova
Maybe you'll get... But, but it, it i- it is, there's a lot of craft and there's a lot of taste, right?
- SPSpeaker
Mm-hmm.
- NBNicole Brichtova
That you accumulate-
- SPSpeaker
Absolutely, yeah
- NBNicole Brichtova
... sometimes over decades, right?
- SPSpeaker
Yeah.
- NBNicole Brichtova
And I don't think these models really have taste, right? And so I think a lot of, like, a lot of the reactions that you mentioned maybe also come from that. And so we do work with a lot of artists across all the modalities that we work with, um, so image, video, um, music, because we really care about like building the technology step by step with them and trying to figure out... They really help us kind of like push the boundary of what's possible.
- SPSpeaker
Yeah, yeah.
- NBNicole Brichtova
A lot of people are really excited, but they, they really do bring a lot of their knowledge and expertise and kind of like 30 years of design knowledge. We just worked with, um, Ross Lovegrove, um, on fine-tuning a model on his sketches, so that he can then create something new-
- SPSpeaker
Mm-hmm
- NBNicole Brichtova
... out of that, and then we design an actual physical chair that we like have a prototype of. Um, and so there, there's a lot of people who want to kind of bring the expertise that they've built and kind of like the rich language that they use to describe their work, and, and have that dialogue with the model so that they can push their work kind of to the frontier. And it is... You know, it doesn't happen in like one prompt and two minutes. Um, it, it, it does require a lot of that kind of taste and human creation and, um, and craft that goes into building something that actually then, you know, becomes art.
- SPSpeaker
At the end, it's still a tool that requires the human behind it to, to express the feelings and the emotions and the story.
- NBNicole Brichtova
Yeah.
- SPSpeaker
And everything.
- NBNicole Brichtova
Yeah, absolutely.
- SPSpeaker
Ab- absolutely.
- NBNicole Brichtova
And that's what resonates with you when you probably look at it, right? Um-
- SPSpeaker
Yeah, exactly
- NBNicole Brichtova
... you will, you will have a different reaction when you know that there's a human behind it who has spent 30 years thinking about something, and then poured that into a piece of art.
- SPSpeaker
Yeah. I think there's also a bit of this, um, phenomenon that like most people who consume creative content, and maybe even ones that are, that care a lot about it, like they, they don't know what they're gonna like next. You need someone who has a vision and can do something that's interesting and different.
- SPSpeaker
That's right.
- SPSpeaker
And then you show it to people, and like, "Oh, wow, that's amazing." But like they wouldn't necessarily like think of that on their own.
- NBNicole Brichtova
Right.
- SPSpeaker
So when we're, you know, when we're optimizingThese models, like one thing we could do is we could optimize for like the, the average preference of everybody.
- 53:50 – 54:00
Closing Thoughts
- NBNicole Brichtova
[outro music]
Episode duration: 54:11
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode I8VUN141MjU
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome