OpenAIInside image generation’s Renaissance moment — the OpenAI Podcast Ep. 19
EVERY SPOKEN WORD
30 min read · 5,987 words- 0:00 – 0:36
Intro
- AMAndrew Mayne
Hello, I'm Andrew Main, and this is the OpenAI Podcast. On today's episode, we're talking about Images 2.0 with researcher Kenji Hata and product lead Adele Li. They'll discuss why the new model represents such a major leap forward, the evaluations that mattered most during development, and what people are creating with it now that it's widely available.
- ALAdele Li
[instrumental music] If DALL-E was the Stone Ages, Image Gen 2.0 is the Renaissance. It's not only great artistically and aesthetically, but it also incorporates, you know, science, art, architecture, all in one image.
- KHKenji Hata
We looked at it and we're like, "All right, this is better than Images 1." [laughs]
- 0:36 – 2:27
How Adele and Kenji came to work on Images
- AMAndrew Mayne
A- Adele, tell me a little bit about how you became a product manager here.
- ALAdele Li
So I joined OpenAI a little over two years ago, and before OpenAI, I was an investor my entire career.
- AMAndrew Mayne
Oh, wow.
- ALAdele Li
So I was in private equity and spent three years at Ve- Redpoint Ventures investing in AI and software companies. And when I first joined OpenAI, it was for a completely different role. I was thinking about, how do we build out our data and compute infrastructure? And over time, made my way over to the product side, and for the last six months have been working on Image Gen.
- AMAndrew Mayne
It, it's, it's interesting how you saw yourself going from one role then finding yourself into this space here, which is kind of cool, you know, to think about the idea that you have this sort of, you know, ability to be useful in different ways.
- ALAdele Li
Absolutely, and I think the role of a product manager is just to do the job that needs to be done-
- AMAndrew Mayne
Mm-hmm
- ALAdele Li
... no matter what it is. And for Image Gen in particular, it's been really awesome to flex a lot of different muscles when it comes to, uh, building products, working with researchers like Kenji, but also thinking about, like, what is the gap in the market today that we wanna fill, and what is the opportunity that we wanna grasp here? It's not the same market that it was a year ago-
- AMAndrew Mayne
Mm-hmm
- ALAdele Li
... when we first released Image Gen 1.0. Now it's a very different landscape. There are multiple image generation makers out there. Um, and ChatGPT is a very different company and, uh, product itself too, and so, um, really thinking about the evolution of Image Gen and its role within ChatGPT has been really, really exciting to me.
- AMAndrew Mayne
Kenji, how did you end up working on Images?
- KHKenji Hata
Uh, actually, like, when I first started at OpenAI, I also started about two years ago.
- AMAndrew Mayne
Mm-hmm.
- KHKenji Hata
Um, I was working on, like, some random audio project initially. Just-
- AMAndrew Mayne
Mm-hmm
- KHKenji Hata
... this was my first project, and then at the time, I just found my way just working on helping them work on, uh, Images 1.0, the-
- AMAndrew Mayne
Mm-hmm
- KHKenji Hata
... prior to the launch. Um, and so gradually I moved more and more onto the project, and then I became full-time on it, basically.
- 2:27 – 5:25
Images 2.0 launch reception
- AMAndrew Mayne
What has the reception been like right now for the model?
- ALAdele Li
In the last two weeks since we launched the model, usage is up more than 50%. More than 1.5 billion images are generated every week on ChatGPT.
- AMAndrew Mayne
Wow.
- ALAdele Li
And we've seen viral trends emerge across the world, um, all the way from trends in Asia for color analysis and stickers to US, where crayon and scribble are going viral. Um, but also a lot of people exploring emergent use cases, and I think it shows the dynamic range of the model but also how people are able to visually grasp the advancement of the model almost immediately. I think the visual, uh, communication and reaction that we've seen from our users, for them to say, "Hey, this is the best, highest fidelity, highest quality and aesthetic model that we've seen," has been really awesome.
- AMAndrew Mayne
This felt like a really big shift, almost worthy of maybe not even being, uh, uh, Images 2, but almost, like, just a new paradigm because just the capabilities are through the roof. What made that possible?
- ALAdele Li
When we started working on this project, I think we sat down and we discussed, what is the step change of capability and use cases that we wanted to build towards? Um, and we believe that image generation has the ability to do so much more than-
- AMAndrew Mayne
Mm-hmm
- ALAdele Li
... it, what it does today. You could distill every single output, uh, or visual content that you see today into an image, and so that was the mandate that we sought out to improve. Um, and f- with this 2.0 model, we've improved on various different dimensions. Um, one is text rendering.
- AMAndrew Mayne
Mm-hmm.
- ALAdele Li
The ability for text on a page is so much better fidelity. The language and words actually make sense, and they're actual words. Um, the second of all is multilingual.
- AMAndrew Mayne
Mm-hmm.
- ALAdele Li
So we've really focused on making this model work in various different languages, and we're already seeing that people across the world in Asia and Europe are really resonating with these advancements. Um, the third is photorealism. I think we really saw a lot of feedback from our previous models that, uh, the output wasn't very realistic or altered their face or their bodies, and so one of our mandates was, how do we actually make the image feel like more like yourself? And so all the things that you think that the model knows, it does because it has imbued the knowledge of the world into, um, its conscience and is able to visually communicate that back to you as a user. And so putting that all together, I think we really get a state-of-the-art image generation model that is the best aesthetic model out there on the market right now, um, that really represents a new paradigm for image generation, um, which is a huge part of, I think, AI progress at large, uh, that, that we have an opportunity to work on here.
- KHKenji Hata
We often listen back, uh, listen to feedback on social media too.
- AMAndrew Mayne
Mm-hmm.
- KHKenji Hata
So we kind of just take all these things and basically are just aware of it and try to make sure that they're mitigated or completely fixed in some cases in, in the next iteration.
- 5:25 – 9:34
Productivity use cases and and 360 images
- AMAndrew Mayne
What kind of use cases are you seeing? What are you seeing people do with this now?
- KHKenji Hata
I think one that's particularly close to, like, the research team as a general is, like, infographics, text. Um, I think text in images is, like, so much better nowadays.
- AMAndrew Mayne
Mm-hmm.
- KHKenji Hata
So, um, I think it just opens up a lot more productive use cases, and at, from, like, the research side, we th- kind of think, you know, image generation used to al- always be about fun and maybe, like, unproductive things.
- AMAndrew Mayne
Mm-hmm.
- KHKenji Hata
But now we're really seeing steps forward into productivity and, uh, image generation for any type of use case that you can imagine it for.
- AMAndrew Mayne
So you mentioned text. I remember the early modelsUh, no disrespect to chimpanzees, but getting it to spell, like, OpenAI even looked like a chimp did it. And then now I'm looking at pages of text and finely detailed stuff, and I know that as models get smarter, variable binding, the ability to put things next to each other improves, but this was just a big improvement.
- KHKenji Hata
Yeah. But I don't think it's, like, completely unexpected.
- AMAndrew Mayne
Mm.
- KHKenji Hata
I think you, you see a lot of growth in between-
- AMAndrew Mayne
Mm
- KHKenji Hata
... uh, well, first you see between DALL-E 3 and, you know, GPT Images 1.
- AMAndrew Mayne
Mm-hmm.
- KHKenji Hata
There was, um, if you ask for a grid of random objects, you, you go from maybe, like, five to eight in DALL-E 3 to maybe around 16 in Images 1, and then with 1.5, we went to about 25 to 36-
- AMAndrew Mayne
Mm-hmm
- KHKenji Hata
... um, consistently. And I think now we could probably do over 100, I think.
- AMAndrew Mayne
Wow.
- KHKenji Hata
This is, like, a test that we might do internally is just, um, we just ask ChatGPT gen- give me a list of 100 random objects, right?
- AMAndrew Mayne
Yeah.
- KHKenji Hata
And then we just send that to our image generator and s- see how, how many are correct. And usually, you know, it'll get almost all 100 correct. Uh, and that's... But you see the, the constant growth-
- AMAndrew Mayne
Mm-hmm
- KHKenji Hata
... over time. Um, so I don't think it's, like, completely unexpected. It's just a steady pace.
- AMAndrew Mayne
That was a test I used to use for, like, the really old models back with, like, Ada, Babbage, and Curry, like, list 100 science fiction books.
- KHKenji Hata
Yeah.
- AMAndrew Mayne
And then some of them would get, by the time it got to, like, 22, would just start repeating stuff-
- KHKenji Hata
Yeah, yeah
- AMAndrew Mayne
... as it was, the model reached the end of it. So we've seen stuff too, like 360, 360-degree panoramas. How did that happen?
- ALAdele Li
Yeah, that really came from the emerging capability of the model, which is the ability to render images in any aspect ratio.
- AMAndrew Mayne
Mm-hmm.
- ALAdele Li
We discovered that people were generating really long, amazing panoramics, you know, skinny bookmarks as well, and one of the cool capabilities with the model is that not only were you able to generate images in this panoramic aspect ratio, but you'd also render images in the style of 360.
- 9:34 – 10:51
Viral trends, authenticity, and imperfection
- AMAndrew Mayne
funny too how one of the things that was trending was taking popular images or photos of people and then having the model make, like, kinda janky looking Microsoft Paint versions of them.
- ALAdele Li
Yes. Yeah.
- AMAndrew Mayne
And did you think that was something you would see, was that people are gonna use this [laughs] incredibly capable tool to then go make, you know, these silly looking things?
- ALAdele Li
Yeah. It's funny 'cause it takes a lot of intelligence to actually create something that is imperfect.
- AMAndrew Mayne
That's what I tell people all the time.
- ALAdele Li
Yeah. And it's definitely very interesting in the viral trends that we're seeing online right now. Um, one thing that I think people are really striving for is authenticity-
- AMAndrew Mayne
Mm
- ALAdele Li
... imperfection, nostalgia. We're seeing that in the MS Paint prompt-
- AMAndrew Mayne
Mm
- ALAdele Li
... crayons, um, all different kinds of generations that people are creating, and that really feels like the theme of consumers, is they wanna interact with AI in a very authentic, imperfect way. They wanna show their imperfections and use AI to help make them look good, but also show a more fun and goofy side of themselves, and I think that's self-expression via AI is something that we're really excited about. And, you know, I think it's really part of our mission as a company to make it easier for people to learn more and distribute that intelligence, but also, um, letting them express a version of themselves that maybe wasn't possible before.
- 10:51 – 14:06
Training breakthroughs and photorealism
- AMAndrew Mayne
Kenji, was there a moment with this model where you're saying to yourself, "Wow, I think, think this is ready to go"?
- KHKenji Hata
You know, as it's training, we take a checkpoint, and then, like-
- AMAndrew Mayne
Mm
- KHKenji Hata
... we just sample from it, right? And just see, okay, how good is this thing? And I think, like, we just sampled a checkpoint, a model, uh, an image, and we looked at it and we're like, "All right, this is better than Images 1." [laughs]
- AMAndrew Mayne
[laughs]
- KHKenji Hata
Like, we were just like, "Okay."
- AMAndrew Mayne
I remember watching the iteration of one of the early versions of DALL-E-
- KHKenji Hata
Yeah
- AMAndrew Mayne
... and how at first it was sort of the wispy, sort of weird, sort of the tendril sort of thing, and talking to one of the researchers like, "Is, is that gonna go away?" He's like, "I think two, probably two runs away from that." And then [fingers snap] just like that. The ability to predict that was amazing to me, and all of a sudden everything got crisp and clear.
- KHKenji Hata
Yeah.
- AMAndrew Mayne
And then also, like, looking at, y- you know, years ago I'd played with, like, you know, GANs and, like, doing those things. You'd, you have to squint and say, "I think it's a pickup truck," or something like that.
- KHKenji Hata
Yeah.
- AMAndrew Mayne
So it's interesting what you see as you say, "Okay, this just all of a sudden got much better." And-
- KHKenji Hata
Yeah. I mean, it was just very obvious. You just, you just take the early checkpoint, you just sample an image from it, and then you just sample an image from, uh, you know, Images 1, and you just look at the two, and you're just... There's just, there's-
- AMAndrew Mayne
Yeah. Why do I like this garbage? This is-
- KHKenji Hata
I forgot what the image was. It might have just been, like, a picture of, like, a woman at a sea- on the seaside, like-
- AMAndrew Mayne
Yeah
- KHKenji Hata
... you know, overlooking a seaside. We just looked at it and we're like, "All right."There's like no, no question
- AMAndrew Mayne
Yeah, that was the big, the big-
- KHKenji Hata
[laughs]
- AMAndrew Mayne
... the big jump was the photorealism of going from something that looked, that was more of a, a glossy, idealized magazine cover to something that looked like a really good photograph. So help me understand, like besides just more compute, how did this happen? How did you get a model that's much better and also that doesn't take an hour to generate an image? The times are still... I, I remember in the DALL-E days-
- KHKenji Hata
Mm-hmm
- AMAndrew Mayne
... like we would literally have to, you know, "Tell, tell us what you want," and then an hour later it'd be on Instagram, to now these things are in ChatGPT, and it's faster. How is it getting both more intelligent and you're maintaining the same speeds?
- KHKenji Hata
I think we learned a lot, uh, in each release, like between 1 and 1.5, now 2.
- AMAndrew Mayne
Mm-hmm.
- KHKenji Hata
And so we take each, each of the learnings that we've made, and we've, you know... Like for example, speed, right?
- AMAndrew Mayne
Mm-hmm.
- KHKenji Hata
Um, you know, one of the things is like, "Oh, can we make the model more token efficient?" Or-
- AMAndrew Mayne
Mm-hmm
- KHKenji Hata
... or, or something like that. And, uh, you know, we did a lot of work to make it, to make it pu- produce very good images with less tokens.
- 14:06 – 22:16
Evals, prompting, and creative control
- AMAndrew Mayne
you have any personal favorite benchmark tests you like to do, things you say, "I wanna see it make an image of this"?
- ALAdele Li
I have a eval that I call the me, me, me eval.
- AMAndrew Mayne
Okay. [laughs]
- ALAdele Li
It's essentially 100 photos of myself-
- AMAndrew Mayne
[laughs]
- ALAdele Li
[laughs] ... and my friends and my family. Um, and I put everyone in goofy positions. I have about a card or a birthday, um, for every single person. Um, and I think it's a really great eval in the sense that, uh, you only know the people around your, you know, faces the best.
- KHKenji Hata
Mm-hmm.
- ALAdele Li
Um, you also want to create funny things with the model and thi- do things that are relevant.
- KHKenji Hata
Mm-hmm.
- ALAdele Li
And so one thing for me as the product manager, um, that I'm testing is not only is the raw capability of the model really great-
- KHKenji Hata
Mm-hmm
- ALAdele Li
... but also does ChatGPT understand what I want in that context?
- KHKenji Hata
Mm-hmm.
- ALAdele Li
You know, ChatGPT remembers, you know, that I have a brother, that I have a mom and dad, um, and what they like to do. And so does the model accurately know how to insert pieces of personalization in the moments that matter in the images? These are things that I'm testing for.
- AMAndrew Mayne
How about you?
- KHKenji Hata
Besides the grid one I mentioned earlier, that's probably the one I've used the most. For a while, I think Divya and I were doing a lot about photorealism. [laughs]
- AMAndrew Mayne
Yeah.
- KHKenji Hata
We were trying real hard to push on that. Um, uh, just basically, I know Divya's favorite one was, like a woman holding an or- a jug of orange juice. I don't know if you've seen that.
- ALAdele Li
Yeah. [laughs]
- KHKenji Hata
[laughs] There's like so many images of a woman holding a jug of orange juice. Um-
- ALAdele Li
Well, actually feel like the researchers had a more standard set of images like, than they like to-
- KHKenji Hata
I think so too
- ALAdele Li
... to lead on.
- AMAndrew Mayne
Yeah, and you get like the standard, can it do somebody writing with their left hand and a watch on their right hand and a clock-
- ALAdele Li
The clock
- AMAndrew Mayne
... showing this. I think the big, the big leap of the image is like probably 1 or 1.5 was like a half-full glass of wine.
- ALAdele Li
Hmm.
- KHKenji Hata
Or the wine glass full to the rim?
- AMAndrew Mayne
Yeah.
- KHKenji Hata
Yeah, yeah, yeah.
- 22:16 – 22:27
Creative agents and what comes next
- AMAndrew Mayne
How do you see the progression of this? This is great, but typically any time I talk to somebody at OpenAI about what they're working on, they're like, "Yeah, this is good, but..."
- ALAdele Li
I think we're still super early in exploring all the different use
- 22:27 – 28:08
Images + Codex
- ALAdele Li
cases that people are really trying to push the model with. Um, and so one of the things that we're really excited about, um, is what is that next, um, stage for Image Gen, um, which is to create the creative agent.
- AMAndrew Mayne
Mm.
- ALAdele Li
Ultimately, the agent that can work alongside you, be your creative assistant, um, and really understand how you work, what your preferences are, what is the output that you wanna get to, um, and build the product and model ecosystem that helps users kind of have a personal interior designer, personal architect, um, personal, you know, wedding planner, et cetera, all in one Image Gen.
- AMAndrew Mayne
I'll tell you another thing that was kind of amazing was like, um, I write books, and so, like, every now and I have a book come out, I've gotta change my social media headers. And I just went, and I said, "Oh, find my book cover and write, you know, create a, a po- you know, create appropriate size social media header that I can put on X or Facebook or whatever." Like, let's see. First shot. First shot. Right aspect ratio, everything.
- KHKenji Hata
We basically did that from the start or trained the models-
- AMAndrew Mayne
Mm
- KHKenji Hata
... to be good at that from the start. I remember, like, I worked on the initial de-risks of-
- AMAndrew Mayne
Mm
- KHKenji Hata
... of ev- basically it could do any aspect ratio that you ask.
- AMAndrew Mayne
Yeah.
- ALAdele Li
Yeah, you can now, um, really just easily specify the outcome that you want.
- AMAndrew Mayne
Yeah.
- ALAdele Li
Like in the case of yourself, you're like, "I want promotional material."
- AMAndrew Mayne
Yeah.
- ALAdele Li
"I don't have an idea. I didn't specify exactly what I wanted." But the model was able to do the research and then give it to you in the style and aspect ratio that was relevant to you, and that's super powerful. We're already seeing this. Um, you know, you're, you're an author. I've talked to real estate agents who are using Image Gen to help them create listings for their apartments or stage their listings.
- AMAndrew Mayne
Mm-hmm.
- ALAdele Li
Um, YouTube creators have talked to me about using Image Gen for their thumbnails and-
- AMAndrew Mayne
Mm
- ALAdele Li
... promotional content. I've talked to-Top artists who wanna use Image Gen to connect with their fans, and I think the ability for all different kinds of professions to start to use Image Gen to help them with visual creation is super powerful, especially if you're working in a visual and a creative industry.
- AMAndrew Mayne
Mm-hmm.
- ALAdele Li
Image Gen is such a hack in your professional toolkit. I think it has to be a part of everyone's everyday workflow in the future.
- AMAndrew Mayne
This does feel like the... I think it feels like the first time where anything I can reasonably come up with, it does a pretty good job of it.
- ALAdele Li
We think it's a new paradigm for-
- AMAndrew Mayne
Yeah
- ALAdele Li
... image generation altogether. Like, if, you know, we said this in the launch video, if DALL-E was the Stone Ages, Image Gen 2.0 is the Renaissance.
- AMAndrew Mayne
Yeah.
- ALAdele Li
Um, and I think that is so true because the model, it's not only great artistically and aesthetically, but it also incorporates, you know, science, art-
- AMAndrew Mayne
Mm
- ALAdele Li
... architecture, all in one image together, and I think that composition, um, and knowledge that the model has just means that the outputs are so much more trustworthy, um, are more powerful-
- AMAndrew Mayne
Mm-hmm
- 28:08 – 29:21
Prompt tips
- ALAdele Li
with Image Gen.
- AMAndrew Mayne
Any, any parting prompt tips for people?
- ALAdele Li
Well, one of the things I would suggest people try is Image Gen Thinking.
- AMAndrew Mayne
Okay.
- ALAdele Li
So if you navigate to the Thinking or Pro models-
- AMAndrew Mayne
Mm-hmm
- ALAdele Li
... we have a more powerful version of Image Gen in that experience, and in that model, uh, you actually are able to search the web, analyze files, um, leverage tools under the hood, um, which then yields a better quality and higher composition photo. And the suggestion that I have for prompting that experience is be open-ended.
- AMAndrew Mayne
Mm-hmm.
- ALAdele Li
I think the model will go and do the exploration itself to understand and try to reason, um, and find information that matters. And I also think giving it a sense of an aesthetic is also super helpful. Um, using, grounding that in a style has been really, um, fruitful for a great result.
- AMAndrew Mayne
Good one. Good one.
- KHKenji Hata
I think just being very particular about the style or, like, what you like in general. Like, for me, I like minimalist infographics.
- AMAndrew Mayne
Mm-hmm.
- KHKenji Hata
Sometimes I think the model can be a little dense.
- AMAndrew Mayne
Mm.
- KHKenji Hata
And so I just... Maybe I'm just a simplistic kinda guy.
- AMAndrew Mayne
[laughs]
- KHKenji Hata
So I just like very th- very clean, a very clean look, so I like that.
- AMAndrew Mayne
Adele, Kenji, thank you very much.
Episode duration: 29:22
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode bH2nP-aCFjk
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome