Y CombinatorBuilding The World's Best Image Diffusion Model
EVERY SPOKEN WORD
60 min read · 11,828 words- 0:00 – 1:07
Intro
- SDSuhail Doshi
I think we thought the product was going to be one way, and then we literally ripped it all up in a month and a half or so before release. We were sort of, like, lost in the jungle for a moment. (laughs) Like a bit of a panic.
- GTGarry Tan
There's a lot of unsolved problems, basically. I mean, the, you know, even this version of it, you know, people are gonna try it, and then they might be blown away by it, but like, the next one's gonna be even crazier.
- SDSuhail Doshi
To get to SOTA, you basically have to be maniacal about, like, every detail. There are going to be some people that train their models and they get cool text generation, but the kerning is off. Are you the kind of person that will care about the kerning being off? Or are you the kind of person that is okay with it, like, or you don't even notice it?
- GTGarry Tan
Welcome back to another episode of The Light Cone. I'm Gary. This is Jared, Harj, and Diana. And collectively, we have funded companies worth hundreds of billions of dollars, usually just with one or two people just starting out, and we're in the middle of this crazy AI revolution. And
- 1:07 – 1:47
What is Playground?
- GTGarry Tan
so, we thought we would invite our friend Suhel Doshi, founder and CEO of Playground, which is the state-of-the-art image generation model, with also a state-of-the-art user experience, and it just launched. So, how you feeling, Suhel?
- SDSuhail Doshi
Very under pressure right now.
- GTGarry Tan
(laughs)
- SPSpeaker
(laughs)
- SDSuhail Doshi
Uh, excited though.
- GTGarry Tan
That's good then-
- SDSuhail Doshi
Yeah.
- SPSpeaker
Yeah.
- GTGarry Tan
... so you're sort of like a startup founder.
- SDSuhail Doshi
Yes.
- SPSpeaker
Right. Yeah.
- GTGarry Tan
(laughs) Which is normal. Maybe the best way to start off is to, uh, look at some examples of the images that you were able to generate. Um, and this is stuff sort of right off the presses.
- SPSpeaker
(laughs)
- GTGarry Tan
So,
- 1:47 – 7:04
What Garry was able to make using Playground
- GTGarry Tan
uh, at Y Combinator, I, uh, also am one of the group partners, so I fund a number of companies, uh, every batch. I funded about 15 for the summer batch. And so what we're looking at here is one of the T-shirt designs I made. As you can see, there's a GPU, and it was based on one of the core templates in your library. I like metal, so this, uh, very much (laughs) spoke to me. This one was off of a sticker design, and I guess I just really liked that sword, and what I was able to do is, uh, add GPU fans.
- SPSpeaker
(laughs)
- SDSuhail Doshi
Love it. I love it.
- GTGarry Tan
And so that's one of the noteworthy things about Playground. You can upload an image. It'll sort of extract, um, the essence of like, sort of the aesthetic, and some of the features of it-
- SPSpeaker
This one-
- GTGarry Tan
... and then you can remix it.
- SPSpeaker
... feels like a, feels like a tattoo. (laughs)
- SDSuhail Doshi
(laughs)
- GTGarry Tan
Yeah, exactly.
- SPSpeaker
(laughs)
- HTHarj Taggar
Do you remember what you prompted it with to get those?
- GTGarry Tan
Oh yeah, I- I basically... So the cool thing about Playground to create this, was I- I picked, uh, a default template that I liked, um, and I think it only had the sword and sort of this ribbon, and I said, "Make it say Houstan on the ribbon, and, um, add a GPU (laughs) with two fans." I was very specific. I wanted a two-fan GPU, and that's one of the things that you'll see in all these designs. This is actually the T-shirt that Houstan itself actually chose-
- SDSuhail Doshi
Mm-hmm.
- GTGarry Tan
... so, um, you know, it's a very summery vibe. I think this was based on something around summer and surfing, and we replaced the surfboard with the GPU.
- SDSuhail Doshi
I feel like you used a preset that we had.
- GTGarry Tan
I did, yeah.
- SDSuhail Doshi
One of the style presets, right?
- GTGarry Tan
All of these are from presets.
- SPSpeaker
They're pretty good.
- GTGarry Tan
I think the noteworthy thing that I was able to do is, um, I didn't have to like, prompt and re-prompt, and re-prompt, and sort of keep trying to refine the same text prompt. Like, I actually could just talk to a designer and it would just give me what I wanted. Going from left to right, for instance, by default, I think the template had this yellowish background, and I said, "Make it on white."
- SDSuhail Doshi
Mm-hmm.
- GTGarry Tan
And so that- that was like, a very unusual interaction that I, you know, I'm not used to. Like, usually you're used to either discord with Midjourney, or you're sort of used to a chat interface, or like, prompt and then twiddle things, and re-prompt, and re-prompt, and re-prompt. Whereas this felt much more natural language. I could just talk to, uh, you know, a machine designer that would, uh, take my, uh, feedback into account.
- SDSuhail Doshi
Yeah. Normally, when you make these kinds of images, you have to like, describe all of it, right? You'd have to say like, "I want it on this, you know, beige background, and I want this orange sunset," and then you'd have to even describe like, the lines of the sun, and you know... Or- or you don't describe very much, and then every time you try it's like, totally different from the other thing. So usually, you know, you either have to learn like, a magical incantation of words, uh, or versus being able to like, pick something that you start from.
- SPSpeaker
And then also, with these images, Gary, did you add this text in post-processing? Or is the model actually like, incorporating the text organically?
- GTGarry Tan
Oh, the- the model will both take your direction on what should be there, what its size is. It can, you can actually specify where in the design. You can say, "I want it in the middle. I want it at the top. Could we use a font that's bigger or smaller?" You know, uh, "Better leading? Could you kern it a little bit?" Like, you could just speak to it in plain English, and I'd never seen that in any image model to date.
- SPSpeaker
That's- that's crazy, 'cause the text is flawless.
- SDSuhail Doshi
Yep.
- SPSpeaker
And anyone who's used DALLE knows that if you try to get it to write text- Mm-hmm. ... the text comes out like, garbled and zombie-like.
- SDSuhail Doshi
Yeah.
- HTHarj Taggar
It's- it's pretty incredible having just accurate text and then being able to position the text exactly where you want. That is very cool.
- 7:04 – 10:44
The focus on text accuracy
- HTHarj Taggar
good with text, or is that like an emergent property of just how you architected everything?
- SDSuhail Doshi
We definitely focused on making text accuracy really good. I think it's been kind of our number one focus, and part of, part of it is, is t- text to us is so interrelated with actually the utility of graphics in design, um, because a lot of things without text just mostly feel like art. But yeah, text was an ex- an extraordinary, uh, high priority, and it was really hard actually. There were, there was like a maybe, uh, a point there where, like, our text accuracy was 45%. We were sort of like lost in the jungle for a moment. (laughs)
- GTGarry Tan
(laughs)
- SDSuhail Doshi
Like a bit of a panic, but we figured it out.
- JFJared Friedman
I think one of the remarkable things on all these designs is that a lot of, uh... I was playing a lot with it as well. A lot of the outputs are very utilitarian and useful, because I play with Midjourney and all of those, and I think they're fun, but they're more like toys, more like art.
- SDSuhail Doshi
Mm-hmm.
- JFJared Friedman
But it's really hard to work with it if you actually wanted to design logos, T-shirts, font sizes. It, I could totally see this replacing Adobe Illustrator, right?
- SDSuhail Doshi
Right. Yeah. Yeah, I think that, you know, uh, part of... It's kind of funny, it's like the reason why we're, I'm partly so excited about graphic design is because actually when I was younger, when I was in high school, I used to do logo contests and I would try to win them. (laughs)
- GTGarry Tan
(laughs)
- SDSuhail Doshi
I think there's this site called like sitepoint.net or something, and, um, and I was just trying to make like a little bit of money before college, before going to college, and, uh, and so I did all these like logo designs and did all these tutorials, um, trying to win them. And, uh, and so during the training of this model, I tested it for logos and I started to be like, wow, it's actually way better than anything I could have made, and then I've also made like my own company logos typically, which are also very bad. And so it just feels to me like if you can get text and you can get these other kinds of use cases, um, you're probably going to be able to beat the like mid- at least the midpoint designer, um, graphic designer that's an illustrator, and then I think over time we should be able to get to like the 90th percentile designer, graphic designer.
- GTGarry Tan
So this is actually a really different use case that really hasn't been addressed. You know, I haven't seen image models try to design graphics or illustrations. It's less, uh, you know, generating really cool images that would replace stock art or something like that. Uh, it's more literally allowing you to create Canva type things-
- SDSuhail Doshi
Right.
- GTGarry Tan
... you know, whenever you want, and you don't have to mess around with it. It's, you know, plain English, just talk to the model, the model's gonna create what you want. I've never seen anything like that.
- SDSuhail Doshi
Yeah, I think we, we were just sort of like looking at what are the use cases for graphic design, and it's, you know, when... It ha- actually, interestingly, it has a lot of real world like physical impact, physical world impact, because there are like bumper stickers and then T-shirt... I think it was at Outside Lands the other weekend, and I was just looking at everyone's T-shirts (laughs) -
- GTGarry Tan
(laughs)
- SDSuhail Doshi
... looking at what they, what they have on them, and then I, I saw a, a bunch of women at Outside Lands had this thing, this T-shirt that said, uh, "I feel like 2007 Britney." (laughs)
- GTGarry Tan
(laughs)
- HTHarj Taggar
(laughs)
- SDSuhail Doshi
I just thought that was such a cool shirt and so we made the template for it and put it in the product, and... But it, there's just like-
- JFJared Friedman
Mm-hmm.
- SDSuhail Doshi
... so much cool real world impact, and there's, and I think that the world... I often sometimes think that, um, I'm almost, I'm als- a little disappointed that MySpace doesn't exist, for those that were on MySpace, 'cause it was such an expressive-
- GTGarry Tan
Yeah.
- SDSuhail Doshi
... social network, and I feel like humans really deeply care about that form of expression, and, um, and so that's really cool to be able to make a model that's like really focused on all those kinds of things.
- JFJared Friedman
But you're actually building
- 10:44 – 16:00
Building a marketplace for Playground
- JFJared Friedman
a product, it's not just research because when-
- SDSuhail Doshi
Yeah.
- JFJared Friedman
... with all these designs in, in, in Playground, you can actually go and purchase them, like the-
- SDSuhail Doshi
Yeah.
- JFJared Friedman
... stickers, the T-shirts, right?
- SDSuhail Doshi
Right.
- JFJared Friedman
Can you tell us about kind of this marketplace that you're building?
- SDSuhail Doshi
Yeah. So I think that, you know, one thing that we, we learned was that it's kinda hard for people to prompt, and because it's hard to prompt, it's hard, it's, we found also it's hard to teach people how to prompt, like, and the truth is, is that when you make these models, it's not like we even know how it works. We are also discovering with the community how the model kind of works, and so one of the things that we decided to do, uh, was, uh, you know, me and, and, and our designer, we decided that one core belief was that the product should be visual first, not text first, which is a huge departure from like language models and ChatGPT, because, uh, because our product is so visual, why should it not be? And so in order to make it visual first and to make it so that you don't have to learn how to prompt, uh, we decided that we would start from something like a template, which is something people already understand in a tool like Canva, right? It's not something that we necessarily invented, like there's templates everywhere, but I think that if you could start from a template and then we could make it really easy to modify that template, then it feels like we've already taken you like 80% of the journey. If it was like, "I feel like 2007 Britney," but then you wanted to change the celebrity and the year to a different person-
- GTGarry Tan
(laughs) Yeah.
- SDSuhail Doshi
... then you totally could. We wanted to make that very easy, um, but it also required a lot of integration with research, because how do you make these changes? How do you make them coherent? How do you keep things like similar? It's not as simple as, uh, you know, just 75, 77 tokens that you put into Stable Diffusion. The existing open source models aren't really capable of that, so it required kind of-... yeah, like the marrying of, like, what a good product should feel like and what ... and, and then, you know, enabling that with research, which is not always possible.
- JFJared Friedman
I think that's what Gary was getting at with you building this state-of-the-art UX, the UI for all these models, because up to this point, people just get raw access. It feels like kinda back in the days you would just SSH their computer-
- SDSuhail Doshi
Yeah.
- JFJared Friedman
... and kind of work with it. That's how people interact with these models. But you kind of basically built a whole new browser into it. Nobody has done it, and you've done it really well. Can you talk about this idea of, uh, departing from raw model access?
- SDSuhail Doshi
Yeah. I think, I think just we observed the users over 18 months, like, failing, you know? And, and so it's, it's ... A- AI's a little bit weird right now because there's so much, there's such a big novelty factor, I would say, and it's exciting 'cause we're able to do things we've never been able to do before. And so as a result, you're gett- you can easily get millions of users using your product, and that's totally what happened to us. And so it feels almost like, "Oh, maybe I've got the product." But then when you actually go look at the data and how people are using it, there's just this, there's this constant failure (laughs) of people using the product. And so ...
- GTGarry Tan
Yeah. You're talking about sort of, uh, the prior version of Playground.
- SDSuhail Doshi
The prior version of Playground, yeah.
- GTGarry Tan
So it didn't have this type of model.
- SDSuhail Doshi
Yeah.
- GTGarry Tan
It didn't, um, it was really-
- SDSuhail Doshi
We mostly used Stable Diffusion.
- GTGarry Tan
... quite aesthetic, yeah.
- SDSuhail Doshi
We used, like, open source models, and then we started training some of our own that are very similar to Stable Diffusion as, like, a way to ramp up to where we are now. When we watched users prompt this model that, you know, obviously the two pieces of feedback were, um, you know, "This is fun. It's cool. Uh, I can get, like, a cat drinking beer." (laughs) And then you post it to Twitter, and it's exciting.
- GTGarry Tan
Yeah.
- SDSuhail Doshi
But then, "But why would people come back?" you know, is one big question. And then the second part is that people are using our service a lot, but they're not always using our service a lot because it's, like, a useful thing. It's because they're, they're getting ... they're not getting what they want, so they have to keep retrying.
- GTGarry Tan
Yeah, yeah, yeah.
- SDSuhail Doshi
Yeah. (laughs) You know, where, like, Google's trying to get you off the website?
- GTGarry Tan
Yeah.
- SDSuhail Doshi
You know, that sort of feeling? Like, it's almost bad that people are using it too much, in some, in some sense. And, um, you know, they just keep re- ... we call it reroiling, right? They just keep reroiling to get a different image or a slightly better image or fix, like, a paw or tail that's off, you know? And then the other thing that happened was that our model can take an extremely long prompt. Like, most of these models, you can only write 75 tokens, but with our model it's, like, 8,000. And most people, you're never really gonna go over 1,000 right now. I say that now, but we'll see. (laughs) 1,000 tokens is a lot. Um, and, and our model lets you be extremely descriptive. And so you can, you can really describe the, the texture of the table, skin texture. We have all those, like, puzzle prompts where it's like, "green triangle next to an orange cube," you know, and it works. Like, spatial reasoning is all there, actually, um, including text generation.
- GTGarry Tan
That's new. That's totally novel, and-
- SDSuhail Doshi
Yeah.
- 16:00 – 22:25
Prompts are like HTML for graphics
- SDSuhail Doshi
what we were going to spend our time on was prompt understanding and text generation accuracy, because we also felt like aesthetics were kind of saturating. Like, they're getting better, but they're also just kinda, like, not getting better at a fast enough rate. And users even vote and say e- even in the Midjourney Discord, you know, they'll poll their users and they'll say, "What do you want to be better?" And, like, aesthetics is, like, going, like, lower and lower on the rank of things that people care about. So we wanted to, like, try to leap on something that really mattered to users, which was prompt understanding and text generation accuracy for those kinds of use cases. And, uh ... But b- when you have a very long prompt, it's not really feasible to ask anybody, like, "Are you gonna write, like, an essay?" And so we started to realize that actually the prompt is, it's almost like a ... It's kinda like HTML for graphics, which I think is so cool.
- JFJared Friedman
I think you've done a lot here because you completely have a novel architecture that really gets to magical prompting, because the experience of using Playground is feel as if you're talking to a designer. It, it, it has a coherence. It listens to you, because with, with other ... I don't know, with Midjourney, if you wanna move the text or, or that, it doesn't.
- SDSuhail Doshi
Mm-hmm.
- JFJared Friedman
And the positional awareness is not there. I guess one of the insights you had when we chatted a bit earlier, one of the problems you learned, to create good designs, you have to have, um, a lot of description-
- SDSuhail Doshi
Mm-hmm.
- JFJared Friedman
... for the images.
- SDSuhail Doshi
Yeah.
- JFJared Friedman
And users are basically lazy, right?
- SDSuhail Doshi
Very. (laughs)
- JFJared Friedman
They might just tell you, "I want a nature scene." And if you input this into Midjourney, what would it give you? It's like ...
- SDSuhail Doshi
Yeah. It'll give you, like, this very beautiful, very rich, high-contrast nature scene.
- JFJared Friedman
But you've done something very interesting. We wanna talk a bit about how you've done, um, kinda aiding users and expanding upon the prompt to actually build something much better?
- SDSuhail Doshi
The first thing to kind of, like, improve our prompt understanding was just, like, making your data better. Pretty much it's, it's actually just that simple. And so one of the first things we wanted to do was, uh, we wanted to, like, have extremely detailed prompts. So when we train the model, we, we train on very, very detailed prompts. But we also want users to feel like they could just say, "nature scene". (laughs) And so sort of what you see here is just how detailed we can get. And actually, um, we're actually even more detailed than this, uh, these days. When we train the next model, it'll be even more than this. But once you get to, like, this level of detail, I mean, we're just teaching the model to, like, represent all of these concepts correctly, you know, whether something is in the center or whether there's, like, a background blur. One thing that we wanna get better at, and I think we're actually already pretty good at this, but it's emotional expression.... is, like, another thing. Like, we have this, like, image of Elon Musk, and he's, like, disgusted. He's anxious. He's happy. He's sad. He's (laughs) confident. And, like, trying to see his expression in all these different ways. And, um, and so that's just like one thing that we want to make sure is represented in these prompts. There's obviously a lot more like, you know, spatial location. And so by doing this, uh, we can ensure that the model can be- could be a good experience if you raw prompted it as a user, if you just said nothing. And then most of the time, users are not really writing more than, like, maybe like caption three here or something like that. I mean, even that's kind of a lot.
- JFJared Friedman
That's a lot.
- SDSuhail Doshi
Right?
- JFJared Friedman
I think when I was playing, I was mostly like on five and six. (laughs)
- SDSuhail Doshi
Yeah, yeah, exactly. When you're playing around, like, the nor- norm- the normies are kind of doing five and six.
- JFJared Friedman
Yeah.
- SDSuhail Doshi
And then the, like, hardcore prompters are, like, copying each other's prompts, and then they end up more like one, but they don't even look like one. And one is a very unnatural way of typing, you know? Like, nobody's writing these, like, essays and paragraphs of text. It's too much work. And that was one thing that we didn't... We knew we were going to probably fail if we expected users to do this. So, this kind of led us to, uh, like a more visual approach where you're like picking something you already like in the world that we understand how that's represented in our model, and then we can, like, make those changes and edits and stuff like that.
- HTHarj Taggar
Is the benefit of like expanding the prompts this way that you're more likely to get what the user wants at the first go? Or is it that it just makes it easier for them to iterate on it to get to what they want?
- SDSuhail Doshi
You know, I don't even know that we necessarily needed to do this. But I think the reason why we did it was 'cause at- initially, we didn't know how good the model would be. And so we needed to serve users in the way that they already use the existing models. And so we didn't exactly know the breakthrough, like, interface. We hadn't gotten there yet. And so in order to make sure that we would work the way everyone wa- is happy with, we wanted to do this kind of like segmented out. It's almost like lossy prompting. And, uh, so that's why we do it. But I think, you know, it's not even that- it's not as necessary. But I think the- and then the other reason to do it- do it this way is once the prompts get extremely detailed, it's hard to have too much, like, variation between the images because you're kind of locking in on your image.
- HTHarj Taggar
Yep.
- SDSuhail Doshi
And so by having kind of ambiguity in the prompt, you can get like more variational ability. So there's like- we call it, like, image diversity. So that way you, you say squash dish, but it's like really different each time.
- HTHarj Taggar
Yeah.
- JFJared Friedman
I guess the cool thing about your product, you basically remove all of the prompt engineering with zero guests because you do it behind the scenes-
- SDSuhail Doshi
Yeah.
- JFJared Friedman
... with expanding and exploding into this multi-caption level system.
- SDSuhail Doshi
Mm-hmm.
- JFJared Friedman
Right? I guess the- what comes to mind is sort of back in the day, if you needed to navigate a website through the command terminal, maybe you curl and do get some posts, literally, like typing the commands until you had like a browser to actually have the right UI, right?
- SDSuhail Doshi
What I told my team was I said, "Yeah, we should be doing the prompt engineering for users." It should not be like the users are the prompt engineers and then they like- or the prompt graphic designers, (laughs) if you will, here. But like it shouldn't be like the users have to go- Like, we can't write a... What are we going to do? Write a manual on how to do this? You know, it's- it's just too tricky. Like 1% of humanity will understand that manual. And, uh, and the rest will be like, "I don't know how to use this. It's too difficult." So, I think it's really valuable that, you know, I told my team, I think it's very important we do all of that work. Like, we should have an extremely strong sense of how the model works rather than putting that on the users where, um, I think it's like infeasible. And then the other thing that we do is we now work with, uh, creators to help us like kind of construct these like different templates and different prompts around these templates and stuff like that. And they might be like the 1% of humanity that's willing to learn this on behalf of users. And this is totally normal. Like that's what Y- that's what YC does. (laughs) You know, we like build these great companies that, uh, you know, every- like billions of people in humanity use as a result of that.
- 22:25 – 26:13
Creating new design professions
- JFJared Friedman
I guess there's two things out of this that come out. One is you might be creating this whole new set of, uh, profession-
- SDSuhail Doshi
Mm-hmm.
- JFJared Friedman
... sort of, uh, back in the day with design you have Behance where people hire designers.
- SDSuhail Doshi
Yeah. Right.
- JFJared Friedman
Now people will, through Playground, hire like AI designers that are this-
- SDSuhail Doshi
Right.
- JFJared Friedman
... top 1%.
- SDSuhail Doshi
Well, we're doing it actually. So we are hiring them. (laughs)
- JFJared Friedman
Oh, you're hiring them?
- SDSuhail Doshi
Yes, we're hiring them. We're going to launch a creator program soon actually and the goal is to bring on creators that have good tastes. That- that still matters, right? Like, you know, there's this image of a, you know, stuffed- a squash dish, but it's not like- not a very beautiful image. (laughs) And there is- I think taste is still real in the world and it's also in- in design. You know, in LLMs you get to like measure how well you did on a biology test and that's like a pretty objective thing. But for design it's constantly evolving. Like, design from 10 years ago can look dated unless you're like Dieter Rams. But, uh, but I think, you know, more fundamentally we want to bring on creators that, um, are going to help create graphics that other people can then use. And we're actually paying them.
- JFJared Friedman
I guess one thing that's cool, the second thing because of this, you actually are state of the art on many aspects for this model. So much of it was driven by a product because now in order to get the good captioning you probably are beating GPT 4o, right? In terms of image captioning.
- SDSuhail Doshi
We are beating... Yeah, we now have a new like SOTA, uh, captioner, yeah.
- JFJared Friedman
To generate these and that was-
- SDSuhail Doshi
Yeah.
- JFJared Friedman
... not just to be like a benchmark but actually a very practical purpose to build the model.
- SDSuhail Doshi
Yeah.
- JFJared Friedman
Can you maybe tell us a bit about what's underneath because PGV3, Playground V3, right?
- SDSuhail Doshi
Yeah.
- JFJared Friedman
Is all- all in-house and state of the art in many aspects.
- SDSuhail Doshi
Yeah. Yeah, so the whole architecture of the model we- we had to like rip everything out. Um, so like the normal stable diffusion architecture that people know about is like there's a variational auto-coder, a VAE, and then there's CLIP and then there's like this U-Net architecture for people that are in the know. And, uh, and then since then it's kind of evolved to using, um, you know, more transformers. Like there's this great paper by I think it was like William Peebles, um, that did DIT which I think is like what people believe Sora is based on as well. And then so there's some new models that are using that. Um, we actually don't use any of those new architectures either. We did something completely from scratch, but one of the reasons why we had to kind of blow everything up was because you can't really get this kind of prompt understanding using CLIP because there's just so much error in CLIP and it's also like just-... bounded by just the architecture of that model. And then the second thing is, uh, we also, we also needed the text accuracy to be really high. So you can't just, like, use the off-the-shelf VAE from Stable Diffusion, because it cannot reconstruct small details. Like, I don't know if you guys ever noticed, but-
- JFJared Friedman
Like the hands and the logos. (laughs)
- SDSuhail Doshi
Hands, zoomed out faces, yeah. You n- you need something that ... You also need, like, a state-of-the-art VAE or something like a VAE, um, that's better than the existing one. Like, the existing one's, like, four channel. Um, and, uh, and so there's all these, like, pieces, and they all, they all interact. Um, and they can all bound the overall performance of your model. And so we basically looked at every single piece, and then I think, like, four months ago, there was a ... (laughs) I think with the team, there was literally we were at, we were at the whiteboard with the research team, and there was, like, the non-risky architecture, which was kind of more similar to some of the open, the state-of-the-art open source models that are out these days, like flocks and stuff. And then there was, like, this other (laughs) architecture that shall not be named. And, um, and we were like, "Well, that's th- that's, like, the risky one, where we don't even know if it'll work, and if we try it for two or three months, we'll, like, waste compute, and if it d- and it might just, like, blow up, and then we'll be behind." And we just, like, put everything in that basket. (laughs)
- SPSpeaker
(laughs)
- GTGarry Tan
Nice.
- SDSuhail Doshi
Uh, we decided that, um, we had no choice, you know? It was like we were just going to fail if we didn't do it
- 26:13 – 30:06
Using tailwinds of what is happening in language
- SDSuhail Doshi
anyway.
- JFJared Friedman
I think what's remarkable, your order of magnitude on text and in a lot of all these aspects, you're basically SOTA. I think that's really impressive. Can we maybe talk a bit about, as much as you can, how you beat the text encoder? I mean, you teased that out a bit. You basically don't use clip because the traditional Stable Diffusion just, uh, uses the last layer, right? But you guys have done something completely new, where you allowed a basically almost infinite context window, because Midjourney's only 256. The prompt had adherence. Like, you can actually talk to it like a designer. So tell us, uh, what you can talk about that.
- SDSuhail Doshi
Yeah, um ... (laughs)
- GTGarry Tan
(laughs)
- JFJared Friedman
As much as you can tell us about it. (laughs)
- SDSuhail Doshi
I think it's fair to ask the question. (laughs)
- GTGarry Tan
(laughs)
- JFJared Friedman
Share as much as you want. (laughs)
- SDSuhail Doshi
I think that to, to kind of get here, you know, there's some obvious things that you would do. The most obvious thing that you would do, uh, you know, is not use clip, but the second-most obvious thing is, um, kind of like using the tailwinds of what's already happening in language. You know, like the language models already so deeply understand, uh, everything about text. And so there's some models that use this, you know, they use like T5 XXL, which has this ... It's like another embedding, but it's like a much more rich embedding of language understanding. Kind of feel like language is just the first, it's just the f- it's like the first thing that happened, and, um, there's a whole bunch of AI companies that are gonna come about, whether they train their models or not, that are just gonna benefit from everything that's going on in language, in, in open source language. And so, you know, I think our model is able to have such great prompt understanding in part because of the big boom in language and all of the stuff that they give you, whether it's Google or Meta or what have you, is doing. And so we're just ... We can be slightly behind in terms of language for our prompt understanding, because the language stuff is already just so good. And it, and it will just continue to get better, and our models will also continue to get better. So that might be my, like, one small hint.
- GTGarry Tan
(laughs)
- JFJared Friedman
Maybe the analogy playing with a lot of this and from chatting with you, the current state-of-the-art Stable Diffusion models, their language understanding feels like in the NLP land, like Word2Vec, right?
- SDSuhail Doshi
Mm-hmm.
- JFJared Friedman
Word2Vec was this, uh, paper that came out from-
- SDSuhail Doshi
Yeah.
- JFJared Friedman
... Google in 2013, and it didn't really understand text per se. It was more the latent space. The famous example was that it would take the, uh, vector of king-
- SDSuhail Doshi
Mm-hmm.
- JFJared Friedman
... and then you would subtract the vector of man and then add the vector-
- SDSuhail Doshi
Yeah.
- JFJared Friedman
... of woman-
- SDSuhail Doshi
Yeah.
- JFJared Friedman
... and the output would be the vector of queen.
- SDSuhail Doshi
Right, yeah.
- JFJared Friedman
Which is, like, but very basic. I mean, still very cool, which I think is what kind of Stable Diffusion current, current models before you are. But playing with your model, you basically did a leap to u- for the audience, the leap is that you basically got GPT level of understanding. It's like sort of the Word2Vec to GPT was, I don't know, like-
- SDSuhail Doshi
Yeah, I would say it's like-
- JFJared Friedman
... six, six, ten years later?
- SDSuhail Doshi
Yeah, I'd say it's like GPT-3 level image model, like sort of prompt understanding now. Yeah, and I think there's, there's much more leap, there's another leap to go. Many more, actually, I would say.
- JFJared Friedman
And that's impressive.
- GTGarry Tan
So it's, it's, yeah. It's safe to say that this is the worst the model has e- will ever be.
- SDSuhail Doshi
For sure. Yeah, for sure. I mean, there's, there's, you know, small things that we already want to fix, like, you know, we wish that the model understood concepts like film grain. I mean, it could still go, still be better at spatial positioning. Um, even the, the, like, the model has issues with, like, the idea of, like, left and right, like, put the bear to the left. (laughs)
- GTGarry Tan
(laughs)
- 30:06 – 32:42
Problems with aesthetics evals
- SDSuhail Doshi
there was this really funny thing when w- uh, you know, I think a, like a week or two ago that we realized about the model, which is when we started to do evals for aesthetics. You know, and the way we do this is we just show, like, two ... It's an AB test. We show users two images, one from maybe a rival of ours, and then another image from our model. And we're constantly doing evals and constantly asking our users what they think so that we can get better. And anyway, one of the things that we realized was that there's this new thing that I don't think has been talked about, but, you know, apologize to the audience if, if there, if this has been talked about, but there's a problem with, uh, we have entanglement issues, which is that if the, um, if the model adheres too well to the prompt, it can adj- it can, like, it can have an effect on aesthetics. So, when we compare ourselves to say, something like Midjourney, which, you know, we've actually evalved it, has great aesthetics, best in the world at that, one of the problems is that-... we will get dinged because the, our model is adhering more. So I'll give you an example. We have an image, and it's like a image of a woman, and it's like kind of like a split pane, like she's on this side and on this side, so it's like a two, it's like a composite. And Midjourney doesn't respect that, it just shows the woman one frame. The users will always pick that because it's more aesthetically pleasing compositionally versus this like split-pane thing. But our model is adhering to that, uh, prompt, right? And so the users ding us, and then we get a lower aesthetic score. (laughs)
- JFJared Friedman
Do me, like...
- SDSuhail Doshi
Because it's not listening. And so there's this entanglement problem, like what do you do? We had another image that was a, like hand-painted palm trees or something, and the users chose the other model because they were less hand-painted looking.
- GTGarry Tan
Hmm.
- SDSuhail Doshi
And the hand-painted ones do look less aesthetic, but our model is adhering. So we have this entanglement problem, and we don't know how to measure ourselves for aesthetics now. And there's no, I don't, I'm not aware of any, if anyone has any literature, please send it to me, but I'm not aware of any literature on this, and so we don't know what to do.
- JFJared Friedman
I think what it sounds to me is basically your model is too SOTA that the current evals don't work because they, it's actually following the rules.
- SDSuhail Doshi
Yeah, we're trying to figure out what, we have, we have to make a new, uh, a new eval (laughs) basically-
- JFJared Friedman
You're, you're too advanced. It's like...
- SDSuhail Doshi
... to figure out what to do.
- GTGarry Tan
You broke the test.
- SDSuhail Doshi
Yeah, you kind of broke the test. And, um, and yeah. So now, now it's, it's, it's a little weird externally. We don't, you know, it's like obviously we want to portray to the world, hey, you know, we have this great thing, and okay, we lose here, but like, but not really. And so I think we're going to...
- GTGarry Tan
But it does what you want.
- SDSuhail Doshi
Yeah, but it does what you want, and so I think we're going to try to, you know, we're going to talk about this in more detail, this kind of entanglement problem, because it's actually like a very interesting, more fundamental insight.
- GTGarry Tan
Yeah, it sounds
- 32:42 – 33:54
The commercial applications
- GTGarry Tan
like you're just building a completely different kind of company. Like the thread that comes up hearing everyone here is, if using Playground feels like you're talking to a graphic designer-
- SDSuhail Doshi
Mm-hmm.
- GTGarry Tan
Which then in my head actually buckets you into just the companies in YC that are really taking off are the ones that are replacing some form of labor.
- SDSuhail Doshi
Mm.
- GTGarry Tan
Um, which is just different to how people talk about Midjourney, right? It's not just like a tool to play around with. This is actually just going to be like the replacement for hiring a graphic design team potentially, which is like way more commercial.
- SDSuhail Doshi
Right. Yeah, yeah. I mean, we've been, we've been searching for like where is the utility, where, where, how are people using, you know, things like Midjourney. Um, and, uh, and, and I think that for me it's actually s- it's even simpler, it's just that I think we're just enabling the person to have more control over...
- JFJared Friedman
Mm.
- SDSuhail Doshi
... the whole thing. Like, I always feel bothered, you know, when you're like, you know, I produce music, and so if I make a song, like I have to go to a designer and say, "Can you make me album art?" And then I only get like four variations of it, and then I feel badly asking for a fifth if I don't like any of the four. But it, the more you just like put in control the person that's actually making the thing, they'll always, they'll be able to connect exactly the thing that they're looking for with, you know, the core product or song or whatever they're making.
- GTGarry Tan
So at YC we're always, uh,
- 33:54 – 40:30
When the users you get are not the users you want
- GTGarry Tan
telling founders, "Hey, you should talk to your users more," or obs- you know, and what you did was you had so many users, you couldn't just talk to them, you needed to look at how were they actually using it.
- SDSuhail Doshi
Yeah.
- GTGarry Tan
And at some point you realized, you know, somewhat uncomfortably, that they were generating near porn. (laughs)
- SDSuhail Doshi
Near porn, yeah.
- GTGarry Tan
(laughs)
- SDSuhail Doshi
We get a lot of near porn, uh, and, and porn. Um...
- GTGarry Tan
And then, you know, I think people sort of when they're exploring a space often run into that situation, like what happens when, you know, the users that you're getting aren't the users you actually want?
- SDSuhail Doshi
Yeah. We, (sighs) me and my COO talked about this. We were like, if we listened to what the users wanted, we would have to build a porn company essentially...
- JFJared Friedman
(laughs)
- GTGarry Tan
(laughs)
- SDSuhail Doshi
... which is not something that I think my wife-
- GTGarry Tan
(laughs)
- SDSuhail Doshi
... would be happy with...
- GTGarry Tan
(laughs)
- SDSuhail Doshi
... or my mother. Um, it was kind of this tricky thing where you're like, listen to your users, talk to your users, uh, and, and, and look, I'm not saying everybody does that with image models, for sure they don't, but, but a lot of them do. And so we had to kind of go ask ourselves, well then what can you do with these things? And the answer was like not much else. Nothing big and commercial enough. We can make a cool website that people use, um, and the problem is all the, all the web, all the image generator sites are plagued with this problem, and we all know it. We all, we all know. (laughs)
- JFJared Friedman
(laughs)
- SDSuhail Doshi
And they're huge safety problems, and it, you know, it turned out to be just like a business we didn't like. And that's a hard, like that's like a hard thing after, you know, 12, 18 months of working on something and you're just like, "Well, I don't really like this that much. And now what?" And when we looked around for use cases, we were like, oh, all the use cases have text.
- GTGarry Tan
Mm.
- SDSuhail Doshi
All the big ones.
- GTGarry Tan
Mm.
- SDSuhail Doshi
Practically all of them. Logos, posters, T-shirts, bumper stickers, everything. Everything has text because text is also a way to communicate with humans. That's why it became number one, like the number one priority.
- GTGarry Tan
I mean, this isn't the first time that you've s- sort of confronted this issue before. You know, in your prior startup Mixpanel, which you built into a company that, you know, makes hundreds of millions of dollars a year, one of the leaders in analytics from a really young age. You know, I think you started it when you were 19, and I remember because...
- SDSuhail Doshi
(laughs)
- GTGarry Tan
(laughs) ... I met you when you first started it.
- SDSuhail Doshi
(laughs)
- GTGarry Tan
That was another moment where here's this brand new technology, and there's sort of very commercial use cases that you could build a real business on, and then there were other use cases, in that case I think it was, um, sort of fly by night gaming operations that would come and sort of pop up on Facebook, steal a bunch of users, and then disappear. And you had to make some choices about...
- SDSuhail Doshi
Yeah.
- GTGarry Tan
... who you wanted your users to be. Like do you want it to be people who can actually pay you money for a real product, uh, over the long haul, or sort of, oh yeah, they're here and they're gone and we can make our graph go up. Like it's sort of a quandary that, uh, a lot of founders are facing. How did you approach that?
- SDSuhail Doshi
Yeah. I, I mean, I, that one's like burnt into my memory actually. Um, so we, you know, the, the simple story was just that like we got all these gaming companies back in the gaming heyday of, you know, Zynga and RockYou and Slide and all this stuff. And, um...And they would, we, we were making so much money off of them. But then, they would die because they had bad retention or... Games just have like a decay factor. And uh-
- GTGarry Tan
You could tell that they were gonna die because the-
- 40:30 – 48:30
Reflections on going through YC twice
- GTGarry Tan
thing about Playground, it was also a previous, more radical pivot you had because you had gone through-
- SDSuhail Doshi
(laughs)
- GTGarry Tan
... YC twice.
- SDSuhail Doshi
Yeah.
- GTGarry Tan
So you went through with Mixpanel, which became this successful company, making hundreds of millions of dollars. Then you went through with-
- SDSuhail Doshi
Mighty.
- GTGarry Tan
... Mighty.
- SDSuhail Doshi
Mm-hmm.
- GTGarry Tan
Can you tell us about that second time going through YC and then what was it? And then you pivoted into...
- SDSuhail Doshi
Yeah, so we, I did this com- browser, um, company called Mighty where our goal was to try to like stream a browser and... The real goal was to try to make a new kind of computer, and um, we basically did it but the problem was is that, you know, we hit this wall where it was like, we didn't, I didn't believe that it was gonna be a new kind of computer anymore. I just couldn't make it more than two times faster, and I just didn't feel like if I couldn't get like a 10X or a 5X on this thing, like, and th- or at least see that it could get a 10X, um, that it just, it wasn't a company that I wanted to work on anymore. And it's-
- GTGarry Tan
Oh, I remember. I, I had invested before I came back- (laughs)
- SDSuhail Doshi
Yeah, of course.
- GTGarry Tan
... to YC. (laughs) And one of the big, uh, ideas that really got me was that, uh, actually our MacBook Pros were really sucking at the time.
- SDSuhail Doshi
Yeah, they were. Yeah. There was no-
- GTGarry Tan
And so-
- SDSuhail Doshi
... M1 at the time.
- GTGarry Tan
Yeah, and-
- SDSuhail Doshi
Yeah.
- GTGarry Tan
... we'd actually... I don't think we even knew that Apple was going to-
- SDSuhail Doshi
We had no idea.
- GTGarry Tan
... release silicon yet. I mean, it's interesting. I think that in Silicon Valley, we maybe, um, underestimate how valuable strategy actually is. Mainly because strategy is so fun and so interesting and the MBAs who come into our sector like immediately seize on that and just-
- SDSuhail Doshi
(laughs)
- GTGarry Tan
... want, you know. It's like, "You need a strategy person as like the f- you know, as a co-founder." And it's like, "No, no, no. We don't actually need that." But that's not to say that strategy is not necessary. (laughs) In this particular case, like I think that we were trying to solve a real problem which was our c- our browsers really sucked.
- SDSuhail Doshi
Yeah.
- GTGarry Tan
And that cloud was getting very, very good. And then suddenly, you know, the maze changed when, uh-
- SDSuhail Doshi
Right.
- GTGarry Tan
... when Apple released silicon.
- SDSuhail Doshi
Well, they clearly thought so too. So you know. (laughs)
- GTGarry Tan
(laughs) Strategy was right in some sense, like the overarching problem of trying to make our computers faster, they were able to make a chip. (laughs) Yeah.
- SDSuhail Doshi
But, but still, you know, even, you know, even in the, even the face of the M1, we had kind of convinced ourselves like, "Well, doesn't really matter. Like, the Mac only has like 8.3% market share, desktop market share. The rest is Windows." And um, you know, I even met, you know, the, the prior CEO of Intel, Bob Swan, and you know, talked to him about like, why is Intel behind here and all that. And I was trying to figure out like, "Why is AMD and Intel behind? Where it's, where is it going?"... is, is anyone even gonna get close to the M1 or not? And so I think one problem is that, like, wanting them to be behind is, like, non-ideal-
- 48:30 – 53:35
Running a research lab/startup hybrid vs a pure startup
- HTHarj Taggar
How does it feel to run Playground, which is sort of part startup, part research lab versus just pure startup?
- SDSuhail Doshi
Well, one thing we try to do is, uh, we try to differentiate on not trying to go after AGI. (laughs)
- HTHarj Taggar
(laughs)
- SPSpeaker
(laughs)
- SDSuhail Doshi
That's one thing we try to s- uh, say we're not doing, um, because there's lots of people doing that. It feels really tractable, I guess, the research does. You know, where it's not always clear whether research will be like that. You know, I- I'm- I've kind of learned that you can't, you can't do research in a rush, so one big problem is that when you're building a startup, like, you want to ship everything, like, you want to just ship-... you want to ship it today. You want to fix the bug, you want to ship the feature. Like, you're just trying to move on such a fast pace, but that's not tenable with research in the same way. Research is moving fast. But it's not like you ship ... You can't ship your new model, you can't build and ship your model in a week. And so I think that's been, like, really challenging, and I've had to kind of adjust my brain for one team versus the other.
- HTHarj Taggar
Yeah, one thing I think is interesting about successful research labs in the past, if you look at Bell Labs, for example, it's almost like the, the CEO of the lab's main responsibility is shielding the lab from, like, the commercial interests that are-
- SDSuhail Doshi
Right. (laughs)
- HTHarj Taggar
... pushing for, like, things now.
- SDSuhail Doshi
Yeah, right.
- HTHarj Taggar
But as CEO of Playground, you're kind of both, like, protector of the researchers, but you're also the commercial interests. Like, how do you juggle those competing forces?
- SDSuhail Doshi
Yeah, I don't know that I, I've probably mastered it yet by any means, but, um ... I think I asked Sam Altman once, like, you know, to what degree he allowed the researchers at OpenAI to, like, wander, I guess. So I just d- wasn't really sure. You know, usually it's like there's, like, a task and you do it. But what about wandering? (laughs) How does wandering make sense in a, in a research or an engineer- engineering team? And, um, and he said there's like a, he's like, "There's quite a bit of wa- wandering." It ... So I took that to heart, and, um, and so I let the research team kind of wander and get to a point where they are able to show an impressive result. And then we kind of, like, kind of start to, like, really accelerate that. But until then, there's not much to do. (laughs)
- GTGarry Tan
Well, not all who wander are lost. (laughs)
- HTHarj Taggar
(laughs) Nice.
- SDSuhail Doshi
I love that. That should be a T-shirt and a product.
- GTGarry Tan
Yeah, that's right.
- HTHarj Taggar
Yeah, you should make a...
- SDSuhail Doshi
We will add that as a template.
- HTHarj Taggar
Make a Playground T-shirt, yep.
- SDSuhail Doshi
We can link it below in the video.
- HTHarj Taggar
(laughs)
- GTGarry Tan
I'll be a creator in the Playground marketplace.
- SDSuhail Doshi
Love it. You were asking, like, how do you like inter- ... Almost like how do these two teams integrate in a startup? And I think that we just, like, have this channel now where we just see so much feedback that now the researchers can actually, like, look into the failure, and they can decide for themself while wandering, "Do I want to fix that? That's surprising. Why did that happen?" Um, and so I want to try to, like, integrate these two, because I think that that's like a weird ... That's a more differentiating factor these days. I think that, like, the researcher- research labs are very lab-based, and they don't necessarily ... They're not always deeply looking into real user behavior, what are they really trying to do. But sometimes it's just, like, we need to get to this, like ... We need to get a high score on this eval, and we got to put it in the paper, and then we got to, like, get really good score for LLM Arena. (laughs) And then there's, like, some KPI, you know, (laughs) to do that. But then, you know, does that thing matter? Does it correlate? Does the eval that we see out in the world, does it strongly correlate to usefulness to users? Like, I still want the LLMs to, like, help me make rap lyrics. But there's no eval for that. (laughs) So, you know, uh, who will do that? How will that happen? It's certainly possible to do that. But I ... If you notice, I, I always pick on this rap lyrics thing because to me it, like, belies a fundamental problem with how people are evaluating the models. Because the models should be extremely good at it, but they're not.
- JFJared Friedman
Maybe the problem is some of these, um ... There's a gap between commercialization of research, because all these eval publicly are academic and very different use case than if you wanted to go beat Canva, let's say.
- SDSuhail Doshi
Yeah. I mean, I may be talking out of turn here, sorry to the LLM folks, but the ... If you go look at the evals for the language models, they're all like, you know, math, biology, legal questions. It's no wonder that the biggest use case of ChatGPT is homework.
- HTHarj Taggar
Hmm.
- JFJared Friedman
(laughs)
- SDSuhail Doshi
Because they, they, you know, we ... They ... All the models are like basically, "Hit these numbers," right, initially. And maybe they're different now. They're probably more sophisticated now. But it's no wonder that the models are good at homework, and that's a huge category.
- GTGarry Tan
So, you made it to SOTA. People are watching right now and they're just asking, like, "How do I do it?" What's your answer to that?
- SDSuhail Doshi
There's, like, this feeling that all you need is a lot of data and, um, a lot of compute, and, uh, you just ... And then you just, uh, you run, you train these models and you'll get there, you know? Uh, they'll just generalize and suddenly everything will be great. I think there are a lot of smart software engineers, and so they fundamentally understand that these are the core components, ingredients to make, like, this great model. But it's vastly more complex than this. And
- 53:35 – 55:09
What it takes to make a state-of-the-art model
- SDSuhail Doshi
what I've, at least I've, uh, uh ... What I've, uh, experienced is that to get to SOTA, you basically have to be maniacal about, like, every detail of, um, you know, the models', like, capability. For example, like, you can, like, look at text generation. There are going to be some people that train their models and they get cool text generation, but the kerning is off. Are you the kind of person that will care about the kerning being off? Or are you the kind of person that is okay with it? You're like ... Or you don't even notice it. Like, do you have this just maniacal sense? Like, we look at skin texture. Like, my eyes feel burnt out basically (laughs) from looking at, like, the smallest little skin texture, you know, smoothing it. We, like, talk about these things as a research team day in and day out. We, like, argue about it. To build these SOTA models, you have to be so ... You have to care so much about ... In, in our world, it's image quality. And, and, you know, we even look at, like, little small things, like if there's even a slight film grain and it's missing, we go, "Oh, the prompt understa- ... The captioning model is bad, not good enough. We need to be better at this." And I think this maniacal mindset, I think, allows you ... If you do this 100 times, the model extrapolate even more. I think people don't quite internalize, like, extrapolation of all of these dimensions together and how they work together to make everything better. Like, you don't know that how making one thing better here will impact, like, another thing over there. We can't. It's hard to understand that. But I think that, that's what's wh- ... That's what's required to get to a SOTA model.
- GTGarry Tan
And
- 55:09 – 55:50
Outro
- GTGarry Tan
it is possible.
- SDSuhail Doshi
It is possible.
- GTGarry Tan
It is possible. It's not easy, though.
- SDSuhail Doshi
It's really hard, yeah. (laughs)
- GTGarry Tan
(laughs) Well, Suhail, thanks a lot for coming on The Light Cone. That's all we have time for, but you can try Playground right now, playground.com, uh, or in the App Store, Android, iOS. Uh, and this is actually the biggest flex, you didn't have any waitlists, it was just available on day one. So go try it out right now, and we'll see you guys next time. (instrumental music)
Episode duration: 55:51
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode VyIOoqjm8HA
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome