No PriorsNo Priors Ep. 60 | With Playground AI Founder Suhail Doshi
EVERY SPOKEN WORD
55 min read · 10,584 words- 0:00 – 0:52
Introduction
- SGSarah Guo
(music plays) Hi, listeners, and welcome to another episode of No Priors. Today, we're talking to Suhail Doshi, the founder of Playground AI, an image generator and editor. They've been open sourcing foundation diffusion models, most recently Playground 2.5. We're so excited to have Suhail on to talk about building this model in conjunction with the Playground community and the future of AI pixel generation. Welcome, Suhail.
- SDSuhail Doshi
Thanks for having me.
- SGSarah Guo
Uh, so this is your third company. You started Mixpanel, Mighty, now you're working on Playground. Um, how did you decide this was the next thing?
- SDSuhail Doshi
I- I think like back in April of 2022, I think that was just a place at that time it was like GPT-3... 5 kind of came out and then DALLE-2 came out, um, and I was actually working on the second company, Mighty, and at that time I was trying to figure
- 0:52 – 3:01
Focusing on image generation
- SDSuhail Doshi
out how to like do something with AI inside of a browser address bar. But when I saw DALLE-2 came out, it was just this like very big strange eye-opening moment where I think a lot of people didn't think that we'd be able to do like weird interesting art things, um, so soon. And so and I think then soon after that I think Stable Diffusion came out around June or July of that same year, and I got early access, maybe a couple weeks access to early... to, uh, SD 1.4, and I just kind of blew my mind what, what people could do with that. And I just thought that it seemed odd that all of this was being done in a Google Colab notebook. Shouldn't there be like a UI that makes it really easy? That sort of thing.
- SGSarah Guo
From the start, were you just thinking, "We will open source. We will, um, train our own models from scratch"? Did you think about other, um, other modalities?
- SDSuhail Doshi
Um, yeah. It was... I mean there have been a lot of people that thought I should do like something in music, but I just... I... 'Cause music has been like a huge hobby of mine for like six years or so, I like produce music, but I just couldn't wrap my, my brain around like what useful thing I would end up making for people. Although now there's like a lot of very interesting, cool, useful things for music. (laughs) Um, and, uh, and then it seemed like a lot of people were very focused on language, and I had really enjoyed... I already work with lots of creative tools like when I was in high school I used to make logos or I would make music or whatever. Um, so I was excited that finally I could find something where it was a combination of creativity, tooling. Images have like really amazing like built-in distribution. People want to share those kinds of things. So it ended up just like being this perfect thing that I was excited to work on.
- SGSarah Guo
How do you think the, like, overall landscape for, um, uh, competition is different in language versus images versus music? Right? Like how, how did you think about w-... in what ways you guys would want to like build advantage and stand out?
- SDSuhail Doshi
I think with language there's like... I don't know. I don't know how many language companies there are. You guys would probably know better than me, but it seems like there's like over 20, and then, and maybe like five or like eight of them have a billion dollars worth of funding. I also
- 3:01 – 5:58
Differentiating from other AI creative tools
- SDSuhail Doshi
didn't want to work on something if there were already extremely passionate people really working hard at that thing, people that I, like, really respected that were working on that thing. And so at the time with, with, um, images there was just sort of I think there was Midjourney, there was OpenAI doing some DALLE stuff, and then you saw sort of Stable Diffusion. Um, but for some of these companies it didn't seem like there was going to be a longstanding concerted effort to keep making them better. It was sort of unclear like who was doing this as like a fun demo versus who was doing this as something they would like spend and invest tons of their time in. And so once I kind of figured out, um, you know, to what extent OpenAI was going to invest in it or to what extent seemed like the folks at Stability AI were like sort of focused on like seven different kinds of things, and I just thought like, "Hmm, there's just not enough people that want to do this one thing and do it really, really great." So I think for me it was just about what... were there enough capable people that wanted to do this.
- EGElad Gil
Can you talk a little bit about the specific direction you decided to take with Playground as well? And I know you thought really deeply about some of the applications or use cases for it. So I was just curious if you could share a bit more about that.
- SDSuhail Doshi
Yeah, I think one thing that has sort of been surprising and it's hasn't changed too much actually from maybe around June or July of August 2022, um, uh, was that like a lot of... a lot of people think about like text to image as just... it's... right now it's kind of not... it's not even text to image, it's like text to art.
- SGSarah Guo
Sorry, what's the difference?
- SDSuhail Doshi
The difference is that, um, these models have, uh... they don't... they don't... they haven't quite reached the potential of like what... maybe what its utility could be. Right now for the most part we take... we formulate a prompt which is really just a caption of what the image is, and then it diffuses into an image, a set of pixels, but a lot of those pixels are primarily used for art. But what we haven't done is we haven't done anything beyond that. We haven't really done something like editing for example. Like why can't we take an image that you already have and why can't we like sort of insert something into that with like the correct lighting and stuff? Why can't we like stylize an existing thing? Why is there not a blend of like real and synthetic imagery into a single image that could be then be used for a lot more things than just pure art? And so right now it's a lot of just people making art but not a lot of people... sometimes that reduces its like practicality or its utility.
- EGElad Gil
Yeah. That makes sense. I think, um, one of the things that you folks did as well that I thought was really interesting is, uh, you built your own models, right? Or you trained your own models and a lot of people in this space just take Stable Diffusion and fine-tune it or do other approaches like that. And, um, you have... you've gotten... you just launched 2.5. The model is performing incredibly well, like it creates really beautiful imagery and it's super high quality. And I'll be curious if you could share a bit more about how you went about training your model and hiring a team specifically for that purpose and how you thought about it and approached it.
- SDSuhail Doshi
Yeah. It turns tur... turns out that, you know, I think when... i-... you know, instead of strong engineers like their first thought is that like you, you just take a model architecture.You find a lot of data. You get, you fund yourself with enough compute, and you just sort of like throw these things
- 5:58 – 8:31
Training a Stable Diffusion model
- SDSuhail Doshi
in, into like a mixture of, of, of sorts and like out comes out like something like DALLE-2 or DALLE-3. (laughs) It turns out that it's just way more complex than that, and way more complex than I even imagined. I had a sense that it was more complicated than that, but then it still further is more complicated than even that. Um, you know, so I think there are a couple things that we did. Uh, one of the things that we were really focused on with that model was that we wanted to see how far we could push the architecture of something that already existed. This was mostly like a test. It was a test to see whether, um, how far we could get as a research team before like the next model change, and so we wanted to take something that we knew was a recipe that worked already, which was Stable Diffusion XL's architecture, which is like a U-Net, right? Um, and CLIP and, and the same VAE that Robin Rombach trained, um, all this stuff, and then we sort of said, "Okay. What if we tried to get something that's just at least better than SDXL, the be- better than the open source model?" And we weren't really sure by how much, and so our only goal was to, like, just be better and try to deliver on the number one state of the art open source model that we could release. And, um, and so we kind of learned two things. One is that when we looked at some of the images from something like SDXL, we noticed that they were sort of this, like, average brightness. It was really confusing. Um, it didn't quite have, like, the right kind of color and contrast, and in fact, I became so used to this, I became so surprised about the average brightness when comparing it to the images of our model that I thought it was a bug during eval. Like, I literally was, like, looking at the images and I was like, "These cannot be the right images." And my team was sort of like, "Hey, I think you're actually just getting used to the images of the, the new model." And so we employed this thing called, um, like, this EDM formulation, which, like, samples the noise slightly differently, and it's like a really clever kind of math trick. And there's pro- there's a paper that you could probably read on it, um, but it, it's surprising how this, like, one little, like, very clever trick w- can produce, um, images that have, like, incredibly, uh, great color and contrast where like the blacks are really vibrant with like a bunch of different colors and this average brightness kind of goes away. So that's like one thing.
- EGElad Gil
You know, that's a really interesting example of really optimizing for one aspect of creating, you know, aesthetically pleasing imagery, and there's a few other aspects like that. So I'm just curious, how much do you have to sort of hand tune different parameters versus it's just something that you get as you, um, you know, train a model or post train a model?
- SDSuhail Doshi
Yeah. I mean, there's just so many... There's different dimensions of these models. I mean, one is just like there, it's understanding of knowledge, but then for like aesthetics, it's really tricky.
- 8:31 – 15:00
Long term vision for Playground AI
- SDSuhail Doshi
Um, I, honestly, I think like the field itself is just so nascent that, like, every month there's like a new trick. There's like a new thing that we all sort of develop or find, find out. I think there are, there's an element of some of that being like a lot of different tricks. Like, there's like this new trick that hasn't been like ex- uh, um, well employed or well exploited yet by this guy named Theo, uh, Karras, and he basically does this weird thing called power EMA. Anyway, it like basically helps converge training really fast. And, and so that's like one trick, and then there's this EDM trick, and there's this thing called offset noise. And so there is a lot of tricks for things like color and contrast. There's even a trick called DPO for like, um, that I think works in the language model world and also the image world, right? Um, so I think there are all these... There are like lots of tricks that sometimes get you like 10, 20, sometimes 2X improvements, but I think the number one trick is like really just like that last phase of, you know, a supervised fine tune where you're finding like really great curated data. And it's hard to say how much of that is a trick 'cause it's actually just a lot of meticulous work. So I think there's a kind of a combination of some of these things being tricks and techniques, and then there's just like this other thing that's just like really hard, meticulous, like there has to be deep care. And with images, maybe more so than language, um, there has to be like taste and judgment.
- EGElad Gil
Yeah. How do you think about that from the perspective of the eval you do? Because not that many people have amazing taste aesthetically, right? And so I'm a little bit curious how you end up determining what is good taste, or is it just user feedback, you know, thumbs up, thumbs down? Like, how do you think about that?
- SDSuhail Doshi
One thing that I've noticed is that every time we do an eval, uh, we try to make our evils better and we try, try to make them better than the predecessor eval. And so one thing I always notice though with each successive run is that I find out much later after eval that the model has like all these gaps. So an- an example of a gap that we recently had was like, we did well on our eval, but one area that I thought we did poorly that I wish we had done better on was, uh, photorealism. Sometimes it would make faces look like they hadn't gone to sleep for three days or something (laughs) . And so I think that most evals in the industry are relatively flawed. Like, a lot of them are doing like benchmarks on things that, um, maybe are valuable from the purposes of marketing, but are not necessarily well correlated with the, with what maybe users care about. And so like an- a simple example would be like with large language models, like there's probably a good reason, there's a reason why they're probably good at homework is because like a lot of the evals are like related to things that are related to things (laughs) that could be like homework, like solving like an LSAT or a bio test or a math test. And so some of these evals just don't have like the necessary coverage. So I think with things like judgment and taste, my feeling is that overall the evils need to get like way stronger. And so one thing that we tend to do is we just tend to like really look at a, like a lot of images across a lot of grids, and we're really like being exacting about like what thing could be off, but you have to look at like thousands of images across thou- uh, you know, lots of, lots of different grids across different checkpoints to basically find and pick like sort of release candidates. But I still think that our own evals are not sufficiently strong enough, um, and they could be better at like world knowledge, whether that's like its ability to reproduce a celebrity if that's w- that w- that's what you want, or paintings. Um, sometimes paintings are like difficult, um, or like 3D or like illustrations or logos. Those kinds of things are all like... Overall, I think coverage is like pretty tricky.
- SGSarah Guo
So one of the things that you guys do is, um, have like voting scenes or user studies within the product itself. So I don't know if it's grids, but you're asking users to, you know, um, express preferences more so than I, I think, uh, perhaps other research efforts are.Can you talk about just like generally your data curation strategy, if there's some sort of overall framework or if community is a big piece of it?
- SDSuhail Doshi
Generally, we try to keep something, like, very simple because we know that users are, they have, like, they're- they're there to make, like, images. They're not there to, like, necessarily, like, help us label images and so... Or annotate things or tell us everything about their preferences. And so we kind of have, like, a very sophisticated process of how we sort of curate images and, like, how we're collecting- collecting data from these users to help us kind of, like, rank and sort of, like, make sure we're choosing the right sort of things that we want to curate. And so I think these things are, like, very... They might seem, like, very simple when you encounter it but, like, beneath that is, like, something very, very complex. But yeah. It's a little tough to go into it too deeply because, yeah, it does feel like a little bit of a secret sauce, I suppose.
- SGSarah Guo
Yeah. Well, at least I'll feel good that my, um, guess as to what's interesting is right.
- SDSuhail Doshi
(laughs)
- SGSarah Guo
Can you just characterize, like, where Playground does really well today? Like, where you stand out and what sort of use cases you're focused on, like, winning?
- SDSuhail Doshi
We're- we're, maybe, like, probably, like, number two, I suspect, I guess, at, like, text to art at the moment just because we're training these models from scratch and we're get- we're closing the gap really rapidly, as- as rapidly as we can around all the various, like, kind of use cases. But I think that we'll probably diverge from some of the other companies in part because we care... I think we start... We're- we're gonna care a lot more about editing. Um, you know, people just have, like, a lot of images on their phone or they wanna take some image that they love, whether that's made as art or something, um, that they found and they wanna, like, tweak it a little bit. It's a little annoying that, like, you make this image and then you can't really change too much about it. You can't change the likeness of it. Maybe there's, like, a dog or, like, your face or, like, something, character consistency issues. It feels a little like a loot box right now and I think that because it's so much of a loot box, it feels, mm, it feels like it's too much effort, I guess, to get something that you really, really want. So I think more where we're navigating is, like, how can we help you take an image that you love, maybe it's your logo, maybe, or incorporate something like your logo or put it in some sort of situation that- that you would prefer. Um, text synthesis is, like, something that we wanna do, uh, for example. So those are some areas that we want to head towards, um, where there's, like, higher utility and- and less, like, you make an image and you just post it to Instagram or something like that.
- EGElad Gil
Where do you want to take, uh, the company and the product over the next few years? Like, what- what is a long term vision of what you're doing?
- SDSuhail Doshi
Um, if people are out there kind of, like, working on scaling text, we're basically trying to focus on, like, scaling pixels and the first area that we're basically started on is just images. And the reason why we're working on- on images instead of say something like video or 3D
- 15:00 – 17:21
Evolution of AI architecture
- SDSuhail Doshi
or something like that is one part... One issue with 3D is that it tends to be better to work on 3D if you're, like, making the content, like you're making Pixar movies. Um, the tools in 3D tend to not, like, make as much money. And then the other thing with video is, like, videos is just extraordinarily computationally expensive to, um, to do inference or even training on. Um, and a lot of the video models first train, like pre-trained with like a billion images first anyway to, like, have a rich semantic understanding of, like, pixels. We just think that video... That images is, like, a, like, maybe the most obvious place to start because A, the utility is quite low and B, um, and B, it's, like, actually somewhat efficient computationally, like, to do. So long term, I think that we're trying to make a large vision model. There's not really, like, a word, I guess. Like, we have LLMs, but I'm not really sure what the word is for vision or pixels if you're trying to make like a, a, you know, a, um, a multitask, like, vision model. And so the goal would be to try to, like, do three- three areas of a large vision model would be to be able to create things, edit things, and then understand things. And so, like, understanding would be like GPT-4V or if you're using something open source like CogVLM or there's all these amazing vision language models that are happening. Um, and then editing and creating are things that we've kind of talked about. But it would be really amazing at some point that if you made this, like, really amazing large vision model that it could do things, like, not just, like, create things like art but, like, maybe, uh, like, help some kind of robot, like, tr- traverse, like, some sort of path or, like, maze and then there's, like, things in the middle that are sort of, like, you know, maybe you have, like, a video camera or surveillance system or something and it's, like, able to understand what's going on in that. Um, but I think right now we're just really focused on graphics.
- EGElad Gil
And then how do you think about the underlying architecture for what you're doing? Uh, 'cause, you know, traditionally, a lot of the, um, models have been, uh, diffusion model based and then increasingly, you know, you see people now starting to use, uh, transformer based architectures for some aspects of image gen and things like that. How do you think about where all this is heading from an architectural perspective and what sort of models will exist in the next year or two?
- SDSuhail Doshi
My kind of, like, controversial take perhaps is that, um, you know, there's this thing called DIT which people allegedly believe, like, Sora is based on and, um, and then there are variants of DIT. There's this thing called, like, I think, MM-DIT which I think stable Division
- 17:21 – 22:30
Capabilities of multimodal models
- SDSuhail Doshi
Three is supposed to be based on by that research team, that Stability AI. And my overall feeling is that transformers are definitely... I think transformers are definitely, like, the right direction but I don't think that we're going to get a lot of u- enough utility if we're not, like, somewhat trying to figure out a way to combine the great amazing knowledge of, like, a language model and/or... and then just, like, using, you know, something like DIT which is completely trained from some kind of video caption or image caption to an image 'cause there's not enough, like, interpretable knowledge, I suppose. Like, you're not able to interpret anything about the input which a language model is really great at but then there's, like, these models that are just trained on these captions that emit images and it's ki- it's kind of unclear, like, how we might marry these two things. Um, and so it'd be... It sure would be nice if, like, somehow we could combine these two things, so I think the architecture is, like, mostly going... is most likely going to change. I don't think that DIT is, like, the right architecture but transformers certainly.
- EGElad Gil
And just for people in the... who are listening, DIT just stands for Diffusion Transformers, so.... um, in case people are wondering.
- SGSarah Guo
One belief held by some of the large labs focused, uh, mostly on language today is, in the end, we end up with like one truly multimodal general model, right? That it is like not, like we don't end up with a language model and a video model and an audio model and an image model. Uh, it is any modality in giga-brain knowledge, reasoning, long context, and any modality out. Like, uh, do you, do you believe in that world view or how do you see it differently?
- SDSuhail Doshi
I definitely think the models are going to be multimodal. And in fact, like that's kind of what I mean about, like, some of these models that are just like strictly trained through, like, you know, a division transformer. Um, the, like, a division transformer that's only taking, like, caption image inputs is ju- just, like, completely lacks sort of somewhat, some knowledge. And then conversely, if you look at just, like, the language models, um, you know, we know that language is at, at a much lower dimensionality than, say, like, an image which has, like, all these pixels that, like, sort of tell us about lighting or physics or spatial relationships or size and shapes. So, you know, for example, if you were to, like, take a glass and, like, shatter it on the floor and then I asked you to describe it, and I asked, uh, and then I described it, we would both come up with, like, completely different descriptions if Elad had to go and, like, draw it, right? So we know that, like, pixels have an enormous amount of high information density compared to language, and language is just really between me and you. It's like a compressed way that you and I can converse with each other at a, at a higher, at somewhat of a higher bandwidth, right? Um, like we have an abstract view of, like, those, what those words mean. So I think that the models, there has to be some, like language is really great 'cause it's compressed information, and then, like the, and then, like, vision is really great because it's so information-rich, but it's been hard to annotate until recently. Um, it's only because vision-language models exist that it's now suddenly, uh, a lot easier to, like, sort of label or annotate or understand what's going on in an image. So I think that these two things are very, are going to be very likely married. The only question is, is like, do, does, does language... To me, it's, uh, kind of a question of, uh, does language... Language has this wonderful, um, trait where it's like you can use language to control things, which is pretty cool because of its low dimensionality. But my question would be like, I wonder if language will hit a ceiling, uh, or has like a lower ceiling than, say, vision, because it's very easy to get lots of pixel data, and that pixel data is, like, very, very high density.
- SGSarah Guo
Very easy to get more pixel, like additional pixel data to the already collected data from the internet that's gone into these models.
- SDSuhail Doshi
Yeah, I mean, r- there's an assumption maybe, one assumption I tend to question is like whether the inner data's sufficient. Like the internet's very big, but maybe there's like some kind of mode collapse even with internet data, whereas like with vision, at least you can like, you can like make a robot that just like travels down the street and like just keeps taking pictures of everything. (laughs) Um, you can get like infinite training data, um, uh, wi- with vision, but it might be trickier to, like, sort of filter and clean internet data, especially as like more synthetic data ends up on the internet.
- EGElad Gil
One, one other area that, um, I know you spend a lot of time is music, you know, you make your own music and produce it, and, um, you know, there've been a number of different applications, Riffusion, Suno, et cetera, that have sort of come up on the music side. I was just curious how you've been paying attention to that, what you think of it, and, you know, where you think that whole space evolves to.
- SDSuhail Doshi
Yeah, I love, I love audio. That'd be the other kind of thing that I would go work on if it weren't for Playground. Um, partly I didn't work on music because the music industry is, like, only... The whole industry is $26 billion, so it was a little hard for me to figure out, like, how big, like, a music thing could be. But I definitely think audio is going to be enormous, like things like ElevenLabs are very interesting. But anyway, I, um, yeah, I mean, the way that I've been tr- I've been trying to find ways to s- figure out how to use it as a user, because that gives me like a really, that gives me a stronger sense of maybe where things are going. And so one thing that I've been waiting for for many years is that,
- 22:30 – 24:31
Parallels between audio AI tools and image-generation
- SDSuhail Doshi
um, instrumentals in music are actually very easy, uh, to get or to make. Um, you know, there, there's a wide variety of, like, quality of course, but, uh, generally instrumentals in a song, like if you hear a song from Taylor Swift or whoever, a rap song, those beats or those instrumentals are fairly easy to make or e- and, and easy. What's hard is to get lyrics and vocals, and that's always been like a difficulty of mine, like how do I find a singer and then how do I get them to, like, write lyrics and then sing it? That's much, uh, that's a much more scarce resource in the music world. And, um, and so for the first time with like something like Suno, uh, AI, it was really cool because it's the first time that I heard, you know, them be able to make like a rap song where the rapper has, like, good flow. Uh, flow is just like the swing of lyrics to a beat and, um, or like a, you know, you hear like actually really good lyrics that feel like very emotional, have the right breathiness, doesn't sound like, um, (laughs) like it's all made on like auto-tune, I guess. And so, um, I have this, like, little flow where I, like, make a song in Suno and then I use a different AI tool, it's AI tools all the way down, I guess, to, like, split the stems and just grab the lyrics, but then throw away the instrumental, and then I get to, like, make a song with the instrumental and the, uh, vocal. Anyway, so I made, I put some songs on my Twitter where, where like I basically tried to do this and it sounds, you know, so I can get to like a higher quality song, I guess, because I make the instrumental. It doesn't, there are still some, like, weird errors in the songs, but, um, but that's been like a really cool way to use AI in my opinion.
- EGElad Gil
So Hale, thanks so much for sharing everything that you're working on at Playground with us.
- SGSarah Guo
Find us on Twitter @nopriorspod. Subscribe to our YouTube channel if you wanna see our faces. Follow the show on Apple Podcasts, Spotify, or wherever you listen. That way you get a new episode every week. And sign up for emails or find transcripts for every episode at no-priors.com.
Episode duration: 24:31
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode YKXgQQ-ELG8
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome