No Priors Ep. 60 | With Playground AI Founder Suhail Doshi

No PriorsApr 18, 202424m

Sarah Guo (host), Suhail Doshi (guest), Elad Gil (host)

Founding Playground AI and choosing images over language or musicText-to-art vs. broader text-to-image utility and editing workflowsTraining diffusion models from scratch and pushing SDXL architectureAesthetics, evaluation challenges, and data curation with user feedbackStrategic focus on image editing, consistency, and practical use casesLong-term vision for large vision models (create, edit, understand pixels)Perspectives on multimodal architectures, transformers, and AI in music/audio

In this episode of No Priors, featuring Sarah Guo and Suhail Doshi, No Priors Ep. 60 | With Playground AI Founder Suhail Doshi explores playground AI Bets Big On Pixels, Editing, And Open-Source Models Suhail Doshi, founder of Playground AI, discusses why he chose to build a company focused on image generation and editing rather than language or music, seeing a gap in long-term, dedicated effort around pixels. He explains how Playground trains its own diffusion models from scratch, pushing existing architectures like SDXL with new sampling tricks, meticulous data curation, and strong aesthetic judgment. A major theme is moving from “text-to-art loot boxes” toward high-utility workflows centered on editing, consistency, and integrating real and synthetic imagery. Doshi also outlines a long-term vision for a large vision model that can create, edit, and understand pixels, and shares views on future architectures, multimodality, and adjacent areas like AI-generated music.

Playground AI Bets Big On Pixels, Editing, And Open-Source Models

Suhail Doshi, founder of Playground AI, discusses why he chose to build a company focused on image generation and editing rather than language or music, seeing a gap in long-term, dedicated effort around pixels. He explains how Playground trains its own diffusion models from scratch, pushing existing architectures like SDXL with new sampling tricks, meticulous data curation, and strong aesthetic judgment. A major theme is moving from “text-to-art loot boxes” toward high-utility workflows centered on editing, consistency, and integrating real and synthetic imagery. Doshi also outlines a long-term vision for a large vision model that can create, edit, and understand pixels, and shares views on future architectures, multimodality, and adjacent areas like AI-generated music.

Key Takeaways

Differentiate by focusing deeply on an underserved modality.

Doshi deliberately avoided crowded language-model competition and music’s smaller market, instead committing to images where few were obsessively improving models and tooling over time.

Get the full analysis with uListen AI

Move from one-shot ‘text-to-art’ to iterative image editing.

Playground’s strategy is to shift away from random, loot-box-style generations toward workflows where users can reliably edit, stylize, and blend real and synthetic imagery for higher utility.

Get the full analysis with uListen AI

Architectural tricks matter, but curated data and taste dominate.

Techniques like EDM sampling, power EMA, offset noise, and DPO can yield big aesthetic gains, yet Doshi emphasizes that meticulous, high-quality supervised fine-tuning and aesthetic judgment are the primary performance drivers.

Get the full analysis with uListen AI

Current model evaluations underrepresent real user needs.

Benchmarks often miss practical gaps—like photorealistic faces or logos—so Playground relies on large-scale visual inspection, iterative eval improvements, and in-product user preference data to guide training.

Get the full analysis with uListen AI

Vision models should converge on create–edit–understand capabilities.

Playground’s long-term goal is a large vision model that can generate graphics, robustly edit existing content, and semantically understand images and video, similar in spirit to GPT-4V but focused on pixels.

Get the full analysis with uListen AI

Future architectures must marry language knowledge with pixel richness.

Doshi believes pure diffusion transformers trained only on caption–image pairs lack rich, interpretable world knowledge, and that the winning approach will combine transformer-based vision with language models in a multimodal system.

Get the full analysis with uListen AI

AI can unlock scarce components in creative workflows, like vocals and lyrics.

In music, he uses tools like Suno to generate vocals and lyrics—which are harder to source than instrumentals—and then recomposes around them, illustrating how AI can fill the scarcest, most bottlenecked creative roles.

Get the full analysis with uListen AI

Notable Quotes

“Right now it's kind of not even text to image, it's like text to art.”
— Suhail Doshi

“It turns out that it's just way more complex than that, and way more complex than I even imagined.”
— Suhail Doshi

“It feels a little like a loot box right now and I think that because it's so much of a loot box, it feels like it's too much effort to get something that you really, really want.”
— Suhail Doshi

“The number one trick is really just that last phase of supervised fine-tune where you're finding really great curated data.”
— Suhail Doshi

“We know that pixels have an enormous amount of high information density compared to language, and language is just really between me and you—it's like a compressed way that you and I can converse.”
— Suhail Doshi

Questions Answered in This Episode

How would Playground’s editing-first philosophy change workflows in design, marketing, and everyday photo use compared to current text-to-image tools?

Get the full analysis with uListen AI

What concrete steps could the industry take to build evaluations that correlate better with real-world aesthetic preferences and utility?

Get the full analysis with uListen AI

How might a unified large vision model practically interact with a large language model—would they be separate components or a single tightly integrated system?

Get the full analysis with uListen AI

What business models make sense around open-sourcing high-quality diffusion models while still sustaining the heavy compute and research costs?

Get the full analysis with uListen AI

If pixels provide far richer data than language, could vision-centric training ultimately surpass language-centric AGI approaches in capability or generality?

Get the full analysis with uListen AI

Transcript Preview

Sarah Guo

(music plays) Hi, listeners, and welcome to another episode of No Priors. Today, we're talking to Suhail Doshi, the founder of Playground AI, an image generator and editor. They've been open sourcing foundation diffusion models, most recently Playground 2.5. We're so excited to have Suhail on to talk about building this model in conjunction with the Playground community and the future of AI pixel generation. Welcome, Suhail.

Suhail Doshi

Thanks for having me.

Sarah Guo

Uh, so this is your third company. You started Mixpanel, Mighty, now you're working on Playground. Um, how did you decide this was the next thing?

Suhail Doshi

I- I think like back in April of 2022, I think that was just a place at that time it was like GPT-3... 5 kind of came out and then DALLE-2 came out, um, and I was actually working on the second company, Mighty, and at that time I was trying to figure out how to like do something with AI inside of a browser address bar. But when I saw DALLE-2 came out, it was just this like very big strange eye-opening moment where I think a lot of people didn't think that we'd be able to do like weird interesting art things, um, so soon. And so and I think then soon after that I think Stable Diffusion came out around June or July of that same year, and I got early access, maybe a couple weeks access to early... to, uh, SD 1.4, and I just kind of blew my mind what, what people could do with that. And I just thought that it seemed odd that all of this was being done in a Google Colab notebook. Shouldn't there be like a UI that makes it really easy? That sort of thing.

Sarah Guo

From the start, were you just thinking, "We will open source. We will, um, train our own models from scratch"? Did you think about other, um, other modalities?

Suhail Doshi

Um, yeah. It was... I mean there have been a lot of people that thought I should do like something in music, but I just... I... 'Cause music has been like a huge hobby of mine for like six years or so, I like produce music, but I just couldn't wrap my, my brain around like what useful thing I would end up making for people. Although now there's like a lot of very interesting, cool, useful things for music. (laughs) Um, and, uh, and then it seemed like a lot of people were very focused on language, and I had really enjoyed... I already work with lots of creative tools like when I was in high school I used to make logos or I would make music or whatever. Um, so I was excited that finally I could find something where it was a combination of creativity, tooling. Images have like really amazing like built-in distribution. People want to share those kinds of things. So it ended up just like being this perfect thing that I was excited to work on.

Install uListen to search the full transcript and get AI-powered insights

Get Full Transcript

Get more from every podcast

AI summaries, searchable transcripts, and fact-checking. Free forever.

Add to Chrome