No PriorsNo Priors Ep. 60 | With Playground AI Founder Suhail Doshi
At a glance
WHAT IT’S REALLY ABOUT
Playground AI Bets Big On Pixels, Editing, And Open-Source Models
- Suhail Doshi, founder of Playground AI, discusses why he chose to build a company focused on image generation and editing rather than language or music, seeing a gap in long-term, dedicated effort around pixels. He explains how Playground trains its own diffusion models from scratch, pushing existing architectures like SDXL with new sampling tricks, meticulous data curation, and strong aesthetic judgment. A major theme is moving from “text-to-art loot boxes” toward high-utility workflows centered on editing, consistency, and integrating real and synthetic imagery. Doshi also outlines a long-term vision for a large vision model that can create, edit, and understand pixels, and shares views on future architectures, multimodality, and adjacent areas like AI-generated music.
IDEAS WORTH REMEMBERING
5 ideasDifferentiate by focusing deeply on an underserved modality.
Doshi deliberately avoided crowded language-model competition and music’s smaller market, instead committing to images where few were obsessively improving models and tooling over time.
Move from one-shot ‘text-to-art’ to iterative image editing.
Playground’s strategy is to shift away from random, loot-box-style generations toward workflows where users can reliably edit, stylize, and blend real and synthetic imagery for higher utility.
Architectural tricks matter, but curated data and taste dominate.
Techniques like EDM sampling, power EMA, offset noise, and DPO can yield big aesthetic gains, yet Doshi emphasizes that meticulous, high-quality supervised fine-tuning and aesthetic judgment are the primary performance drivers.
Current model evaluations underrepresent real user needs.
Benchmarks often miss practical gaps—like photorealistic faces or logos—so Playground relies on large-scale visual inspection, iterative eval improvements, and in-product user preference data to guide training.
Vision models should converge on create–edit–understand capabilities.
Playground’s long-term goal is a large vision model that can generate graphics, robustly edit existing content, and semantically understand images and video, similar in spirit to GPT-4V but focused on pixels.
WORDS WORTH SAVING
5 quotesRight now it's kind of not even text to image, it's like text to art.
— Suhail Doshi
It turns out that it's just way more complex than that, and way more complex than I even imagined.
— Suhail Doshi
It feels a little like a loot box right now and I think that because it's so much of a loot box, it feels like it's too much effort to get something that you really, really want.
— Suhail Doshi
The number one trick is really just that last phase of supervised fine-tune where you're finding really great curated data.
— Suhail Doshi
We know that pixels have an enormous amount of high information density compared to language, and language is just really between me and you—it's like a compressed way that you and I can converse.
— Suhail Doshi
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome