No Priors Ep. 60 | With Playground AI Founder Suhail Doshi

Multimodal models are making it possible to create AI art and augment creativity across artistic mediums. This week on No Priors, Sarah and Elad talk with Suhail Doshi, the founder of Playground AI, an image generator and editor. Playground AI has been open-sourcing foundation diffusion models, most recently releasing Playground V2.5. In this episode, Suhail talks with Sarah and Elad about how the integration of language and vision models enhances the multimodal capabilities, how the Playground team thought about creating a user-friendly interface to make AI-generated content more accessible, and the future of AI-powered image generation and editing. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @Suhail Show Notes: 0:00 Introduction 0:52 Focusing on image generation 3:01 Differentiating from other AI creative tools 5:58 Training a Stable Diffusion model 8:31 Long term vision for Playground AI 15:00 Evolution of AI architecture 17:21 Capabilities of multimodal models 22:30 Parallels between audio AI tools and image-generation

Sarah GuohostSuhail DoshiguestElad Gilhost

Apr 18, 202424mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

Playground AI Bets Big On Pixels, Editing, And Open-Source Models

Suhail Doshi, founder of Playground AI, discusses why he chose to build a company focused on image generation and editing rather than language or music, seeing a gap in long-term, dedicated effort around pixels. He explains how Playground trains its own diffusion models from scratch, pushing existing architectures like SDXL with new sampling tricks, meticulous data curation, and strong aesthetic judgment. A major theme is moving from “text-to-art loot boxes” toward high-utility workflows centered on editing, consistency, and integrating real and synthetic imagery. Doshi also outlines a long-term vision for a large vision model that can create, edit, and understand pixels, and shares views on future architectures, multimodality, and adjacent areas like AI-generated music.

IDEAS WORTH REMEMBERING

5 ideas

Differentiate by focusing deeply on an underserved modality.

Doshi deliberately avoided crowded language-model competition and music’s smaller market, instead committing to images where few were obsessively improving models and tooling over time.

Move from one-shot ‘text-to-art’ to iterative image editing.

Playground’s strategy is to shift away from random, loot-box-style generations toward workflows where users can reliably edit, stylize, and blend real and synthetic imagery for higher utility.

Architectural tricks matter, but curated data and taste dominate.

Techniques like EDM sampling, power EMA, offset noise, and DPO can yield big aesthetic gains, yet Doshi emphasizes that meticulous, high-quality supervised fine-tuning and aesthetic judgment are the primary performance drivers.

Current model evaluations underrepresent real user needs.

Benchmarks often miss practical gaps—like photorealistic faces or logos—so Playground relies on large-scale visual inspection, iterative eval improvements, and in-product user preference data to guide training.

Vision models should converge on create–edit–understand capabilities.

Playground’s long-term goal is a large vision model that can generate graphics, robustly edit existing content, and semantically understand images and video, similar in spirit to GPT-4V but focused on pixels.

WORDS WORTH SAVING

5 quotes

Right now it's kind of not even text to image, it's like text to art.

— Suhail Doshi

It turns out that it's just way more complex than that, and way more complex than I even imagined.

— Suhail Doshi

It feels a little like a loot box right now and I think that because it's so much of a loot box, it feels like it's too much effort to get something that you really, really want.

— Suhail Doshi

The number one trick is really just that last phase of supervised fine-tune where you're finding really great curated data.

— Suhail Doshi

We know that pixels have an enormous amount of high information density compared to language, and language is just really between me and you—it's like a compressed way that you and I can converse.

— Suhail Doshi

Founding Playground AI and choosing images over language or musicText-to-art vs. broader text-to-image utility and editing workflowsTraining diffusion models from scratch and pushing SDXL architectureAesthetics, evaluation challenges, and data curation with user feedbackStrategic focus on image editing, consistency, and practical use casesLong-term vision for large vision models (create, edit, understand pixels)Perspectives on multimodal architectures, transformers, and AI in music/audio

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.