Building The World's Best Image Diffusion Model

Suhail Doshi, a YC alumni who previously founded Mixpanel and Mighty, has created a state-of-the-art (SOTA) AI image diffusion model with Playground. The app allows you to talk to it like a graphic designer and helps you create imagery and text for a wide variety of use cases. In this episode of Lightcone, Suhail sits down with the hosts to talk about his experience building Playground with his team, and what it takes to make a SOTA model. Try Playground: https://playground.com/design Read Playground V3 Paper: https://arxiv.org/pdf/2409.10695 Chapters (Powered by https://bit.ly/chapterme-yc) - 0:00 Intro 1:07 What is Playground? 1:47 What Garry was able to make using Playground 7:04 The focus on text accuracy 10:44 Building a marketplace for Playground 16:00 Prompts are like HTML for graphics 22:25 Creating new design professions 26:13 Using tailwinds of what is happening in language 30:06 Problems with aesthetics evals 32:42 The commercial applications 33:54 When the users you get are not the users you want 40:30 Reflections on going through YC twice 48:30 Running a research lab/startup hybrid vs a pure startup 53:35 What it takes to make a state-of-the-art model 55:09 Outro

Suhail DoshiguestGarry TanhostHarj TaggarhostJared Friedmanhost

Sep 19, 202455mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

Playground Reinvents Image Generation As A True Graphic Design Partner

The conversation centers on Playground v3, a state-of-the-art image diffusion model and design product optimized for real-world graphic design tasks rather than artistic toy use. Founder Suhail Doshi explains how the team rebuilt the entire stack—architecture, captioning, UX, and marketplace—to achieve unprecedented text accuracy, prompt understanding, and designer‑like interaction. They emphasize shifting from raw model access and prompt engineering toward visual templates, natural language edits, and a creator ecosystem. Alongside technical details, Doshi shares strategic lessons on choosing users, pivoting from failed directions, and marrying research rigor with product usefulness.

IDEAS WORTH REMEMBERING

5 ideas

Text accuracy and prompt adherence unlock real commercial design use cases.

By prioritizing flawless text rendering and precise spatial control, Playground moves beyond ‘toy art’ into practical workflows like logos, posters, and T‑shirts—areas where text is indispensable.

Visual-first, template-based UX removes the need for user prompt engineering.

Instead of forcing users to learn long, arcane prompts, Playground lets them start from curated templates or uploaded designs and then refine via plain-English instructions, dramatically reducing friction.

Owning the full stack—from captioner to architecture—is necessary for SOTA.

Simply scaling data and compute isn’t enough; Playground rebuilt the VAE, text encoder, and diffusion core, and created a new state-of-the-art captioner to capture small details like kerning, film grain, and facial expressions.

Being maniacal about tiny qualitative details compounds into better models.

The team obsessively debates skin texture, kerning, film grain, and positional nuances; this relentless refinement across hundreds of dimensions lets the model generalize to higher overall quality.

You can and should choose your users and markets—sometimes against early demand.

Playground deliberately avoided becoming a porn-focused platform and, drawing on Mixpanel’s experience with gaming analytics, redirected toward the much larger, more durable graphic design market (e.g., Canva-scale).

WORDS WORTH SAVING

5 quotes

To get to SOTA, you basically have to be maniacal about every detail.

— Suhail Doshi

We decided that one core belief was that the product should be visual first, not text first.

— Suhail Doshi

We should be doing the prompt engineering for users.

— Suhail Doshi

If we listened to what the users wanted, we would have to build a porn company.

— Suhail Doshi

It’s safe to say that this is the worst the model will ever be.

— Suhail Doshi

Playground v3’s focus on text accuracy, prompt understanding, and spatial reasoningProduct and UX shift from raw prompts to visual-first templates and natural-language editingTechnical overhaul: new architecture, captioning system, and departure from CLIP/Stable DiffusionUse cases in graphic design (logos, T‑shirts, stickers) and competition with Canva/MidjourneyBuilding a creator marketplace and new ‘AI designer’ professionStrategic lessons from previous startups (Mixpanel, Mighty) and choosing the right users/marketsBalancing research wandering with commercial product needs and new evaluation challenges

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.