Building The World's Best Image Diffusion Model

Y CombinatorSep 19, 202455m

Suhail Doshi (guest), Garry Tan (host), Harj Taggar (host), Jared Friedman (host)

Playground v3’s focus on text accuracy, prompt understanding, and spatial reasoningProduct and UX shift from raw prompts to visual-first templates and natural-language editingTechnical overhaul: new architecture, captioning system, and departure from CLIP/Stable DiffusionUse cases in graphic design (logos, T‑shirts, stickers) and competition with Canva/MidjourneyBuilding a creator marketplace and new ‘AI designer’ professionStrategic lessons from previous startups (Mixpanel, Mighty) and choosing the right users/marketsBalancing research wandering with commercial product needs and new evaluation challenges

In this episode of Y Combinator, featuring Suhail Doshi and Garry Tan, Building The World's Best Image Diffusion Model explores playground Reinvents Image Generation As A True Graphic Design Partner The conversation centers on Playground v3, a state-of-the-art image diffusion model and design product optimized for real-world graphic design tasks rather than artistic toy use. Founder Suhail Doshi explains how the team rebuilt the entire stack—architecture, captioning, UX, and marketplace—to achieve unprecedented text accuracy, prompt understanding, and designer‑like interaction. They emphasize shifting from raw model access and prompt engineering toward visual templates, natural language edits, and a creator ecosystem. Alongside technical details, Doshi shares strategic lessons on choosing users, pivoting from failed directions, and marrying research rigor with product usefulness.

Playground Reinvents Image Generation As A True Graphic Design Partner

The conversation centers on Playground v3, a state-of-the-art image diffusion model and design product optimized for real-world graphic design tasks rather than artistic toy use. Founder Suhail Doshi explains how the team rebuilt the entire stack—architecture, captioning, UX, and marketplace—to achieve unprecedented text accuracy, prompt understanding, and designer‑like interaction. They emphasize shifting from raw model access and prompt engineering toward visual templates, natural language edits, and a creator ecosystem. Alongside technical details, Doshi shares strategic lessons on choosing users, pivoting from failed directions, and marrying research rigor with product usefulness.

Key Takeaways

Text accuracy and prompt adherence unlock real commercial design use cases.

By prioritizing flawless text rendering and precise spatial control, Playground moves beyond ‘toy art’ into practical workflows like logos, posters, and T‑shirts—areas where text is indispensable.

Get the full analysis with uListen AI

Visual-first, template-based UX removes the need for user prompt engineering.

Instead of forcing users to learn long, arcane prompts, Playground lets them start from curated templates or uploaded designs and then refine via plain-English instructions, dramatically reducing friction.

Get the full analysis with uListen AI

Owning the full stack—from captioner to architecture—is necessary for SOTA.

Simply scaling data and compute isn’t enough; Playground rebuilt the VAE, text encoder, and diffusion core, and created a new state-of-the-art captioner to capture small details like kerning, film grain, and facial expressions.

Get the full analysis with uListen AI

Being maniacal about tiny qualitative details compounds into better models.

The team obsessively debates skin texture, kerning, film grain, and positional nuances; this relentless refinement across hundreds of dimensions lets the model generalize to higher overall quality.

Get the full analysis with uListen AI

You can and should choose your users and markets—sometimes against early demand.

Playground deliberately avoided becoming a porn-focused platform and, drawing on Mixpanel’s experience with gaming analytics, redirected toward the much larger, more durable graphic design market (e. ...

Get the full analysis with uListen AI

Prompt expansion and rich training captions let users be ‘lazy’ but still get quality.

Internally, Playground explodes simple user prompts into very detailed multi-caption descriptions during training and inference, enabling strong results from short, natural inputs like “nature scene.”

Get the full analysis with uListen AI

Standard evals break when models start obeying prompts too well.

Because Playground adheres strictly to prompts (including constraints that hurt composition), users sometimes rate rival outputs as more ‘aesthetic,’ revealing a new entanglement problem between adherence and perceived beauty.

Get the full analysis with uListen AI

Notable Quotes

“To get to SOTA, you basically have to be maniacal about every detail.”
— Suhail Doshi

“We decided that one core belief was that the product should be visual first, not text first.”
— Suhail Doshi

“We should be doing the prompt engineering for users.”
— Suhail Doshi

“If we listened to what the users wanted, we would have to build a porn company.”
— Suhail Doshi

“It’s safe to say that this is the worst the model will ever be.”
— Suhail Doshi

Questions Answered in This Episode

How might Playground’s designer-like interface change the role and workflow of human graphic designers over the next five years?

The conversation centers on Playground v3, a state-of-the-art image diffusion model and design product optimized for real-world graphic design tasks rather than artistic toy use. ...

Get the full analysis with uListen AI

What new evaluation methods could fairly separate prompt adherence from aesthetic preference in image models?

Get the full analysis with uListen AI

How far can template-based, visual-first interfaces scale before advanced users again need raw prompt or model access?

Get the full analysis with uListen AI

What guardrails and business choices are needed to prevent powerful image models from being pulled toward undesirable but high-demand use cases like explicit content?

Get the full analysis with uListen AI

If image prompt understanding is now at roughly ‘GPT‑3 level,’ what would a ‘GPT‑4 level’ leap look like for visual models in terms of capabilities and applications?

Get the full analysis with uListen AI

Transcript Preview

Suhail Doshi

I think we thought the product was going to be one way, and then we literally ripped it all up in a month and a half or so before release. We were sort of, like, lost in the jungle for a moment. (laughs) Like a bit of a panic.

Garry Tan

There's a lot of unsolved problems, basically. I mean, the, you know, even this version of it, you know, people are gonna try it, and then they might be blown away by it, but like, the next one's gonna be even crazier.

Suhail Doshi

To get to SOTA, you basically have to be maniacal about, like, every detail. There are going to be some people that train their models and they get cool text generation, but the kerning is off. Are you the kind of person that will care about the kerning being off? Or are you the kind of person that is okay with it, like, or you don't even notice it?

Garry Tan

Welcome back to another episode of The Light Cone. I'm Gary. This is Jared, Harj, and Diana. And collectively, we have funded companies worth hundreds of billions of dollars, usually just with one or two people just starting out, and we're in the middle of this crazy AI revolution. And so, we thought we would invite our friend Suhel Doshi, founder and CEO of Playground, which is the state-of-the-art image generation model, with also a state-of-the-art user experience, and it just launched. So, how you feeling, Suhel?

Suhail Doshi

Very under pressure right now.

Garry Tan

(laughs)

Speaker

(laughs)

Suhail Doshi

Uh, excited though.

Garry Tan

That's good then-

Suhail Doshi

Yeah.

Speaker

Yeah.

Garry Tan

... so you're sort of like a startup founder.

Suhail Doshi

Yes.

Speaker

Right. Yeah.

Garry Tan

(laughs) Which is normal. Maybe the best way to start off is to, uh, look at some examples of the images that you were able to generate. Um, and this is stuff sort of right off the presses.

Speaker

(laughs)

Garry Tan

So, uh, at Y Combinator, I, uh, also am one of the group partners, so I fund a number of companies, uh, every batch. I funded about 15 for the summer batch. And so what we're looking at here is one of the T-shirt designs I made. As you can see, there's a GPU, and it was based on one of the core templates in your library. I like metal, so this, uh, very much (laughs) spoke to me. This one was off of a sticker design, and I guess I just really liked that sword, and what I was able to do is, uh, add GPU fans.

Speaker

(laughs)

Suhail Doshi

Love it. I love it.

Garry Tan

And so that's one of the noteworthy things about Playground. You can upload an image. It'll sort of extract, um, the essence of like, sort of the aesthetic, and some of the features of it-

Speaker

This one-

Garry Tan

... and then you can remix it.

Speaker

... feels like a, feels like a tattoo. (laughs)

Suhail Doshi

(laughs)

Garry Tan

Yeah, exactly.

Speaker

(laughs)

Harj Taggar

Do you remember what you prompted it with to get those?

Garry Tan

Oh yeah, I- I basically... So the cool thing about Playground to create this, was I- I picked, uh, a default template that I liked, um, and I think it only had the sword and sort of this ribbon, and I said, "Make it say Houstan on the ribbon, and, um, add a GPU (laughs) with two fans." I was very specific. I wanted a two-fan GPU, and that's one of the things that you'll see in all these designs. This is actually the T-shirt that Houstan itself actually chose-

Install uListen to search the full transcript and get AI-powered insights

Get Full Transcript

Get more from every podcast

AI summaries, searchable transcripts, and fact-checking. Free forever.

Add to Chrome