How To Build Generative AI Models Like OpenAI's Sora

How To Build Generative AI Models Like OpenAI's Sora

Y CombinatorMar 28, 202434m

Harj Taggar (host), Diana Hu (host), Jared Friedman (host)

Capabilities and limitations of OpenAI’s Sora text-to-video modelTechnical underpinnings: transformers, diffusion, and space-time patching for videoHow YC startups train foundation models on limited budgets and computeData strategies: synthetic data, high-quality niche corpora, and compressionDomain-specific foundation models (code, weather, biology, EEG, CAD, robotics)The evolving role of expertise vs. self-education in AI entrepreneurshipCompetitive landscape: startups vs. Big Tech in vertical AI applications

In this episode of Y Combinator, featuring Harj Taggar and Diana Hu, How To Build Generative AI Models Like OpenAI's Sora explores building Powerful Generative AI Without Billions: Lessons Beyond Sora The episode reviews OpenAI’s Sora video demos, highlighting breakthroughs in visual quality, physics simulation, and long-term temporal consistency while still noting subtle artifacts and errors.

Building Powerful Generative AI Without Billions: Lessons Beyond Sora

The episode reviews OpenAI’s Sora video demos, highlighting breakthroughs in visual quality, physics simulation, and long-term temporal consistency while still noting subtle artifacts and errors.

They explain, at a high level, how Sora likely works by combining transformers, diffusion models, and ‘space-time patches’ to handle video as sequences of visual tokens over time.

Most of the discussion focuses on how early-stage YC startups, with modest budgets and little formal ML background, are nevertheless training impressive foundation models by creatively hacking data, compute, and expertise constraints.

The hosts argue that generative AI is becoming a general-purpose function approximator for domains like video, code, weather, biology, CAD, and EEG, and that motivated founders can still build competitive, domain-specific models without OpenAI-scale resources.

Key Takeaways

Text-to-video models like Sora already show near-photorealism and credible physics.

Sora can maintain consistent scenes over long clips, handle complex prompts, spell text correctly in frames, and model realistic motion for entities like dogs and robots, despite noticeable artifacts (e. ...

Get the full analysis with uListen AI

Sora-style systems likely merge transformers, diffusion, and temporal modeling via space-time patches.

OpenAI appears to use visual transformers and diffusion over variable-sized 3D ‘patches’ (x, y, time), effectively treating chunks of video as tokens to learn spatial and temporal consistency at scale.

Get the full analysis with uListen AI

You do not need billions of dollars or PhDs to train meaningful foundation models.

Multiple YC startups in a single batch (e. ...

Get the full analysis with uListen AI

Smart data strategies can substitute for massive compute.

Teams compress video to low resolution, focus on smaller but higher-quality corpora (e. ...

Get the full analysis with uListen AI

Foundation models are becoming general-purpose function approximators across the physical and biological world.

Startups are using similar architectures to model weather (Atmo), proteins (Diffuse Bio), EEG brain signals (Pyramidal-like company), and CAD/structural physics (Draft 8), often outperforming legacy physics-based or hand-engineered systems in cost and speed.

Get the full analysis with uListen AI

Vertical, explainable, or physics-grounded models offer room to compete with tech giants.

Founders can win in specialized domains—like explainable models, hardware copilot tools, or CAD/robotics—where general-purpose models don’t yet excel, by deeply understanding the domain and tailoring architecture, data, and evaluation to it.

Get the full analysis with uListen AI

Self-education and rapid iteration can put founders on the cutting edge quickly.

Examples like Playground’s pivot into AI and Suhail Doshi locking himself away to read papers illustrate that dedicated study plus access to GPUs (e. ...

Get the full analysis with uListen AI

Notable Quotes

You can actually be on the cutting edge in relatively short order, and that's an incredible blessing.

Host (about self-taught founders in AI)

How do YC companies build foundation models during the batch with just 500,000?

Host (framing the central question of the episode)

This is literally built by 21-year-old new college grads, and they built this thing in two months.

Host (on Sonato’s text-to-song model)

If you're looking for a reason why you can't succeed, guess what? You're right.

Host (on mindset vs. opportunity in AI)

You can actually compete with OpenAI for very valuable, like verticals and use cases by training your own model without having to be Sam Altman or having a hundred million dollars.

Host

Questions Answered in This Episode

What concrete technical steps and resources would a small team need to build a vertical foundation model similar to the YC examples discussed?

The episode reviews OpenAI’s Sora video demos, highlighting breakthroughs in visual quality, physics simulation, and long-term temporal consistency while still noting subtle artifacts and errors.

Get the full analysis with uListen AI

How far can synthetic data be pushed before it degrades model quality, and how should founders validate when synthetic data is helping versus hurting?

They explain, at a high level, how Sora likely works by combining transformers, diffusion models, and ‘space-time patches’ to handle video as sequences of visual tokens over time.

Get the full analysis with uListen AI

In which domains is it still clearly better to fine-tune existing large models rather than train a custom model from scratch?

Most of the discussion focuses on how early-stage YC startups, with modest budgets and little formal ML background, are nevertheless training impressive foundation models by creatively hacking data, compute, and expertise constraints.

Get the full analysis with uListen AI

How will physics-grounded simulators like Sora change robotics development compared with traditional reinforcement learning in the real world or in game engines?

The hosts argue that generative AI is becoming a general-purpose function approximator for domains like video, code, weather, biology, CAD, and EEG, and that motivated founders can still build competitive, domain-specific models without OpenAI-scale resources.

Get the full analysis with uListen AI

What standards or techniques will be needed to make vertical foundation models—especially those in biology and healthcare—safe, explainable, and regulatable?

Get the full analysis with uListen AI

Transcript Preview

Harj Taggar

A lot of the sci-fi stuff is actually now becoming possible. What happens when you have a model that's able of simulating real world physics?

Diana Hu

Wouldn't it be cool if this podcast were actually an Infinity AI video?

Jared Friedman

One thing I noticed that, like, the lip syncing is, like, extremely accurate. Like, it really looks like he's actually speaking Hindi.

Diana Hu

How do YC companies build foundation models during the batch with just 500,000?

Jared Friedman

This is literally built by 21-year-old new college grads, and they built this thing in two months. I think he, like, locked himself in his apartment for a month and just read AI papers.

Speaker

You can actually be on the cutting edge in relatively short order, and that's an incredible blessing. Welcome back to another episode of The Light Cone. Today, we're talking about generative AI. First there was GPT-4, then there was Midjourney for image generation, and now we're making the leap into video. Harj, we got access to Sora, and we're about to take a look at some clips that they generated just for us.

Harj Taggar

Yeah, should we take a look? Okay, so here's the first one. The prompt is, "It's the year 2050. A humanoid robot, acting as a household helper, walks someone's golden retriever down a pretty, tree-lined suburban street." What do we think?

Speaker

I like how it actually spells out "helper." It's like a flex.

Jared Friedman

Yeah.

Speaker

Like, "I can spell now."

Jared Friedman

Yeah, which was not true with the image models, like-

Harj Taggar

It would always screw up the text in the image.

Jared Friedman

Yeah.

Harj Taggar

Yeah, that's true.

Jared Friedman

Stable Diffusion, DALL·E were, were notoriously bad at spelling text, so that is a major advance that no one's really talked about yet.

Harj Taggar

I mean, it's wild how high definition it is. Like, that's almost realistic.

Diana Hu

And the other really cool thing is the physics. The way the robot walks, for the most part, is-

Jared Friedman

Yeah.

Diana Hu

... very accurate.

Jared Friedman

Accurate.

Diana Hu

You do notice a little kind of, like, shuffle that's a little bit off, but for the most part, it's believable.

Jared Friedman

And the way the golden retriever moves, I have a golden retriever.

Harj Taggar

Yeah, but look at the tail.

Jared Friedman

So I can personally vouch that, like, they perfectly modeled the, like-

Harj Taggar

(laughs) Yeah, you have one, right? So you would know.

Jared Friedman

(laughs)

Diana Hu

Like your dog, right?

Jared Friedman

Yeah, this is perfect, is a perfect representation of how a golden retriever walks. I also like that, um, with, with DALL·E and Stable Diffusion, as you got... As you made your prompts longer and longer, it would just start ignoring it, and not actually doing exactly what you told it to do. And like, we gave it a very specific prompt here, and it did exactly the thing that we told it to.

Harj Taggar

You can see it's not... It's still not exactly perfect.

Jared Friedman

Y-

Harj Taggar

So, I think towards the end, you see as, like, a floating dog or something in there.

Jared Friedman

Okay. I- I- I was gonna call out a couple other imperfections here-

Install uListen to search the full transcript and get AI-powered insights

Get Full Transcript

Get more from every podcast

AI summaries, searchable transcripts, and fact-checking. Free forever.

Add to Chrome