
How To Build Generative AI Models Like OpenAI's Sora
Harj Taggar (host), Diana Hu (host), Jared Friedman (host)
In this episode of Y Combinator, featuring Harj Taggar and Diana Hu, How To Build Generative AI Models Like OpenAI's Sora explores building Powerful Generative AI Without Billions: Lessons Beyond Sora The episode reviews OpenAI’s Sora video demos, highlighting breakthroughs in visual quality, physics simulation, and long-term temporal consistency while still noting subtle artifacts and errors.
Building Powerful Generative AI Without Billions: Lessons Beyond Sora
The episode reviews OpenAI’s Sora video demos, highlighting breakthroughs in visual quality, physics simulation, and long-term temporal consistency while still noting subtle artifacts and errors.
They explain, at a high level, how Sora likely works by combining transformers, diffusion models, and ‘space-time patches’ to handle video as sequences of visual tokens over time.
Most of the discussion focuses on how early-stage YC startups, with modest budgets and little formal ML background, are nevertheless training impressive foundation models by creatively hacking data, compute, and expertise constraints.
The hosts argue that generative AI is becoming a general-purpose function approximator for domains like video, code, weather, biology, CAD, and EEG, and that motivated founders can still build competitive, domain-specific models without OpenAI-scale resources.
Key Takeaways
Text-to-video models like Sora already show near-photorealism and credible physics.
Sora can maintain consistent scenes over long clips, handle complex prompts, spell text correctly in frames, and model realistic motion for entities like dogs and robots, despite noticeable artifacts (e. ...
Get the full analysis with uListen AI
Sora-style systems likely merge transformers, diffusion, and temporal modeling via space-time patches.
OpenAI appears to use visual transformers and diffusion over variable-sized 3D ‘patches’ (x, y, time), effectively treating chunks of video as tokens to learn spatial and temporal consistency at scale.
Get the full analysis with uListen AI
You do not need billions of dollars or PhDs to train meaningful foundation models.
Multiple YC startups in a single batch (e. ...
Get the full analysis with uListen AI
Smart data strategies can substitute for massive compute.
Teams compress video to low resolution, focus on smaller but higher-quality corpora (e. ...
Get the full analysis with uListen AI
Foundation models are becoming general-purpose function approximators across the physical and biological world.
Startups are using similar architectures to model weather (Atmo), proteins (Diffuse Bio), EEG brain signals (Pyramidal-like company), and CAD/structural physics (Draft 8), often outperforming legacy physics-based or hand-engineered systems in cost and speed.
Get the full analysis with uListen AI
Vertical, explainable, or physics-grounded models offer room to compete with tech giants.
Founders can win in specialized domains—like explainable models, hardware copilot tools, or CAD/robotics—where general-purpose models don’t yet excel, by deeply understanding the domain and tailoring architecture, data, and evaluation to it.
Get the full analysis with uListen AI
Self-education and rapid iteration can put founders on the cutting edge quickly.
Examples like Playground’s pivot into AI and Suhail Doshi locking himself away to read papers illustrate that dedicated study plus access to GPUs (e. ...
Get the full analysis with uListen AI
Notable Quotes
“You can actually be on the cutting edge in relatively short order, and that's an incredible blessing.”
— Host (about self-taught founders in AI)
“How do YC companies build foundation models during the batch with just 500,000?”
— Host (framing the central question of the episode)
“This is literally built by 21-year-old new college grads, and they built this thing in two months.”
— Host (on Sonato’s text-to-song model)
“If you're looking for a reason why you can't succeed, guess what? You're right.”
— Host (on mindset vs. opportunity in AI)
“You can actually compete with OpenAI for very valuable, like verticals and use cases by training your own model without having to be Sam Altman or having a hundred million dollars.”
— Host
Questions Answered in This Episode
What concrete technical steps and resources would a small team need to build a vertical foundation model similar to the YC examples discussed?
The episode reviews OpenAI’s Sora video demos, highlighting breakthroughs in visual quality, physics simulation, and long-term temporal consistency while still noting subtle artifacts and errors.
Get the full analysis with uListen AI
How far can synthetic data be pushed before it degrades model quality, and how should founders validate when synthetic data is helping versus hurting?
They explain, at a high level, how Sora likely works by combining transformers, diffusion models, and ‘space-time patches’ to handle video as sequences of visual tokens over time.
Get the full analysis with uListen AI
In which domains is it still clearly better to fine-tune existing large models rather than train a custom model from scratch?
Most of the discussion focuses on how early-stage YC startups, with modest budgets and little formal ML background, are nevertheless training impressive foundation models by creatively hacking data, compute, and expertise constraints.
Get the full analysis with uListen AI
How will physics-grounded simulators like Sora change robotics development compared with traditional reinforcement learning in the real world or in game engines?
The hosts argue that generative AI is becoming a general-purpose function approximator for domains like video, code, weather, biology, CAD, and EEG, and that motivated founders can still build competitive, domain-specific models without OpenAI-scale resources.
Get the full analysis with uListen AI
What standards or techniques will be needed to make vertical foundation models—especially those in biology and healthcare—safe, explainable, and regulatable?
Get the full analysis with uListen AI
Transcript Preview
A lot of the sci-fi stuff is actually now becoming possible. What happens when you have a model that's able of simulating real world physics?
Wouldn't it be cool if this podcast were actually an Infinity AI video?
One thing I noticed that, like, the lip syncing is, like, extremely accurate. Like, it really looks like he's actually speaking Hindi.
How do YC companies build foundation models during the batch with just 500,000?
This is literally built by 21-year-old new college grads, and they built this thing in two months. I think he, like, locked himself in his apartment for a month and just read AI papers.
You can actually be on the cutting edge in relatively short order, and that's an incredible blessing. Welcome back to another episode of The Light Cone. Today, we're talking about generative AI. First there was GPT-4, then there was Midjourney for image generation, and now we're making the leap into video. Harj, we got access to Sora, and we're about to take a look at some clips that they generated just for us.
Yeah, should we take a look? Okay, so here's the first one. The prompt is, "It's the year 2050. A humanoid robot, acting as a household helper, walks someone's golden retriever down a pretty, tree-lined suburban street." What do we think?
I like how it actually spells out "helper." It's like a flex.
Yeah.
Like, "I can spell now."
Yeah, which was not true with the image models, like-
It would always screw up the text in the image.
Yeah.
Yeah, that's true.
Stable Diffusion, DALL·E were, were notoriously bad at spelling text, so that is a major advance that no one's really talked about yet.
I mean, it's wild how high definition it is. Like, that's almost realistic.
And the other really cool thing is the physics. The way the robot walks, for the most part, is-
Yeah.
... very accurate.
Accurate.
You do notice a little kind of, like, shuffle that's a little bit off, but for the most part, it's believable.
And the way the golden retriever moves, I have a golden retriever.
Yeah, but look at the tail.
So I can personally vouch that, like, they perfectly modeled the, like-
(laughs) Yeah, you have one, right? So you would know.
(laughs)
Like your dog, right?
Yeah, this is perfect, is a perfect representation of how a golden retriever walks. I also like that, um, with, with DALL·E and Stable Diffusion, as you got... As you made your prompts longer and longer, it would just start ignoring it, and not actually doing exactly what you told it to do. And like, we gave it a very specific prompt here, and it did exactly the thing that we told it to.
You can see it's not... It's still not exactly perfect.
Y-
So, I think towards the end, you see as, like, a floating dog or something in there.
Okay. I- I- I was gonna call out a couple other imperfections here-
Install uListen to search the full transcript and get AI-powered insights
Get Full TranscriptGet more from every podcast
AI summaries, searchable transcripts, and fact-checking. Free forever.
Add to Chrome