How To Build Generative AI Models Like OpenAI's Sora

If you read articles about companies like OpenAI and Anthropic training foundation models, it would be natural to assume that if you don’t have a billion dollars or the resources of a large company, you can’t train your own foundational models. But the opposite is true. In this episode of the Lightcone Podcast, we discuss the strategies to build a foundational model from scratch in less than 3 months with examples of YC companies doing just that. We also get an exclusive look at Open AI's Sora! Read more about the YC AI companies from this episode on our blog: https://www.ycombinator.com/blog/building-ai-models Chapters (Powered by https://bit.ly/chapterme-yc) - 00:00 - Coming Up 01:13 - Sora Videos 05:05 - How Sora works under the hood? 08:19 - How expensive is it to generate videos vs. texts? 10:01 - Infinity AI 11:23 - Sync Labs 13:41 - Sonauto 15:44 - Metalware 17:40 - Guide Labs 19:29 - Phind 24:21 - Diffuse Bio 25:36 - Piramidal 27:15 - K-Scale Labs 28:58 - DraftAid 30:38 - Playground 33:20 - Outro

Harj TaggarhostDiana HuhostJared Friedmanhost

Mar 27, 202434mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

Building Powerful Generative AI Without Billions: Lessons Beyond Sora

The episode reviews OpenAI’s Sora video demos, highlighting breakthroughs in visual quality, physics simulation, and long-term temporal consistency while still noting subtle artifacts and errors.
They explain, at a high level, how Sora likely works by combining transformers, diffusion models, and ‘space-time patches’ to handle video as sequences of visual tokens over time.
Most of the discussion focuses on how early-stage YC startups, with modest budgets and little formal ML background, are nevertheless training impressive foundation models by creatively hacking data, compute, and expertise constraints.
The hosts argue that generative AI is becoming a general-purpose function approximator for domains like video, code, weather, biology, CAD, and EEG, and that motivated founders can still build competitive, domain-specific models without OpenAI-scale resources.

IDEAS WORTH REMEMBERING

5 ideas

Text-to-video models like Sora already show near-photorealism and credible physics.

Sora can maintain consistent scenes over long clips, handle complex prompts, spell text correctly in frames, and model realistic motion for entities like dogs and robots, despite noticeable artifacts (e.g., impossible roads, floating objects, wrong-side traffic).

Sora-style systems likely merge transformers, diffusion, and temporal modeling via space-time patches.

OpenAI appears to use visual transformers and diffusion over variable-sized 3D ‘patches’ (x, y, time), effectively treating chunks of video as tokens to learn spatial and temporal consistency at scale.

You do not need billions of dollars or PhDs to train meaningful foundation models.

Multiple YC startups in a single batch (e.g., Infinity AI, SyncLab, Sonato, Metalware, Phind) trained their own models in a few months using ~$500K or less, cloud credits, and focused problem definitions—often as recent grads who self-taught modern ML.

Smart data strategies can substitute for massive compute.

Teams compress video to low resolution, focus on smaller but higher-quality corpora (e.g., scanned hardware textbooks), generate synthetic training data for coding tasks, and use older, smaller base models (e.g., GPT‑2.5) to achieve strong vertical performance.

Foundation models are becoming general-purpose function approximators across the physical and biological world.

Startups are using similar architectures to model weather (Atmo), proteins (Diffuse Bio), EEG brain signals (Pyramidal-like company), and CAD/structural physics (Draft 8), often outperforming legacy physics-based or hand-engineered systems in cost and speed.

WORDS WORTH SAVING

5 quotes

You can actually be on the cutting edge in relatively short order, and that's an incredible blessing.

— Host (about self-taught founders in AI)

How do YC companies build foundation models during the batch with just 500,000?

— Host (framing the central question of the episode)

This is literally built by 21-year-old new college grads, and they built this thing in two months.

— Host (on Sonato’s text-to-song model)

If you're looking for a reason why you can't succeed, guess what? You're right.

— Host (on mindset vs. opportunity in AI)

You can actually compete with OpenAI for very valuable, like verticals and use cases by training your own model without having to be Sam Altman or having a hundred million dollars.

— Host

Capabilities and limitations of OpenAI’s Sora text-to-video modelTechnical underpinnings: transformers, diffusion, and space-time patching for videoHow YC startups train foundation models on limited budgets and computeData strategies: synthetic data, high-quality niche corpora, and compressionDomain-specific foundation models (code, weather, biology, EEG, CAD, robotics)The evolving role of expertise vs. self-education in AI entrepreneurshipCompetitive landscape: startups vs. Big Tech in vertical AI applications

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.