How To Build Generative AI Models Like OpenAI's Sora

If you read articles about companies like OpenAI and Anthropic training foundation models, it would be natural to assume that if you don’t have a billion dollars or the resources of a large company, you can’t train your own foundational models. But the opposite is true. In this episode of the Lightcone Podcast, we discuss the strategies to build a foundational model from scratch in less than 3 months with examples of YC companies doing just that. We also get an exclusive look at Open AI's Sora! Read more about the YC AI companies from this episode on our blog: https://www.ycombinator.com/blog/building-ai-models Chapters (Powered by https://bit.ly/chapterme-yc) - 00:00 - Coming Up 01:13 - Sora Videos 05:05 - How Sora works under the hood? 08:19 - How expensive is it to generate videos vs. texts? 10:01 - Infinity AI 11:23 - Sync Labs 13:41 - Sonauto 15:44 - Metalware 17:40 - Guide Labs 19:29 - Phind 24:21 - Diffuse Bio 25:36 - Piramidal 27:15 - K-Scale Labs 28:58 - DraftAid 30:38 - Playground 33:20 - Outro

Harj TaggarhostDiana HuhostJared Friedmanhost

Mar 28, 202434mWatch on YouTube ↗

CHAPTERS

0:00 – 1:12
Why text-to-video feels like a sci‑fi turning point
The hosts frame generative video as the next leap after GPT‑4 and image models, and tease the broader implication: models that can simulate aspects of the physical world. They set up a hands-on look at Sora clips made for the show.
- •Generative AI’s progression: text → images → video
- •What it would mean for a model to simulate real-world physics
- •Teaser of Sora access and custom clips
- •Early hints at applications beyond entertainment
1:12 – 3:46
Sora demo #1: Robot walking a dog—prompt following, text rendering, and physics
They watch a suburban scene generated from a detailed prompt and analyze what’s impressive versus what breaks. The discussion focuses on higher fidelity, better text rendering, stronger prompt adherence, and more believable motion—while still spotting artifacts.
- •Accurate spelling/text in-video as a notable advance over earlier image models
- •Improved prompt adherence even with longer, specific prompts
- •Believable motion/physics (robot gait, dog movement)
- •Remaining artifacts: odd street geometry, floating/jumping objects, minor inconsistencies
3:46 – 5:05
Sora demo #2: Golden Gate drone shot—world knowledge, continuity, and simulation gaps
A second clip shows a cinematic orbit around the Golden Gate Bridge, highlighting Sora’s ability to reproduce recognizable landmarks with high definition. They also point out subtle failures: geometry alignment issues, wrong-side driving, and still-challenging fluid motion.
- •Landmark recognition and cinematic composition
- •High-resolution output and long-term visual coherence
- •Errors: disjointed bridge column perspective, geographically off terrain
- •Cars on wrong side of the road; fluid/waves still imperfect
5:05 – 6:45
How Sora works under the hood: transformers + diffusion + time
Diana offers a primer on the likely architecture: combining transformer approaches (token-like representations) with diffusion techniques used in image generation, then adding temporal consistency across frames. She explains OpenAI’s “space-time patches” concept as a video analogue to tokens.
- •Hybrid approach: transformer model + diffusion model + temporal component
- •Training on video using “space-time patches” (spatial + temporal chunks)
- •Variable patch sizes across dimensions to represent video efficiently
- •Analogy: patches function similarly to tokens for video
6:45 – 8:19
Research lineage and speculation: ViTs, world models, and robotics influences
They connect Sora’s ideas to prior work such as visual transformers (image patches) and earlier world-model research separating perception and memory over time. Because OpenAI is opaque about specifics, the hosts emphasize informed speculation based on known research threads.
- •Visual Transformer precedent: patching images for transformer processing
- •World-model ideas: perception vs temporal memory components
- •Possible blend of robotics literature with modern transformer stacks
- •Acknowledgment of limited public detail from OpenAI
8:19 – 8:55
Cost and scale: why video generation likely dwarfs text compute
They discuss why video is far more expensive than text: video adds dimensions and temporal length, implying much larger models and GPU needs. The conversation transitions to the surprising theme that startups can still achieve impressive results with far fewer resources.
- •Video’s dimensionality increases compute needs dramatically vs text
- •Speculation: models could be an order of magnitude larger than GPT‑4-class systems
- •GPU scale discussion (tens of thousands for frontier models)
- •Setup for how YC startups “hack” data/compute/expertise constraints
8:55 – 10:06
Myth-busting: YC startups building foundation models on $500k (and credits)
The hosts challenge the idea that only billion-dollar labs can train meaningful models. They introduce the framework of data, compute, and expertise—and tee up a tour of YC company demos that achieved strong results during the batch.
- •The ‘you need billions’ meme is misleading
- •Framework: data + compute + expertise as the key constraints
- •YC batch examples as proof smaller teams can build frontier-like features
- •Segue into company demos and tactics
10:06 – 11:23
Infinity AI: rapid avatar video from minimal data via adaptation
Infinity AI demonstrates generating a convincing deepfake-style video by typing a script, using a replica trained on the show’s first episodes. The hosts highlight that once a base model exists, adapting to a new identity can require surprisingly little data.
- •Text/script to personalized talking-head video
- •Model trained from ~hours of YouTube footage (few episodes)
- •Transfer/adaptation reduces per-person data requirements
- •Demonstration of high believability for lightweight training data
11:23 – 13:42
Sync Labs: real-time lip sync with low-res data and fast GPU iteration
Sync Labs shows extremely accurate lip syncing (even to unfamiliar languages) and explains how they trained with limited hardware. The key hacks: compressing data with low-res video and leveraging YC’s Azure GPU cluster credits for rapid iteration.
- •API for real-time lip syncing; high accuracy noted by hosts
- •Training reportedly done on a single A100
- •Data hack: low-res video drastically reduces compute/data requirements
- •Compute/iteration hack: YC’s dedicated GPU cluster and large cloud credits
13:42 – 15:44
Sonauto: text-to-song—small team, self-taught expertise, strong intelligibility
They feature Sonauto’s text-to-song generation, emphasizing how rare and difficult this capability is and praising lyrical intelligibility and vocal realism. The founders’ story reinforces the theme that motivated newcomers can reach the cutting edge quickly.
- •Text/lyrics to full song in a chosen style/performer persona
- •Output stands out for understandable lyrics and convincing vocals
- •Built by 21-year-old founders in ~two months
- •Self-teaching and rapid execution as a repeatable pattern
15:44 – 19:29
Metalware & Guide Labs: domain-specific models with high-quality data and explainability
Metalware’s hardware-design copilot shows how constrained domains plus curated datasets can enable smaller models to work well. Guide Labs (as described) targets explainable foundation models to address black-box concerns, indicating breadth beyond generative media.
- •Metalware: hardware copilot built by SpaceX hardware engineers who learned AI
- •Data strategy: scan/curate high-quality textbook figures and content
- •Compute strategy: smaller model choice enabled by better data and narrower task
- •Guide Labs: focus on explainable outputs vs opaque deep learning
19:29 – 22:03
Phind and the synthetic data debate: generating better training signal at scale
Phind’s software copilot is positioned as outperforming traditional Q&A sources, enabled by clever synthetic data generation for programming-competition style tasks. They unpack why synthetic data once felt ‘circular’ and why it can work when models can reason, comparing to simulation-heavy self-driving training.
- •Phind: programming copilot trained with synthetic competition-style data
- •Synthetic data’s ‘mosquito drinking its own blood’ objection
- •Reasoning and self-improvement flywheels as a possible explanation
- •Analogy: self-driving trained heavily on simulated data
22:03 – 30:38
Physics simulators and world models: from Sora to weather, biology, brains, robotics, and CAD
They broaden from entertainment to real-world modeling: Sora-like physics intuition could unlock major domains. Examples include Atmo’s ML-based weather prediction, Diffuse Bio’s protein generation, Pyramidal’s EEG temporal modeling, humanoid robotics (K‑Scale), and DraftAid accelerating CAD/engineering kernels.
- •Atmo: ML weather model more efficient than traditional physics supercomputer approaches
- •Diffuse Bio: generative protein/drug discovery; speed via custom kernels and expertise
- •Pyramidal: EEG as ‘space-time’ data; chunking reduces runtime complexity
- •Robotics implications (K‑Scale; ties to Tesla/Optimus experience) and CAD acceleration (DraftAid)
30:38 – 34:05
Playground and the closing message: pivots, self-learning, and competing with giants
They highlight Playground as a case study of a startup pivoting hard into AI, learning fast, and producing image models that compete with larger, better-funded rivals. The episode ends with encouragement: the field is young, and focused effort plus smart strategy can put small teams on the frontier.
- •Playground 2.5: competitive image quality vs Midjourney/Stable Diffusion
- •Pivoting into AI as a viable path; learning by deep immersion in papers
- •Core takeaway: data/compute/expertise can be ‘hacked’ creatively
- •Closing encouragement: you can build meaningful models without massive capital

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

Why text-to-video feels like a sci‑fi turning point

Sora demo #1: Robot walking a dog—prompt following, text rendering, and physics

Sora demo #2: Golden Gate drone shot—world knowledge, continuity, and simulation gaps

How Sora works under the hood: transformers + diffusion + time

Research lineage and speculation: ViTs, world models, and robotics influences

Cost and scale: why video generation likely dwarfs text compute

Myth-busting: YC startups building foundation models on $500k (and credits)

Infinity AI: rapid avatar video from minimal data via adaptation

Sync Labs: real-time lip sync with low-res data and fast GPU iteration

Sonauto: text-to-song—small team, self-taught expertise, strong intelligibility

Metalware & Guide Labs: domain-specific models with high-quality data and explainability

Phind and the synthetic data debate: generating better training signal at scale

Physics simulators and world models: from Sora to weather, biology, brains, robotics, and CAD

Playground and the closing message: pivots, self-learning, and competing with giants

Get more out of YouTube videos.