No Priors Ep.61 | OpenAI's Sora Leaders Aditya Ramesh, Tim Brooks and Bill Peebles

AI-generated videos are not just leveled-up image generators. But rather, they could be a big step forward on the path to AGI. This week on No Priors, the team from Sora is here to discuss OpenAI’s recently announced generative video model, which can take a text prompt and create realistic, visually coherent, high-definition clips that are up to a minute long. Sora team leads, Aditya Ramesh, Tim Brooks, and Bill Peebles join Elad and Sarah to talk about developing Sora. The generative video model isn’t yet available for public use but the examples of its work are very impressive. However, they believe we’re still in the GPT-1 era of AI video models and are focused on a slow rollout to ensure the model is in the best place possible to offer value to the user and more importantly they’ve applied all the safety measures possible to avoid deep fakes and misinformation. They also discuss what they’re learning from implementing diffusion transformers, why they believe video generation is taking us one step closer to AGI, and why entertainment may not be the main use case for this tool in the future. Show Notes: 0:00 Sora team Introduction 1:05 Simulating the world with Sora 2:25 Building the most valuable consumer product 5:50 Alternative use cases and simulation capabilities 8:41 Diffusion transformers explanation 10:15 Scaling laws for video 13:08 Applying end-to-end deep learning to video 15:30 Tuning the visual aesthetic of Sora 17:08 The road to “desktop Pixar” for everyone 20:12 Safety for visual models 22:34 Limitations of Sora 25:04 Learning from how Sora is learning 29:32 The biggest misconceptions about video models

Sarah GuohostTim BrooksguestBill PeeblesguestAditya RameshguestElad Gilhost

Apr 25, 202431mWatch on YouTube ↗

CHAPTERS

0:00 – 0:55
Meet the Sora leads and why text-to-video matters
Sarah Guo introduces OpenAI’s Sora and the three leaders—Aditya Ramesh, Tim Brooks, and Bill Peebles. The conversation frames Sora as a major step in applying scalable transformer-style training to video generation.
- •Sora generates HD, coherent video clips up to ~1 minute from text prompts
- •Positioning video generation as part of OpenAI’s broader AGI pathway
- •Quick introductions from Aditya, Tim, and Bill
0:55 – 2:14
Sora as a world simulator on the path to AGI
Bill argues that realistic video generation requires learning rich models of people, objects, and environments. He describes Sora as an early version of a “world simulator” that could eventually allow interactive, agent-like behavior inside simulated environments.
- •Complex scenes (e.g., Tokyo winter street) illustrate modeling people, motion, and interactions
- •To generate realism, the model must internalize how the world works
- •Vision of personal simulators where agents can be tasked and return results
- •Scaling Sora is framed as a route toward more general intelligence
2:14 – 3:11
Product roadmap: limited access, artists, and red teaming
Tim explains why OpenAI isn’t committing to a near-term product timeline. Instead, they’re distributing Sora to a small cohort of artists and red teamers to understand usefulness, impact, and safety before broader release decisions.
- •No public timeline yet for a general consumer product
- •Early access program for artists to guide usability and creative workflows
- •Red teaming to surface misuse and safety issues prior to release
- •Feedback informs research direction and eventual product decisions
3:11 – 5:46
Creator feedback and standout demos: controllability and storytelling
Aditya and Tim describe early creator feedback, focusing on the need for more control than text prompts alone. They also share inspiring examples that emphasize narrative filmmaking and quick iterative creativity.
- •Key gap: controllability beyond text-only prompting; interest in richer inputs
- •Artists push the model in unexpected, creatively ambitious directions
- •Example: Shy Kids’ “Airhead” short highlights storytelling enablement
- •Bill’s “Bling Zoo” sample shows rapid iteration and surreal multi-shot generation
5:46 – 8:29
Beyond films: new interaction modes and simulation-heavy applications
Elad and the team discuss how video models may unlock entirely new media formats—not just traditional movies. They also explore forward-looking uses like robotics and physical-world learning, where video provides grounding unavailable from text alone.
- •Unclear timeline for professional-grade long-form content, but progress expected over years
- •Models may enable new forms of interactive content beyond current media formats
- •Video training captures physical details (joints, motion, contact) relevant to robotics
- •World-model learning from raw video could support embodied intelligence
8:29 – 10:16
What a diffusion transformer is (and why it scales)
Tim gives a technical overview of Sora’s foundations: diffusion for generation and transformers for scalable learning. They emphasize that increased data and compute predictably improve outputs, echoing scaling behavior seen in language models.
- •Diffusion: start from noise and iteratively denoise into a video sample
- •Transformer backbone (GPT-like) enables scaling with data/compute
- •Technical report shows better generations at higher compute levels
- •Belief that scaling will improve simulation fidelity and longer-term coherence
10:16 – 13:29
Scaling laws for video and the tokenization breakthrough: spacetime patches
Bill explains why “tokens” are crucial for generalist training, and introduces spacetime patches as the video analog. This enables training on diverse aspect ratios and durations without forcing everything into a fixed-size format, expanding data utilization and model breadth.
- •Transformers inherit scaling-law-friendly behavior from language modeling
- •Spacetime patches: 3D “cubes” of video act as tokens for the transformer
- •Avoids limiting training to fixed 256x256, fixed-duration clips
- •Enables breadth: images + multiple aspect ratios (vertical/widescreen) and variable lengths
- •Requires major infrastructure to ingest and process large, varied video datasets
13:29 – 15:24
End-to-end video-first architecture: designing for minute-long HD from scratch
Tim contrasts Sora’s approach with prior work that extends image generators into short videos. Sora began with the target of minute-long HD generation, driving a scalable, video-native representation and architecture choice.
- •Prior video models often bolt video onto image generators
- •Sora started with the explicit goal: “a minute of HD footage”
- •Video-first design motivated simple, scalable decomposition of data
- •General lesson: step back and build for where the solution should be in a few years
15:24 – 17:02
Visual aesthetic: why Sora looks good and where personalization could go
Elad asks how Sora’s striking look was tuned; Aditya says little explicit aesthetic tuning was done. They discuss steering via language and the future potential for personalization—e.g., letting creators upload portfolios so the model learns their style and studio jargon.
- •Minimal explicit aesthetic tuning; results reflect learned data distribution
- •Strong language understanding helps users steer visuals via descriptive prompts
- •Future direction: personalized aesthetics based on a creator’s portfolio/assets
- •Desire for models to learn firm-specific jargon and long-running visual identity
17:02 – 19:03
“Desktop Pixar” and new paradigms for entertainment, education, and communication
Sarah describes an emerging vision: real-time, personalized, richly visual storytelling for kids and families—“desktop Pixar.” Tim broadens this into custom educational explainers and visual communication tools, arguing video is central to how humans learn and share ideas.
- •Real-time, interactive story narration paired with generated visuals
- •Potential shift in entertainment and education toward personalized content
- •Custom-tailored educational videos could explain concepts on demand
- •Video generation as a future medium for clearer communication
19:03 – 20:06
Avatars as a use case—and why the team is focusing on core tech first
Elad raises digital avatars as a promising application area. Tim notes they haven’t pursued specific verticals yet; the priority is improving the underlying engine, likening current Sora to an early “GPT-1” stage for visual models.
- •Digital avatars are compelling but not a current focus
- •Team prioritizes fundamental capability improvements over downstream apps
- •Sora framed as early-stage foundation model for video (“GPT-1 moment”)
- •Goal: build a general engine that can later power many applications
20:06 – 22:33
Safety for video models: deepfakes, misinformation, and shared responsibility
The conversation turns to safety: what mitigations transfer from DALL·E 3 and what’s new for video. Aditya highlights misinformation and responsibility boundaries among model providers, platforms, and users, balancing expressive freedom with responsible rollout.
- •Some mitigations port from DALL·E 3 (e.g., handling racy/gory content)
- •Video introduces additional risks like misinformation and deceptive edits
- •Open questions: who bears responsibility—model provider vs platforms vs users?
- •Tension between creative freedom and gradual, responsible deployment
22:33 – 25:03
Current limitations and what needs to improve for broad access
Bill outlines practical blockers to wide release: serving cost, latency, and safety readiness—especially during election cycles. They also discuss qualitative gaps like long-horizon interaction consistency and object permanence.
- •Serving constraints: cost and generation time (minutes for long clips)
- •Need to drive inference cost down to democratize access
- •Safety readiness a gating factor, especially around elections and misinformation
- •Quality gaps: complex object-object interactions and persistent state across time
25:03 – 31:24
Learning world models from video: 3D understanding, bitter lesson, and misconceptions
The team reflects on what Sora learns implicitly—3D structure and causal-ish state changes—purely from video prediction. They argue scaling ‘predict the data’ approaches is the most reliable path, and position Sora 1.0 as proof that GPT-like scaling will unlock surprising video capabilities quickly.
- •Sora learns 3D and stateful effects (bite marks, paint trails) without explicit supervision
- •Human intelligence relies on internal world simulation; Sora parallels that direction
- •Bitter lesson framing: simple objectives + scale beat handcrafted complexity
- •Expectation that video-model capabilities will improve rapidly along a scaling curve
- •Public “update”: Sora is an existence proof that GPT-style scaling applies to video too