No Priors Ep.61 | OpenAI's Sora Leaders Aditya Ramesh, Tim Brooks and Bill Peebles

AI-generated videos are not just leveled-up image generators. But rather, they could be a big step forward on the path to AGI. This week on No Priors, the team from Sora is here to discuss OpenAI’s recently announced generative video model, which can take a text prompt and create realistic, visually coherent, high-definition clips that are up to a minute long. Sora team leads, Aditya Ramesh, Tim Brooks, and Bill Peebles join Elad and Sarah to talk about developing Sora. The generative video model isn’t yet available for public use but the examples of its work are very impressive. However, they believe we’re still in the GPT-1 era of AI video models and are focused on a slow rollout to ensure the model is in the best place possible to offer value to the user and more importantly they’ve applied all the safety measures possible to avoid deep fakes and misinformation. They also discuss what they’re learning from implementing diffusion transformers, why they believe video generation is taking us one step closer to AGI, and why entertainment may not be the main use case for this tool in the future. Show Notes: 0:00 Sora team Introduction 1:05 Simulating the world with Sora 2:25 Building the most valuable consumer product 5:50 Alternative use cases and simulation capabilities 8:41 Diffusion transformers explanation 10:15 Scaling laws for video 13:08 Applying end-to-end deep learning to video 15:30 Tuning the visual aesthetic of Sora 17:08 The road to “desktop Pixar” for everyone 20:12 Safety for visual models 22:34 Limitations of Sora 25:04 Learning from how Sora is learning 29:32 The biggest misconceptions about video models

Sarah GuohostTim BrooksguestBill PeeblesguestAditya RameshguestElad Gilhost

Apr 24, 202431mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

OpenAI’s Sora: Building a World-Simulating Video Engine for AGI

The episode features Sora leads Aditya Ramesh, Tim Brooks, and Bill Peebles discussing OpenAI’s new text-to-video model and its role on the path to AGI.
They frame Sora not just as a generative video tool, but as an early world simulator that learns physics, 3D structure, and human/animal behavior directly from raw video.
The team explains the core technical ideas—diffusion transformers and spacetime patches—chosen to make minute‑long HD, multi‑aspect‑ratio video scalable in the same way GPT is for text.
They also cover potential applications (creative tools, education, robotics, avatars), current limitations (cost, speed, object interactions, safety), and why they see Sora as the “GPT‑1 moment” for visual models.

IDEAS WORTH REMEMBERING

5 ideas

Treat video models as world simulators, not just media generators.

By learning to predict future frames from raw video, Sora implicitly acquires knowledge of 3D structure, physics, and social interactions—similar to how humans mentally simulate scenarios—making it a plausible building block for AGI.

Architectural choices were made for long‑term scalability, not quick wins.

Instead of extending image models, the team designed Sora from scratch around diffusion transformers and spacetime patches, optimized from day one for minute‑long HD video, diverse aspect ratios, and broad data coverage.

Scaling laws appear to hold for video much like for language.

Using GPT‑style transformers on video tokens, they observe that more compute and data systematically improve quality, suggesting a clear path to stronger physical reasoning, longer sequences, and richer interactions as resources increase.

Tokenizing video as latent spacetime patches unlocks generality.

Representing videos as 3D cubes in latent space lets a single model handle different resolutions, durations, orientations, and even still images, analogous to how text tokens let LLMs train on diverse textual formats.

Creative empowerment and new media forms may be the first big wins.

Early artist tests show Sora lowering barriers to high‑quality visual storytelling and hint at entirely new interaction paradigms (e.g., personalized stories, dynamic educational content) beyond traditional film or clips.

WORDS WORTH SAVING

5 quotes

We really believe models like Sora are on the critical pathway to AGI.

— Aditya Ramesh

This is the GPT‑1 of this new paradigm of visual models.

— Tim Brooks

This is really the first generative model of visual content that has breadth in a way that language models have breadth.

— Bill Peebles

The best way to learn intelligence in a scalable manner is to just predict data.

— Tim Brooks

It understands 3D… we didn’t explicitly bake 3D information into it whatsoever, we just trained it on video data and it learned about 3D because 3D exists in those videos.

— Tim Brooks

Sora as a world model and its connection to AGIEarly access, creator feedback, and product roadmap (no broad release yet)Core architecture: diffusion transformers and latent spacetime patchesScaling laws, data/tokenization strategy, and infrastructure for long HD videoCreative and educational applications beyond traditional film and mediaSafety, deepfakes, misinformation, and responsibility allocationCurrent limitations and future improvements in physical and long-horizon coherence

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.