At a glance
WHAT IT’S REALLY ABOUT
Devi Parikh on generative video, multimodal AI, and creative control
- Devi Parikh, research director in generative AI at Meta and professor at Georgia Tech, traces her path from early “pattern recognition” work to leading-edge multimodal generative models. She explains the Make-A-Video project, which builds text-to-video by leveraging powerful image diffusion models and separating appearance from motion learning. Parikh outlines why video generation is progressing more slowly than images—citing infrastructure costs, representation challenges, architecture complexity, and immature data curricula—while emphasizing the importance of controllability and multimodal prompts for creative tools. She also reflects on AI’s role in democratizing creative expression, underexplored research directions like cross-modal models, and practical career advice such as not self-selecting out of opportunities.
IDEAS WORTH REMEMBERING
5 ideasLeverage existing image models to bootstrap video generation.
Make-A-Video reuses powerful image diffusion models to learn visual appearance and language alignment from image–text pairs, then separately learns motion from unlabeled videos, reducing data needs and inheriting rich visual diversity (e.g., dragons, unicorns) without requiring matching text–video pairs.
Video generation is fundamentally harder and will likely progress slower than images.
Despite rapid image and language advances, video models remain short, low-complexity ‘animated images’; Parikh points to computational cost, redundancy across frames, high dimensionality, and lack of good video representations or hierarchies as reasons step‑change breakthroughs are lagging.
Better data recipes and curricula are crucial for scalable video training.
Beyond collecting more data, the field lacks robust strategies for sequencing training on simple versus complex clips (length, motion, scene changes), and for ‘massaging’ video datasets—skills that are more mature in language and image domains than in video.
Controllability is essential if generative models are to serve real creators.
Text prompts are a big improvement over random sampling but still too unpredictable; Parikh argues for richer multimodal inputs (sketches, reference images, audio, seed videos) plus iterative editing interfaces so users can refine outputs toward their exact creative intent.
Multimodal, all-in-one models are an underexplored frontier.
Current systems tend to specialize (text-only, image-only, video-only), but Parikh envisions unified models that can ingest and generate across text, images, video, audio, and music—offering deeper understanding and more powerful creative and agentive capabilities.
WORDS WORTH SAVING
5 quotesText prompts are democratizing creative expression, and the Holy Grail is AI-generated and -edited video.
— Sarah Guo (host, paraphrasing the show’s framing)
Video right now is essentially an animated image... it's the same scene, the same set of objects moving around in reasonable ways.
— Devi Parikh
I think that might be harder in video, and I wonder if that is something that we're kind of fundamentally missing in terms of how we approach video generation.
— Devi Parikh
If we want these generative models to be tools for creative expression, then it needs to be generating content that corresponds to what someone wants to express.
— Devi Parikh
Don't self-select... It's on the world to say no to you.
— Devi Parikh
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome