No Priors Ep. 24 | With Devi Parikh from Meta

Video dominates modern media consumption, but video creation is still expensive and difficult. AI-generated and edited video is a holy grail of democratized creative expression. This week on No Priors, Sarah Guo and Elad Gil sit down with Devi Parikh. She is a Research Director in Generative AI at Meta and an Associate Professor in the School of Interactive Computing at Georgia Tech. Her work focuses on multimodality and AI for images, audio and video. Recently, she worked on Make a Video 3D, also called MAV3D, which creates animations from text prompts. She is also a talented AI-generated and analog artist herself. Elad, Sarah and Devi talk about what’s exciting in computer vision, what’s blocking researchers from fully immersive Generative 4-D, and AI controllability. 00:00 - Democratizing Creative Expression With AI-Generated Video 08:31 - Challenges in Video Generation Research 15:57 - Challenges and Implications of Video Processing 20:43 - Control and Multi-Modal Inputs in Video 25:50 - Audio's Role in Visual Content 39:00 - Don't Self-Select & Devi’s tips for young researchers

Sarah GuohostDevi ParikhguestElad Gilhost

Jul 19, 202339mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

Devi Parikh on generative video, multimodal AI, and creative control

Devi Parikh, research director in generative AI at Meta and professor at Georgia Tech, traces her path from early “pattern recognition” work to leading-edge multimodal generative models. She explains the Make-A-Video project, which builds text-to-video by leveraging powerful image diffusion models and separating appearance from motion learning. Parikh outlines why video generation is progressing more slowly than images—citing infrastructure costs, representation challenges, architecture complexity, and immature data curricula—while emphasizing the importance of controllability and multimodal prompts for creative tools. She also reflects on AI’s role in democratizing creative expression, underexplored research directions like cross-modal models, and practical career advice such as not self-selecting out of opportunities.

IDEAS WORTH REMEMBERING

5 ideas

Leverage existing image models to bootstrap video generation.

Make-A-Video reuses powerful image diffusion models to learn visual appearance and language alignment from image–text pairs, then separately learns motion from unlabeled videos, reducing data needs and inheriting rich visual diversity (e.g., dragons, unicorns) without requiring matching text–video pairs.

Video generation is fundamentally harder and will likely progress slower than images.

Despite rapid image and language advances, video models remain short, low-complexity ‘animated images’; Parikh points to computational cost, redundancy across frames, high dimensionality, and lack of good video representations or hierarchies as reasons step‑change breakthroughs are lagging.

Better data recipes and curricula are crucial for scalable video training.

Beyond collecting more data, the field lacks robust strategies for sequencing training on simple versus complex clips (length, motion, scene changes), and for ‘massaging’ video datasets—skills that are more mature in language and image domains than in video.

Controllability is essential if generative models are to serve real creators.

Text prompts are a big improvement over random sampling but still too unpredictable; Parikh argues for richer multimodal inputs (sketches, reference images, audio, seed videos) plus iterative editing interfaces so users can refine outputs toward their exact creative intent.

Multimodal, all-in-one models are an underexplored frontier.

Current systems tend to specialize (text-only, image-only, video-only), but Parikh envisions unified models that can ingest and generate across text, images, video, audio, and music—offering deeper understanding and more powerful creative and agentive capabilities.

WORDS WORTH SAVING

5 quotes

Text prompts are democratizing creative expression, and the Holy Grail is AI-generated and -edited video.

— Sarah Guo (host, paraphrasing the show’s framing)

Video right now is essentially an animated image... it's the same scene, the same set of objects moving around in reasonable ways.

— Devi Parikh

I think that might be harder in video, and I wonder if that is something that we're kind of fundamentally missing in terms of how we approach video generation.

— Devi Parikh

If we want these generative models to be tools for creative expression, then it needs to be generating content that corresponds to what someone wants to express.

— Devi Parikh

Don't self-select... It's on the world to say no to you.

— Devi Parikh

Devi Parikh’s background and evolution from pattern recognition to multimodal AITransition from academic research to Meta’s generative AI organizationMake-A-Video: architecture, training approach, and limitations of current video generationTechnical bottlenecks in generative video: infrastructure, representation, hierarchy, and dataControllability and multimodal prompting as keys to AI-assisted creativityApplications and future impact of generative media on social media and creative expressionUnderexplored research areas and career/time-management advice for AI researchers

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.