
No Priors Ep. 24 | With Devi Parikh from Meta
Sarah Guo (host), Devi Parikh (guest), Elad Gil (host), Sarah Guo (host)
In this episode of No Priors, featuring Sarah Guo and Devi Parikh, No Priors Ep. 24 | With Devi Parikh from Meta explores devi Parikh on generative video, multimodal AI, and creative control Devi Parikh, research director in generative AI at Meta and professor at Georgia Tech, traces her path from early “pattern recognition” work to leading-edge multimodal generative models. She explains the Make-A-Video project, which builds text-to-video by leveraging powerful image diffusion models and separating appearance from motion learning. Parikh outlines why video generation is progressing more slowly than images—citing infrastructure costs, representation challenges, architecture complexity, and immature data curricula—while emphasizing the importance of controllability and multimodal prompts for creative tools. She also reflects on AI’s role in democratizing creative expression, underexplored research directions like cross-modal models, and practical career advice such as not self-selecting out of opportunities.
Devi Parikh on generative video, multimodal AI, and creative control
Devi Parikh, research director in generative AI at Meta and professor at Georgia Tech, traces her path from early “pattern recognition” work to leading-edge multimodal generative models. She explains the Make-A-Video project, which builds text-to-video by leveraging powerful image diffusion models and separating appearance from motion learning. Parikh outlines why video generation is progressing more slowly than images—citing infrastructure costs, representation challenges, architecture complexity, and immature data curricula—while emphasizing the importance of controllability and multimodal prompts for creative tools. She also reflects on AI’s role in democratizing creative expression, underexplored research directions like cross-modal models, and practical career advice such as not self-selecting out of opportunities.
Key Takeaways
Leverage existing image models to bootstrap video generation.
Make-A-Video reuses powerful image diffusion models to learn visual appearance and language alignment from image–text pairs, then separately learns motion from unlabeled videos, reducing data needs and inheriting rich visual diversity (e. ...
Get the full analysis with uListen AI
Video generation is fundamentally harder and will likely progress slower than images.
Despite rapid image and language advances, video models remain short, low-complexity ‘animated images’; Parikh points to computational cost, redundancy across frames, high dimensionality, and lack of good video representations or hierarchies as reasons step‑change breakthroughs are lagging.
Get the full analysis with uListen AI
Better data recipes and curricula are crucial for scalable video training.
Beyond collecting more data, the field lacks robust strategies for sequencing training on simple versus complex clips (length, motion, scene changes), and for ‘massaging’ video datasets—skills that are more mature in language and image domains than in video.
Get the full analysis with uListen AI
Controllability is essential if generative models are to serve real creators.
Text prompts are a big improvement over random sampling but still too unpredictable; Parikh argues for richer multimodal inputs (sketches, reference images, audio, seed videos) plus iterative editing interfaces so users can refine outputs toward their exact creative intent.
Get the full analysis with uListen AI
Multimodal, all-in-one models are an underexplored frontier.
Current systems tend to specialize (text-only, image-only, video-only), but Parikh envisions unified models that can ingest and generate across text, images, video, audio, and music—offering deeper understanding and more powerful creative and agentive capabilities.
Get the full analysis with uListen AI
Audio and sound design are powerful but underinvested complements to visuals.
Text-to-audio models can sometimes generate convincing short soundscapes, yet remain unreliable and poor at handling long sequences or overlapping sounds; Parikh believes richer sound and music generation could greatly enhance expressiveness of generated media.
Get the full analysis with uListen AI
For researchers, intentional time management and not self-selecting are high‑leverage habits.
Parikh recommends calendaring all tasks (not just keeping to‑do lists) to force realistic time estimates, and urges students and researchers to apply broadly for jobs, internships, and fellowships instead of pre‑rejecting themselves—letting the world say no rather than doing it for them.
Get the full analysis with uListen AI
Notable Quotes
“Text prompts are democratizing creative expression, and the Holy Grail is AI-generated and -edited video.”
— Sarah Guo (host, paraphrasing the show’s framing)
“Video right now is essentially an animated image... it's the same scene, the same set of objects moving around in reasonable ways.”
— Devi Parikh
“I think that might be harder in video, and I wonder if that is something that we're kind of fundamentally missing in terms of how we approach video generation.”
— Devi Parikh
“If we want these generative models to be tools for creative expression, then it needs to be generating content that corresponds to what someone wants to express.”
— Devi Parikh
“Don't self-select... It's on the world to say no to you.”
— Devi Parikh
Questions Answered in This Episode
What kinds of new video representations or hierarchies might overcome current bottlenecks in generative video models?
Devi Parikh, research director in generative AI at Meta and professor at Georgia Tech, traces her path from early “pattern recognition” work to leading-edge multimodal generative models. ...
Get the full analysis with uListen AI
How could user interfaces for multimodal prompts and iterative editing be designed so that non-technical creators gain real control without overwhelming complexity?
Get the full analysis with uListen AI
In what ways might unified multimodal models (spanning text, images, video, and audio) change how we build AI assistants or agents?
Get the full analysis with uListen AI
How should academia and smaller labs strategically contribute to a field increasingly driven by large models and industrial-scale compute?
Get the full analysis with uListen AI
What ethical and societal implications arise when AI-generated media becomes indistinguishable from real video and sound across social platforms?
Get the full analysis with uListen AI
Transcript Preview
(instrumental music plays) Text prompts are democratizing creative expression, and the Holy Grail is AI-generated and -edited video. Elad Gil and I sit down with Devi Parikh. She's a research director in generative AI at Meta, a leading researcher in multimodality in AI for visual, audio, and video, and she's an associate professor in the School of Interactive Computing at Georgia Tech. Recently, she worked on Make-a-Video 3D, which creates animations from text prompts. She's also a talented artist herself. Devi, welcome to No Priors.
Thank you. Thank you for having me.
Let's start with your background and how you got started in, um, computer vision. Uh, I- I've heard you say you choose projects based on what brings you joy. Is that how you got into AI research?
(laughs) Um, kind of, kind of, yeah. So my background is that I grew up in India, and then I moved to the U.S., uh, after high school. And I went, uh, to a small school called Rowan University in Southern New Jersey for my undergrad. And that is where I first got exposed to, um, what at the time was being called pattern recognition, we weren't even calling it machine learning, um, and got exposed to some research projects. There was a professor there who kind of showed some interest in me, thought I might have potential to contribute meaningfully (laughs) to research projects, um, and that's how I got exposed. And I really, really enjoyed what I was doing there, um, decided to go to grad school, to Carnegie Mellon. Um, I knew I was enjoying it, but I wasn't sure if I wanted to do a PhD, so at first, I wanted to just kind of get a master's degree with a thesis where I can do some research. But the year that I applied, uh, that, the ECE department at CMU decided that there wasn't going to be a master's track for a thesis, like either you can just take courses or you go to a PhD. And so they kind of slotted me onto the PhD track, um, which I wasn't so sure of, but my advisor there was reasonably confident that I'm going to enjoy it and I'm gonna want to keep going. Um, so yeah, that's how I got started in this space. At first, I was doing projects that didn't have a visual element to it.
How did you pick a thesis project?
So at first, I wasn't, I was working on projects that didn't have too much of a visual element to them, um, but when I got to CMU, my advisor's lab was working in image processing and computer vision and I always thought that it was pretty cool that everybody gets to kind of look at the outputs of their algorithms, um, and see what they're doing. Whereas if it's kind of non-visual, then yeah, you see these metrics, but you don't really have a sense for what's, what's, uh, happening, if it's working, if it's not. Um, and so that's how I got interested in computer vision, and that then defined, um, the topic of my thesis over the course of my PhD.
Install uListen to search the full transcript and get AI-powered insights
Get Full TranscriptGet more from every podcast
AI summaries, searchable transcripts, and fact-checking. Free forever.
Add to Chrome