No Priors Ep. 24 | With Devi Parikh from Meta

No PriorsJul 20, 202339m

Sarah Guo (host), Devi Parikh (guest), Elad Gil (host), Sarah Guo (host)

Devi Parikh’s background and evolution from pattern recognition to multimodal AITransition from academic research to Meta’s generative AI organizationMake-A-Video: architecture, training approach, and limitations of current video generationTechnical bottlenecks in generative video: infrastructure, representation, hierarchy, and dataControllability and multimodal prompting as keys to AI-assisted creativityApplications and future impact of generative media on social media and creative expressionUnderexplored research areas and career/time-management advice for AI researchers

In this episode of No Priors, featuring Sarah Guo and Devi Parikh, No Priors Ep. 24 | With Devi Parikh from Meta explores devi Parikh on generative video, multimodal AI, and creative control Devi Parikh, research director in generative AI at Meta and professor at Georgia Tech, traces her path from early “pattern recognition” work to leading-edge multimodal generative models. She explains the Make-A-Video project, which builds text-to-video by leveraging powerful image diffusion models and separating appearance from motion learning. Parikh outlines why video generation is progressing more slowly than images—citing infrastructure costs, representation challenges, architecture complexity, and immature data curricula—while emphasizing the importance of controllability and multimodal prompts for creative tools. She also reflects on AI’s role in democratizing creative expression, underexplored research directions like cross-modal models, and practical career advice such as not self-selecting out of opportunities.

Devi Parikh on generative video, multimodal AI, and creative control

Devi Parikh, research director in generative AI at Meta and professor at Georgia Tech, traces her path from early “pattern recognition” work to leading-edge multimodal generative models. She explains the Make-A-Video project, which builds text-to-video by leveraging powerful image diffusion models and separating appearance from motion learning. Parikh outlines why video generation is progressing more slowly than images—citing infrastructure costs, representation challenges, architecture complexity, and immature data curricula—while emphasizing the importance of controllability and multimodal prompts for creative tools. She also reflects on AI’s role in democratizing creative expression, underexplored research directions like cross-modal models, and practical career advice such as not self-selecting out of opportunities.

Key Takeaways

Leverage existing image models to bootstrap video generation.

Make-A-Video reuses powerful image diffusion models to learn visual appearance and language alignment from image–text pairs, then separately learns motion from unlabeled videos, reducing data needs and inheriting rich visual diversity (e. ...

Get the full analysis with uListen AI

Video generation is fundamentally harder and will likely progress slower than images.

Despite rapid image and language advances, video models remain short, low-complexity ‘animated images’; Parikh points to computational cost, redundancy across frames, high dimensionality, and lack of good video representations or hierarchies as reasons step‑change breakthroughs are lagging.

Get the full analysis with uListen AI

Better data recipes and curricula are crucial for scalable video training.

Beyond collecting more data, the field lacks robust strategies for sequencing training on simple versus complex clips (length, motion, scene changes), and for ‘massaging’ video datasets—skills that are more mature in language and image domains than in video.

Get the full analysis with uListen AI

Controllability is essential if generative models are to serve real creators.

Text prompts are a big improvement over random sampling but still too unpredictable; Parikh argues for richer multimodal inputs (sketches, reference images, audio, seed videos) plus iterative editing interfaces so users can refine outputs toward their exact creative intent.

Get the full analysis with uListen AI

Multimodal, all-in-one models are an underexplored frontier.

Current systems tend to specialize (text-only, image-only, video-only), but Parikh envisions unified models that can ingest and generate across text, images, video, audio, and music—offering deeper understanding and more powerful creative and agentive capabilities.

Get the full analysis with uListen AI

Audio and sound design are powerful but underinvested complements to visuals.

Text-to-audio models can sometimes generate convincing short soundscapes, yet remain unreliable and poor at handling long sequences or overlapping sounds; Parikh believes richer sound and music generation could greatly enhance expressiveness of generated media.

Get the full analysis with uListen AI

For researchers, intentional time management and not self-selecting are high‑leverage habits.

Parikh recommends calendaring all tasks (not just keeping to‑do lists) to force realistic time estimates, and urges students and researchers to apply broadly for jobs, internships, and fellowships instead of pre‑rejecting themselves—letting the world say no rather than doing it for them.

Get the full analysis with uListen AI

Notable Quotes

“Text prompts are democratizing creative expression, and the Holy Grail is AI-generated and -edited video.”
— Sarah Guo (host, paraphrasing the show’s framing)

“Video right now is essentially an animated image... it's the same scene, the same set of objects moving around in reasonable ways.”
— Devi Parikh

“I think that might be harder in video, and I wonder if that is something that we're kind of fundamentally missing in terms of how we approach video generation.”
— Devi Parikh

“If we want these generative models to be tools for creative expression, then it needs to be generating content that corresponds to what someone wants to express.”
— Devi Parikh

“Don't self-select... It's on the world to say no to you.”
— Devi Parikh

Questions Answered in This Episode

What kinds of new video representations or hierarchies might overcome current bottlenecks in generative video models?

Devi Parikh, research director in generative AI at Meta and professor at Georgia Tech, traces her path from early “pattern recognition” work to leading-edge multimodal generative models. ...

Get the full analysis with uListen AI

How could user interfaces for multimodal prompts and iterative editing be designed so that non-technical creators gain real control without overwhelming complexity?

Get the full analysis with uListen AI

In what ways might unified multimodal models (spanning text, images, video, and audio) change how we build AI assistants or agents?

Get the full analysis with uListen AI

How should academia and smaller labs strategically contribute to a field increasingly driven by large models and industrial-scale compute?

Get the full analysis with uListen AI

What ethical and societal implications arise when AI-generated media becomes indistinguishable from real video and sound across social platforms?

Get the full analysis with uListen AI

Transcript Preview

Sarah Guo

(instrumental music plays) Text prompts are democratizing creative expression, and the Holy Grail is AI-generated and -edited video. Elad Gil and I sit down with Devi Parikh. She's a research director in generative AI at Meta, a leading researcher in multimodality in AI for visual, audio, and video, and she's an associate professor in the School of Interactive Computing at Georgia Tech. Recently, she worked on Make-a-Video 3D, which creates animations from text prompts. She's also a talented artist herself. Devi, welcome to No Priors.

Devi Parikh

Thank you. Thank you for having me.

Sarah Guo

Let's start with your background and how you got started in, um, computer vision. Uh, I- I've heard you say you choose projects based on what brings you joy. Is that how you got into AI research?

Devi Parikh

(laughs) Um, kind of, kind of, yeah. So my background is that I grew up in India, and then I moved to the U.S., uh, after high school. And I went, uh, to a small school called Rowan University in Southern New Jersey for my undergrad. And that is where I first got exposed to, um, what at the time was being called pattern recognition, we weren't even calling it machine learning, um, and got exposed to some research projects. There was a professor there who kind of showed some interest in me, thought I might have potential to contribute meaningfully (laughs) to research projects, um, and that's how I got exposed. And I really, really enjoyed what I was doing there, um, decided to go to grad school, to Carnegie Mellon. Um, I knew I was enjoying it, but I wasn't sure if I wanted to do a PhD, so at first, I wanted to just kind of get a master's degree with a thesis where I can do some research. But the year that I applied, uh, that, the ECE department at CMU decided that there wasn't going to be a master's track for a thesis, like either you can just take courses or you go to a PhD. And so they kind of slotted me onto the PhD track, um, which I wasn't so sure of, but my advisor there was reasonably confident that I'm going to enjoy it and I'm gonna want to keep going. Um, so yeah, that's how I got started in this space. At first, I was doing projects that didn't have a visual element to it.

Sarah Guo

How did you pick a thesis project?

Devi Parikh

So at first, I wasn't, I was working on projects that didn't have too much of a visual element to them, um, but when I got to CMU, my advisor's lab was working in image processing and computer vision and I always thought that it was pretty cool that everybody gets to kind of look at the outputs of their algorithms, um, and see what they're doing. Whereas if it's kind of non-visual, then yeah, you see these metrics, but you don't really have a sense for what's, what's, uh, happening, if it's working, if it's not. Um, and so that's how I got interested in computer vision, and that then defined, um, the topic of my thesis over the course of my PhD.

Install uListen to search the full transcript and get AI-powered insights

Get Full Transcript

Get more from every podcast

AI summaries, searchable transcripts, and fact-checking. Free forever.

Add to Chrome