EVERY SPOKEN WORD
85 min read · 16,568 words- 0:00 – 8:31
Democratizing Creative Expression With AI-Generated Video
- SGSarah Guo
(instrumental music plays) Text prompts are democratizing creative expression, and the Holy Grail is AI-generated and -edited video. Elad Gil and I sit down with Devi Parikh. She's a research director in generative AI at Meta, a leading researcher in multimodality in AI for visual, audio, and video, and she's an associate professor in the School of Interactive Computing at Georgia Tech. Recently, she worked on Make-a-Video 3D, which creates animations from text prompts. She's also a talented artist herself. Devi, welcome to No Priors.
- DPDevi Parikh
Thank you. Thank you for having me.
- SGSarah Guo
Let's start with your background and how you got started in, um, computer vision. Uh, I- I've heard you say you choose projects based on what brings you joy. Is that how you got into AI research?
- DPDevi Parikh
(laughs) Um, kind of, kind of, yeah. So my background is that I grew up in India, and then I moved to the U.S., uh, after high school. And I went, uh, to a small school called Rowan University in Southern New Jersey for my undergrad. And that is where I first got exposed to, um, what at the time was being called pattern recognition, we weren't even calling it machine learning, um, and got exposed to some research projects. There was a professor there who kind of showed some interest in me, thought I might have potential to contribute meaningfully (laughs) to research projects, um, and that's how I got exposed. And I really, really enjoyed what I was doing there, um, decided to go to grad school, to Carnegie Mellon. Um, I knew I was enjoying it, but I wasn't sure if I wanted to do a PhD, so at first, I wanted to just kind of get a master's degree with a thesis where I can do some research. But the year that I applied, uh, that, the ECE department at CMU decided that there wasn't going to be a master's track for a thesis, like either you can just take courses or you go to a PhD. And so they kind of slotted me onto the PhD track, um, which I wasn't so sure of, but my advisor there was reasonably confident that I'm going to enjoy it and I'm gonna want to keep going. Um, so yeah, that's how I got started in this space. At first, I was doing projects that didn't have a visual element to it.
- SGSarah Guo
How did you pick a thesis project?
- DPDevi Parikh
So at first, I wasn't, I was working on projects that didn't have too much of a visual element to them, um, but when I got to CMU, my advisor's lab was working in image processing and computer vision and I always thought that it was pretty cool that everybody gets to kind of look at the outputs of their algorithms, um, and see what they're doing. Whereas if it's kind of non-visual, then yeah, you see these metrics, but you don't really have a sense for what's, what's, uh, happening, if it's working, if it's not. Um, and so that's how I got interested in computer vision, and that then defined, um, the topic of my thesis over the course of my PhD.
- SGSarah Guo
So, you have been working in machine learning long enough that, as you said, it was called pattern recognition, and you've worked across a bunch of different, uh, modalities. How does, how has that changed your, your research path? Because like, things like diffusion models and GANs and large transformers, none of that existed when you were first starting, and I, I, I think you have managed to sort of translate, or transition your interests in a way that keeps you on the cutting edge. H- how has that happened?
- DPDevi Parikh
Yeah. So I think, I mean, it, you can always kind of look back and try and find patterns. Like when you're actually doing it, you don't necessarily have a grand strategy (laughs) of anything in mind. But when I look back, I think one common theme that led to me transitioning across topics a little bit, um, was that I was interested in seeing how we can get humans to interact with machines in more meaningful ways. And so kind of even my transition from kind of non-visual to visual modalities, in hindsight, I feel like was essentially that. I felt like you can't interact with these systems too much if it's sort of these abstract modalities that you're looking at. And then when I was working in computer vision, I wanted to find ways for humans to be able to interact with these systems more, so I started looking at kind of these attributes and adjectives of like, oh, something is funny or something is shiny, and using that as a mode of communication between humans and machines, both for humans to teach machines new concepts and for machines to be more interpretable in explaining why they're making the decisions that they're making. And that slowly led to the sort of more into natural language processing, where instead of these kind of just adjectives and attributes, looking at more natural language as a way of interacting, so a lot of my work in visual question answering, where you're answering questions about images, image captioning was coming from there. And then over time, I sort of started thinking of other ways to go even deeper in this interaction, that are there ways where AI tools can enhance sort of creative expression for people, give them more tools for expressing themselves. Um, and that's how I got interested in AI for creativity, and I was dabbling in kind of a few fairly random projects, um, a few years ago, kind of my bread and butter research was still multimodal vision and language, um, but then sort of I enjoyed what I was doing with AI for creativity, and a couple of years ago that took sort of a little bit more of a serious turn where I made it more of my sort of full-fledged research agenda, and that's how I started working more seriously on generative modeling, um, including transformer-based approaches, diffusion models for images, for video, for 3D video, uh, things of that sort.
- SGSarah Guo
So you became a professor, and you also, you know, now work in industry at Meta. What brought you there?
- DPDevi Parikh
So this was, I've been at Meta for about seven years now. And so this had started when I was transitioning from Virginia Tech to Georgia Tech, so I was an assistant professor at Virginia Tech, then I was getting started at, at Georgia Tech. And in that, uh, transition, uh, I decided to spend a year at, at FAIR. Um, at the time it was called Facebook AI Research, now fundamentally AI Research at, at Meta. And I knew colleagues there from, some of them had been at Microsoft Research before that. I had interned at MSR, I had spent summers at MSR even as a faculty member.... um, and so I had a lot of colleagues who I knew and so I thought it would be fun to kind of spend a year, collaborate with them, um, get to know what FAIR is like. Um, so that's what that was. It was supposed to be a one-year, uh, stint and then I was gonna go back to Georgia Tech (laughs) and kind of continue with my, uh, academic position. Um, but in that one year, um, I enjoyed it enough. Uh, I think FAIR enjoyed having me around enough where we tried to figure out, is there a way to keep this going for longer? Um, and so for, uh, many years after that, for five years or so, I was splitting my time. But every fall, I would go back to Georgia Tech, um, to Atlanta, spend the fall semester there, teach. And then the rest of the year, I would be in Menlo Park at Facebook, now Meta.
- SGSarah Guo
Mm-hmm. And you transitioned from fundamental AI research to a new sort of generative AI group. Can you talk about why that's interesting to Meta or sort of what, what kind of things you're working on now?
- DPDevi Parikh
Yeah, yeah. I mean, yeah, that's a very exciting space right now. (laughs) There's a lot happening both, uh, within Meta and, and outside, as I'm sure many of the people listening to this are, are aware of. Um, but yeah, so that ... The new organization was created a few months ago, um, so not, not a long, not a long time ago. Um, and it's, uh, looking at things like large language models, image generation, video generation, generating 3D content, audio, music, um, yeah, all sort of, all sorts of modalities that you might think of. Um, and why is it interesting? I mean, like right now, if you think about all the content, there's so much content that we consume in all modalities and all sorts of surfaces, um, and it makes a lot of sense to ask that instead of ... um, maybe not instead of, but in addition to all of this consumption, can more of us be creating more of this content, right? And so, um, almost everything that you think of, images, video, you can ask this question. Like for any situation whether you're searching for something, trying to find something, it's relevant to ask, "Well, could I just create what it is that I have in my head?" Um, and so when you think of it that way, you can see how it can touch, um, a lot of different things across a variety of products and surfaces. So, yeah.
- EGElad Gil
Yeah. That makes a ton of sense. I think we'll, um, come back and ask you some questions about images and audio and a few other things, since you've done so much interesting work across so many different areas. But maybe we can start a little bit with video generation. In part, y- you know, due to the fact that you had a really interesting recent project called MakeAVideo, and in that approach users can generate video with a text prompt. So you could type in, "Imagine a corgi playing with a ball," and it would generate a short video of a corgi playing with a ball. (laughs) Um, could you tell us a bit more about that project, how it started, and also how it works, and what's the basis for the technology there?
- DPDevi Parikh
Yeah, yeah. So MakeAVideo, um, it, it started
- 8:31 – 15:57
Challenges in Video Generation Research
- DPDevi Parikh
because, I mean, this was, um, a couple of years ago where, uh ... This was before DALL·E 2, by the way. (laughs) So this was like after DALL·E 1 had happened. And I feel like a lot of people don't even remember DALL·E 1 anymore. Like people don't even talk about that. It's fun to go check out what those images look like, um, and that had, that had blown our minds at the time. But now when you go back, you're like, "Wait, like that's not even interesting." But anyway, so we had seen a lot of progress in image generation, um, and so it seemed like the next kind of entirely open question where we hadn't seen much work at all was to see what can we do with video generation. And so that was kind of the, the inspiration behind that. Um, and for MakeAVideo, the approach, uh, specifically the thinking was we have these image generation models. By this time, we had seen a lot of progress with diffusion-based models, um, from a variety of different institutions. And so the idea was, is there a way of leveraging all the progress that's happened with images in a very direct way to sort of make video generation, uh, possible? And so that led to this intuition that what if we try and use images and associated text as a way of learning what the world looks like and how we, how people talk about the visual content, and then separate that out from trying to learn how the w- ... how things in the world move. So separate out appearance and language and that correspondence from motion, um, of how things, of how things move. Um, and so that is what led to MakeAVideo.
- SGSarah Guo
Mm-hmm.
- DPDevi Parikh
Um, and so the ... there are sort of multiple advantages of thinking of it that way. One is, there's less for the model to learn because you're directly bringing in everything that you already know about images to start with. Um, the second is, all of the diversity that we have in our image data sets, that image models already know all sorts of fantastical depictions of like dragons and unicorns and things like that, uh, but you may not have as much, uh, video data easily available. All of that is inherited. So all the diversity of the visual concepts can come in through images, even if your video data set doesn't have all of that. Um, and the third benefit, uh, maybe the biggest one, is that because of the separation you don't need video and associated text as paired data. You have images and text as paired data, and then you just have one labeled video to learn motion from. And so these were kind of three things that we thought were quite interesting in how we approached MakeAVideo. Um, and so concretely the way it works is that when you initialize the model, you're starting off with image generation, um, sort of parameters that have already been learned. So before you do any training for MakeAVideo, you, you ... we set it up so that it can generate a few frames that are not temporally coherent. So there is going to be independent images, like the corgi playing with the ball is just going to be independent images of corgis playing with blue balls, um, but they're not gonna be temporally coherent. And then what the network is trying to do as it goes through the learning process is to make these images temporally coherent so that at the end of training, um, it is generating a video rather than just unrelated images. And that's where the videos come in as training data.
- EGElad Gil
That's a, that's a great explanation.
- SGSarah Guo
One, one question just like if we, if we use an example of something that, uh, is not going to be in your video training set, right? Um, so I want a, uh, a flying corgi, for example, right? How should I think of this in terms of like interpreted motion?
- DPDevi Parikh
Yeah. So one way of thinking of it could be that you may not have seen a flying corgi, but you've probably seen flying airplanes or flying birds and other things that fly, um, in images and in video. And so from images, you will have text associated with it, so you will have a sense for what things tend to look like when someone is saying, "Oh, this is X flying or Y flying." And then in videos, you will have seen the motion of what stuff looks like when it, when it flies. And in images, you will have seen what corgis look like. And so it's hard to kind of know for sure what it is that these models learn. Sort of interpretability is not (laughs) a strength of many of these, uh, deep, large architectures, but that could be one intuitive explanation for how the model is managing to figure out what a flying corgi might look like. Yeah.
- EGElad Gil
Well, what are some of the major forward-looking aspects of this sort of project and research?
- DPDevi Parikh
I think there's a ton to do in the context of video generation. Like, if you look at Make-A-Video, it was very exciting. It was sort of first of its kind, um, capabilities at the time. But it's still, it's a four-second video. It's essentially an animated image, right? (laughs) It's kind of the same scene, the same set of objects that are moving around in reasonable ways, but you're not seeing objects appear, objects, uh, disappear. You're not seeing objects reappear. You're not seeing scene transitions. Um, none of this is, is, is in there. And so if you look at... if you think about the complexity of videos that you just regularly come across on various surfaces, this is far from that. And so there is a ton to be done in terms of making these videos longer, uh, more complex, having memory so that if an object reappears, it's actually consistent, it doesn't now look entirely different. Um, things of that sort, sort of being able to tell more complex stories through videos. All of this is, uh, entirely open.
- EGElad Gil
Mm-hmm. Well, uh, and I know these things are always extremely hard to predict, but if you look forward a year or two, what do you think the state of the art will be in terms of length of video, complexity of the scenes that you can animate, things like that?
- DPDevi Parikh
Uh, yeah, that is, that is hard to say. And to be honest, I've actually been surprised that we haven't seen more of this already. Like, Make-A-Video was, I think, what, nine months or so ago, maybe approaching one year. And it's not like even from other institutions, it's not like we're seeing amazingly longer videos or significantly higher resolution or much more complexity. We're still kind of in this videos equals animated images and, yeah, maybe the resolution is a little bit bigger, quality is a little bit higher, um, but it's not like we've made significant breakthroughs, unlike, for example, what we've seen with images. Um, so I do, I, I... That has given me a sense that maybe this is harder than what we might think and sort of our usual curves of, like, with language models or image models where you're like, "Oh, just six more months and there's gonna be something else that's an entirely different step change over this." Um, I think that might be harder in video, and I wonder if that is something that we're kind of fundamentally missing in terms of how we approach video generation. Um, so it's not quite answering, uh, what you asked me-
- EGElad Gil
That's true. Yeah.
- DPDevi Parikh
... but I do think that it might be a little bit slower than what we might have guessed just based on progress in other modalities.
- EGElad Gil
What do you think is the main either challenge or bottleneck that you think has slowed progress in this field? Or not slowed it. I mean, obviously there's been, you know, every... a lot of people are working very hard on these problems. But to your point, it seems like sometimes you have these fundamental breakthroughs, and sometimes it's like an, it's an architecture, like transformer-based models versus traditional NLP, and sometimes it's, um, you know, iterating on a lot of other things that already exist in the preexisting approaches and just sort of solving specific engineering or technical challenges. If you were to sort of list out the bottlenecks to this, what do you think they're likely to be?
- DPDevi Parikh
Yeah. I think there's a few different things. One is videos are just sort of, from an infrastructure perspective, harder to work with, right? They're just sort of larger, more storage and sort of more expensive to process and more expensive to generate and all of that. So there's just that iteration cycle that is much slower with video than it would be with other modalities. So that is, is one. Um, the second is I
- 15:57 – 20:43
Challenges and Implications of Video Processing
- DPDevi Parikh
don't, I don't think we've still figured out the right representations for video, right? There is a lot of redundancy in video. One frame to the next frame, there's not a whole lot that changes. Um, we still kind of approach them fairly independently as sort of independent images. Even if you're generating it as kind of one after the other or if, even if you're generating in parallel and then making it finer-grained. Um, so I think maybe that could be something that helps with a breakthrough that if we really figure out how to represent videos efficiently. Um, and the third is this hierarchical architecture that if you want longer videos, there are just so many pixels that you're trying to generate, right? It's a very, very high-dimensional signal, um, compared to anything else, uh, that we're doing. And so just thinking through how do we even approach that, what sort of hierarchical representation, um, makes sense, especially if you want these scene transitions, if you want to have this consistency which may be a form of memory. Um, figuring those architectural pieces out, um, I think may be another piece of this puzzle. Um, and then finally, data, right? Data is kind of gold (laughs) in anything that we're, uh, we're trying to do. Um, and I don't know if as a community we- we've quite built the muscle of thinking through data, um, sort of massaging the data appropriately, um, and all of that in the context of video. We have that muscle quite a bit with language, quite a bit with images, uh, but with video we are perhaps not quite there yet.
- EGElad Gil
What, what would be the ideal training set for, for video in that case? Or what's lacking from the existing approaches?
- DPDevi Parikh
Yeah, I think what's lacking may not be so much, um, the data source itself, although that is certainly a challenge as it is with other modalities, but I think it might also be the data recipes, that do we want to start with training with sort of these very short videos where not much is happening, the scene isn't really changing, but then that also tends to limit the motion. There's just not much happening, and so you kind of end up with these kind of animated image-looking things. Um, and on the other hand, you have sort of very complex video, might be multiple minutes long with all sorts of scene transitions, and that's ideally what you want to shoot for. That's where you want to get. But if you just kind of directly throw all of that into the network, it's unclear if the models will be able to learn, um, all of that complexity well. So I think thinking through some sort of a curriculum, um, may be valuable here, and I don't think we've quite nailed that recipe down.
- EGElad Gil
I feel like every generation of sort of technology shifts always runs into video as the hardest thing to do. And if you look at sort of just the first substantiation of the web, one of the reasons YouTube sold was the infrastructure point you made earlier, where just dealing with that huge amount of streaming and the cost associated with it and everything else, even in the prior generation of just, you know, can we host and stream this effectively, um, in part led to them, you know, getting sold to Google reasonably early in the life of a company. So it's, it's interesting how video is always that much more complicated.
- DPDevi Parikh
Yeah, yeah. And same thing for computer vision, right? Like here we're talking about generation, but even just understanding, um, with images, with image understanding, there was so much progress that was being made, and videos was always kind of not only trailing behind, but just sort of continued to be harder, even sort of the rate of progress, uh, was lower, not just the absolute progress. And I think, yeah, I think we're seeing some of that, uh, for generative models as well.
- SGSarah Guo
Devi, I, I know this isn't within your, like, core field, but I'm sure you also pay attention. Like, how do you think advances in video, um, may, like, impact robotics?
- DPDevi Parikh
Hmm. So I think there, the video understanding piece is probably more relevant than the video generation piece. Um, and video is, like if you think of embodied agents, right, they are sort of moving around and consuming visual content, which inherently is video, right? They're not looking at static images. And so I think that video understanding piece is, is very relevant there. What's also interesting in the context of embodied agents or sort of robotics, uh, physical robots that are moving around, is that it's not passive consumption of videos, right? It's not like how you and I might be watching videos on YouTube or anything else. It's that the-
- SGSarah Guo
I'm yelling at the screen. I'm not passive. (laughs)
- DPDevi Parikh
(laughs) Um, it's that the, the next visual signal that the robot will see will be a consequence of the action that the robot had taken, right? So if it chose-
- SGSarah Guo
Mm-hmm.
- DPDevi Parikh
... to move a certain way, that's gonna change what the video looks like in the next few seconds. Um, and so there's that interesting feedback loop there where it knows what action it had taken. It sees how that changed the visual signal that it is, uh, that it is now getting as input. And so that connection makes it, um, adds a layer of interestingness to how it can process the video, sort of in contrast to with sort of regular computer vision disembodied tasks where we think of sort of it just streaming a video is just kinda happening and you're not controlling what you're seeing.
- 20:43 – 25:50
Control and Multi-Modal Inputs in Video
- SGSarah Guo
You s- started by saying that human interaction, uh, was a big driving force in, in your research interests and, um, you know, going beyond like metrics as outputs and, um, even l- language as, as inputs. Um, how do you think about, uh, controllability in video and, like, how important text prompting is to sort of the next generation of creation?
- DPDevi Parikh
Yeah. I think that's, um, I think that's very important, exactly to your point, that if we want these generative models not just for video but for any modality to be, um, tools for creative expression, then it needs to be generating con- content that corresponds to what someone wants to express, right? Like, it has to bring somebody's-
- SGSarah Guo
Mm-hmm.
- DPDevi Parikh
... voice to life, and that is not possible if there aren't good enough ways of controlling, um, these models. And so text is one way. That's better than random samples. (laughs) That's, that's one way in which I can, I can say what I want. But right now, for the most part, you type in a text prompt, you get an image back, a video back, um, and either you take it or leave it, right? Like, if you like it, that's great. If not, you just kind of try again, and maybe you tweak the prompt a little bit. You sort of, um, try a whole bunch of these prompt engineering tricks and, and, and hope that you get lucky, but it's not really a very direct, uh, form of control. Um, and so I think of more control at least in two different ways. One is to allow for prompts that are not just text but are multimodal themselves. So for image generation, for example, instead of just text, it would be nice if I can kinda sketch out what I want the composition of the scene to look like and, and the model would be expected to kinda respect that. For video, instead of just text as input, maybe I can also provide an image as input so that I can tell the system that this is the kind of scene that I want. Maybe I can provide sort of a little audio clip as input to convey that this is the kind of audio or sound that I want associated with it. Maybe I also bring in a short video clip, um, and expect the model to sort of bring in all of these different modalities in a reasonable way, um, to, to generate a video. So that's one piece where I can bring in more inputs as a way of more control. And the second piece is sort of the predictability part, that even if I bring in all of these modalities as input, if the model then goes off and kinda does its own thing with these inputs, maybe it's reasonable but that's not what I'm looking for. What do I do, right? Like, do I just go back and try again? It would be ideal if there's some way of having iterative editing mechanisms where whatever I get back, I have a way of communicating to the model what it is that I want changed, in what way, so that over iterations I can get to the content that I intended in sort of a fairly reasonable way without having to sort of spend hours learning a new tool or something like that, right? So if that can be done in a very intuitive interface, I think that would be pretty awesome.
- SGSarah Guo
Where do you think we will get to in terms of the frontier of, like, controls for video generation over the next couple years or five years?
- DPDevi Parikh
I think control sort of tends to lag behind the core capability. Like, even with images, I feel like we first had to get to a point where these models can actually generate nice-looking images before we start worrying about, well, is it really doing what I wanted it to do, and I feel like we're not quite there with video yet.
- SGSarah Guo
So get random good first?
- DPDevi Parikh
Exactly, exactly, exactly.
- SGSarah Guo
All right.
- DPDevi Parikh
Like, at least get random good first, then maybe let me give it text, then let me give it these other prompts. Um, so I do think we'll first probably see more progress in just the core capabilities of sort of text-to-video generation, um, before we look at, uh, um, prompting. Although we are... And this is in the context of sort of-... me generating something from scratch, right, which is where I might want this iterative control and things like that. A parallel scenario is where I already have a video and I'm trying to edit it in an interesting way. I might want to stylize it. And all of that, I think we're already seeing that even in products, um, with-with Runway, for example, right? So I think that, we'll probably see much more of. Uh, we're already seeing that and I think we'll see more of, where you already have a video that you're starting with, and then you're trying to edit it, um, which has similarities too, but is a little bit in- is a little bit different in my mind, um, compared to sort of generating something from scratch and wanting control over that.
- EGElad Gil
The other, um, potential, uh, part of output for videos obviously is text-to-speech or some sort of voice, or other ways to sort of accompany the video or animate it. What is your view in terms of the state of the art of text-to-speech systems and how those are evolving?
- DPDevi Parikh
I think I have, um, I have, I haven't tracked the text-to-speech quite as much. What I have tracked a little bit more closely is, um, things like text-to-audio, where you might say that the sound of a car driving down the street and what you expect is sort of a sound of a car driving down the street (laughs) to be generated. Um, and so there, uh, the state of the art right now is, um, sort of roughly, sort of a few second to tens of seconds long audio, um, and I would say that roughly it probably works reasonably well, uh, one in five times or so. Um, it's like, because there aren't concrete metrics, it's kind of hard to articulate, uh, where state of the art is, but, um, hopefully this
- 25:50 – 39:00
Audio's Role in Visual Content
- DPDevi Parikh
is, this is helpful. And I, I do think that, um, audio added to visual content, um, makes it much more expressive and much more delightful, and I do think that it tends to be, um, underinvested, um, uh, both for audio, similarly for music. I think it just makes the content much more expressive, much more delightful, um, but I feel like we don't do enough of that. Um...
- EGElad Gil
Yeah, it's interesting too because there are actually very large sound effect libraries out there. They're very well labeled as well in terms of what the exact sound effect is and the length and the components and all the rest.
- DPDevi Parikh
Yeah.
- EGElad Gil
And so, it's interesting that the state of the art hasn't quite caught up with, with, you know, what used to be a really interesting old business where you'd generate an enormous amount of IP for different sound effects and then you'd just license them out.
- DPDevi Parikh
Yeah. Yeah.
- EGElad Gil
Which it seems like eventually that industry is likely to go away, so...
- DPDevi Parikh
And even with audio, similar to what, what we were talking about with video, there is the, right, like the same kinds of challenges and dimensions exist, that you want the piece to be longer, um, you may want compositionality, right? I might want to be able to say that, well first, it's the car driving down the street, and then there is a sound of, I don't know, a baby crying, and then something else. And maybe I'm saying that two of these sounds are happening simultaneously, which is not ... like, that's something that can happen in audio, where you can have this superimposition, um, but in video is not something that would, uh, where it's quite as natural. And so all of that isn't stuff that these models can do, uh, very well right now. If I described a complex sequence of sounds or if I tried to talk about these different sounds simultaneously, um, these models can't do that very well.
- EGElad Gil
Where do you think we'll see the first application areas? Or what, what do you think are the first sort of use cases that we'll see immediately, and then how does that evolve over time?
- DPDevi Parikh
Yeah. I think, um, I'm not too much of a product person, so I feel like I, I don't know if I have the strongest intuitions there. Um, but I think for li- kind of like I was touching on earlier, that a lot of these situations where we find ourselves searching for things to express ourselves, um, I think thinking of whether that can be generated, um, so that it's a closer reflection of what you're trying to communicate is likely things that we'll see. And I know we're not talking about sort of LLMs and conversation agents and all of that too much, but I think AI agents is going to be a thing that we'll see a whole bunch of, um, across many different surfaces. Um, and then thinking about what media creation looks like in the context of AI agents is, is another, uh, dimension to this.
- EGElad Gil
Yeah, that makes a lot of sense. I mean, there's, there's all sorts of obvious sort of near-term applications in terms of generating your own animated GIFs or, to your point, um, midstream video editing or, you know, different types of shorter-form animations or other things that you could do, marketing, et cetera. And so, uh, you know, it definitely feels like there are some near-term applications and some longer term ones, and then the thing I always find interesting about these sorts of technologies is the, the spaces where they kind of emerge in a way that you don't quite expect but end up being a primary use case. You know, sort of the Uber version of the, of the mobile revolution, where you push a button and a stranger picks you up in a car and you're fine with it, right? Um, and it feels like those sort of unexpected delightful experiences are, are gonna be very exciting in terms of a lot of areas of this field.
- DPDevi Parikh
Yeah. Yeah. And, and, and to your point that this technology is brand-new, right? So it's not like there are existing product lines or ways of thinking about product that we can kind of directly plug into and kind of see, oh, did the metric go up, did the metric go down? I think there's a lot of just kind of thinking about where do we anticipate people will be excited to use this? And I, as you said, I think I, there's a very good chance that there will be things that we don't necessarily foresee, um, but just kind of come up as very exciting spaces.
- SGSarah Guo
Um, I, I think an interesting cynicism has been that, like, there aren't that many artists out there, or like, people don't want to create imagery, uh, when, when looking at some of these, uh, generative, like, much less video, whe- when looking at some of these generative technologies. But, you know, the, the recent history of social media would say that's, like, certainly not true, right? Um, if you look at Instagram democratizing photography or TikTok democratizing short-form video creation by just, like, reducing the number of parameters of control, right? As you said, like, you know, sound makes video much richer, but it's also really hard to produce any one of these pieces, so you just take one control away and like, you know, record with your phone and, and you get, you get something like TikTok. But, I think it's really exciting, like the explosion in usage of things like Midjourney, right? Because the traction suggests there are an awful lot of people who are actually interested in generating high-quality imagery for a whole range of use cases, professional or not.
- DPDevi Parikh
Yeah. Yeah. And I think there's people across the entire spectrum, right? Like, on one hand, you can talk about artists who already had a voice, were already involved in sort of, um, creating art. And then the other end of the spectrum are people who don't necessarily have the skills, may not have had the training, but are still interested in being able to express their voices a little bit more creatively than they would have otherwise. And so I do think that there is one question of whether or not artists wants, want to be engaging with this technology, and there is the other question of does it kind of just lift the tide for all of the rest of us (laughs) to be able to be more, um, expressive in, in what we can, uh, create and what we can communicate? Um, and so I think both of those ends are, are relevant here. Um, and with artists, there are artists who are, like, whose sort of brand is AI artists, where they are explicitly using AI as the tool of choice, um, for expressing themselves and their entire practice is around that. Someone like Sofia Crespo or Scott Eaton and, and others. Um, so there's also... And this was before Midjourney or anything like that, right? Like, they've been doing this for years, um, even with like GANs, for example, (laughs) that existed, that were popular before, diffusion models and all of that. Yeah.
- SGSarah Guo
Uh, you're an artist yourself, both, um, you know, digital, AI-driven, analog, um, some of it's behind you. Like, how does that impact, like, your view of this?
- DPDevi Parikh
I, I kind of always, um, hesitate a little bit to call myself an artist. I feel like somebody else should be deciding whether I'm an artist or not (laughs) , but then there's this whole community.
- SGSarah Guo
We'll say you're an artist.
- DPDevi Parikh
Yeah. (laughs)
- EGElad Gil
By the way, we should mention, um, some of your lovely macrame art is on the wall behind you as well, so.
- DPDevi Parikh
Yeah.
- EGElad Gil
I think it looks great.
- DPDevi Parikh
Yeah, yeah.
- EGElad Gil
Yeah.
- DPDevi Parikh
Thank you. Thank you. And yeah, so to be honest, I don't know if I... It's hard to kind of look back and get a sense for did that play a certain role in it or not. I know for sure that it plays a role in just how excited I am about this technology, that any time there's some new model out there, whether it's from the teams that I'm working with or if it's something external, I'm definitely very enthusiastic to want to try it out and see what it can do, what it can't do, and sort of tell people about it. And so just kind of my baseline level of excitement around this technology, um, is high, uh, in part because of all these other interests that I have. I, I'm pretty sure that my emphasis on control is probably also coming from that, where I feel like I want to be using these tools to kind of have them do the thing that I (laughs) want to do, um, and sort of text prompts are, are restrictive in that way. Um, and I mean, the, we talked about it in the context of control, that if you can bring in multiple modalities as input, that definitely gives you more control, but it also means that there is more space to be creative, right? I can now pick interesting images or interesting videos or interesting pieces of audio and pair that up with, like, this really interesting text prompt and just kind of see what happens. Like, if I put all of this in, you don't know what the model is necessarily going to do, and so it's also just more norms to play with, um, as you're, as you're trying to, um, interact with these that, yeah, there's just more, more space to be creative if there's more ways or more norms to control these models.
- SGSarah Guo
Yeah. I was talking to, um, Alex Israa, who's an LA-based artist, and, uh, he's not a technical guy, um, but an amazing artist, and he was describing this new video project he wants to do that involves use of AI and I was very inspired by, uh, like, how specific the vision was and, like, thinking through the implementation a little bit for somebody who doesn't come from the technical field, and I imagine there would be a whole crop of people who look at the capabilities as, as, as another tool for expression.
- DPDevi Parikh
Yeah. Yeah. And so, and there are some people who have a very specific vision and they just want the tool to kind of help them get there, and then there are others whose process involves sort of bringing the model along, where the unpredictability and sort of not necessarily knowing what this model is going to generate is a part of their process and is a part of the final piece that they create. So some view them, view these models very much as tools, and then others tend to view them as more of a collaborator, um, in this process of, of creating, and it's always interesting to see what end of the spectrum different people lie on. Yeah.
- SGSarah Guo
Okay. So as we're nearing the end of our time together, we want to run through a few rapid fire questions, if that's okay.
- DPDevi Parikh
S- sounds good.
- SGSarah Guo
Um, maybe I'll start with, one, just given your breadth in the, in the field, um, is there an area of image, audio, video generation, understanding, control that you feel like is, um, just underexplored, uh, for, uh, people looking for research problems?
- DPDevi Parikh
Yeah. So one is the control piece that we already talked about quite a bit, and I think the other is multimodality, like, bringing all of these modalities together. Right now, we have models that can generate text, we have models that can generate images, models that can generate video, um, but there's no reason these all need to be independent. You can envision systems that are sort of ingesting all of these modalities, understanding all of it, and generating all of these modalities, um, and I haven't... I'm starting to see some work in that direction, but I haven't seen, um, a whole lot of it that goes across many different modalities.
- SGSarah Guo
You just got back from CVPR and presented there. Can you mention both what you were talking about and then sort of th- the project or work that most inspired you there?
- 39:00 – 39:49
Don't Self-Select & Devi’s tips for young researchers
- DPDevi Parikh
In terms of advice outside of, um, what I've written, one advice that has stuck with me over the years is don't self-select. That if you want something, go for it. If you want a job, apply for it. If you want a fellowship, for any students who might be listening, just apply for it. You want an internship, just apply for it. Um, and yeah, like don't, don't assume, don't question, "Oh, am I good enough? Am I not?" Um, it's on the world to say no to you. If you are not a good fit, the world will tell you that. Um, and so yeah, there's nothing to lose by just kind of giving it a shot. So, don't self-select.
- SGSarah Guo
That's a great note to end on. Devi, thank you so much for joining us on No Priors.
- DPDevi Parikh
Thank you. Thank you for having me.
- SGSarah Guo
Uh, thanks for the time. (instrumental music)
Episode duration: 39:50
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode x7n5Fdc3u0I
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome