No Priors Ep. 69 | With HeyGen CEO and Co-Founder Joshua Xu

AI video generation models still have a long way to go when it comes to making compelling and complex videos but the HeyGen team are well on their way to streamlining the video creation process by using a combination of language, video, and voice models to create videos featuring personalized avatars, b-roll, and dialogue. This week on No Priors, Joshua Xu the co-founder and CEO of HeyGen, joins Sarah and Elad to discuss how the HeyGen team broke down the elements of a video and built or found models to use for each one, the commercial applications for these AI videos, and how they’re safeguarding against deep fakes. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil l @joshua_xu_ Show Notes: 0:00 Introduction 3:08 Applications of AI content creation 5:49 Best use cases for Hey Gen 7:34 Building for quality in AI video generation 11:17 The models powering HeyGen 14:49 Research approach 16:39 Safeguarding against deep fakes 18:31 How AI video generation will change video creation 24:02 Challenges in building the model 26:29 HeyGen team and company

Joshua XuguestElad GilhostSarah Guohost

Jun 20, 202427mWatch on YouTube ↗

EVERY SPOKEN WORD

50 min read · 9,832 words

0:00 – 3:08
Introduction
1. NANarrator
  Welcome, Joshua. We're so excited to have you here today. How are you?
2. JXJoshua Xu
  Hey, Sarah. I'm so excited to be here. Thanks for having me today.
3. NANarrator
  It's our pleasure. Let's get started. Welcome to the Huberman Lab Podcast, where we discuss science and science-based tools for everyday life.
4. EGElad Gil
  Uh...
5. NANarrator
  I'm Sarah Guo, and I'm a professor of neurobiology and ophthalmology at the School of Medicine.
6. EGElad Gil
  Um, Sarah, I think you're...
7. JXJoshua Xu
  Wait, Sarah, I'm so confused. What's going on here?
8. EGElad Gil
  Is this thing on?
9. NANarrator
  Today, we're here to discuss how AI can benefit your health, and what medicinal properties the technology holds.
10. JXJoshua Xu
  Sarah, I'm so lost. Isn't this the No Priors podcast where you interview technology superstars like Garry Tan and Alexander Wang?
11. NANarrator
  No, that's only for humans. We're really excited to have you. Welcome, Joshua.
12. JXJoshua Xu
  Yeah, excited to be here. Thank you for having me.
13. EGElad Gil
  So, let's start with a little bit of backstory. You started this company, HeyGen. It's had this amazing growth trajectory and is being used by millions of people now. Um, what's the story of starting the company?
14. JXJoshua Xu
  Yeah, sure. So, uh, yeah. Hello, everyone. Uh, my name is Joshua. I'm co-founder and CEO of HeyGen. We founded the company roughly three and a half years ago, and before that I was working at Snapchat for about six and a half years there. I studied robotics at Carnegie Mellon and joined Snap back in 2014 there. I initially worked on machine learning in Snapchat ads, ads ranking and recommendation, then I spent my last two year at Snap working on AI cameras. So, you know, Snap leveraged a lot of AI technology to enhance the camera experience. If you look at, you know, 2018 Snapchat released a baby filter and Disney-style filter. That was the first time I saw a computer can actually create and generate something that does not exist in the world. I was just so fascinated by the technology back then, and I had a feeling that that would potentially change the way how people create the content. So, you know, Snapchat is a camera company, and everybody create the content through the mobile camera. But we wanted to replace the camera because we think AI can create the content, and AI could become the new camera. And that's how we get started with HeyGen, and our mission is to making visual storytelling accessible to all.
15. EGElad Gil
  I love it. The greatest minds of our generation, you know, inspired by, uh, you know, "Your face is a cute kitten," or, or whatever. Um, uh, what does replacing the camera mean to you? Like, why do we need to do that? I use my camera a lot.
16. JXJoshua Xu
  I kind of grew up my career in the whole, you know, mobile, uh, camera space where we worked on a lot of the software and technology to enable people to feeling comfortable and making it easier for people to create content through the mobile camera. But, you know, there's still lots of people are not able to create good content using the camera today, and we felt that if we can replace the camera, that means we can remove the barrier for visual storytelling, for visual content creation, and that will help us to step ahead in terms of the whole content creation space.
17. EGElad Gil
  What
3:08 – 5:49
Applications of AI content creation
1. EGElad Gil
  are some of the areas that you think, um, you know, the technology that you developed is applied to? Because I think you've started with, uh, different forms of, like, virtual avatars so that you can, uh, take a video of yourself and then turn it into a, um, into an avatar that you can then feed text to. It can speak in your voice. It can do all sorts of really interesting things for different areas. How did you both decide to start with avatars? And then where do you think the main applications are?
2. JXJoshua Xu
  When we s- initially started the company, we tried to, like, disassemble the whole video production process. It's really about camera and then editing. So, camera is more about A-roll, which represent a human spokesperson, the avatar piece. Editing is more about B-roll, adding, you know, different assets, voiceover, music, transition, animation, stuff like that. So, editing, we just learned from customer that editing is not that expensive because it's pretty standard service. But camera is super expensive. And imagine, you know, as a CEO of a company, he wants to record something. We probably need to schedule that ahead of, uh, you know, two weeks of time. We need to bring in the camera crew, have a studio to actually record it. And even for two minutes of footage, sometimes we need to record it for 20 minutes because people need to remember the script. And that's the piece that blocking a lot of business, create new, new content. So, that's how we get started from, you know, trying to replace that piece of the process and making avatar, um, to replace the camera for the video production.
3. EGElad Gil
  Where do you think that goes in the future? So, you know, people are already using HeyGen for all sorts of different application areas in terms of, um, you know, marketing and sales, and in some cases like internal webinars or learning or other things. I'm, I'm a little bit curious, like, is the eventual form of this, you know, everybody has somebody who steps in for them for their Zooms, or, um, is it used for entertainment purposes? Or how do you kind of view the evolution of this sort of technology over time?
4. JXJoshua Xu
  Yeah, I'll, I'll, I'll say there's many possibility out there. I think what we are, uh, tackling the problem so far, it is the, you know, entry point of the content creation, where all the content started with the camera. And then we will have people doing a lot of editing after that. You know, we can clear- clearly see a path where people can already assemble all this generative footage and apply the AI editing to a- assemble the final video. And again, you know, if we push forward into technology, making the perform- performance much better, I think we will be able to create experience like generative video in a streaming way, and that actually will be potentially replace a lot of the, you know, real-time conversation we have today, especially
5:49 – 7:34
Best use cases for Hey Gen
1. JXJoshua Xu
  with the GPT-4o and with all this multi-modal real-time streaming technology all together.
2. EGElad Gil
  Uh, okay. W- we're still in asynchronous video creation land, um, in 2024. How do people use HeyGen today? Like, what are the favorite use cases you have?
3. JXJoshua Xu
  I would categorize the use, uh, the use case of HeyGen to three: create, localize, and personalize. And, you know, people can s- let, uh, um, you know, decast from a, a library for our avatar, or create their own digital twin and just like select a template or type the script and generate a video. This works the best for product explainer, how-to videos, learning/development, and some sales enablement training content. We can also take existing video that localize that into 100 or s- more than 175 different languages and dialects, and, you know, in this way we can help customers to really localize th- their content into local languages. And last but not least, people can also use HeyGen to personalize their video messaging at scale. So, I think there is, uh, many, many very creative use case on HeyGen today. We are a very horizontal platform. I would say one of my favorite, um, use case is probably the recent launch with McDonald's, and McDonald's launched a sweet campaign where they can allow people to send a message to a family member in the different languages.
4. NANarrator
  (Hindi)
5. SGSarah Guo
  I love him so much. I would run out of words to express my love for him.
6. JXJoshua Xu
  You know, I just want to call it out, you know, AI is for everyone, grandma and grandchildren alike.
7. EGElad Gil
  Yeah. That's really cool. Um, I mean, like, that's a, a big brand in a public, like, consumer-facing use case. Um,
7:34 – 11:17
Building for quality in AI video generation
1. EGElad Gil
  how do you think about, like, the quality of HeyGen today? It, like, you know, I- I- I would have thought of that as, like, sort of, sort of tip of the pyramid in terms of quality and, you know, how can you tell when the avatars are good enough and, and not?
2. JXJoshua Xu
  Yeah. So, I, I would say, um, um, quality has always been, you know, the number one priority of the product and business and technology, I would say. You know, um, I alwa- I always have a frameworks like this. Uh, there's an invisible line of the quality, you know, let's say that threshold's 90. Uh, anything below 90 essentially is unusable for the customers because we cannot really replace the real life, uh, production process they have. We really, really need to focus on making the video generation quality that, uh, go above that threshold, and I think especially for avatar today, it is above that, so we can really helping people to replace the real camera and unleash a lot of that creativity process that help people to scale their content production there. And, you know, obviously, you know, there's, uh, m- much more room to improve, for example, generating the full-body avatar, being able to bring all, m- most all sorts of element into a video. Yeah. We are in the process of, uh, of that.
3. EGElad Gil
  What, what are you most excited about in terms of, like, what's next or new releases you guys have coming?
4. JXJoshua Xu
  Uh, I think there's many things very exciting going on in our technology and product roadmap. I think particularly I'm very excited for the full-body generation of the avatar. Historically all the tech... avatar technology has been focused on the upper body. It's really hard to generate the gesture and the emo- the, the body motion, but a lot of academic research has proven that this is very possible now, and we just need to, like, basically take that into the last mile. And another thing I would say, um, something I'm very excited about the streaming avatar, uh, especially with the latest release on GPT-4o, really, really help to improve the performance of the real-time interaction with text and voice and HeyGen avatar could become a visualization layer for all those applications.
5. EGElad Gil
  Obviously you need, like, full gesture control and movement to get to any video of any kind.
6. JXJoshua Xu
  Mm-hmm.
7. EGElad Gil
  But what do customers want to do in terms of full-body motion today? Like, you, you had a, like, a demo of walking, um, i- in, you know, the last couple months.
8. JXJoshua Xu
  The, the way how I really look at it is that there is a, a spectrum of the quality requirement, um, laying out on different use cases, right? Let's start from the, uh, the left side of the spectrum, um, is the learning/development content, educational content. It's more like one-to-many broadcasting, um, talking about educational training content. The quality there is lower because the, the avatar can be more still, more professional. But if... on the right side of the spectrum, we call this, like, the high-end, you know, marketing content, really dynamic, you know, the, the, the... One example would be the ads creative, and people ship very, very dynamic content on ads, um, and because that can really help to improve the ROI of the content, making it more engaging. I think making the full body, enabling that full-body rendering will be able to help us to bring the avatar, to bring the video into the next level of engaging and authentic, and that will help to unlock a lot of use cases in a broader case of, uh, marketing and sales.
9. EGElad Gil
  Newscasts or other things, uh, to your point, they often have the shot of the people walking and talking as, like, a standard canned shot, and there's, like, these standard things that they use that if you had full body you could provide for all sorts of application areas. Um, I, I guess, uh, related to that, what is the technology that
11:17 – 14:49
The models powering HeyGen
1. EGElad Gil
  you folks are using today? You mentioned some things like GPT-4o but you've also built your own models in-house. Like, h- how do you think about the technology stack that you're using and how does that have to evolve in order to be able to do full body or other new things?
2. JXJoshua Xu
  There's, uh, three model, right? Text, voice, and video. Um, so we work with, uh, um, OpenAI ChatGPT on the text generation side, obviously also serves as, like, the brain of the orchestration engine that we build internally, and we work with, uh, you know, uh, OpenAI and ElevenLabs on the voice engine, but we build entire video stack in-house, including avatar creation, video rendering and B-roll generation. So, I think over time, I think the whole technology trend has been moving towards to a direction. A lot of all these thing will be chained together, the multi-model, multi- multimedia all get to, into one single model.Um, one of the challenge I want to call out for the full body generation is actually how do you actually connect that voice into... together with the, you know, gesture motion. And that's actually something will be unlocked by actually getting the voice model and the, um, video model training together so that it can sort of, like, build a connection underlying the model as well. And that has been historically really, really hard because we have to, like, train the TTS model on one hand and then feed that TTS model outcome into the, uh, video model, and that's... it's pretty hard to build that connection but with multi-model training, that's very possible.
3. EGElad Gil
  Obviously, Sora is not available to developers and end users today but there are, like, world-class text-to-video generation models that are generic, not avatars. How, how does this technology differ from something like Sora?
4. JXJoshua Xu
  When we initially started HeyGen, we want to help the business solve the video creation problem. What is a business looking for? They're looking for quality, they're looking for control, they're looking for consistency, right? So when we try to look back, okay, this is the north star, how can we get there? What's the technical path to get us there? There's essentially probably potential two paths. One is the text-to-image to Sora where we try to generate the entire thing from end to end and, um, so you get entire video at once. And the other approach is that what we believe in at HeyGen is that we try to de-assemble the whole video into different components. Largely it will be A-roll and B-roll. Um, B-roll represent all different kinds of elements like voiceover, music, transition. A-roll's being the avatar. And we try to tackle this component one by one and then we build orchestration engine around that to assemble the final video together. We felt that this technical path is more capable to deliver the quality, you know, the control, and the consistency that the brand's looking for because... For example there's some stuff we should, probably should not try to generate is, is the logo and the fonts. That needs to be very accurate. And not to mention that we also need to be able to learn about... especially in the business context, we need to learn about brand style, the color mapping et cetera from customers. And I think the second approach would give us more flexibility and capability to build a system around it. And in fact, we actually see Sora as our partner because we can ab- we are able to integrate that as one of the component, you know, generator and then feed that into our acquisition engine for the business a- application.
14:49 – 16:39
Research approach
1. JXJoshua Xu
2. EGElad Gil
  How do you think about, um, what research... Y- y- you know, if you just focus on, like, components of the experience, in particular the video stack being the thing that you really want to own and be state-of-the-art in at HeyGen, how do you approach, like, new capabilities from a research perspective? Is it, you know, look at what's available in academia, look at the problems customers give you, um, sort of de novo?
3. JXJoshua Xu
  Uh, I would say it's a combination. I, I, I would add one more things that we need to play under standard limitation around the model and try to find a, uh, connection between what is the customer looking for, what's capable with the technology. Like when we really try to look at it, all the AI model had some sort of a limitation. And I think the key question is that in order to deliver a great product experience for the customer is that how do we design a product around it so that we can try to avoid the limitation of the model but help to amplify the strength of the model. And this is something that's really important to find a new area that unlock the new cre- creation experience. The one example would be when we look at, you know, the video translation, uh, technology. It's a, um... you know, it's a whole new way to translate your content compared to traditional dubbing. We preserve the user, uh, the natural voice and their facial expression. But if you look at really underlying the model what enable that video rendering, it is actually, uh, a lip synced model, right? But we kind of, like, figure out a way to combine all this together, together with the voice as well as the translation with ChatGPT and build a great experience around it and sort of like we are creating a whole new experience for localized video and content.
4. EGElad Gil
  So there's lots of, like, great
16:39 – 18:31
Safeguarding against deep fakes
1. EGElad Gil
  McDonald's, like, exciting commercial applications. Um, I, I think a lot of people also think deepfakes are really scary, like, ooh, and the ability to, you know, abuse somebody's, um, likeness or voice is scary. How do you think about, um, safety, election safety, abuse?
2. JXJoshua Xu
  Well, first of all, we do not allow any political or, um, election content on our platform today. HeyGen's policies strictly prohibit those creation of, you know, unauthorized content and we take abuse of the platform seriously. So we have our safety, you know, security safeguard include very advanced user verification, include, you know, live video consent, dynamic verbal passcode and rapid human review in the back of all the alpha have been created on the platform. Trust and safety is critical to our, uh, business and we are actively partnering across the industry, you know, continue developing the tools and best practice to combat misinformation and AI safety. And we actually build the safety as part of the design. If you look at a lot of, um, you know, alpha creation process on HeyGen and we base all this ch- safety concern and safety guard on every single step of the creation process as well.
3. EGElad Gil
  That makes a lot of sense. Like I said, um, it, it's kind of interesting because if you think about it, at least from the positive version of this, and you talked about how you try to project against a negative, the positive version is, you know, you're running for office and you should be able to send a personalized message to each voter, uh, literally into their inbox with a short video clip of you talking to them specifically or talking to issues that they specifically care about or things like that, and so...You could imagine using this technology in the future for actually hyper-personalized political campaigning, um, and as long as you can avoid some of the deepfake side of it, um, then obviously it could actually be quite valuable.
18:31 – 24:02
How AI video generation will change video creation
1. EGElad Gil
  Um, how do you think this ability to really generate large scale, differentiated, personalized, et cetera content of individuals talking, how does, how does this kind of generation change how people make or use video in general?
2. JXJoshua Xu
  You know, if people can generate, um, very engaging and authentic video content, they will basically, um, create more videos, and use video more for their business, to grow their business. And we, we, we live in a video first world, um, where every business want to create more videos. I think the bottleneck today i- in the industry is that video just are very expensive to make, and it takes, like, weeks or month to make the video. I think it will fundamentally change a lot of way how people are thinking about how to grow the business, how to do the communication, how to do the marketing and sales. So, I do think there's a huge possibility that we can create and generate, uh, um, a very high degree of, uh, um, personalized, um, video, especially with the full body avatar, they've been able to deliver, um, a very dynamic and high quality content out there. So, I want to give you one ex- example is that I think a lot of AI generation is not only about... obviously, you know, cost saving and time saving is one aspect of the value prop, but it's actually we are seeing a lot of customer using that to unlock new use cases and being able to do something they were not able to do it be- before. I think that's the key driving point of a lot of a business outcome today.
3. EGElad Gil
  H- how do you think about it in the context of real time versus asynchronous? It feels like a lot of these technologies are focused right now on asynchronous use cases, and that's true as well of just pure text to speech models. When do you think we move to any sort of real time or close to real time video avatars and, um, sort of the uses of that?
4. JXJoshua Xu
  I look at it in two ways. Like, one is that the real time application of the avatar, like, even now it's possible. I think people can already experience that on HeyGen. We are making a, um, a new updates that can make it even faster. So, it can potentially become, let's say, the virtual, um, you know, AI, uh, SDR, uh, virtual support that help to take customer calls or, uh, provide supports, right? And, you know, I think the technology has been always, like, developing like this trend. Um, two years from now, it would not be crazy to look at a lot of a- avatars generation asynchronization pipeline will become real time streaming capable. And I also see the world is moving towards a way that we can probably generate the entire video in real time as well in the fut- in the future, let's say five years from now. I have a s- uh, opinion, like, you know, generative image is still image, but generative video is not a video. It is a new format. What I meant by that is, you know, when we really look at video, we look at it as an MP4 file, right? So, it is immutable. Like, for example, if, um, you and I are on Instagram, we probably get recommended by different, um, to different, to different ads. But as long as we are recommended from the same business, we are looking at the same MP4 file. But it does not need to be the same. Let's say if maybe I like avocado, I should be watching an ad with Coca Cola and avocado and showing the, uh, you know, the, the, the new story about Coca Cola with me. And you like something else, you could be looking at something else. And this is not possible today because making video is expensive, but this could be very possible. Let's say we can actually, you know, real time generating the video ads that you like according to u- your user attribute. That will potentially become a new format. You know, when we really look at today's video player, it corresponded to only one v- MP4 file, and it, it doesn't n- need to be true, doesn't need to be like that. That video player can actually take in a lot of, uh, you know, user attribute and generate something in real time to match what's the best way to deliver their content to customers. Cap?
5. EGElad Gil
  Yeah, I think it's, um... Uh, you know, one interesting analogy would just be like if you think about, um, you know, YouTube as one of the largest learning devices in the world today. Like, uh, uh, it is, um, static, immutable video for everyone, but it's pretty clear Bloom studies and, and, um, everything else that, like, personalized education is gonna be the path that is more effective. And people want to learn by video, but it's very hard, it's too expensive to make that video personalized. This feels like a, you know, opportunity for a very different educational future too.
6. JXJoshua Xu
  Yeah. And, and one of the use case we have seen from customers that, you know, Pepsis Group, they generate more than 100,000 videos, uh, a thank you video to send to all the employee globally and w- and localize into different languages, personalize with a name and their, um, their, w- what they like about, about, you know, when they join the company and stuff like that. And historically, that is actually only delivered with one video, right? So, they, maybe the CEO or the executive team hop on, into a camera and record something, you know, saying thank you for the, um, you know, 2023. But now that message and communication can really personalize at, um, very, very big scale.
7. EGElad Gil
  So, one thing you mentioned
24:02 – 26:29
Challenges in building the model
1. EGElad Gil
  is, uh, the various aspects of research that you're doing in terms of building your own video models, as well as using third party APIs. What's been difficult or hard from a research perspective?
2. JXJoshua Xu
  Unlike a lot of other, uh, model, I think building video model, um, you know, being able to integrate aesthetics into the AI model is pretty hard.So, you know, video generation is not only about solving a mathematical problem, it's actually about creating something the customer love and appreciate. So essentially, a- a model with a- a lower optimized cost function doesn't mean it actually produce a better visual outcome. So, I- I guess that is the piece that making it really hard to evaluate, but also really important to deliver the mile, the last mile of the value for the customer. And, you know, generally evaluation is also hard. We have to rely on in-product signal. For example, A/B test to know which model is actually better, uh, because, you know, only the customer can be the judge here for that. And this process generates just not differentiable from a mathematical standpoint. We kind of have to form a system, build a system around it and be able to feedback those data into our model training so that we can contin- continuously to improve.
3. EGElad Gil
  Did this approach come to you because of your work at Snapchat working on consumer products? Or is it something that you had to come up with in the context of Heijian itself?
4. JXJoshua Xu
  I would say it's very similar, especially on whe- when we work on the camera software. Uh, so how do we know whether this parameter works better or the other one works better? And I think we can definitely come up with some, you know, uh, very objective metrics about, "Hey, you know, uh, lightning score, this is lightning score, this is resolution." But there's so many things we figure out, "Hey, a better resolution, I- I mean higher resolution, doesn't mean it's a better image quality for customers." Uh, if you look at iPhone, it does not have the best resolution always compared to, uh, a lot of other phones, but it does produce the image that most people like about using iPhone to capture the image. And yeah, there's some very similar lessons out there we learn from in early days of Snap. Yeah.
5. SGSarah Guo
  What can you say about how big Heijian is today?
6. JXJoshua Xu
  We are a- a little bit over 40 people, but we are serving over 40,000, 40,000 paying customers on the platform today. And I think what's so interesting about our customer is that these are not
26:29 – 27:26
HeyGen team and company
1. JXJoshua Xu
  the typical AI early adopter- adopters. These are mainstream companies from European manufacturers, to small business, to global nonprofits, to Fortune 500 companies, which is based on problem we are solving.
2. EGElad Gil
  Given that you have a thousand customers per employee, which is an incredibly impressive metric, are there specific key roles that you're hiring for or other things that maybe, uh, members of our audience may want to apply for?
3. JXJoshua Xu
  Sure, yeah. We- we're hiring across different teams basically, product, design, engineering, AI research and go-to-market. Yeah.
4. SGSarah Guo
  Uh, this has been a great conversation. Thanks, Joshua.
5. EGElad Gil
  Thanks so much.
6. JXJoshua Xu
  Yeah. Thank you. Thank you for having me.
7. SGSarah Guo
  (instrumental music) Find us on Twitter @nopriorspod. Subscribe to our YouTube channel if you want to see our faces, follow the show on Apple Podcasts, Spotify, or wherever you listen. That way you get a new episode every week. And sign up for emails or find transcripts for every episode at no-priors.com.

Episode duration: 27:26

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Transcript of episode 6q7r8B4HEbY

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome