No PriorsNo Priors Ep. 69 | With HeyGen CEO and Co-Founder Joshua Xu
At a glance
WHAT IT’S REALLY ABOUT
HeyGen’s AI Avatars Aim To Replace Cameras And Personalize Video
- HeyGen CEO Joshua Xu explains how the company is building AI-generated avatars and video tools that aim to replace traditional cameras, making visual storytelling accessible to everyone.
- The product focuses on three main use cases—creating, localizing, and personalizing videos—serving thousands of mainstream businesses for training, marketing, internal communication, and large campaigns like McDonald’s and PepsiCo.
- Xu contrasts HeyGen’s modular, controllable approach to video generation with end‑to‑end text‑to‑video models like Sora, arguing brands need quality, control, and consistency rather than purely generative novelty.
- He also discusses upcoming full‑body and real‑time streaming avatars, the challenges of video model research, and HeyGen’s safety practices to mitigate deepfakes and political misuse.
IDEAS WORTH REMEMBERING
5 ideasAI avatars can remove major bottlenecks in business video production.
Recording executives or spokespeople is time‑consuming and costly; avatar-based A‑roll lets companies generate high-quality, on‑brand video from text, dramatically speeding up training, explainers, and marketing content creation.
Decomposing video into components gives brands more control than pure text‑to‑video.
HeyGen separates A‑roll (avatars) and B‑roll (voiceover, music, transitions, brand assets), orchestrating them instead of generating everything in one shot, preserving accuracy for logos, fonts, and visual identity while still leveraging generative models.
Localization and personalization at scale unlock entirely new use cases.
Beyond saving cost, customers now do things that were previously impossible, like McDonald’s consumer campaigns in many languages or PepsiCo sending 100,000 individualized thank‑you videos, each localized and personalized with names and details.
Future video will be dynamic and user-specific, not a single static file.
Xu argues generative video is a new format: instead of one immutable MP4 for everyone, video players could render tailored content in real time based on each viewer’s attributes, especially in advertising and education.
Full‑body and real‑time avatars depend on tighter multimodal model integration.
Achieving natural gestures and body motion synchronized with speech requires jointly training voice and video models, moving beyond today’s pipeline of separate TTS feeding into video, and leveraging multimodal architectures like those behind GPT‑4o.
WORDS WORTH SAVING
5 quotesWe wanted to replace the camera because we think AI can create the content, and AI could become the new camera.
— Joshua Xu
Editing is not that expensive…but camera is super expensive.
— Joshua Xu
When we initially started HeyGen, we want to help the business solve the video creation problem. What is a business looking for? They're looking for quality, they're looking for control, they're looking for consistency.
— Joshua Xu
Generative image is still image, but generative video is not a video. It is a new format.
— Joshua Xu
Video generation is not only about solving a mathematical problem, it's actually about creating something the customer love and appreciate.
— Joshua Xu
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome