No PriorsNo Priors Ep. 69 | With HeyGen CEO and Co-Founder Joshua Xu
CHAPTERS
- 0:05 – 0:54
AI-generated cold open: spoofing a podcast intro to set the theme
The episode opens with an intentionally confusing, AI-generated “Narrator” introduction that mimics another show format, prompting Joshua and Elad to react in real time. The bit tees up the core topic: synthetic media and how convincingly AI can generate video and voice.
- •AI narration deliberately misidentifies the show and hosts
- •Joshua and Elad call out the confusion, signaling an AI-generated demo
- •Sets the stage for a discussion about generative video and authenticity
- 0:54 – 2:20
Joshua Xu’s origin story: from Snapchat AI camera to “AI as the new camera”
Joshua explains his background in robotics and his years at Snapchat working on ads ML and AI camera features. Seeing early generative filters convinced him that AI could become a new “camera,” inspiring HeyGen’s mission to make visual storytelling accessible.
- •Career path: CMU robotics → Snapchat ML/AI camera
- •Early generative filters as the ‘aha’ moment for synthetic content
- •Vision: replace the traditional camera with AI-generated capture
- 2:20 – 3:07
What “replacing the camera” really means: removing friction from content creation
Elad challenges the premise, and Joshua clarifies that the goal is lowering barriers for people who can’t easily create good on-camera content. The product aims to make visual communication faster, easier, and more scalable than traditional filming.
- •Many people struggle to create polished content on camera
- •AI can reduce skill, comfort, and production barriers
- •Scaling visual storytelling is the north-star outcome
- 3:07 – 4:34
Why HeyGen started with avatars: the most expensive part of video production
Joshua breaks down video production into A-roll (spokesperson/camera) and B-roll (editing assets). Customer learning showed editing is relatively standardized, while filming is costly and slow—so HeyGen focused on replacing A-roll with avatars first.
- •Deconstructing production: A-roll vs B-roll
- •Filming requires scheduling, crews, studios, repeated takes
- •Avatars target the biggest bottleneck: creating the spokesperson footage
- 4:34 – 5:56
Near-term vs long-term: from async creation to streaming, interactive avatars
The discussion moves to where the product could go next—assembling generative components into complete videos and eventually enabling real-time, streaming avatar experiences. Joshua ties this to broader multimodal progress and the potential to replace some live interactions.
- •Path from component generation to AI-assisted end-to-end assembly
- •Streaming avatars as real-time visualization for multimodal assistants
- •Potential to change how people handle conversations and communication
- 5:56 – 6:35
HeyGen’s core use cases: create, localize, and personalize
Joshua groups HeyGen usage into three buckets: creating new videos from scripts/templates, localizing existing videos across many languages, and personalizing video messages at scale. He highlights practical business content like explainers, training, and sales enablement.
- •Create: script-to-video with stock or custom “digital twin” avatars
- •Localize: translate and dub/retime into 175+ languages/dialects
- •Personalize: individualized videos for outreach or messaging at scale
- 6:35 – 7:26
Brand examples in the wild: McDonald’s campaign and ‘AI for everyone’
Joshua shares a favorite consumer-facing example: a McDonald’s campaign enabling multilingual family messages. The hosts underscore that these tools aren’t just for tech insiders—they can serve mainstream, emotional communication use cases.
- •McDonald’s activation enabling cross-language personal messages
- •Demonstrates consumer-friendly framing, not just enterprise tooling
- •Reinforces mainstream accessibility (‘grandma and grandchildren alike’)
- 7:26 – 8:50
Quality as the gating factor: crossing the ‘usable’ threshold
Elad probes how HeyGen decides when avatar quality is good enough for public brand use. Joshua describes an internal “invisible line” of quality—below it the output isn’t useful—and notes ongoing work toward higher fidelity, including full-body avatars and richer scene elements.
- •Quality is the top priority because it must substitute real production
- •A threshold model: below ‘90’ quality isn’t deployable
- •Roadmap includes full-body avatars and more complete scene composition
- 8:50 – 9:57
What’s next: full-body avatars and better real-time interaction
Joshua highlights two major product bets: generating full-body gestures/motion (historically very hard) and improving streaming avatars for real-time conversational experiences. He points to academic progress and GPT-4o-era multimodal interaction as accelerants.
- •Full-body motion to increase authenticity and engagement
- •Streaming avatars to support real-time text/voice interaction
- •‘Last mile’ productization of research into deployable features
- 9:57 – 11:29
Matching motion to use case: quality spectrum from training to ads
Joshua explains that different applications demand different realism. Educational content can tolerate stillness, while high-end marketing/ads require dynamic, engaging movement—making full-body rendering a key unlock for broader marketing and sales applications.
- •Use-case spectrum: lower requirements (training) → higher (ads/marketing)
- •Dynamic motion improves engagement and ROI for creative
- •Full-body avatars enable more authentic, varied shot types
- 11:29 – 12:49
HeyGen’s model stack: text + voice partnerships, video built in-house
Joshua outlines the three-part stack—text, voice, and video—and explains which pieces are outsourced vs owned. HeyGen uses external providers for text/voice while building the video/avatars/rendering stack internally, and he describes the trend toward joint multimodal training.
- •Three layers: text (brain/orchestration), voice, video
- •Partners: OpenAI for text; OpenAI/ElevenLabs for voice
- •In-house: avatar creation, video rendering, B-roll generation
- •Key technical challenge: coupling voice with gesture/motion via joint training
- 12:49 – 14:50
Why not generate everything end-to-end like Sora? Control, consistency, and brand constraints
Elad asks how HeyGen differs from generic text-to-video models. Joshua argues businesses need control and consistency—logos, fonts, brand styles—so HeyGen decomposes video into components and assembles them with an orchestration engine; he also frames Sora-like models as potential partners for components.
- •Business requirements: quality, control, and consistency
- •Component approach (A-roll/B-roll) allows precise brand fidelity
- •Some elements shouldn’t be generated (logos/fonts) due to accuracy needs
- •Sora positioned as integrable component generator, not a competitor
- 14:50 – 16:36
Research and product design: building around model limitations to create new experiences
Joshua describes a blended approach: customer needs, academic progress, and an honest view of model constraints. He emphasizes designing product flows that amplify strengths and hide weaknesses, using video translation as a case study built from lip-sync plus translation plus voice.
- •Research inputs: customers + academia + platform constraints
- •Product design can ‘route around’ model limitations
- •Video translation example: lip-sync + translation + voice preservation
- •Goal: new user experiences, not just better raw models
- 16:36 – 18:46
Trust & safety: anti-deepfake policies, consent, verification, and review
The conversation turns to abuse risks and election safety. Joshua states HeyGen prohibits political/election content and details safeguards like user verification, live consent, verbal passcodes, and rapid human review, with safety embedded into each creation step.
- •No political/election content allowed on the platform
- •Safeguards: advanced verification, live video consent, verbal passcodes
- •Rapid human review and multi-step safety checks in creation flow
- •Industry collaboration on misinformation and AI safety practices
- 18:46 – 24:01
How generative video changes communication: scalable personalization and a ‘new format’
Joshua argues the biggest shift is not just saving time and money but enabling new use cases—mass personalized video for business communication, marketing, and education. He predicts a future where video isn’t a fixed MP4 but can be generated dynamically per viewer, with examples like personalized ads and PepsiCo-scale employee messaging.
- •Lower cost/time expands video usage across business functions
- •Personalization becomes practical at massive scale (e.g., employee messages)
- •Real-time generation could tailor ads/learning content per viewer attributes
- •Claim: generative video becomes a new interactive format, not just a file
- 24:01 – 26:12
Hard problems building video models: aesthetics, evaluation, and closing the feedback loop
Joshua explains that video modeling is difficult because visual appeal is hard to capture with simple objective metrics, and lower loss doesn’t guarantee better output. HeyGen relies on product signals and A/B tests to evaluate models, then feeds learnings back into training—an approach he relates to lessons from Snapchat’s camera work.
- •Aesthetics are hard to encode; optimization doesn’t equal desirability
- •Evaluation requires in-product signals and user judgment
- •A/B testing is central to deciding which models are ‘better’
- •Experience parallels consumer camera tuning at Snapchat
- 26:12 – 27:26
Company snapshot and hiring: small team, large customer base
Sarah asks about company size; Joshua notes HeyGen is ~40+ people serving ~40,000 paying customers across mainstream industries. The episode closes with roles they’re hiring for across product, design, engineering, research, and go-to-market.
- •Team size: ~40+ people
- •Scale: ~40,000 paying customers; broad, mainstream adoption
- •Hiring across: product, design, engineering, AI research, GTM