Stanford OnlineStanford CS153 Frontier Systems | Mati Staniszewski from ElevenLabs on The Future of Voice Systems
CHAPTERS
CS153 kickoff: ElevenLabs’ origin story and the Discord bot that “blew up”
Anjney introduces Mati Staniszewski (founder/CEO of ElevenLabs) and recounts discovering ElevenLabs as a viral text-to-speech Discord bot. Mati shares an early cultural detail: the company initially ran internally on Discord to stay close to creators and move fast.
Gaming/creator ecosystems as a leading indicator for AI use cases
They discuss how gaming communities often incubate future mainstream product patterns. Mati explains why creator/developer proximity and fast feedback loops shaped ElevenLabs’ product-led growth and roadmap.
The original problem: fixing bad dubbing and unlocking global content access
Mati traces ElevenLabs’ starting obsession to Poland’s “single narrator” dubbing style and the broader pain of low-quality localization. The founding vision becomes universal access to content in any language with preserved performance, emotion, and identity.
Anatomy of AI dubbing: the cascaded pipeline and why it was hard in 2022
Mati lays out the end-to-end dubbing system: transcription/speaker handling, translation, and voice generation—each needing to be strong to produce good results. In 2022, translation and overall pipeline quality weren’t there yet, leading to a strategic narrowing of scope.
Product pivot: focus on text-to-speech as the “common denominator”
User conversations revealed immediate demand for voiceover fixes, narration, audiobook creation, and script-to-audio workflows. ElevenLabs prioritized solving high-quality, emotional TTS in English first, rather than building the full transcription/translation stack immediately.
Early research strategy: open source inspiration, papers, and pragmatic compute
Mati describes how early progress came from combining open-source models and academic ideas not yet widely applied to audio. Tortoise (James Betker) is highlighted as a breakthrough baseline, while ElevenLabs focused on speed, stability, and longer-form generation.
IP strategy and speed: why they skipped patents
They reflect on deciding not to patent early innovations due to cost, rapid iteration, and limited strategic value. Mati frames IP defense as less useful than out-innovating, while acknowledging the nuisance of patent trolls.
Capability timeline: 2022 TTS → 2023 voice cloning/marketplace → 2024 localization/dubbing → 2025 real-time agents
Mati offers a year-by-year map of major audio breakthroughs and how ElevenLabs expanded scope. The arc moves from expressive TTS, to scalable multilingual narration and voice cloning, to high-quality localization, and finally to low-latency interactive voice agents.
Cascaded vs fused/omni models: tradeoffs in emotional understanding, reliability, and latency
They dig into why current voice systems often lose emotion/intent by converting audio to plain text, and what it takes to preserve paralinguistic cues. Mati frames the core architectural decision—cascaded components vs fused models—through three constraints: expressivity, reliability, and latency.
Data and control: labeling emotion and enabling “director-style” steerability
Mati explains that emotion-aware agents require labeled data and controllable generation. He describes building labeling pipelines and releasing agent features that detect emotion, pass it to the LLM, and render responses with appropriate tone; later, voice generation becomes steerable like a director guiding an actor.
Collaboration culture at the frontier: ElevenLabs, Sesame, and the open ecosystem
Anjney highlights Mati’s collaborative approach—publicly acknowledging peers and sharing insights—even amid competition. The discussion connects ecosystem collaboration (including open-sourcing models like Sesame’s CSM) to accelerating frontier progress beyond VC-defined categories.
Scaling the business: revenue growth, team structure, and forward-deployed engineering
Mati shares striking business metrics and the organizational model behind execution. He attributes predictability to forward-deployed teams that deliver value inside iconic enterprises, plus small autonomous teams that move quickly and accept being wrong.
Pricing, safety, and security: value-based packaging, watermarking, and the end of voice authentication
Mati outlines value-based pricing principles and addresses voice replication risks. He describes layered safety: traceability, fraud/scam detection, watermarking/detection tools, and guidance that voice should not be used as a secure authentication factor.
Sovereign scale and geopolitics: Ukraine deployment, China distillation risk, and open vs closed trends
Mati describes working with Ukraine’s digital-government ecosystem to add voice access to citizen services during wartime, and discusses strategic alignment with Western allies. The conversation then turns to distillation attacks, China’s competitive model ecosystem, and tensions between openness, IP norms, and safety standards.
Five-year vision: leading audio foundation models, platform-as-infrastructure, and on-device TTS
Mati projects ElevenLabs’ future as both a research leader in conversational audio and a platform enabling businesses/builders to deploy, monitor, and govern voice systems. He also previews on-device progress—useful for constrained TTS—while emphasizing that full interactive reliability remains a cloud/platform challenge.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome