Stanford OnlineStanford CS153 Frontier Systems | Mati Staniszewski from ElevenLabs on The Future of Voice Systems
At a glance
WHAT IT’S REALLY ABOUT
ElevenLabs’ voice AI evolution: from dubbing to real-time agents
- ElevenLabs began with an AI-dubbing vision inspired by poor Polish voiceover conventions, then narrowed to text-to-speech as the highest-leverage component that unlocked multiple creator and enterprise use cases.
- Mati outlines the core “voice system” pipeline (speech-to-text → translation/LLM reasoning → text-to-speech) and explains why early product feedback pushed the company to perfect TTS quality, expressivity, and voice cloning first.
- The conversation contrasts cascaded vs fused (end-to-end) multimodal architectures, arguing cascaded systems win today for enterprise reliability, tool-use traceability, and guardrails, while fused systems can win on latency and fluidity.
- ElevenLabs’ growth is attributed to tight user feedback loops, a PLG-to-enterprise motion with forward-deployed engineering, small autonomous teams, and pricing anchored on customer value rather than model inference cost.
- Safety and governance themes include voice cloning abuse mitigation, watermarking/detection, discouraging voice-based authentication, resisting distillation attacks, and building trusted ecosystems like voice marketplaces and licensed celebrity voices.
IDEAS WORTH REMEMBERING
5 ideasStart with the customer’s real pain, not the most ambitious end-state.
ElevenLabs aimed at full AI dubbing but discovered creators first needed simpler voiceover fixes (re-recording lines, narration corrections), which pulled the roadmap toward TTS as the foundational wedge.
In voice systems, “quality” includes context and performance—not just intelligibility.
Human-like speech required modeling emotional delivery, dialogue awareness, and long-context coherence; early models could sound great on short clips but failed on longer passages or consistency.
Cascaded architectures remain the enterprise default because they’re governable.
Separating STT, LLM/tooling, and TTS makes it easier to debug, audit steps, apply guardrails, and ensure correct authentication/payment flows—critical for support, banking, and regulated domains.
Fused (end-to-end) voice models primarily win on latency and natural turn-taking.
Mati suggests fused models can reach ~300ms responsiveness, but they currently sacrifice reliability and step-level observability, making them better for low-stakes or companion-style experiences.
Emotion-aware agents require labeled data more than clever prompts.
A major bottleneck was the lack of datasets that map speech to states like stressed/happy/sad; ElevenLabs invested heavily in labeling to detect emotion, pass it as context, and control response style.
WORDS WORTH SAVING
5 quotesA very peculiar thing that happens in Poland is that if you watch a foreign movie in Polish, all the voices, whether that's a male voice or a female voice, get narrated with one single character.
— Mati Staniszewski
But more so, more so than not in that early days, just, uh, I think you need to be, like, extremely problem obsessed. What is the problem that they are having? And, and the variation of what you think the problem is to what the customer actually thinks is a problem is slightly, is slightly different.
— Mati Staniszewski
It's very easy to get, uh, stuck in this loop of like, you know, like other startups are your competition, and ultimately it, like, it does not matter for the mission you are solving. It's a long-term game, and, and, and many of those people will come through different intersections of your path in the future.
— Mati Staniszewski
We think this is not the future, and you should step away from this and not use that as an authentication side. We think that's, uh, from the security perspective, it's the, it's the, it's the, it's the wrong approach.
— Mati Staniszewski
Always think about the value you deliver to the customer and work backwards from there, never from the cost of how much it runs to do something.
— Mati Staniszewski
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome