Stanford Online

Stanford CS153 Frontier Systems | Mati Staniszewski from ElevenLabs on The Future of Voice Systems

For more information about Stanford's online Artificial Intelligence programs, visit: https://stanford.io/ai Follow along with the course schedule and syllabus, visit: https://cs153.stanford.edu/ In week two of CS153 ("AI Coachella"), Anjney Midha interviews Mati Staniszewski, founder and CEO of ElevenLabs, tracing the company’s origins from an early Discord text-to-speech bot to a fast-growing frontier audio and speech platform. Mati explains ElevenLabs’ initial focus on solving AI dubbing inspired by Poland’s single-voice film narration, the shift to prioritizing emotional, natural-sounding text-to-speech for creators, and the evolution from cascaded pipelines (transcription, translation/LLM, and speech generation) toward real-time voice agents. They discuss tradeoffs between cascaded versus fused multimodal systems, efforts to detect and convey emotion, safety and voice authentication limits, on-device model deployment, collaboration with teams like Sesame, and business lessons on PLG plus enterprise deployment, team structure, pricing from customer value, and growth to over $430M revenue with ~450 employees. Guest Speaker: Mati Staniszewski is the CEO and co-founder of ElevenLabs, the AI voice/audio platform. Born in 1995 in a town outside Warsaw, Poland, he attended Copernicus Bilingual High School in Warsaw before earning a degree in mathematics from Imperial College London. While at Imperial, he organized Mathscon, a UK student-led mathematics conference. His earlier career included roles at Opera Software, BlackRock (where he worked in the Portfolio Analytics Group and helped launch the Aladdin Wealth platform), and Palantir Technologies (as a Deployment Strategist managing large-scale public- and private-sector implementations). In 2022, he co-founded ElevenLabs with his high school friend Piotr Dabkowski. He has raised hundreds of millions from investors, including Sequoia, Andreessen Horowitz, and Salesforce Ventures, with the company valued at $11 billion as of February 2026. He joined the board of Klarna in 2025 and was named to Forbes 30 Under 30 Europe in 2024 and TIME's 100 Most Influential People in AI in 2025. Follow the playlist: https://youtube.com/playlist?list=PLoROMvodv4rN447WKQ5oz_YdYbS74M5IA&si=DOJ5amlyRdyMJBhG

Anjney MidhahostMati Staniszewskiguest

May 3, 20261h 6mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

ElevenLabs’ voice AI evolution: from dubbing to real-time agents

ElevenLabs began with an AI-dubbing vision inspired by poor Polish voiceover conventions, then narrowed to text-to-speech as the highest-leverage component that unlocked multiple creator and enterprise use cases.
Mati outlines the core “voice system” pipeline (speech-to-text → translation/LLM reasoning → text-to-speech) and explains why early product feedback pushed the company to perfect TTS quality, expressivity, and voice cloning first.
The conversation contrasts cascaded vs fused (end-to-end) multimodal architectures, arguing cascaded systems win today for enterprise reliability, tool-use traceability, and guardrails, while fused systems can win on latency and fluidity.
ElevenLabs’ growth is attributed to tight user feedback loops, a PLG-to-enterprise motion with forward-deployed engineering, small autonomous teams, and pricing anchored on customer value rather than model inference cost.
Safety and governance themes include voice cloning abuse mitigation, watermarking/detection, discouraging voice-based authentication, resisting distillation attacks, and building trusted ecosystems like voice marketplaces and licensed celebrity voices.

IDEAS WORTH REMEMBERING

5 ideas

Start with the customer’s real pain, not the most ambitious end-state.

ElevenLabs aimed at full AI dubbing but discovered creators first needed simpler voiceover fixes (re-recording lines, narration corrections), which pulled the roadmap toward TTS as the foundational wedge.

In voice systems, “quality” includes context and performance—not just intelligibility.

Human-like speech required modeling emotional delivery, dialogue awareness, and long-context coherence; early models could sound great on short clips but failed on longer passages or consistency.

Cascaded architectures remain the enterprise default because they’re governable.

Separating STT, LLM/tooling, and TTS makes it easier to debug, audit steps, apply guardrails, and ensure correct authentication/payment flows—critical for support, banking, and regulated domains.

Fused (end-to-end) voice models primarily win on latency and natural turn-taking.

Mati suggests fused models can reach ~300ms responsiveness, but they currently sacrifice reliability and step-level observability, making them better for low-stakes or companion-style experiences.

Emotion-aware agents require labeled data more than clever prompts.

A major bottleneck was the lack of datasets that map speech to states like stressed/happy/sad; ElevenLabs invested heavily in labeling to detect emotion, pass it as context, and control response style.

WORDS WORTH SAVING

5 quotes

A very peculiar thing that happens in Poland is that if you watch a foreign movie in Polish, all the voices, whether that's a male voice or a female voice, get narrated with one single character.

— Mati Staniszewski

But more so, more so than not in that early days, just, uh, I think you need to be, like, extremely problem obsessed. What is the problem that they are having? And, and the variation of what you think the problem is to what the customer actually thinks is a problem is slightly, is slightly different.

— Mati Staniszewski

It's very easy to get, uh, stuck in this loop of like, you know, like other startups are your competition, and ultimately it, like, it does not matter for the mission you are solving. It's a long-term game, and, and, and many of those people will come through different intersections of your path in the future.

— Mati Staniszewski

We think this is not the future, and you should step away from this and not use that as an authentication side. We think that's, uh, from the security perspective, it's the, it's the, it's the, it's the wrong approach.

— Mati Staniszewski

Always think about the value you deliver to the customer and work backwards from there, never from the cost of how much it runs to do something.

— Mati Staniszewski

Origin story: AI dubbing and multilingual accessCascaded voice pipelines (STT/LLM/TTS)TTS breakthroughs: context, prosody, controllabilityReal-time voice agents: latency vs reliability tradeoffsData bottlenecks: emotion labeling and personalizationGo-to-market: PLG + enterprise deployment teamsSafety: watermarking, fraud prevention, and distillation defense

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.