This video isn’t embeddableWatch on YouTube →

Stanford CS153 Frontier Systems | Mati Staniszewski from ElevenLabs on The Future of Voice Systems

For more information about Stanford's online Artificial Intelligence programs, visit: https://stanford.io/ai Follow along with the course schedule and syllabus, visit: https://cs153.stanford.edu/ In week two of CS153 ("AI Coachella"), Anjney Midha interviews Mati Staniszewski, founder and CEO of ElevenLabs, tracing the company’s origins from an early Discord text-to-speech bot to a fast-growing frontier audio and speech platform. Mati explains ElevenLabs’ initial focus on solving AI dubbing inspired by Poland’s single-voice film narration, the shift to prioritizing emotional, natural-sounding text-to-speech for creators, and the evolution from cascaded pipelines (transcription, translation/LLM, and speech generation) toward real-time voice agents. They discuss tradeoffs between cascaded versus fused multimodal systems, efforts to detect and convey emotion, safety and voice authentication limits, on-device model deployment, collaboration with teams like Sesame, and business lessons on PLG plus enterprise deployment, team structure, pricing from customer value, and growth to over $430M revenue with ~450 employees. Guest Speaker: Mati Staniszewski is the CEO and co-founder of ElevenLabs, the AI voice/audio platform. Born in 1995 in a town outside Warsaw, Poland, he attended Copernicus Bilingual High School in Warsaw before earning a degree in mathematics from Imperial College London. While at Imperial, he organized Mathscon, a UK student-led mathematics conference. His earlier career included roles at Opera Software, BlackRock (where he worked in the Portfolio Analytics Group and helped launch the Aladdin Wealth platform), and Palantir Technologies (as a Deployment Strategist managing large-scale public- and private-sector implementations). In 2022, he co-founded ElevenLabs with his high school friend Piotr Dabkowski. He has raised hundreds of millions from investors, including Sequoia, Andreessen Horowitz, and Salesforce Ventures, with the company valued at $11 billion as of February 2026. He joined the board of Klarna in 2025 and was named to Forbes 30 Under 30 Europe in 2024 and TIME's 100 Most Influential People in AI in 2025. Follow the playlist: https://youtube.com/playlist?list=PLoROMvodv4rN447WKQ5oz_YdYbS74M5IA&si=DOJ5amlyRdyMJBhG

Anjney MidhahostMati Staniszewskiguest

May 4, 20261h 6mWatch on YouTube ↗

CHAPTERS

0:07 – 3:06
CS153 welcome + early ElevenLabs origin story on Discord
Anjney introduces Mati Staniszewski and recounts discovering ElevenLabs as a rapidly growing Discord text-to-speech bot. Mati shares that the early company itself ran on Discord, using bots to move fast and stay close to users.
- •ElevenLabs was first noticed as a Discord TTS bot that spread quickly
- •Founding team avoided meetings/email; experimented with running the company on Discord
- •Gaming/creator communities as early indicators for broader tech adoption
- •Early focus on tight feedback loops with real users
3:06 – 5:13
Why community-driven PLG mattered: finding real problems and unexpected use cases
Mati explains ElevenLabs’ initial strategy: build foundational voice research and ship product via product-led growth (PLG). Staying close to creators and developers surfaced use cases and clarified the real customer pain versus founders’ assumptions.
- •Two-part mission: foundational audio research + applied product layer
- •PLG motion emphasized rapid iteration with creators/developers
- •Community reveals emergent use cases faster than top-down planning
- •Problem-obsession: customers define the problem differently than founders expect
- •Voice marketplace as a mechanism for community contribution
5:13 – 7:07
ElevenLabs 101: the Poland dubbing pain that sparked the mission
Mati describes the unique Polish “single narrator voice” dubbing experience that motivated a better future for multilingual content. The founding insight: content should be accessible in any language while preserving emotion and performance.
- •Polish dubbing often uses one monotone narrator for all characters
- •Motivation: preserve original performance while translating across languages
- •European context made language/localization pain more salient than in the US
- •Founders started in London/Warsaw, informed by multilingual market needs
7:07 – 8:19
Decomposing AI dubbing into a pipeline: transcription → translation → speech generation
ElevenLabs initially targeted automatic dubbing and broke it into key model components. They learned success required all pieces—speaker/cleanup, transcription, translation, and expressive TTS—to work together at high quality.
- •Three core model components: transcription, translation, text-to-speech
- •Need speaker identification + noise/background removal upstream
- •Translation quality (pre-GPT era) was a major bottleneck
- •End-to-end dubbing quality depends on weakest link in the cascade
8:19 – 13:09
The pivot: from full dubbing to best-in-class text-to-speech for creators
Customer discovery showed creators wanted simpler, immediate value: voiceover fixes, corrections, and narration—before full localization. ElevenLabs narrowed scope to the common denominator: expressive, natural TTS (initially in English).
- •Studios asked first for voiceover corrections and patching bad recordings
- •Shifted focus to a single component rather than the whole dubbing stack
- •Goal: human-like, emotional delivery with context awareness
- •Initial wedge: English TTS for creators + developer API
13:09 – 15:29
Early research approach: open source, papers, and Tortoise as a catalyst
Mati outlines how the team evaluated open/closed solutions and adapted ideas from broader ML literature into audio. Tortoise (open source) demonstrated human-like short-form quality but was too slow and unstable for long-form, guiding ElevenLabs’ architecture choices.
- •Surveyed open source, closed offerings, and academic papers for transferable ideas
- •Open source often ahead; hyperscalers still strong in classic TTS
- •Tortoise showed high-quality short snippets but had latency + long-form stability limits
- •Borrowed from diffusion/transformer-era innovations entering audio
15:29 – 17:23
Compute constraints, startup pragmatism, and why patents didn’t matter
They trained early models with relatively small budgets (helped by compute credit programs) and avoided spending on patents. Mati argues rapid iteration makes patents less valuable and shifts focus to shipping and defending pragmatically if needed.
- •Early training spend: tens of thousands to ~low hundreds of thousands USD
- •Used programs like NVIDIA Inception (era of accessible GPU credits)
- •Chose not to file patents; fast-moving innovation makes them obsolete
- •Awareness of patent-troll dynamics; focus on execution over legal moats
17:23 – 21:59
From TTS to full audio stack: transcription, agents, and beyond (2022–2026 roadmap)
Mati provides a year-by-year evolution of audio capabilities: 2022 expressive TTS, 2023 scaling voices and creator tools, 2024 localization via transcription+translation+dubbing, and 2025 real-time agents. He predicts 2026 focuses on reducing latency and moving from cascaded to more fused/continuous architectures.
- •2022: context-aware, emotional TTS breakthrough
- •2023: multilingual expansion, voice cloning, marketplace, author tools
- •2024: strong localization pipeline (STT + translation + TTS) enables high-quality dubbing
- •2025: real-time voice agents with interactive turn-taking
- •2026: further latency cuts; potential move toward fused/continuous systems
21:59 – 25:37
Cascaded vs fused ‘omni’ voice systems: tradeoffs in quality, reliability, and latency
Anjney challenges why current voice modes miss emotion/intent; Mati frames the core architectural decision: keep cascaded components or train fused models end-to-end. ElevenLabs favors cascaded systems for enterprise reliability and tool orchestration, while fused systems can win on speed for less reliability-critical use cases.
- •Cascaded: STT → LLM → TTS; easier debugging/guardrails/tooling
- •Fused: speech-in to speech-out; faster, but harder to control and audit
- •Three axes: emotionality/quality, reliability (tool calls/guardrails), and latency
- •Enterprise needs prioritize reliability over 300ms responsiveness
- •Hybrid future: fused for info-seeking, cascaded for authenticated transactions
25:37 – 30:43
Making agents emotionally aware: labeling data, sentiment signals, and controllable delivery
Mati explains why emotion understanding is hard: limited labeled data mapping audio to emotions. ElevenLabs is investing in labeling and released an ‘expressive’ agent approach that detects emotion during transcription, passes it to the LLM, and generates appropriately styled speech.
- •Key hurdle: scarce labeled emotion/sentiment training data in speech
- •Approach: detect emotions on transcription side; condition LLM + TTS delivery
- •Goal: agents respond reassuringly to stress, excitedly to excitement, etc.
- •Expressivity seen as solvable for both cascaded and fused approaches
- •Industry reference points (e.g., Sesame) pushing ‘voice Turing test’ quality
30:43 – 35:10
Ecosystem culture: collaborating with ‘competitors’ and accelerating the frontier together
Anjney highlights Mati’s openness in crediting and collaborating with other teams (notably Sesame). The discussion frames frontier progress as people-and-collaboration-driven rather than logo-landscape rivalry, with open models and shared learning accelerating the field.
- •Leadership mindset: view frontier work as collective progress
- •Concrete collaboration examples with Sesame; mutual investing and knowledge sharing
- •Open sourcing (e.g., Sesame CSM) expands what builders can do
- •Categories/landscape slides can obscure real technical collaboration dynamics
35:10 – 39:31
ElevenLabs business scale: revenue growth, team structure, and execution model
Mati shares major growth metrics and describes how a ~450-person team operates via small, high-ownership squads. The model emphasizes speed, autonomy, and learning from customers across many vertical deployments.
- •Reported revenue milestones: ~$330M in 2025; >$430M run-rate after adding >$100M ARR this quarter
- •Team size ~450; major hubs in London, New York, Warsaw, SF
- •Org design: teams <10 people with high ownership; tolerance for being wrong
- •Execution focus: fast iteration across research + product + deployment
39:31 – 42:32
Predictable growth via deployment + pricing by value (not cost)
Mati explains that enterprise growth becomes more predictable when you can estimate delivered customer value and scale forward-deployed engineering capacity. On pricing, he argues strongly for value-based pricing—aiming to capture a fraction of created value—rather than cost-plus compute accounting.
- •Forward-deployed engineering bridges ‘lab AI’ to applied customer outcomes
- •Growth constrained by hiring high-IQ/EQ operators who can execute with humility
- •Enterprise revenue becomes predictable by capacity/value delivered per year
- •PLG/self-serve is harder to forecast; innovation cadence drives upside
- •Pricing principle: start from customer value, not model/compute cost
42:32 – 59:29
Safety, fraud, and governance: voice cloning, watermarking, and nation-state dynamics
Audience Q&A turns to risks of easy voice replication: fraud, authentication, and public detection. Mati outlines internal safeguards (traceability, moderation) and calls for broader watermarking/detection standards; the conversation extends to government deployments (Ukraine) and concerns about distillation/IP pressure from China.
- •Security: trace generated audio to the creator; detect/stop abuse and fraud
- •Need public/industry detection tools: AI-generated identification + watermarking
- •Voice authentication is unsafe for banking; should be phased out
- •Ukraine deployment: voice access to services and citizen apps under wartime constraints
- •China: mitigate distillation attacks; recognize strong regional models; open ecosystem vs closed frontier trend
59:29 – 1:06:25
Creative adoption + on-device models: controllability, economics, and the next platform layer
Mati addresses why studios hesitate: fear of backlash, economics/IP, and lack of directorial control—now improving with controllability (“more dramatic, slower”). He also reveals progress on running TTS on-device and explains ElevenLabs’ long-term role as a platform that bundles models with integration, orchestration, and evaluation tooling for real deployments.
- •Studios prefer ‘middle-to-middle’ workflows over one-shot prompt-to-output to avoid slop
- •Recent breakthrough: directorial controls for delivery style and pacing
- •Adoption constrained by economics/IP models for voice licensing and reuse
- •On-device TTS is now feasible (language-constrained), but cloud remains richer for full agent stacks
- •Platform vision: integrations (telephony/CRM), tool-calling, testing/evals, monitoring, and continuous improvement

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

iOS

Android

Claude

Chrome

CS153 welcome + early ElevenLabs origin story on Discord

Why community-driven PLG mattered: finding real problems and unexpected use cases

ElevenLabs 101: the Poland dubbing pain that sparked the mission

Decomposing AI dubbing into a pipeline: transcription → translation → speech generation

The pivot: from full dubbing to best-in-class text-to-speech for creators

Early research approach: open source, papers, and Tortoise as a catalyst

Compute constraints, startup pragmatism, and why patents didn’t matter

From TTS to full audio stack: transcription, agents, and beyond (2022–2026 roadmap)

Cascaded vs fused ‘omni’ voice systems: tradeoffs in quality, reliability, and latency

Making agents emotionally aware: labeling data, sentiment signals, and controllable delivery

Ecosystem culture: collaborating with ‘competitors’ and accelerating the frontier together

ElevenLabs business scale: revenue growth, team structure, and execution model

Predictable growth via deployment + pricing by value (not cost)

Safety, fraud, and governance: voice cloning, watermarking, and nation-state dynamics

Creative adoption + on-device models: controllability, economics, and the next platform layer

Get more out of YouTube videos.