Skip to content
Nikhil KamathNikhil Kamath

The $11B Bet That Voice Will Replace Everything | Mati Staniszewski x Nikhil Kamath | WTF Online

ElevenLabs is the AI company that makes machines sound human. They dubbed the Lex Friedman–Modi podcast into Hindi. They built an AI Gordon Ramsay that teaches you how to cook while you cook. They automated 60,000 customer support calls for Meesho. The company is worth $11 billion and is racing OpenAI to own the future of how humans talk to machines. The CEO is a 29-year-old from Poland named Mati Staniszewski who started it because every foreign film in Poland had one man dubbing every character. I sat down with him in Davos. We got into why headphones will matter more than phones, why the real opportunity for young entrepreneurs isn't building AI models but going deep into one boring domain — automotive, healthcare, e-commerce — and deploying voice agents better than anyone else. Then it took a turn. We ended up talking about why social media is fundamentally broken, why no foreign algorithm should define the mood of India's youth, and why nobody has built an AI-native social product yet. We decided to try. 00:00 Introduction 06:35 Voice as the next tech interface 13:24 Competing with OpenAI and big labs 20:12 Preserving emotion in dubbed content 27:39 Building profitable voice businesses today 35:09 AI valuations and global opportunity 42:29 Geopolitics reshaping trust in tech platforms 49:51 Designing a new social media platform 56:40 Incentivising authenticity over negativity #nikhilkamath Co-founder of Zerodha and Gruhas Host of 'WTF is' & 'People By WTF' Podcast Twitter: https://x.com/nikhilkamathcio/ Instagram: https://www.instagram.com/nikhilkamathcio/ LinkedIn: https://www.linkedin.com/in/nikhilkamathcio?utm_source=share&utm_campaign=share_via&utm_content=profile&utm_medium=ios_app Facebook: https://www.facebook.com/nikhilkamathcio/ #matistaniszewski LinkedIN- https://uk.linkedin.com/in/matiii Twitter - https://x.com/matiii Watch 'WTF is' Podcast on Spotify https://tinyurl.com/4nsm4ezn Watch 'People by WTF' Podcast on Spotify https://tinyurl.com/yme92c59 Watch 'WTF Online' on Spotify https://tinyurl.com/4tjua4th #WTFiswithnikhilkamath #PeopleByWTF #WTFOnline

Nikhil KamathhostMati Staniszewskiguest
Mar 10, 202659mWatch on YouTube ↗

At a glance

WHAT IT’S REALLY ABOUT

Why voice is the next interface—and how to build on it

  1. Voice will likely become a dominant interface once speech feels human-level—interruptible, emotional, low-latency, and intelligent—so devices can “fade into the background.”
  2. ElevenLabs positions itself as both a research company (foundational audio models) and a product company (creative tools and enterprise voice agents), reducing dependence on LLM platform providers.
  3. High-quality localization/dubbing is shifting from “robotic translations” toward preserving emotion and intent, with the next step being automated lip re-animation to match translated speech.
  4. Near-term profitable opportunities are in domain-specific voice agents for traditional industries (healthcare, automotive, e-commerce, financial services), where integration and workflow deployment are the real moat.
  5. The conversation broadens into geopolitics, trust, and social media: Kamath argues global platforms may fragment, while Staniszewski emphasizes network effects and trust, and both explore how voice/AI could enable a healthier, authenticity-first social product.

IDEAS WORTH REMEMBERING

5 ideas

Voice wins when it matches human conversational dynamics.

Staniszewski argues adoption hinges on speech that’s fast, interruptible, emotionally accurate, and paired with strong “assistant intelligence,” otherwise people won’t tolerate voice agents everywhere.

Great voice UX requires “knowledge plumbing,” not just a good TTS model.

He highlights integrations (CRM, enterprise systems), memory, and deployment channels (WhatsApp, phone numbers, devices) as essential to make voice agents truly useful in real workflows.

Form factor is still the unsolved bottleneck for ubiquitous voice.

They discuss headphones (especially behind-the-ear), pendants, wristbands, and glasses; the technology may exist, but adoption depends on comfortable, always-available hardware with battery, sensors, and social acceptability.

Localization value comes from preserving emotion—and soon, fixing lips.

ElevenLabs’ dubbing approach aims to retain intonation and contextual emotion across languages, but mismatched sentence structure breaks visual timing; lip animation/re-animation is framed as the next major layer.

Entrepreneurs can build big businesses by pairing voice with “boring” domain expertise.

Staniszewski recommends targeting lagging industries (healthcare, automotive) and starting with deployable wedges like customer support, then expanding into richer in-product experiences once trust and distribution are established.

WORDS WORTH SAVING

5 quotes

To make this possible, there's at least three things that need to happen: ... voice quality and interaction ... knowledge access ... and the form factor.

Mati Staniszewski

Dubbing is... a small market... the interactive use case is the biggest market... that will shift everything.

Nikhil Kamath / Mati Staniszewski

An hour podcast will be few dollars... but... we did employ a wider set of people that would actually check every translation.

Mati Staniszewski

I believe social media is broken today... an app which is not governed by an algorithm... triggering hate... I don't think is conducive in the long term.

Nikhil Kamath

If you could have a platform that inspires curiosity and learning... and doesn't lead to just two sides screaming at each other—[that would be] incredible.

Mati Staniszewski

Voice as the next tech interfaceThree prerequisites: quality, knowledge, form factorHeadphones vs glasses vs pendants; ambient computingElevenLabs: creative tools vs enterprise agentsEmotion-preserving dubbing and localization workflowsOpenAI/platform risk and defensibility beyond modelsGeopolitics, data residency, trust, and new social media design

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome