Skip to content
Nikhil KamathNikhil Kamath

The $11B Bet That Voice Will Replace Everything | Mati Staniszewski x Nikhil Kamath | WTF Online

ElevenLabs is the AI company that makes machines sound human. They dubbed the Lex Friedman–Modi podcast into Hindi. They built an AI Gordon Ramsay that teaches you how to cook while you cook. They automated 60,000 customer support calls for Meesho. The company is worth $11 billion and is racing OpenAI to own the future of how humans talk to machines. The CEO is a 29-year-old from Poland named Mati Staniszewski who started it because every foreign film in Poland had one man dubbing every character. I sat down with him in Davos. We got into why headphones will matter more than phones, why the real opportunity for young entrepreneurs isn't building AI models but going deep into one boring domain — automotive, healthcare, e-commerce — and deploying voice agents better than anyone else. Then it took a turn. We ended up talking about why social media is fundamentally broken, why no foreign algorithm should define the mood of India's youth, and why nobody has built an AI-native social product yet. We decided to try. 00:00 Introduction 06:35 Voice as the next tech interface 13:24 Competing with OpenAI and big labs 20:12 Preserving emotion in dubbed content 27:39 Building profitable voice businesses today 35:09 AI valuations and global opportunity 42:29 Geopolitics reshaping trust in tech platforms 49:51 Designing a new social media platform 56:40 Incentivising authenticity over negativity #nikhilkamath Co-founder of Zerodha and Gruhas Host of 'WTF is' & 'People By WTF' Podcast Twitter: https://x.com/nikhilkamathcio/ Instagram: https://www.instagram.com/nikhilkamathcio/ LinkedIn: https://www.linkedin.com/in/nikhilkamathcio?utm_source=share&utm_campaign=share_via&utm_content=profile&utm_medium=ios_app Facebook: https://www.facebook.com/nikhilkamathcio/ #matistaniszewski LinkedIN- https://uk.linkedin.com/in/matiii Twitter - https://x.com/matiii Watch 'WTF is' Podcast on Spotify https://tinyurl.com/4nsm4ezn Watch 'People by WTF' Podcast on Spotify https://tinyurl.com/yme92c59 Watch 'WTF Online' on Spotify https://tinyurl.com/4tjua4th #WTFiswithnikhilkamath #PeopleByWTF #WTFOnline

Nikhil KamathhostMati Staniszewskiguest
Mar 11, 202659mWatch on YouTube ↗

CHAPTERS

  1. ElevenLabs in India, and why hardware (Nothing) might be the voice wedge

    Nikhil and Mati open with team presence in India and segue into hardware as the missing link for mainstream voice. They discuss Nothing’s positioning and how AI-native audio devices could make real-time translation and always-on assistance feel natural.

    • ElevenLabs’ growing team footprint in India (Bengaluru/Mumbai)
    • Why hardware is hard—and why Nothing is an interesting bet
    • Earbuds/headphones as a potential “first AI-native” everyday device
    • Real-time translation as a killer use case, gated by form factor
  2. The “meeting companion” vision: voice agents, pendants, and opt-in transcription

    Mati describes internal experiments using voice agents to capture event/meeting feedback and action items. The conversation explores what an ideal wearable note-taker could look like—assuming explicit consent and strong privacy norms.

    • Voice agent for capturing feedback and shortening post-meeting workflows
    • Opt-in transcription and the need for clear “safe environment” norms
    • Pendant/wearable constraints: size, battery, capture quality
    • Why form factor determines adoption as much as model quality
  3. Voice as the next interface: what must improve for a Jarvis-like experience

    Mati outlines why voice is a natural interaction layer and what it will take to move computing “into the background.” He breaks the transition into three requirements: human-level interaction quality, knowledge/memory access, and the right device form factor.

    • Voice will become a major interface across support, education, devices, and robots
    • Three prerequisites: natural interaction, knowledge+memory, and deployable hardware
    • Need for low-latency, interruptible, emotionally expressive voice
    • Vision: phone back in pocket; ambient voice assistance
  4. Which devices win: headphones, glasses, pendants—and even silent speech

    They debate the most likely winners among AI hardware formats. Mati leans toward behind-the-ear audio devices, and mentions emerging “silent speech” approaches that infer speech from mouth movements to interact privately.

    • Likely multi-form-factor future (glasses/headphones/pendants/wristbands)
    • Headphones as the most natural near-term interface; phone remains a companion
    • On-device constraints and why additional sensors may require more wearables
    • Silent-speech interaction via mouth-movement detection
  5. Speculating on the Jony Ive + Sam Altman device, and why OpenAI voice matters

    Nikhil and Mati speculate about what OpenAI and Jony Ive might build, focusing on behavior change vs. adoption friction. They also note OpenAI’s increased emphasis on voice, raising the competitive bar for specialists like ElevenLabs.

    • Adoption likely starts with a familiar handheld form, then expands to companions
    • Strategic importance of pre-installed AI-first hardware for distribution
    • OpenAI’s push into voice as a competitive pressure test
    • Why hardware/software bundling could reshape the AI landscape
  6. Competing with big labs: ElevenLabs’ moat (research + product) and platform split

    Nikhil raises the fear of platform providers moving up the stack; Mati responds with ElevenLabs’ strategy: owning core audio research and shipping products that create durable customer value. He explains their two major lines: creative tools and enterprise/agent platforms.

    • Risk: large labs replicate successful apps built on their platforms
    • ElevenLabs’ approach: foundational audio research plus product execution
    • International voice quality as a differentiator (incl. Indian & European languages)
    • Two platforms: Creative (voiceovers/localization) and Agents (CX/training/education)
  7. Creator workflows and dubbing at scale: fixing audio, then localizing across languages

    Mati explains practical creator use cases: patching missing lines with voice reconstruction and dubbing full podcasts into multiple languages. They discuss the Lex Fridman–PM Modi dubbing workflow and how cost differs between fully automated dubbing and human-reviewed, high-stakes translations.

    • Post-production voice fixes: add/replace lines seamlessly in the same voice
    • End-to-end dubbing workflow with human-in-the-loop review when needed
    • Lex Fridman–PM Modi episode as a stress test of quality and accuracy
    • Cost spectrum: automated dubbing is cheap; precision review adds labor costs
  8. Preserving emotion in dubbed content—and the next step: lip reanimation

    Nikhil challenges the core problem: dubbing often loses emotion and sounds robotic. Mati explains how ElevenLabs attempts to preserve intonation and context, and why sentence structure differences create lip-sync mismatches—leading to the need for lip animation/reanimation.

    • Emotion/intonation preservation via contextual generation, not just translation
    • Structural differences between languages shift timing (e.g., German word order)
    • Current limitations: facial/lip movements may not match translated phrasing
    • Near-term future: automatic lip reanimation integrated into localization workflows
  9. Voice agents that make money today: domain-first businesses in lagging industries

    Asked what entrepreneurs can build now, Mati recommends pairing voice tech with deep domain expertise—especially in “non-innovating” industries like automotive and healthcare. The key is deployment, integrations, compliance, and replicability beyond a single client.

    • Best entry: combine voice agents with strong domain knowledge
    • Targets: automotive, healthcare, financial services, e-commerce
    • Start with customer support, then expand into richer in-product experiences
    • Build once, then abstract into a repeatable product for multiple customers
  10. Real-world deployments: Meesho scale and the ‘AI concierge’ future of commerce

    Mati shares ElevenLabs’ work helping Meesho automate high-volume multilingual customer calls. They then zoom out to a broader vision: voice-driven shopping where an assistant helps discovery, comparison, and ordering—turning e-commerce into a guided conversation.

    • Meesho: large-scale voice support (tens of thousands of calls) in Hindi/English
    • Use cases: shipping/refund/order queries handled via voice agent
    • Next step: on-site AI concierge for product discovery and guided purchase
    • Voice shifts e-commerce from browsing to conversational intent fulfillment
  11. Education, loneliness, and how voice changes what humans should learn

    Nikhil and Mati explore voice agents as companions (including loneliness/intimacy) and as personalized tutors. They debate how always-available knowledge changes education—shifting value from memorization toward learning-to-learn and social development.

    • Loneliness as a major future consumer use case for voice companionship
    • Personal tutors modeled after great teachers (e.g., Feynman-style instruction)
    • If knowledge is always in-ear, schooling shifts away from memorization
    • Future education split: AI learning efficiency + human-only social time
  12. AI valuations and global innovation: bubble talk, Europe/India opportunity, and risks

    Nikhil questions whether AI valuations outpace revenue fundamentals. Mati argues this cycle is different due to clearer value and revenue, while acknowledging frothy pockets (e.g., GPU-layer providers) and emphasizing that global teams outside the Valley can build category leaders.

    • Why some AI valuations may still be rational: high upside and real demand
    • Skepticism around certain infrastructure middle-layers (GPU resellers)
    • Global talent and company creation: Europe/India as major sources of innovation
    • Research breakthroughs can justify premium multiples in early stages
  13. Geopolitics, data residency, and trust: will platforms fragment by country?

    Nikhil predicts a more multipolar tech world where countries demand local data, local platforms, and reduced dependence on global incumbents like WhatsApp. Mati partially disagrees, arguing foundational models and some infrastructures will remain concentrated, while fine-tuning and product layers localize—making trust and network effects central.

    • Thesis: rising geopolitical tension may break trust in global tech platforms
    • Data residency likely becomes standard; local versions of products proliferate
    • Counterpoint: foundation-model training remains concentrated due to resources
    • Trust + network effects as the main defenses for global platforms
  14. Designing a new social network: authenticity, verification, voice-first interaction, and incentives

    Nikhil shares plans for an India-first social platform that reduces algorithmic outrage and foreign influence over youth culture. Mati proposes voice-enabled “companion” consumption (summaries, deeper questions), multilingual access, strong human verification, and carefully designed incentives that don’t reward knee-jerk negativity.

    • Problem framing: current social media is brand/influencer-heavy and outrage-optimized
    • Proposed approach: tighter entry/verification to reduce bots and trolls
    • Voice layer: assistant-like summaries, voice comments, and interactive engagement
    • Hardest challenge: incentives that reward authenticity/curiosity (or neutrality) over outrage

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.