Skip to content
YC Root AccessYC Root Access

This Startup Built the Infrastructure Powering Voice AI

In this episode of Founder Firesides, YC Managing Partner Jared Friedman talks to Dylan Fox, the Founder of Assembly AI (S17), which has raised $160M to date. AssemblyAI is the voice AI infrastructure platform powering 10,000 companies, including Granola, Zoom and Delta Airlines. https://www.assemblyai.com/ Apply to Y Combinator: https://www.ycombinator.com/apply Work at a startup: https://www.ycombinator.com/jobs Chapters: 02:08 - What AssemblyAI actually does 05:23 - Dylan learns to code and discovers ML 07:11 - The Amazon Echo moment 09:32 - Why Dylan built voice AI infrastructure 13:02 - Building AI before anyone cared 16:50 - The 2021 inflection point 24:13 - Real-time voice agents are here 28:26 - Inside AssemblyAI’s new voice models 45:33 - Lessons from hypergrowth 52:00 - The future of voice AI

Jared FriedmanhostDylan Foxguest
Mar 5, 202653mWatch on YouTube ↗

CHAPTERS

  1. AssemblyAI today: the voice AI infrastructure layer at massive scale

    Dylan explains AssemblyAI’s core role: providing the APIs and primitives companies use to build voice AI products. He shares scale metrics (developers, customers, voice hours) and illustrates how AssemblyAI powers everything from note-takers to enterprise workflows.

  2. Customer examples: note-takers, contact centers, and enterprise deployments

    Concrete customer stories make the platform tangible: popular note-taking tools and large contact-center stacks depend on AssemblyAI under the hood. The discussion highlights the “embedded” nature of infrastructure—end users often don’t realize voice is being processed by AssemblyAI.

  3. From self-taught coder to early machine learning practitioner

    Dylan recounts learning to code via books while building small college startups, then moving into machine learning professionally. He places his journey in the pre-LLM era, when classical ML gave way to early neural-network momentum around 2014–2015.

  4. The Amazon Echo moment: voice finally felt reliable

    Buying an Amazon Echo in 2015 was a turning point: voice recognition worked across the room and in noisy environments, creating new user habits. That experience triggered Dylan’s curiosity about building with voice—only to discover the developer tooling ecosystem was missing.

  5. Why build voice AI infrastructure: “Twilio/Stripe for voice” vs. CD-ROM incumbents

    Dylan contrasts modern developer platforms (Twilio/Stripe) with the painful incumbent voice stack (Nuance’s expensive, outdated SDK workflow). AssemblyAI’s founding idea emerges: apply deep learning advances to voice and deliver it through a world-class developer experience.

  6. Building before the market existed: early YC years and slow iteration cycles

    In YC 2017, AssemblyAI faced a brutal reality: models were hard to improve quickly, and the market for voice applications was tiny. Dylan explains the broader missing ecosystem (LLMs, real-time tech, mobile networks) that had to mature before voice apps could flourish.

  7. The 2021 inflection point: COVID data, transformers, and NLP stacking

    AssemblyAI’s breakout begins around 2021 as more voice data moved online during remote work and model quality/cost improved. Transformers, more data, and adjacent NLP capabilities made it easier to build useful voice applications beyond raw transcription.

  8. From batch audio to real-time: the threshold moment for voice agents

    Early deployments were mostly non-real-time (post-call analytics, meeting transcription). Dylan explains why real-time is harder and why it only recently crossed a practical threshold—unlocking voice agents and other interactive experiences now “good enough” to deploy broadly.

  9. Where voice AI is heading: agents, robotics/hardware, and ambient intelligence

    Dylan outlines the fastest-growing application categories: voice agents that customers can’t reliably distinguish from humans, voice interfaces for robots and consumer devices, and ambient capture in healthcare and sales. The focus shifts from novelty to ROI-driven deployment.

  10. Inside the new generation of voice models: Universal-3 Pro and ‘promptable STT’

    AssemblyAI introduces a model positioned between traditional speech-to-text and multimodal LLMs: reliable transcription plus instruction-following for controllability. The emphasis is on staying ‘on the rails’ for speech tasks while adding configurable behavior developers can shape via prompts.

  11. Live demo: verbatim capture, alphanumerics, whisper robustness, and translation prompts

    Dylan demonstrates low-latency verbatim transcription, strong performance on emails and alphanumeric strings, and better-than-typical robustness under difficult audio (including whispering). He also shows prompt-driven behavior changes such as translating speech into Spanish during transcription.

  12. Controllability for real apps: cross-talk handling and ‘what to ignore’ vs ‘what to capture’

    A key theme is that different products need different behaviors—sometimes you want background speakers, sometimes you don’t. Dylan shows prompting the model to mark cross-talk segments without transcribing them, highlighting granular control that developers can tailor per use case.

  13. Hypergrowth lessons: hiring, capital pressure, and staying lean for speed

    Dylan reflects on scaling from a tiny team to rapid growth after major fundraising. He shares hiring and organizational lessons: avoid hiring just to ‘explore,’ define role non-negotiables, prioritize mission/market passion, and minimize process overhead to preserve speed.

  14. Living in the voice-first future: company-wide knowledge from transcripts and feedback

    AssemblyAI uses its own technology to capture and organize internal knowledge, making customer truth accessible to everyone. Dylan argues AI-augmented organizations will outcompete those that rely on manual layers of interpretation between customers and builders.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome