This Startup Built the Infrastructure Powering Voice AI

In this episode of Founder Firesides, YC Managing Partner Jared Friedman talks to Dylan Fox, the Founder of Assembly AI (S17), which has raised $160M to date. AssemblyAI is the voice AI infrastructure platform powering 10,000 companies, including Granola, Zoom and Delta Airlines. https://www.assemblyai.com/ Apply to Y Combinator: https://www.ycombinator.com/apply Work at a startup: https://www.ycombinator.com/jobs Chapters: 02:08 - What AssemblyAI actually does 05:23 - Dylan learns to code and discovers ML 07:11 - The Amazon Echo moment 09:32 - Why Dylan built voice AI infrastructure 13:02 - Building AI before anyone cared 16:50 - The 2021 inflection point 24:13 - Real-time voice agents are here 28:26 - Inside AssemblyAI’s new voice models 45:33 - Lessons from hypergrowth 52:00 - The future of voice AI

Jared FriedmanhostDylan Foxguest

Mar 4, 202653mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

AssemblyAI built voice infrastructure enabling real-time agents and audio understanding

AssemblyAI provides voice AI infrastructure—APIs and models—used by products like note takers, contact centers, and emerging real-time voice agents at massive scale.
Fox’s motivation came from experiencing the Amazon Echo’s reliability and noticing that existing speech vendors (e.g., Nuance) had a poor, inaccessible developer experience compared to Stripe/Twilio.
The company started extremely early (YC 2017) and endured years of slow progress until a 2021–2022 inflection driven by better models, more voice data from COVID-era remote work, and a maturing ecosystem (LLMs, WebRTC, etc.).
Voice is shifting from batch transcription to real-time interaction, with recent breakthroughs crossing thresholds of accuracy, latency, and cost that unlock widespread deployment.
AssemblyAI’s roadmap centers on “smarter” promptable speech models (e.g., Universal-3 Pro) that remain reliable for transcription while adding instruction-following and richer audio understanding.

IDEAS WORTH REMEMBERING

5 ideas

Voice AI becomes valuable when it’s developer-accessible infrastructure, not just a model.

AssemblyAI’s wedge is a Stripe/Twilio-like experience for voice primitives so teams can ship note takers, call analytics, and agents without building speech stacks in-house.

Product breakthroughs hinge on crossing practical thresholds—not just incremental accuracy gains.

Fox describes adoption surging once real-time models became “good enough” on accuracy, latency, and cost; before that, even compelling demos didn’t create sustained usage.

Ecosystem readiness can matter more than a single company’s technology.

Early on, voice apps lacked enabling layers (LLMs, vector databases, WebRTC, bandwidth/5G), so most use cases stayed non-real-time and narrowly transcription-focused.

COVID accelerated voice AI by increasing recorded voice data and normalizing audio workflows.

Remote work, meetings, and podcasts expanded the amount of internet voice data and made transcription + downstream analysis (summaries, sentiment) more economically compelling.

Promptable speech models are a middle path between STT and multimodal LLMs.

Universal 3 Pro is positioned as reliable for transcription-like tasks while allowing instruction following (e.g., translate to Spanish, mark crosstalk) without the “off-the-rails” behavior of general multimodal LLMs.

WORDS WORTH SAVING

5 quotes

We really are focused on helping other companies have the voice AI infrastructure and primitives they need to just innovate around voice.

— Dylan Fox

They would mail you a CD-ROM with a developer SDK on it, and I didn’t even have a CD-ROM drive on my laptop.

— Dylan Fox

If you said AI… it was like a scam. Deep learning was the thing to say.

— Dylan Fox

Real-time voice agents… work really well now. It’s kind of wild.

— Dylan Fox

My goal is having as little operational overhead as possible… teetering… on the line of complete chaos.

— Dylan Fox

Voice AI infrastructure vs. applicationsEarly founder journey and self-taught codingAmazon Echo as product inspirationDeveloper experience (Twilio/Stripe standard) vs. incumbents2017–2021 slow build and ecosystem dependencies2021 inflection: COVID data + transformers + NLP stack maturityReal-time voice agents and ambient listening (healthcare/sales)Universal 3 Pro: promptable, controllable speech modelCompetition with Big Tech and moat via subject-matter expertiseHypergrowth lessons: hiring, focus, operational overheadCompany operating system: speed, transparency, AI-augmented feedback loops

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.