A New Approach To AI Models

During last month’s NeurIPS 2025 conference, YC’s Ankit Gupta sat down with Karan Goel, founder and CEO of Cartesia, to explain why today’s AI architectures may be fundamentally limited. They discuss why transformers behave more like retrieval systems than learning systems, how state space models enable compression and abstraction, and why multimodal intelligence may require a whole new approach. The conversation also covers why Cartesia chose AI voice as a wedge product, and how research-driven companies can balance deep technical bets with real-world product discipline. Apply to Y Combinator: https://www.ycombinator.com/apply Work at a startup: https://www.ycombinator.com/jobs Chapters: 00:11 — Introducing Cartesia 00:26 — From Architecture Research to Startup 01:20 — What “Architecture Research” Really Means 02:18 — Why Transformers Hit a Ceiling 03:33 — State Space Models Explained 04:21 — Intelligence as Compression 05:47 — Retrieval vs. Abstraction 06:41 — Hybrid Architectures and the Future 07:13 — Why Cartesia Chose Voice AI 08:25 — What Multimodality Actually Means 09:20 — Audio as a Recipe for Other Modalities 10:09 — Tokens, Representations, and Learning Signals 11:37 — Learning Representations End-to-End 12:29 — Building for the “Average Human” 13:54 — Research vs. Product Reality 15:18 — One Vision, Ruthlessly Executed 16:28 — Product as a Truth Serum for Research 17:25 — Startup Gravity Applies to Research Too

Ankit GuptahostKaran Goelguest

Jan 9, 202618mWatch on YouTube ↗

CHAPTERS

Meet Cartesia: architecture-first AI, known today for voice
Karan Goel introduces Cartesia as a two-year-old company founded by former Stanford PhD researchers focused on “architecture research.” While many people recognize Cartesia for developer-focused voice AI models, he frames the company’s core identity as building new model approaches and commercializing them through products.
What “architecture research” means (and why it matters now)
Karan contrasts recent AI progress—dominated by scaling a proven architecture—with the earlier era where new architectures drove step-changes (e.g., transformers). He explains their grad-school motivation: identify what breaks when transformers are scaled “to their logical conclusion,” especially for efficient, human-like intelligence.
Why transformers may hit a ceiling for human-like intelligence
Karan argues the transformer paradigm has architectural limitations for long-timescale, multimodal, interactive intelligence. He frames transformers as great for today’s recipes, but not necessarily the endpoint for models that need to consolidate knowledge over long horizons the way humans do.
State Space Models (SSMs) as an alternative direction
He introduces state-space models as a recurrent-family approach and credits his co-founder Albert as a pioneer of the modern SSM wave. The chapter positions SSMs as a serious architectural exploration motivated by different tradeoffs than attention-based models.
Intelligence as compression: a core mental model
Karan proposes compression as a primitive underlying intelligence: to reason over huge information, systems must build abstractions that consolidate meaning across modalities. He uses examples like “cup” across text, physical world, and spoken audio to illustrate the need for unified representations.
Retrieval vs abstraction: transformers and SSMs as opposite extremes
He characterizes transformers as retrieval-oriented: they keep history in raw form and query it via keys/values/queries, enabling high-fidelity access. SSMs compress history into a “fuzzier” state—losing some fidelity but gaining abstraction—highlighting a central tradeoff between recall and compression.
Hybrid architectures and the search for the “ultimate” multimodal model
Karan notes the emergence of hybrid models that combine strengths of transformers and SSM-like components. He argues the real question is the ultimate architecture for long-timescale, multimodal learning—beyond stitching pieces together—especially for models that can learn and act over long horizons.
Why Cartesia focused on voice: a grounded wedge into multimodality
He reframes “multimodal” beyond flashy video: audio-to-text is inherently multimodal because it maps continuous signals to discrete symbols. Cartesia chose audio+text as a focused slice of the broader signal-to-symbol problem, allowing concrete productization without “biting off the entire pie.”
Audio as a transferable recipe for other modalities (video, robotics, more)
Karan argues many domains share a common bottleneck: how to represent continuous signals as tokens/representations for learning and prediction. He claims that solving representation learning well for audio generalizes to video, images, and robotics signals (trajectories, joint angles), because they share the same core structure: learning over signals.
Rethinking tokens: toward end-to-end learned representations
He describes Cartesia’s intersection of architecture + tokenization: moving away from hand-engineered pipelines and toward models that learn hierarchical representations internally. The aim is to reduce dependence on fixed tokenization schemes and let the model discover the right abstractions end-to-end.
Building for the “average human”: voice agents as long-horizon, interactive work
Karan frames their product ambition as building systems that can do what ordinary people do—high-context, interpersonal, action-oriented tasks over long periods. He uses the example of a call-center agent that must onboard quickly, interact naturally, and improve over years, emphasizing long-term adaptation and reliability.
Research vs product reality: one vision, executed with discipline
Karan contrasts academia’s many parallel visions with startups’ constraint: there’s only room for one. He explains the need to preserve exploration inside a focused company direction, and argues that product requirements force clarity about what matters and what doesn’t.
Product as a truth serum—and “startup gravity” applies to research companies too
He argues customers impose intellectual honesty: you don’t ship an architecture “just because,” and must prove it improves outcomes. He closes by warning that research startups are still subject to core startup dynamics—distribution, speed, iteration, and discipline—so founders should adopt operational wisdom (like YC’s) even in research-heavy settings.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome

Meet Cartesia: architecture-first AI, known today for voice

What “architecture research” means (and why it matters now)

Why transformers may hit a ceiling for human-like intelligence

State Space Models (SSMs) as an alternative direction

Intelligence as compression: a core mental model

Retrieval vs abstraction: transformers and SSMs as opposite extremes

Hybrid architectures and the search for the “ultimate” multimodal model

Why Cartesia focused on voice: a grounded wedge into multimodality

Audio as a transferable recipe for other modalities (video, robotics, more)

Rethinking tokens: toward end-to-end learned representations

Building for the “average human”: voice agents as long-horizon, interactive work

Research vs product reality: one vision, executed with discipline

Product as a truth serum—and “startup gravity” applies to research companies too

Get more out of YouTube videos.