A New Approach To AI Models

During last month’s NeurIPS 2025 conference, YC’s Ankit Gupta sat down with Karan Goel, founder and CEO of Cartesia, to explain why today’s AI architectures may be fundamentally limited. They discuss why transformers behave more like retrieval systems than learning systems, how state space models enable compression and abstraction, and why multimodal intelligence may require a whole new approach. The conversation also covers why Cartesia chose AI voice as a wedge product, and how research-driven companies can balance deep technical bets with real-world product discipline. Apply to Y Combinator: https://www.ycombinator.com/apply Work at a startup: https://www.ycombinator.com/jobs Chapters: 00:11 — Introducing Cartesia 00:26 — From Architecture Research to Startup 01:20 — What “Architecture Research” Really Means 02:18 — Why Transformers Hit a Ceiling 03:33 — State Space Models Explained 04:21 — Intelligence as Compression 05:47 — Retrieval vs. Abstraction 06:41 — Hybrid Architectures and the Future 07:13 — Why Cartesia Chose Voice AI 08:25 — What Multimodality Actually Means 09:20 — Audio as a Recipe for Other Modalities 10:09 — Tokens, Representations, and Learning Signals 11:37 — Learning Representations End-to-End 12:29 — Building for the “Average Human” 13:54 — Research vs. Product Reality 15:18 — One Vision, Ruthlessly Executed 16:28 — Product as a Truth Serum for Research 17:25 — Startup Gravity Applies to Research Too

Ankit GuptahostKaran Goelguest

Jan 8, 202618mWatch on YouTube ↗

WHAT IT’S REALLY ABOUT

Cartesia’s bet: beyond transformers via compression-driven multimodal architectures for voice

Cartesia was founded by Stanford PhD researchers to commercialize “architecture research,” not just scale existing transformer recipes.
Goel argues transformers are strong at retrieval over raw context but hit ceilings for long-context, efficient abstraction, motivating compression-oriented alternatives like state-space models (SSMs).
SSMs trade some fidelity for compressed representations that enable abstraction, and the emerging direction is hybrid architectures that combine retrieval strengths with compression strengths.
Cartesia chose audio+text because it is inherently multimodal (signal-to-symbol) and provides a grounded, solvable path to general multimodal learning recipes.
Running a research-driven startup requires a single, ruthlessly executed vision, with product constraints acting as a “truth serum” that enforces empirical honesty over novelty-for-novelty’s-sake.

IDEAS WORTH REMEMBERING

5 ideas

Architecture research is about changing the core learning recipe, not just scaling it.

Goel frames the last decade as converging on strong transformer recipes, but says key remaining challenges (efficiency, long context, multimodal interaction) require new architectural primitives.

Transformers skew toward retrieval over raw context, which limits abstraction under long horizons.

He characterizes attention as making past information available “in raw form,” which is powerful for recalling specifics but weak at building compressed, reusable world models.

Compression is positioned as a fundamental primitive for intelligence.

The discussion ties human-like intelligence to consolidating many forms of experience (text, audio, physical meaning) into abstractions that remain useful across time and tasks.

SSMs emphasize compressed internal state, trading fidelity for abstraction.

SSMs are described as living on the opposite extreme from transformers: they “lose fidelity” by compressing history, but gain generalized representations that can support longer-range reasoning.

The near-term frontier is hybrids that blend retrieval and compression.

Goel points to a trend of architectures combining strengths of both paradigms, suggesting neither pure transformer nor pure SSM is the final answer for long-timescale multimodal agents.

WORDS WORTH SAVING

5 quotes

I think machine learning and AI is just, um, putting data into architectures to make, um, really cool models, uh, on lots of compute, right?

— Karan Goel

So I think that transformers are fundamentally limited by their, um, inability to model and compress, I think, compress representations in this way, and they're, um, sort of like context window machines, right? Like, uh, they're very retrieval-oriented machines, right?

— Karan Goel

SSMs have a fuzzier representation of the world, so they try to compress all this information, which means you lose fidelity, uh, but you at the same time gain something, which is by compression you build abstraction.

— Karan Goel

So a, a simple way to say it is we wanna get rid of tokens... and have the model learn this representation internally.

— Karan Goel

I think in a company, it's almost the opposite, where there's only really room for one vision.

— Karan Goel

Cartesia’s origin and positioning in voice AIDefinition and motivation of “architecture research”Transformer limits: context windows and retrieval biasState-space models: recurrent compression and abstractionHybrid architectures combining retrieval and compressionMultimodality as signal-to-symbol learning (audio↔text)Product discipline vs. academic exploration in research teams

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.