YC Root AccessYC Root Access

A New Approach To AI Models

Ankit Gupta and Karan Goel on cartesia’s bet: beyond transformers via compression-driven multimodal architectures for voice.

Ankit GuptahostKaran Goelguest
Jan 9, 202618mWatch on YouTube ↗
Cartesia’s origin and positioning in voice AIDefinition and motivation of “architecture research”Transformer limits: context windows and retrieval biasState-space models: recurrent compression and abstractionHybrid architectures combining retrieval and compressionMultimodality as signal-to-symbol learning (audio↔text)Product discipline vs. academic exploration in research teams
AI-generated summary based on the episode transcript.

In this episode of YC Root Access, featuring Ankit Gupta and Karan Goel, A New Approach To AI Models explores cartesia’s bet: beyond transformers via compression-driven multimodal architectures for voice Cartesia was founded by Stanford PhD researchers to commercialize “architecture research,” not just scale existing transformer recipes.

At a glance

WHAT IT’S REALLY ABOUT

Cartesia’s bet: beyond transformers via compression-driven multimodal architectures for voice

  1. Cartesia was founded by Stanford PhD researchers to commercialize “architecture research,” not just scale existing transformer recipes.
  2. Goel argues transformers are strong at retrieval over raw context but hit ceilings for long-context, efficient abstraction, motivating compression-oriented alternatives like state-space models (SSMs).
  3. SSMs trade some fidelity for compressed representations that enable abstraction, and the emerging direction is hybrid architectures that combine retrieval strengths with compression strengths.
  4. Cartesia chose audio+text because it is inherently multimodal (signal-to-symbol) and provides a grounded, solvable path to general multimodal learning recipes.
  5. Running a research-driven startup requires a single, ruthlessly executed vision, with product constraints acting as a “truth serum” that enforces empirical honesty over novelty-for-novelty’s-sake.

IDEAS WORTH REMEMBERING

5 ideas

Architecture research is about changing the core learning recipe, not just scaling it.

Goel frames the last decade as converging on strong transformer recipes, but says key remaining challenges (efficiency, long context, multimodal interaction) require new architectural primitives.

Transformers skew toward retrieval over raw context, which limits abstraction under long horizons.

He characterizes attention as making past information available “in raw form,” which is powerful for recalling specifics but weak at building compressed, reusable world models.

Compression is positioned as a fundamental primitive for intelligence.

The discussion ties human-like intelligence to consolidating many forms of experience (text, audio, physical meaning) into abstractions that remain useful across time and tasks.

SSMs emphasize compressed internal state, trading fidelity for abstraction.

SSMs are described as living on the opposite extreme from transformers: they “lose fidelity” by compressing history, but gain generalized representations that can support longer-range reasoning.

The near-term frontier is hybrids that blend retrieval and compression.

Goel points to a trend of architectures combining strengths of both paradigms, suggesting neither pure transformer nor pure SSM is the final answer for long-timescale multimodal agents.

WORDS WORTH SAVING

5 quotes

I think machine learning and AI is just, um, putting data into architectures to make, um, really cool models, uh, on lots of compute, right?

Karan Goel

So I think that transformers are fundamentally limited by their, um, inability to model and compress, I think, compress representations in this way, and they're, um, sort of like context window machines, right? Like, uh, they're very retrieval-oriented machines, right?

Karan Goel

SSMs have a fuzzier representation of the world, so they try to compress all this information, which means you lose fidelity, uh, but you at the same time gain something, which is by compression you build abstraction.

Karan Goel

So a, a simple way to say it is we wanna get rid of tokens... and have the model learn this representation internally.

Karan Goel

I think in a company, it's almost the opposite, where there's only really room for one vision.

Karan Goel

QUESTIONS ANSWERED IN THIS EPISODE

5 questions

When you say transformers are “retrieval-oriented,” which concrete failure modes have you observed in real products (e.g., voice agents) that stem from that bias?

Cartesia was founded by Stanford PhD researchers to commercialize “architecture research,” not just scale existing transformer recipes.

What does “compression” operationally mean in your models—are you measuring it via information bottlenecks, state size, perplexity under constrained memory, or something else?

Goel argues transformers are strong at retrieval over raw context but hit ceilings for long-context, efficient abstraction, motivating compression-oriented alternatives like state-space models (SSMs).

Where do today’s state-space approaches still fall short compared to attention (e.g., exact recall, tool use, controllability), and how do hybrids address those gaps?

SSMs trade some fidelity for compressed representations that enable abstraction, and the emerging direction is hybrid architectures that combine retrieval strengths with compression strengths.

You mentioned wanting to “get rid of tokens” and learn representations end-to-end—what learning signals or objectives make that stable for raw audio, and what breaks first?

Cartesia chose audio+text because it is inherently multimodal (signal-to-symbol) and provides a grounded, solvable path to general multimodal learning recipes.

In audio→text problems, what aspects of alignment (timing, prosody, speaker traits) matter most for downstream agent performance versus just transcription accuracy?

Running a research-driven startup requires a single, ruthlessly executed vision, with product constraints acting as a “truth serum” that enforces empirical honesty over novelty-for-novelty’s-sake.

Chapter Breakdown

Meet Cartesia: architecture-first AI, known today for voice

Karan Goel introduces Cartesia as a two-year-old company founded by former Stanford PhD researchers focused on “architecture research.” While many people recognize Cartesia for developer-focused voice AI models, he frames the company’s core identity as building new model approaches and commercializing them through products.

What “architecture research” means (and why it matters now)

Karan contrasts recent AI progress—dominated by scaling a proven architecture—with the earlier era where new architectures drove step-changes (e.g., transformers). He explains their grad-school motivation: identify what breaks when transformers are scaled “to their logical conclusion,” especially for efficient, human-like intelligence.

Why transformers may hit a ceiling for human-like intelligence

Karan argues the transformer paradigm has architectural limitations for long-timescale, multimodal, interactive intelligence. He frames transformers as great for today’s recipes, but not necessarily the endpoint for models that need to consolidate knowledge over long horizons the way humans do.

State Space Models (SSMs) as an alternative direction

He introduces state-space models as a recurrent-family approach and credits his co-founder Albert as a pioneer of the modern SSM wave. The chapter positions SSMs as a serious architectural exploration motivated by different tradeoffs than attention-based models.

Intelligence as compression: a core mental model

Karan proposes compression as a primitive underlying intelligence: to reason over huge information, systems must build abstractions that consolidate meaning across modalities. He uses examples like “cup” across text, physical world, and spoken audio to illustrate the need for unified representations.

Retrieval vs abstraction: transformers and SSMs as opposite extremes

He characterizes transformers as retrieval-oriented: they keep history in raw form and query it via keys/values/queries, enabling high-fidelity access. SSMs compress history into a “fuzzier” state—losing some fidelity but gaining abstraction—highlighting a central tradeoff between recall and compression.

Hybrid architectures and the search for the “ultimate” multimodal model

Karan notes the emergence of hybrid models that combine strengths of transformers and SSM-like components. He argues the real question is the ultimate architecture for long-timescale, multimodal learning—beyond stitching pieces together—especially for models that can learn and act over long horizons.

Why Cartesia focused on voice: a grounded wedge into multimodality

He reframes “multimodal” beyond flashy video: audio-to-text is inherently multimodal because it maps continuous signals to discrete symbols. Cartesia chose audio+text as a focused slice of the broader signal-to-symbol problem, allowing concrete productization without “biting off the entire pie.”

Audio as a transferable recipe for other modalities (video, robotics, more)

Karan argues many domains share a common bottleneck: how to represent continuous signals as tokens/representations for learning and prediction. He claims that solving representation learning well for audio generalizes to video, images, and robotics signals (trajectories, joint angles), because they share the same core structure: learning over signals.

Rethinking tokens: toward end-to-end learned representations

He describes Cartesia’s intersection of architecture + tokenization: moving away from hand-engineered pipelines and toward models that learn hierarchical representations internally. The aim is to reduce dependence on fixed tokenization schemes and let the model discover the right abstractions end-to-end.

Building for the “average human”: voice agents as long-horizon, interactive work

Karan frames their product ambition as building systems that can do what ordinary people do—high-context, interpersonal, action-oriented tasks over long periods. He uses the example of a call-center agent that must onboard quickly, interact naturally, and improve over years, emphasizing long-term adaptation and reliability.

Research vs product reality: one vision, executed with discipline

Karan contrasts academia’s many parallel visions with startups’ constraint: there’s only room for one. He explains the need to preserve exploration inside a focused company direction, and argues that product requirements force clarity about what matters and what doesn’t.

Product as a truth serum—and “startup gravity” applies to research companies too

He argues customers impose intellectual honesty: you don’t ship an architecture “just because,” and must prove it improves outcomes. He closes by warning that research startups are still subject to core startup dynamics—distribution, speed, iteration, and discipline—so founders should adopt operational wisdom (like YC’s) even in research-heavy settings.

EVERY SPOKEN WORD

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome