Skip to content
YC Root AccessYC Root Access

A New Approach To AI Models

During last month’s NeurIPS 2025 conference, YC’s Ankit Gupta sat down with Karan Goel, founder and CEO of Cartesia, to explain why today’s AI architectures may be fundamentally limited. They discuss why transformers behave more like retrieval systems than learning systems, how state space models enable compression and abstraction, and why multimodal intelligence may require a whole new approach. The conversation also covers why Cartesia chose AI voice as a wedge product, and how research-driven companies can balance deep technical bets with real-world product discipline. Apply to Y Combinator: https://www.ycombinator.com/apply Work at a startup: https://www.ycombinator.com/jobs Chapters: 00:11 — Introducing Cartesia 00:26 — From Architecture Research to Startup 01:20 — What “Architecture Research” Really Means 02:18 — Why Transformers Hit a Ceiling 03:33 — State Space Models Explained 04:21 — Intelligence as Compression 05:47 — Retrieval vs. Abstraction 06:41 — Hybrid Architectures and the Future 07:13 — Why Cartesia Chose Voice AI 08:25 — What Multimodality Actually Means 09:20 — Audio as a Recipe for Other Modalities 10:09 — Tokens, Representations, and Learning Signals 11:37 — Learning Representations End-to-End 12:29 — Building for the “Average Human” 13:54 — Research vs. Product Reality 15:18 — One Vision, Ruthlessly Executed 16:28 — Product as a Truth Serum for Research 17:25 — Startup Gravity Applies to Research Too

Ankit GuptahostKaran Goelguest
Jan 9, 202618mWatch on YouTube ↗

CHAPTERS

  1. Meet Cartesia: architecture-first AI, known today for voice

    Karan Goel introduces Cartesia as a two-year-old company founded by former Stanford PhD researchers focused on “architecture research.” While many people recognize Cartesia for developer-focused voice AI models, he frames the company’s core identity as building new model approaches and commercializing them through products.

    • Cartesia was founded by Stanford PhD researchers
    • Company roots are in “architecture research,” not just applications
    • Public perception: a voice AI company building models for developers
    • Goal: commercialize new research directions through real products
  2. What “architecture research” means (and why it matters now)

    Karan contrasts recent AI progress—dominated by scaling a proven architecture—with the earlier era where new architectures drove step-changes (e.g., transformers). He explains their grad-school motivation: identify what breaks when transformers are scaled “to their logical conclusion,” especially for efficient, human-like intelligence.

    • Architecture research = designing the model structures that learn from data/compute
    • Industry trend: scaling/engineering on one dominant architecture
    • Their question (2019–2020): what limits appear at full scale?
    • Human intelligence inspires goals: efficiency, long context, multimodality, interaction
  3. Why transformers may hit a ceiling for human-like intelligence

    Karan argues the transformer paradigm has architectural limitations for long-timescale, multimodal, interactive intelligence. He frames transformers as great for today’s recipes, but not necessarily the endpoint for models that need to consolidate knowledge over long horizons the way humans do.

    • Transformers are powerful but may be the wrong endpoint for human-like systems
    • Target capabilities: very long context, multimodal grounding, interactive use
    • Efficiency matters (“intelligence per watt” comparison to humans)
    • Motivation to pursue fundamentally different architectures
  4. State Space Models (SSMs) as an alternative direction

    He introduces state-space models as a recurrent-family approach and credits his co-founder Albert as a pioneer of the modern SSM wave. The chapter positions SSMs as a serious architectural exploration motivated by different tradeoffs than attention-based models.

    • SSMs framed as recurrent models used for deep learning
    • Albert (co-founder) helped pioneer modern SSMs
    • SSMs represent a distinct architectural family vs transformers
    • Belief: AI still has big unsolved problems beyond reusing the same recipes
  5. Intelligence as compression: a core mental model

    Karan proposes compression as a primitive underlying intelligence: to reason over huge information, systems must build abstractions that consolidate meaning across modalities. He uses examples like “cup” across text, physical world, and spoken audio to illustrate the need for unified representations.

    • Compression enables abstraction and consolidation
    • Models should unify concepts across text/audio/vision/action
    • Humans operate over long timescales by building compressed internal models
    • The goal is not only recall, but interactive, durable understanding
  6. Retrieval vs abstraction: transformers and SSMs as opposite extremes

    He characterizes transformers as retrieval-oriented: they keep history in raw form and query it via keys/values/queries, enabling high-fidelity access. SSMs compress history into a “fuzzier” state—losing some fidelity but gaining abstraction—highlighting a central tradeoff between recall and compression.

    • Transformers: raw-context access, strong retrieval behavior
    • Attention mechanism encourages selective retrieval from stored context
    • SSMs: compress history into internal state, trading fidelity for abstraction
    • Core tension: perfect recall vs abstract, compact world models
  7. Hybrid architectures and the search for the “ultimate” multimodal model

    Karan notes the emergence of hybrid models that combine strengths of transformers and SSM-like components. He argues the real question is the ultimate architecture for long-timescale, multimodal learning—beyond stitching pieces together—especially for models that can learn and act over long horizons.

    • Hybrid models are combining retrieval and compression strengths
    • Many variants exist; implementation/inference details matter
    • Focus: architectures that support long-timescale learning and use
    • End goal: best architecture for multimodal, interactive intelligence
  8. Why Cartesia focused on voice: a grounded wedge into multimodality

    He reframes “multimodal” beyond flashy video: audio-to-text is inherently multimodal because it maps continuous signals to discrete symbols. Cartesia chose audio+text as a focused slice of the broader signal-to-symbol problem, allowing concrete productization without “biting off the entire pie.”

    • Multimodality = signal + discrete symbols (not only image/video)
    • Speech recognition/transcription is already a multimodal mapping
    • Audio+text offers a tractable, high-impact research/product wedge
    • Grounded problem selection enables shipping while advancing research
  9. Audio as a transferable recipe for other modalities (video, robotics, more)

    Karan argues many domains share a common bottleneck: how to represent continuous signals as tokens/representations for learning and prediction. He claims that solving representation learning well for audio generalizes to video, images, and robotics signals (trajectories, joint angles), because they share the same core structure: learning over signals.

    • Shared multimodal problem: representing signals for model training
    • Audio tokenization parallels video/image tokenization and robotics state encoding
    • Key question: what is the best learned representation of a signal?
    • A strong “audio recipe” can generalize to broader multimodal/embodied settings
  10. Rethinking tokens: toward end-to-end learned representations

    He describes Cartesia’s intersection of architecture + tokenization: moving away from hand-engineered pipelines and toward models that learn hierarchical representations internally. The aim is to reduce dependence on fixed tokenization schemes and let the model discover the right abstractions end-to-end.

    • Tokenization is central to scaling beyond text
    • Critique of hand-engineered signal processing as the interface to models
    • Goal: models learn hierarchies/abstractions internally, end-to-end
    • “Get rid of tokens” as a direction: reduce rigid discretization assumptions
  11. Building for the “average human”: voice agents as long-horizon, interactive work

    Karan frames their product ambition as building systems that can do what ordinary people do—high-context, interpersonal, action-oriented tasks over long periods. He uses the example of a call-center agent that must onboard quickly, interact naturally, and improve over years, emphasizing long-term adaptation and reliability.

    • Target intelligence: practical, interactive competence (not Olympiad-style IQ)
    • Voice agents/call-center tasks stress long context and real-world interaction
    • Desired behavior: onboard day one, operate and improve for years
    • Humans excel at systems + people + context; AI architectures should match that
  12. Research vs product reality: one vision, executed with discipline

    Karan contrasts academia’s many parallel visions with startups’ constraint: there’s only room for one. He explains the need to preserve exploration inside a focused company direction, and argues that product requirements force clarity about what matters and what doesn’t.

    • Academia: many visions and curiosity-driven exploration
    • Startup: single coherent vision; limited tolerance for random exploration
    • Need to create space for exploration without losing focus
    • Execution focus: “prosecute that point of view to the end of the earth”
  13. Product as a truth serum—and “startup gravity” applies to research companies too

    He argues customers impose intellectual honesty: you don’t ship an architecture “just because,” and must prove it improves outcomes. He closes by warning that research startups are still subject to core startup dynamics—distribution, speed, iteration, and discipline—so founders should adopt operational wisdom (like YC’s) even in research-heavy settings.

    • Customers don’t care about novelty; they care about performance
    • Product constraints force rigorous evaluation of research claims
    • Avoid delusion about model quality/impact while keeping ambitious vision
    • Research startups still follow “startup gravity”; YC-style discipline applies

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.