Skip to content
No PriorsNo Priors

Baseten CEO Tuhin Srivastava on Custom Models, and Building the Inference Cloud

Baseten CEO and co-founder Tuhin Srivastava sits down with Sarah Guo and Elad Gil to discuss the rapid growth of AI inference demand, Baseten’s 30x growth, and why inference is becoming the strategic “last market.” Tuhin Srivastava argues the application layer will persist because companies with unique user signals can encode value into workflows and post-train specialized models, citing examples like Abridge and support workflows. The conversation covers GPU capacity constraints, Baseten’s multi-cloud fabric across 18 clouds and 90 clusters, long-term contracting dynamics, the importance of the software layer for stickiness, evolving workloads, multichip possibilities, and operational lessons at scale. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @Tuhinone Chapters: 00:31 Baseten growth 01:55 Why the app layer wins 05:57 Serving frontier customers 07:55 Open source model mix 09:21 Chinese models and geopolitics 13:07 Custom inference dominates 14:22 Post training acquisition 17:10 When to invest in custom models 18:35 Supply crunch and data centerse 22:25 Longer GPU Contracts 24:09 What Makes a Winner 26:07 Multi Chip Future 28:19 Runtime Roadmap 31:08 Scaling Edge Cases 33:48 Hiring and Leadership 36:44 Operations Pager Culture 38:19 Efficiency Drives Demand 40:41 Concierge Everything Future 42:34 Conclusion

Sarah GuohostTuhin SrivastavaguestElad Gilhost
May 1, 202642mWatch on YouTube ↗

At a glance

WHAT IT’S REALLY ABOUT

Baseten CEO on custom models, scaling inference, and compute constraints

  1. Baseten’s growth is driven by the rapid expansion of the application layer and the mainstreaming of post-training/RL techniques that let companies “own” and specialize inference.
  2. Most production inference on Baseten is custom (about 90–95% of tokens), with customers modifying, compiling, and optimizing open-source weights for both quality and performance rather than running vanilla models.
  3. GPU supply is extremely tight with minimal slack, pushing providers toward multi-year contracts with significant prepay and making operational reliability and access to quality suppliers as critical as software.
  4. Customers prioritize model capability first and cost second, using a mix of frontier closed models and increasingly strong open-source models (including Chinese-origin models), while navigating security and geopolitical considerations.
  5. Baseten’s product direction emphasizes an end-to-end loop: inference produces data and eval signals, which feeds post-training, which in turn drives more inference—plus runtime innovations like cache-aware routing, prefill/decode separation, and agent sandboxes.

IDEAS WORTH REMEMBERING

5 ideas

Workflow moats beat model moats for most application companies.

Srivastava argues the durable advantage is proprietary user signal embedded in end-to-end workflows (e.g., clinician edits and EMR integration), which labs can’t easily access to post-train long-horizon systems.

The market is still early: enterprise inference adoption is mostly ahead of us.

By inference volume, he estimates ~99% is still from AI-native app companies today, implying a large upcoming wave as enterprises move from API trials to custom-model deployment.

Custom models are already the production default for serious users.

Baseten sees ~90–95% of tokens as “custom” inference—customers fine-tune/post-train and also compile/quantize/optimize for latency and cost, not just accuracy.

Capability-first buying drives model choice; cost optimization comes later.

Customers start with the best-performing model for value creation, then optimize cost/latency—meaning infrastructure must support rapid switching and deep optimization across many models.

GPU scarcity is deeper than most narratives suggest, and “good supply” is rarer than raw supply.

He describes mid-90s utilization as normal and notes many new suppliers lack data-center and inference-SLA maturity, shrinking the set of truly reliable providers to a small top tier.

WORDS WORTH SAVING

5 quotes

I think everyone is real-realizing that you can put AI everywhere.

Tuhin Srivastava

To the extent that it is encoded in workflows, um, that is where they will be able to develop moat.

Tuhin Srivastava

It is all custom.

Tuhin Srivastava

No post-training pre-product market fit is what I... Is what I'd say.

Tuhin Srivastava

As much as we hear about it, I don't think people realize how bad it really is.

Tuhin Srivastava

30x growth and inference as a massive marketWhy the application layer persists vs. frontier labsServing AI-native companies that sell into enterprisesOpen-source model mix and frontier capability raceChinese models: security concerns and geopoliticsCustom inference, compilation, and performance tuningGPU supply crunch, contracts, and cost of capitalRuntime roadmap: KV-cache routing, prefill/decode split, speculationScale edge cases and hyperscaler limitationsHiring leadership and operations/pager cultureJevons paradox and rising demand from efficiency“Concierge everything” agentic future

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome