Baseten CEO Tuhin Srivastava on Custom Models, and Building the Inference Cloud

Baseten CEO and co-founder Tuhin Srivastava sits down with Sarah Guo and Elad Gil to discuss the rapid growth of AI inference demand, Baseten’s 30x growth, and why inference is becoming the strategic “last market.” Tuhin Srivastava argues the application layer will persist because companies with unique user signals can encode value into workflows and post-train specialized models, citing examples like Abridge and support workflows. The conversation covers GPU capacity constraints, Baseten’s multi-cloud fabric across 18 clouds and 90 clusters, long-term contracting dynamics, the importance of the software layer for stickiness, evolving workloads, multichip possibilities, and operational lessons at scale. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @Tuhinone Chapters: 00:31 Baseten growth 01:55 Why the app layer wins 05:57 Serving frontier customers 07:55 Open source model mix 09:21 Chinese models and geopolitics 13:07 Custom inference dominates 14:22 Post training acquisition 17:10 When to invest in custom models 18:35 Supply crunch and data centerse 22:25 Longer GPU Contracts 24:09 What Makes a Winner 26:07 Multi Chip Future 28:19 Runtime Roadmap 31:08 Scaling Edge Cases 33:48 Hiring and Leadership 36:44 Operations Pager Culture 38:19 Efficiency Drives Demand 40:41 Concierge Everything Future 42:34 Conclusion

Sarah GuohostTuhin SrivastavaguestElad Gilhost

May 1, 202642mWatch on YouTube ↗

CHAPTERS

0:31 – 1:55
Baseten’s 30x growth and why inference demand is exploding
Tuhin explains Baseten’s rapid growth as a reflection of AI getting embedded “everywhere,” with open-source model quality crossing a key threshold. He frames Baseten as an index on the expanding application layer as more teams bring intelligence in-house and serve a growing long tail of specialized models.
1:55 – 5:57
Why the application layer still wins against frontier labs
The discussion tackles whether labs will capture the whole stack. Tuhin argues durable application businesses come from proprietary user signal and workflow integration, not just model weights—making it hard for frontier model companies to displace deeply embedded apps.
5:57 – 7:55
Serving frontier customers to learn enterprise requirements indirectly
Baseten prioritizes the highest-scale, most demanding customers and uses them as a proxy for enterprise needs. Many Baseten customers sell into regulated, demanding enterprises and “translate” requirements back into Baseten’s platform roadmap.
7:55 – 9:21
The open-source model mix: best-model-first, then optimize cost
Tuhin describes a capability-first mindset: customers start with frontier performance and later optimize latency and cost. Baseten sees broad experimentation across model families, including Chinese-origin models and specialized modalities like TTS.
9:21 – 13:07
Chinese models, security concerns, and geopolitics
Elad raises security and “Trojan horse” concerns about Chinese-origin models. Tuhin argues network boundaries and lack of evidence reduce practical risk, while emphasizing the strategic importance of the US maintaining strong open-source alternatives.
13:07 – 14:22
Custom inference dominates: almost nobody runs vanilla weights
Baseten’s workload is overwhelmingly custom: customers modify models for quality and performance rather than serving unmodified open-source weights. Tuhin outlines Baseten’s product lines and emphasizes that compilation/optimization is as central as fine-tuning.
14:22 – 17:10
Post-training acquisition: why Baseten bought a research team
Tuhin explains acquiring Parsed to add post-training expertise and move closer to customers earlier in their lifecycle. He highlights how post-training and inference are deeply linked (e.g., quantization choices depend on training), enabling an iterative improvement loop.
17:10 – 18:35
When to invest in custom models: avoid post-training pre-PMF
Customers ask when to move from frontier APIs to custom models. Tuhin advises proving value with best-in-class models first, then optimizing once there’s real product-market fit and a clear user signal to train against.
18:35 – 22:25
Supply crunch reality: running at 90%+ utilization across 18 clouds
Tuhin describes how severe GPU scarcity is in practice, with very little slack compute anywhere. Baseten’s “runtime fabric” across many providers helps reliability and failover, but also becomes a competitive advantage in sourcing capacity quickly.
22:25 – 24:09
Longer GPU contracts, prepay, and why cost of capital matters
The market is shifting toward multi-year commitments with meaningful prepayment, especially for frontier GPUs like B200s. This turns inference into a capital-and-financing game, affecting strategy from procurement to potential IPO timing.
24:09 – 26:07
What makes an inference winner: software stickiness + compute access + ops
Tuhin argues “GPUs as a service” is commoditized, but inference platforms with a strong software layer are sticky. Winning requires both software excellence and secured compute supply, plus the operational maturity to deliver mission-critical SLAs.
26:07 – 31:08
Multi-chip future: diversification is desired, but Nvidia speed and ecosystem dominate
The conversation turns to whether inference will diversify beyond Nvidia. Tuhin expects inference-specific chips over time, but emphasizes Nvidia’s supply chain, CUDA ecosystem, and time-to-market advantages—plus how exclusive supply deals can stifle broader ecosystems.
31:08 – 33:48
Runtime roadmap and scaling edge cases: sandboxes, KV-cache routing, and weird failures
Tuhin outlines Baseten’s runtime priorities: support for new workload types (agents, diffusion, video), sandboxes, and performance techniques like prefill/decode separation. At scale, the team encounters real-world systems failures and immature LLM runtime primitives that require ongoing engineering.
33:48 – 38:19
Hiring, leadership, and pager culture in an always-on cloud
Tuhin explains the shift from a very flat org to bringing in leaders who can own “whole problems.” He stresses clear hiring rubrics (first-principles thinking, low ego, collaboration) and describes the intense operational culture required to run inference reliably, including pervasive on-call expectations.
38:19
Efficiency drives more demand (Jevons paradox) and the ‘concierge everything’ future
As inference gets cheaper, developers embed more intelligence—especially via longer-running agents—rather than stopping at “good enough.” Tuhin predicts a future of ubiquitous personalized assistants (“concierge everything”) and argues companies that don’t adapt face existential risk.
Who’s adopting AI first: AI-native apps vs enterprise in-house builds
Elad contrasts AI-native application companies with enterprises building internally. Tuhin estimates AI-native companies still dominate inference volume today, but enterprise adoption is now visibly progressing from tools → APIs → custom models.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome

Baseten’s 30x growth and why inference demand is exploding

Why the application layer still wins against frontier labs

Serving frontier customers to learn enterprise requirements indirectly

The open-source model mix: best-model-first, then optimize cost

Chinese models, security concerns, and geopolitics

Custom inference dominates: almost nobody runs vanilla weights

Post-training acquisition: why Baseten bought a research team

When to invest in custom models: avoid post-training pre-PMF

Supply crunch reality: running at 90%+ utilization across 18 clouds

Longer GPU contracts, prepay, and why cost of capital matters

What makes an inference winner: software stickiness + compute access + ops

Multi-chip future: diversification is desired, but Nvidia speed and ecosystem dominate

Runtime roadmap and scaling edge cases: sandboxes, KV-cache routing, and weird failures

Hiring, leadership, and pager culture in an always-on cloud

Efficiency drives more demand (Jevons paradox) and the ‘concierge everything’ future

Who’s adopting AI first: AI-native apps vs enterprise in-house builds

Get more out of YouTube videos.