Skip to content
No PriorsNo Priors

Baseten CEO Tuhin Srivastava on Custom Models, and Building the Inference Cloud

Baseten CEO and co-founder Tuhin Srivastava sits down with Sarah Guo and Elad Gil to discuss the rapid growth of AI inference demand, Baseten’s 30x growth, and why inference is becoming the strategic “last market.” Tuhin Srivastava argues the application layer will persist because companies with unique user signals can encode value into workflows and post-train specialized models, citing examples like Abridge and support workflows. The conversation covers GPU capacity constraints, Baseten’s multi-cloud fabric across 18 clouds and 90 clusters, long-term contracting dynamics, the importance of the software layer for stickiness, evolving workloads, multichip possibilities, and operational lessons at scale. Sign up for new podcasts every week. Email feedback to show@no-priors.com Follow us on Twitter: @NoPriorsPod | @Saranormous | @EladGil | @Tuhinone Chapters: 00:31 Baseten growth 01:55 Why the app layer wins 05:57 Serving frontier customers 07:55 Open source model mix 09:21 Chinese models and geopolitics 13:07 Custom inference dominates 14:22 Post training acquisition 17:10 When to invest in custom models 18:35 Supply crunch and data centerse 22:25 Longer GPU Contracts 24:09 What Makes a Winner 26:07 Multi Chip Future 28:19 Runtime Roadmap 31:08 Scaling Edge Cases 33:48 Hiring and Leadership 36:44 Operations Pager Culture 38:19 Efficiency Drives Demand 40:41 Concierge Everything Future 42:34 Conclusion

Sarah GuohostTuhin SrivastavaguestElad Gilhost
May 1, 202642mWatch on YouTube ↗

CHAPTERS

  1. 0:31 – 1:55

    Baseten’s 30x growth and why inference demand is exploding

    Tuhin explains Baseten’s rapid growth as a reflection of AI getting embedded “everywhere,” with open-source model quality crossing a key threshold. He frames Baseten as an index on the expanding application layer as more teams bring intelligence in-house and serve a growing long tail of specialized models.

  2. 1:55 – 5:57

    Why the application layer still wins against frontier labs

    The discussion tackles whether labs will capture the whole stack. Tuhin argues durable application businesses come from proprietary user signal and workflow integration, not just model weights—making it hard for frontier model companies to displace deeply embedded apps.

  3. 5:57 – 7:55

    Serving frontier customers to learn enterprise requirements indirectly

    Baseten prioritizes the highest-scale, most demanding customers and uses them as a proxy for enterprise needs. Many Baseten customers sell into regulated, demanding enterprises and “translate” requirements back into Baseten’s platform roadmap.

  4. 7:55 – 9:21

    The open-source model mix: best-model-first, then optimize cost

    Tuhin describes a capability-first mindset: customers start with frontier performance and later optimize latency and cost. Baseten sees broad experimentation across model families, including Chinese-origin models and specialized modalities like TTS.

  5. 9:21 – 13:07

    Chinese models, security concerns, and geopolitics

    Elad raises security and “Trojan horse” concerns about Chinese-origin models. Tuhin argues network boundaries and lack of evidence reduce practical risk, while emphasizing the strategic importance of the US maintaining strong open-source alternatives.

  6. 13:07 – 14:22

    Custom inference dominates: almost nobody runs vanilla weights

    Baseten’s workload is overwhelmingly custom: customers modify models for quality and performance rather than serving unmodified open-source weights. Tuhin outlines Baseten’s product lines and emphasizes that compilation/optimization is as central as fine-tuning.

  7. 14:22 – 17:10

    Post-training acquisition: why Baseten bought a research team

    Tuhin explains acquiring Parsed to add post-training expertise and move closer to customers earlier in their lifecycle. He highlights how post-training and inference are deeply linked (e.g., quantization choices depend on training), enabling an iterative improvement loop.

  8. 17:10 – 18:35

    When to invest in custom models: avoid post-training pre-PMF

    Customers ask when to move from frontier APIs to custom models. Tuhin advises proving value with best-in-class models first, then optimizing once there’s real product-market fit and a clear user signal to train against.

  9. 18:35 – 22:25

    Supply crunch reality: running at 90%+ utilization across 18 clouds

    Tuhin describes how severe GPU scarcity is in practice, with very little slack compute anywhere. Baseten’s “runtime fabric” across many providers helps reliability and failover, but also becomes a competitive advantage in sourcing capacity quickly.

  10. 22:25 – 24:09

    Longer GPU contracts, prepay, and why cost of capital matters

    The market is shifting toward multi-year commitments with meaningful prepayment, especially for frontier GPUs like B200s. This turns inference into a capital-and-financing game, affecting strategy from procurement to potential IPO timing.

  11. 24:09 – 26:07

    What makes an inference winner: software stickiness + compute access + ops

    Tuhin argues “GPUs as a service” is commoditized, but inference platforms with a strong software layer are sticky. Winning requires both software excellence and secured compute supply, plus the operational maturity to deliver mission-critical SLAs.

  12. 26:07 – 31:08

    Multi-chip future: diversification is desired, but Nvidia speed and ecosystem dominate

    The conversation turns to whether inference will diversify beyond Nvidia. Tuhin expects inference-specific chips over time, but emphasizes Nvidia’s supply chain, CUDA ecosystem, and time-to-market advantages—plus how exclusive supply deals can stifle broader ecosystems.

  13. 31:08 – 33:48

    Runtime roadmap and scaling edge cases: sandboxes, KV-cache routing, and weird failures

    Tuhin outlines Baseten’s runtime priorities: support for new workload types (agents, diffusion, video), sandboxes, and performance techniques like prefill/decode separation. At scale, the team encounters real-world systems failures and immature LLM runtime primitives that require ongoing engineering.

  14. 33:48 – 38:19

    Hiring, leadership, and pager culture in an always-on cloud

    Tuhin explains the shift from a very flat org to bringing in leaders who can own “whole problems.” He stresses clear hiring rubrics (first-principles thinking, low ego, collaboration) and describes the intense operational culture required to run inference reliably, including pervasive on-call expectations.

  15. 38:19

    Efficiency drives more demand (Jevons paradox) and the ‘concierge everything’ future

    As inference gets cheaper, developers embed more intelligence—especially via longer-running agents—rather than stopping at “good enough.” Tuhin predicts a future of ubiquitous personalized assistants (“concierge everything”) and argues companies that don’t adapt face existential risk.

  16. Who’s adopting AI first: AI-native apps vs enterprise in-house builds

    Elad contrasts AI-native application companies with enterprises building internally. Tuhin estimates AI-native companies still dominate inference volume today, but enterprise adoption is now visibly progressing from tools → APIs → custom models.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome