No PriorsBaseten CEO Tuhin Srivastava on Custom Models, and Building the Inference Cloud
CHAPTERS
- 0:31 – 1:55
Baseten’s 30x growth and why inference demand is exploding
Tuhin explains Baseten’s rapid growth as a reflection of AI getting embedded “everywhere,” with open-source model quality crossing a key threshold. He frames Baseten as an index on the expanding application layer as more teams bring intelligence in-house and serve a growing long tail of specialized models.
- 1:55 – 5:57
Why the application layer still wins against frontier labs
The discussion tackles whether labs will capture the whole stack. Tuhin argues durable application businesses come from proprietary user signal and workflow integration, not just model weights—making it hard for frontier model companies to displace deeply embedded apps.
- 5:57 – 7:55
Serving frontier customers to learn enterprise requirements indirectly
Baseten prioritizes the highest-scale, most demanding customers and uses them as a proxy for enterprise needs. Many Baseten customers sell into regulated, demanding enterprises and “translate” requirements back into Baseten’s platform roadmap.
- 7:55 – 9:21
The open-source model mix: best-model-first, then optimize cost
Tuhin describes a capability-first mindset: customers start with frontier performance and later optimize latency and cost. Baseten sees broad experimentation across model families, including Chinese-origin models and specialized modalities like TTS.
- 9:21 – 13:07
Chinese models, security concerns, and geopolitics
Elad raises security and “Trojan horse” concerns about Chinese-origin models. Tuhin argues network boundaries and lack of evidence reduce practical risk, while emphasizing the strategic importance of the US maintaining strong open-source alternatives.
- 13:07 – 14:22
Custom inference dominates: almost nobody runs vanilla weights
Baseten’s workload is overwhelmingly custom: customers modify models for quality and performance rather than serving unmodified open-source weights. Tuhin outlines Baseten’s product lines and emphasizes that compilation/optimization is as central as fine-tuning.
- 14:22 – 17:10
Post-training acquisition: why Baseten bought a research team
Tuhin explains acquiring Parsed to add post-training expertise and move closer to customers earlier in their lifecycle. He highlights how post-training and inference are deeply linked (e.g., quantization choices depend on training), enabling an iterative improvement loop.
- 17:10 – 18:35
When to invest in custom models: avoid post-training pre-PMF
Customers ask when to move from frontier APIs to custom models. Tuhin advises proving value with best-in-class models first, then optimizing once there’s real product-market fit and a clear user signal to train against.
- 18:35 – 22:25
Supply crunch reality: running at 90%+ utilization across 18 clouds
Tuhin describes how severe GPU scarcity is in practice, with very little slack compute anywhere. Baseten’s “runtime fabric” across many providers helps reliability and failover, but also becomes a competitive advantage in sourcing capacity quickly.
- 22:25 – 24:09
Longer GPU contracts, prepay, and why cost of capital matters
The market is shifting toward multi-year commitments with meaningful prepayment, especially for frontier GPUs like B200s. This turns inference into a capital-and-financing game, affecting strategy from procurement to potential IPO timing.
- 24:09 – 26:07
What makes an inference winner: software stickiness + compute access + ops
Tuhin argues “GPUs as a service” is commoditized, but inference platforms with a strong software layer are sticky. Winning requires both software excellence and secured compute supply, plus the operational maturity to deliver mission-critical SLAs.
- 26:07 – 31:08
Multi-chip future: diversification is desired, but Nvidia speed and ecosystem dominate
The conversation turns to whether inference will diversify beyond Nvidia. Tuhin expects inference-specific chips over time, but emphasizes Nvidia’s supply chain, CUDA ecosystem, and time-to-market advantages—plus how exclusive supply deals can stifle broader ecosystems.
- 31:08 – 33:48
Runtime roadmap and scaling edge cases: sandboxes, KV-cache routing, and weird failures
Tuhin outlines Baseten’s runtime priorities: support for new workload types (agents, diffusion, video), sandboxes, and performance techniques like prefill/decode separation. At scale, the team encounters real-world systems failures and immature LLM runtime primitives that require ongoing engineering.
- 33:48 – 38:19
Hiring, leadership, and pager culture in an always-on cloud
Tuhin explains the shift from a very flat org to bringing in leaders who can own “whole problems.” He stresses clear hiring rubrics (first-principles thinking, low ego, collaboration) and describes the intense operational culture required to run inference reliably, including pervasive on-call expectations.
- 38:19
Efficiency drives more demand (Jevons paradox) and the ‘concierge everything’ future
As inference gets cheaper, developers embed more intelligence—especially via longer-running agents—rather than stopping at “good enough.” Tuhin predicts a future of ubiquitous personalized assistants (“concierge everything”) and argues companies that don’t adapt face existential risk.
Who’s adopting AI first: AI-native apps vs enterprise in-house builds
Elad contrasts AI-native application companies with enterprises building internally. Tuhin estimates AI-native companies still dominate inference volume today, but enterprise adoption is now visibly progressing from tools → APIs → custom models.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome