No PriorsBaseten CEO Tuhin Srivastava on Custom Models, and Building the Inference Cloud
At a glance
WHAT IT’S REALLY ABOUT
Baseten CEO on custom models, scaling inference, and compute constraints
- Baseten’s growth is driven by the rapid expansion of the application layer and the mainstreaming of post-training/RL techniques that let companies “own” and specialize inference.
- Most production inference on Baseten is custom (about 90–95% of tokens), with customers modifying, compiling, and optimizing open-source weights for both quality and performance rather than running vanilla models.
- GPU supply is extremely tight with minimal slack, pushing providers toward multi-year contracts with significant prepay and making operational reliability and access to quality suppliers as critical as software.
- Customers prioritize model capability first and cost second, using a mix of frontier closed models and increasingly strong open-source models (including Chinese-origin models), while navigating security and geopolitical considerations.
- Baseten’s product direction emphasizes an end-to-end loop: inference produces data and eval signals, which feeds post-training, which in turn drives more inference—plus runtime innovations like cache-aware routing, prefill/decode separation, and agent sandboxes.
IDEAS WORTH REMEMBERING
5 ideasWorkflow moats beat model moats for most application companies.
Srivastava argues the durable advantage is proprietary user signal embedded in end-to-end workflows (e.g., clinician edits and EMR integration), which labs can’t easily access to post-train long-horizon systems.
The market is still early: enterprise inference adoption is mostly ahead of us.
By inference volume, he estimates ~99% is still from AI-native app companies today, implying a large upcoming wave as enterprises move from API trials to custom-model deployment.
Custom models are already the production default for serious users.
Baseten sees ~90–95% of tokens as “custom” inference—customers fine-tune/post-train and also compile/quantize/optimize for latency and cost, not just accuracy.
Capability-first buying drives model choice; cost optimization comes later.
Customers start with the best-performing model for value creation, then optimize cost/latency—meaning infrastructure must support rapid switching and deep optimization across many models.
GPU scarcity is deeper than most narratives suggest, and “good supply” is rarer than raw supply.
He describes mid-90s utilization as normal and notes many new suppliers lack data-center and inference-SLA maturity, shrinking the set of truly reliable providers to a small top tier.
WORDS WORTH SAVING
5 quotesI think everyone is real-realizing that you can put AI everywhere.
— Tuhin Srivastava
To the extent that it is encoded in workflows, um, that is where they will be able to develop moat.
— Tuhin Srivastava
It is all custom.
— Tuhin Srivastava
No post-training pre-product market fit is what I... Is what I'd say.
— Tuhin Srivastava
As much as we hear about it, I don't think people realize how bad it really is.
— Tuhin Srivastava
High quality AI-generated summary created from speaker-labeled transcript.
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome