Baseten CEO Tuhin Srivastava on Custom Models, and Building the Inference Cloud

Name: Baseten CEO Tuhin Srivastava on Custom Models, and Building the Inference Cloud
Uploaded: 2026-05-01T00:00:00Z
Duration: 42 min 57 s
Description: Baseten’s growth is driven by the rapid expansion of the application layer and the mainstreaming of post-training/RL techniques that let companies “own” and specialize inference.

Sarah Guo and Tuhin Srivastava on baseten CEO on custom models, scaling inference, and compute constraints.

Sarah GuohostTuhin SrivastavaguestElad Gilhost

May 1, 202642mWatch on YouTube ↗

30x growth and inference as a massive marketWhy the application layer persists vs. frontier labsServing AI-native companies that sell into enterprisesOpen-source model mix and frontier capability raceChinese models: security concerns and geopoliticsCustom inference, compilation, and performance tuningGPU supply crunch, contracts, and cost of capitalRuntime roadmap: KV-cache routing, prefill/decode split, speculationScale edge cases and hyperscaler limitationsHiring leadership and operations/pager cultureJevons paradox and rising demand from efficiency“Concierge everything” agentic future

AI-generated summary based on the episode transcript.

In this episode of No Priors, featuring Sarah Guo and Tuhin Srivastava, Baseten CEO Tuhin Srivastava on Custom Models, and Building the Inference Cloud explores baseten CEO on custom models, scaling inference, and compute constraints Baseten’s growth is driven by the rapid expansion of the application layer and the mainstreaming of post-training/RL techniques that let companies “own” and specialize inference.

WHAT IT’S REALLY ABOUT

Baseten CEO on custom models, scaling inference, and compute constraints

Baseten’s growth is driven by the rapid expansion of the application layer and the mainstreaming of post-training/RL techniques that let companies “own” and specialize inference.
Most production inference on Baseten is custom (about 90–95% of tokens), with customers modifying, compiling, and optimizing open-source weights for both quality and performance rather than running vanilla models.
GPU supply is extremely tight with minimal slack, pushing providers toward multi-year contracts with significant prepay and making operational reliability and access to quality suppliers as critical as software.
Customers prioritize model capability first and cost second, using a mix of frontier closed models and increasingly strong open-source models (including Chinese-origin models), while navigating security and geopolitical considerations.
Baseten’s product direction emphasizes an end-to-end loop: inference produces data and eval signals, which feeds post-training, which in turn drives more inference—plus runtime innovations like cache-aware routing, prefill/decode separation, and agent sandboxes.

IDEAS WORTH REMEMBERING

5 ideas

Workflow moats beat model moats for most application companies.

Srivastava argues the durable advantage is proprietary user signal embedded in end-to-end workflows (e.g., clinician edits and EMR integration), which labs can’t easily access to post-train long-horizon systems.

The market is still early: enterprise inference adoption is mostly ahead of us.

By inference volume, he estimates ~99% is still from AI-native app companies today, implying a large upcoming wave as enterprises move from API trials to custom-model deployment.

Custom models are already the production default for serious users.

Baseten sees ~90–95% of tokens as “custom” inference—customers fine-tune/post-train and also compile/quantize/optimize for latency and cost, not just accuracy.

Capability-first buying drives model choice; cost optimization comes later.

Customers start with the best-performing model for value creation, then optimize cost/latency—meaning infrastructure must support rapid switching and deep optimization across many models.

GPU scarcity is deeper than most narratives suggest, and “good supply” is rarer than raw supply.

He describes mid-90s utilization as normal and notes many new suppliers lack data-center and inference-SLA maturity, shrinking the set of truly reliable providers to a small top tier.

WORDS WORTH SAVING

5 quotes

I think everyone is real-realizing that you can put AI everywhere.

— Tuhin Srivastava

To the extent that it is encoded in workflows, um, that is where they will be able to develop moat.

— Tuhin Srivastava

It is all custom.

— Tuhin Srivastava

No post-training pre-product market fit is what I... Is what I'd say.

— Tuhin Srivastava

As much as we hear about it, I don't think people realize how bad it really is.

— Tuhin Srivastava

QUESTIONS ANSWERED IN THIS EPISODE

5 questions

You mentioned workflow moats like Abridge’s EMR integration—what specific “user signals” most reliably translate into post-training gains, and how do you measure that value?

Baseten’s growth is driven by the rapid expansion of the application layer and the mainstreaming of post-training/RL techniques that let companies “own” and specialize inference.

If 90–95% of your tokens are custom, what are the most common modifications customers make (fine-tuning, RL/post-training, quantization, compilation), and which deliver the biggest ROI?

Most production inference on Baseten is custom (about 90–95% of tokens), with customers modifying, compiling, and optimizing open-source weights for both quality and performance rather than running vanilla models.

You say customers choose capability first, cost second—at what usage level or margin pressure does “cost-first” decision-making start to dominate?

GPU supply is extremely tight with minimal slack, pushing providers toward multi-year contracts with significant prepay and making operational reliability and access to quality suppliers as critical as software.

On Chinese-origin open-source models: what concrete security controls do you recommend (network isolation, provenance checks, red-teaming, eval suites) before production deployment?

Customers prioritize model capability first and cost second, using a mix of frontier closed models and increasingly strong open-source models (including Chinese-origin models), while navigating security and geopolitical considerations.

You described only a few “gold tier” clouds—what operational signals (SLA history, hardware telemetry, networking, incident response) determine whether a supplier is trustworthy for inference?

Baseten’s product direction emphasizes an end-to-end loop: inference produces data and eval signals, which feeds post-training, which in turn drives more inference—plus runtime innovations like cache-aware routing, prefill/decode separation, and agent sandboxes.

Chapter Breakdown

Baseten’s 30x growth and why inference demand is exploding

Tuhin explains Baseten’s rapid growth as a reflection of AI getting embedded “everywhere,” with open-source model quality crossing a key threshold. He frames Baseten as an index on the expanding application layer as more teams bring intelligence in-house and serve a growing long tail of specialized models.

Why the application layer still wins against frontier labs

The discussion tackles whether labs will capture the whole stack. Tuhin argues durable application businesses come from proprietary user signal and workflow integration, not just model weights—making it hard for frontier model companies to displace deeply embedded apps.

Who’s adopting AI first: AI-native apps vs enterprise in-house builds

Elad contrasts AI-native application companies with enterprises building internally. Tuhin estimates AI-native companies still dominate inference volume today, but enterprise adoption is now visibly progressing from tools → APIs → custom models.

Serving frontier customers to learn enterprise requirements indirectly

Baseten prioritizes the highest-scale, most demanding customers and uses them as a proxy for enterprise needs. Many Baseten customers sell into regulated, demanding enterprises and “translate” requirements back into Baseten’s platform roadmap.

The open-source model mix: best-model-first, then optimize cost

Tuhin describes a capability-first mindset: customers start with frontier performance and later optimize latency and cost. Baseten sees broad experimentation across model families, including Chinese-origin models and specialized modalities like TTS.

Chinese models, security concerns, and geopolitics

Elad raises security and “Trojan horse” concerns about Chinese-origin models. Tuhin argues network boundaries and lack of evidence reduce practical risk, while emphasizing the strategic importance of the US maintaining strong open-source alternatives.

Custom inference dominates: almost nobody runs vanilla weights

Baseten’s workload is overwhelmingly custom: customers modify models for quality and performance rather than serving unmodified open-source weights. Tuhin outlines Baseten’s product lines and emphasizes that compilation/optimization is as central as fine-tuning.

Post-training acquisition: why Baseten bought a research team

Tuhin explains acquiring Parsed to add post-training expertise and move closer to customers earlier in their lifecycle. He highlights how post-training and inference are deeply linked (e.g., quantization choices depend on training), enabling an iterative improvement loop.

When to invest in custom models: avoid post-training pre-PMF

Customers ask when to move from frontier APIs to custom models. Tuhin advises proving value with best-in-class models first, then optimizing once there’s real product-market fit and a clear user signal to train against.

Supply crunch reality: running at 90%+ utilization across 18 clouds

Tuhin describes how severe GPU scarcity is in practice, with very little slack compute anywhere. Baseten’s “runtime fabric” across many providers helps reliability and failover, but also becomes a competitive advantage in sourcing capacity quickly.

Longer GPU contracts, prepay, and why cost of capital matters

The market is shifting toward multi-year commitments with meaningful prepayment, especially for frontier GPUs like B200s. This turns inference into a capital-and-financing game, affecting strategy from procurement to potential IPO timing.

What makes an inference winner: software stickiness + compute access + ops

Tuhin argues “GPUs as a service” is commoditized, but inference platforms with a strong software layer are sticky. Winning requires both software excellence and secured compute supply, plus the operational maturity to deliver mission-critical SLAs.

Multi-chip future: diversification is desired, but Nvidia speed and ecosystem dominate

The conversation turns to whether inference will diversify beyond Nvidia. Tuhin expects inference-specific chips over time, but emphasizes Nvidia’s supply chain, CUDA ecosystem, and time-to-market advantages—plus how exclusive supply deals can stifle broader ecosystems.

Runtime roadmap and scaling edge cases: sandboxes, KV-cache routing, and weird failures

Tuhin outlines Baseten’s runtime priorities: support for new workload types (agents, diffusion, video), sandboxes, and performance techniques like prefill/decode separation. At scale, the team encounters real-world systems failures and immature LLM runtime primitives that require ongoing engineering.

Hiring, leadership, and pager culture in an always-on cloud

Tuhin explains the shift from a very flat org to bringing in leaders who can own “whole problems.” He stresses clear hiring rubrics (first-principles thinking, low ego, collaboration) and describes the intense operational culture required to run inference reliably, including pervasive on-call expectations.

Efficiency drives more demand (Jevons paradox) and the ‘concierge everything’ future

As inference gets cheaper, developers embed more intelligence—especially via longer-running agents—rather than stopping at “good enough.” Tuhin predicts a future of ubiquitous personalized assistants (“concierge everything”) and argues companies that don’t adapt face existential risk.

EVERY SPOKEN WORD

Install uListen for AI-powered chat & search across the full episode — Get Full Transcript

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome

At a glance

Baseten CEO on custom models, scaling inference, and compute constraints

Workflow moats beat model moats for most application companies.

The market is still early: enterprise inference adoption is mostly ahead of us.

Custom models are already the production default for serious users.

Capability-first buying drives model choice; cost optimization comes later.

GPU scarcity is deeper than most narratives suggest, and “good supply” is rarer than raw supply.

You mentioned workflow moats like Abridge’s EMR integration—what specific “user signals” most reliably translate into post-training gains, and how do you measure that value?

If 90–95% of your tokens are custom, what are the most common modifications customers make (fine-tuning, RL/post-training, quantization, compilation), and which deliver the biggest ROI?

You say customers choose capability first, cost second—at what usage level or margin pressure does “cost-first” decision-making start to dominate?

On Chinese-origin open-source models: what concrete security controls do you recommend (network isolation, provenance checks, red-teaming, eval suites) before production deployment?

You described only a few “gold tier” clouds—what operational signals (SLA history, hardware telemetry, networking, incident response) determine whether a supplier is trustworthy for inference?

Chapter Breakdown

Baseten’s 30x growth and why inference demand is exploding

Why the application layer still wins against frontier labs

Who’s adopting AI first: AI-native apps vs enterprise in-house builds

Serving frontier customers to learn enterprise requirements indirectly

The open-source model mix: best-model-first, then optimize cost

Chinese models, security concerns, and geopolitics

Custom inference dominates: almost nobody runs vanilla weights

Post-training acquisition: why Baseten bought a research team

When to invest in custom models: avoid post-training pre-PMF

Supply crunch reality: running at 90%+ utilization across 18 clouds

Longer GPU contracts, prepay, and why cost of capital matters

What makes an inference winner: software stickiness + compute access + ops

Multi-chip future: diversification is desired, but Nvidia speed and ecosystem dominate

Runtime roadmap and scaling edge cases: sandboxes, KV-cache routing, and weird failures

Hiring, leadership, and pager culture in an always-on cloud

Efficiency drives more demand (Jevons paradox) and the ‘concierge everything’ future

Get more out of YouTube videos.