Skip to content
No PriorsNo Priors

No Priors Ep 56 | With Baseten CEO and Co-Founder Tuhin Srivastava

At a time when users are being asked to wait unthinkable seconds for AI products to generate art and answers, speed is what will win the battle heating up in AI computing. At least according to today’s guest, Tuhin Srivastava, the CEO and co-founder of Baseten which gives customers scalable AI infrastructures starting with interference. In this episode of No Priors, Sarah, Elad, and Tuhin discuss why efficient code solutions are more desirable than no code, the most surprising use cases for Baseten, and why all of their jobs are very defensible from AI. Show Notes: (0:00) Introduction (1:19) Capabilities of efficient code enabled development (4:11) Difference in training inference workloads (6:12) AI product acceleration (8:48) Leading on inference benchmarks at BaseTen (12:08) Optimizations for different types of models (16:11) Internal vs open source models (19:01) timeline for enterprise scale (21:53) Rethinking investment in compute spend (27:50) Defensibility in AI industries (31:30) Hardware and the chip shortage (35:47) Speed is the way to win in this industry (38:26) Wrap

Sarah GuohostTuhin SrivastavaguestElad Gilhost
Mar 21, 202438mWatch on YouTube ↗

CHAPTERS

  1. 0:05 – 1:15

    Baseten’s mission: fast, scalable AI inference infrastructure

    Sarah introduces Tuhin Srivastava and frames Baseten as an AI infrastructure company focused first on inference. Tuhin explains what Baseten provides and why they started it in 2019 as a “picks-and-shovels” business for ML builders.

    • Baseten targets infrastructure for teams deploying large models, starting with inference
    • Company motivation: solve recurring ML deployment pains and build enabling infrastructure
    • Perspective shift: last 12 months brought a major market pull for AI infra
  2. 1:15 – 2:15

    “Efficient code,” not no-code: abstractions that still let engineers turn knobs

    Tuhin explains Baseten’s evolution away from no-code tendencies toward “efficient code.” The goal is tight abstractions that accelerate common tasks while preserving flexibility for advanced tuning and scaling.

    • Engineers prefer code because it’s powerful and controllable
    • No-code can limit under-the-hood tuning and make hard problems harder
    • Baseten focuses on intuitive abstractions: easy things easy, hard things possible
    • Designing to avoid the “graduation problem” as teams scale
  3. 2:15 – 4:10

    What runs on Baseten: from weekend projects to AI-native products

    Tuhin gives concrete examples of applications and customer profiles using Baseten, from small teams seeking leverage to established companies adding AI features. He highlights use cases where fast latency and minimal infra effort are business-critical.

    • Customers span tiny side projects to foundation model companies
    • Examples: Descript, Patreon, and smaller teams shipping quickly
    • Low-latency workloads (sub-200–300ms) enabled without months of infra work
    • Domain-expert companies benefit when infra isn’t proprietary
  4. 4:10 – 6:12

    Inference vs training: different SLAs, workflows, and hardware needs

    Responding to GPU-cluster hype, Tuhin contrasts training and inference requirements. Inference emphasizes reliability, repeatable deployment workflows, and latency; training emphasizes large clustered compute and high-performance networking.

    • Inference cares about co-location, latency, and reliability; downtime is unacceptable
    • Training is less location-sensitive but more sensitive to multi-GPU networking
    • Inference workflows are more repeatable across customers (CI/CD, versioning, cold starts)
    • Training still often looks like SSH-based bespoke workflows
  5. 6:12 – 8:44

    Market acceleration and the “speed advantage” driving buy-vs-build

    Elad asks what surprised Tuhin most; Tuhin points to rapid market change post-2022 and how fast teams must execute to stay relevant. This increases willingness to buy infrastructure rather than build it internally.

    • 2019–2022 was comparatively quiet; 2022–2023 demand surged
    • Speed is increasingly the #1 competitive advantage in AI products
    • Organizations show higher propensity to buy infrastructure to move faster
    • Compute demand keeps rising as models and services scale
  6. 8:44 – 12:00

    Why Baseten leads on inference benchmarks: staying current from kernels to TRT-LLM

    Sarah asks about Baseten’s benchmark performance; Tuhin breaks down why serving fast is hard and what drives throughput/latency improvements. He emphasizes GPU utilization, scaling strategies, and tight integration with rapidly evolving research and open source.

    • Inference difficulty spans workflow, scalability/reliability, and performance optimization
    • Key levers: GPU utilization, multi-GPU scaling, and state-of-the-art decoding techniques
    • Speculative decoding and other recent research materially improves speed
    • Close work with NVIDIA’s TRT-LLM and low-level optimizations contribute to gains
  7. 12:00 – 13:16

    Optimizing beyond LLMs: diffusion and speech models moving toward real time

    Elad asks about other model types; Tuhin discusses serving optimizations for Whisper (including TensorRT variants) and diffusion image generation. Customer demand is increasingly for near-real-time experiences, even when quality is high.

    • Variants like FasterWhisper and Whisper-on-TRT enable faster speech inference
    • Diffusion customers (e.g., image generation) care about not waiting multiple seconds
    • Optimization progress exists, but is less “juiced” than LLMs so far
    • Customer requirements are pushing all modalities toward real-time UX
  8. 13:16 – 15:41

    Path to faster UX: better hardware, batching/decoding tricks, and smaller distilled models

    Sarah asks what will reduce user waiting time; Tuhin outlines the improvement roadmap. Gains come from both hardware jumps (A100→H100→H200), software serving techniques, and model size/architecture evolution toward specialized smaller models and local execution.

    • Hardware step-functions deliver immediate speedups (H100 vs A100)
    • Software optimizations: continuous/dynamic batching, speculative decoding
    • Models likely get smaller and more task-specialized via distillation
    • Local inference is increasingly plausible (e.g., Cody running locally)
  9. 15:41 – 17:58

    When to use public endpoints vs dedicated vs self-hosted open source models

    Sarah asks for guidance on deployment choices; Tuhin describes a common migration path as teams hit cost, latency, and control limits. He explains why dedicated compute and eventually running in a customer’s own cloud becomes compelling, especially at enterprise scale.

    • Typical progression: OpenAI/Anthropic APIs → private deployments → open source → dedicated/self-hosted
    • Drivers: speed, cost, and not needing maximum model capability
    • Dedicated compute avoids noisy-neighbor issues and offers better SLAs
    • Privacy/compliance and scale economics push larger orgs toward self-hosting in AWS/GCP
  10. 17:58 – 21:25

    Enterprise adoption timelines: early copilots now, big spend later—and risks of top-down hype

    Elad probes enterprise scaling; Tuhin notes copilots are already widespread, with broader experimentation underway. He warns near-term forecasts may be overstated due to top-down budget pressure, but expects multi-year adoption to be dramatically larger than most expect.

    • Copilots/coding assistants are already the first enterprise beachhead
    • Next wave: experimentation with foundation model APIs in production workflows
    • Risk: repeating the 2018–2020 “ML trap” of spend not tied to user value
    • Likely underestimation over 3–5 years even if 12–18 month projections are high
  11. 21:25 – 24:27

    Rethinking compute spend and margins: inference as a dominant P&L line item

    Sarah and Tuhin discuss how AI changes SaaS economics: fewer people, more inference spend, and different margin structures. Tuhin shares an anecdote where compute was a company’s second-largest expense after payroll, highlighting how central consumption becomes.

    • Traditional SaaS gross margin expectations don’t map cleanly to AI products
    • Inference can become one of the largest operating expenses
    • Upfront payments/commitments can be hard when compute is already a major cost center
    • Strong optimization and markups can still yield healthy margins in consumption models
  12. 24:27 – 31:10

    Defensibility in fast-ramping AI markets: workflows, contracts, and market structure

    Elad reflects on the unusual speed of revenue ramps and the resulting defensibility questions. The discussion touches on how differentiation may come from workflow integration, oligopoly structures, and contract dynamics—especially in deal-driven sectors like healthcare.

    • Fast ramps can indicate huge demand but also raise commoditization risk
    • Workflow defensibility can matter as much as model capability
    • Many markets become oligopolies rather than winner-take-all monopolies
    • Contract structure (multi-year deals, lock-in) influences sustainable advantage
  13. 31:10 – 35:14

    Hardware reality: GPU shortages, heterogeneity skepticism, and why NVIDIA still dominates

    Sarah asks about chip supply and heterogeneous hardware; Tuhin says availability has improved for some, but premium chips still involve negotiation and delay for many customers. He’s cautious about AMD/non-NVIDIA adoption due to ecosystem friction and added complexity for an abstracted inference platform.

    • Shortage shifted from “everything” to mainly premium chips (H100/A100) for customers
    • Baseten can source faster partly due to scale and long-term commitments
    • Cloud-provider procurement can still take weeks and escalations
    • Hardware heterogeneity is attractive in theory, but CUDA-ecosystem portability and debugging risk are real
  14. 35:14 – 38:32

    Build vs buy conclusion: speed wins, and infrastructure isn’t the differentiated asset

    Elad asks how build-vs-buy will evolve; Tuhin argues speed and reliability dominate, making internal infra builds costly distractions. He claims the proprietary moat is models/data/workflow, while infrastructure is repeatable—and shares examples of teams abandoning DIY stacks quickly.

    • Speed and uptime directly affect end-user experience and competitiveness
    • Differentiation: models, data, and workflow—not serving infrastructure
    • Common pattern: teams attempt DIY, then return after hitting complexity (“Docker dumpster fire”)
    • Examples: migrating a 4-person infra team’s stack in 36 hours; billion-tokens/day scale is unrealistic to DIY for tiny teams

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.