Skip to content
The Twenty Minute VCThe Twenty Minute VC

Steeve Morin: Why Google Will Win the AI Arms Race & OpenAI Will Not | E1262

Steeve Morin is the Founder & CEO @ ZML, a next-generation inference engine enabling peak performance on a wide range of chips. Prior to founding ZML, Steeve was the VP Engineering at Zenly for 7 years leading eng to millions of users and an acquisition by Snap. ---------------------------------------------- In Today’s Episode We Discuss: (00:00) Intro (00:59) How Will Inference Change and Evolve Over the Next 5 Years (06:24) Challenges and Innovations in AI Hardware (14:07) The Economics of AI Compute (16:57) Training vs. Inference: Infrastructure Needs (24:56) The Future of AI Chips and Market Dynamics (36:25)Nvidia's Market Position and Competitors (40:47) Challenges of Incremental Gains in the Market (41:39) The Zero Buy-In Strategy (42:18) Switching Between Compute Providers (43:23) The Importance of a Top-Down Strategy for Microsoft and Google (44:49) Microsoft's Strategy with AMD (49:35) Data Center Investments and Training (52:20) How to Succeed in AI: The Triangle of Products, Data, and Compute (52:48) Scaling Laws and Model Efficiency (54:34) Future of AI Models and Architectures (01:03:38) Retrieval Augmented Generation (RAG) (01:07:51) Why OpenAI’s Position is Not as Strong as People Think (01:15:20) Challenges in AI Hardware Supply ----------------------------------------------- Subscribe on Spotify: https://open.spotify.com/show/3j2KMcZTtgTNBKwtZBMHvl?si=85bc9196860e4466 Subscribe on Apple Podcasts: https://podcasts.apple.com/us/podcast/the-twenty-minute-vc-20vc-venture-capital-startup/id958230465 Follow Harry Stebbings on X: https://twitter.com/HarryStebbings Follow Steeve Morin on X: https://twitter.com/steeve Follow 20VC on Instagram: https://www.instagram.com/20vchq Follow 20VC on TikTok: https://www.tiktok.com/@20vc_tok Visit our Website: https://www.20vc.com Subscribe to our Newsletter: https://www.thetwentyminutevc.com/contact ----------------------------------------------- #20vc #harrystebbings #steevemorin #zml #openai #nvidia #amd #ai #inference #startups #founder #ceo

Steeve MoringuestHarry Stebbingshost
Feb 24, 20251h 18mWatch on YouTube ↗

CHAPTERS

  1. 0:00 – 0:58

    Cold open: NVIDIA’s “fake moats,” owning compute, and why inference will dominate

    Steeve lays out his contrarian view: NVIDIA’s software moat is overhyped, and the real long-term advantage belongs to companies that own products, data, and compute. He forecasts a future where inference dwarfs training and argues Google is the “sleeping giant.”

    • NVIDIA shapes what developers think matters (e.g., CUDA as a perceived moat)
    • If you don’t own compute, you’re structurally margin-constrained
    • Forecast: in ~5 years, compute demand becomes ~95% inference, ~5% training
    • Competitive advantage comes from combining products + data + compute
    • Google is positioned to win because it has all three and broad distribution
  2. 0:58 – 3:20

    What ZML does: “any model on any hardware” without compromise

    Harry asks Steeve to explain ZML and where it sits in the stack. Steeve frames ZML as an infrastructure-layer ML framework that enables hardware-agnostic performance across vendors.

    • ZML’s goal: run any model on any hardware (NVIDIA/AMD/TPU/etc.)
    • Agnosticism only works if there’s no performance or reliability compromise
    • Positioning: infrastructure layer between models and compute providers
    • Focus on better/faster/more reliable execution rather than lock-in
    • Abstraction is the key to reducing ecosystem friction
  3. 3:20 – 6:28

    Multi-model backends and why teams will run heterogeneous hardware

    The conversation moves from “one model” to a reality of many specialized backends behind an API. Steeve argues hardware choice can drive order-of-magnitude efficiency differences, so teams will increasingly mix providers.

    • Closed models behave like “constellations” of backends, not single weight blobs
    • Systems already route tasks (e.g., images to diffusion, text to LLMs)
    • Future deployments feel like one API but switch models internally
    • Hardware choice can dramatically change cost/performance (e.g., large-model economics)
    • The practical barrier: access/supply of non-NVIDIA alternatives
  4. 6:28 – 10:23

    Why NVIDIA rebounded: availability, H100 inference economics, and the agents/latency pivot

    Harry probes why NVIDIA recovered from market shocks faster than AMD. Steeve attributes it to near-term availability, but warns the H100 inference value proposition can’t justify pricing forever—especially as agents and reasoning shift workloads from throughput-bound to latency-bound.

    • NVIDIA rebounds because its chips are broadly available right now
    • Concern: an H100-priced-for-training model is misaligned with inference economics
    • Potential “H100 bubble”: price/perf gap vs prior generation (A100)
    • Agents + reasoning change the goal from throughput to end-to-end latency
    • Latency-bound workloads can expose GPU limitations per-stream
  5. 10:23 – 14:17

    Hardware reality check: GPUs as a ‘good trick,’ TPUs, and why interconnect wins training

    Steeve explains why GPUs succeeded despite not being purpose-built for AI, and why purpose-built accelerators matter. He highlights that in training, the interconnect (e.g., Mellanox/InfiniBand) can matter more than raw compute, while inference cares far less about cluster interconnect.

    • GPUs were repurposed for AI from graphics parallelism (GPGPU origin)
    • LLMs stress memory movement; on-chip data locality becomes critical
    • Single-stream performance improves when avoiding slow external memory transfers
    • Google TPUs represent more AI-native architectural choices
    • Training success often hinges on interconnect speed; inference usually doesn’t
  6. 14:17 – 16:57

    Who captures the margins: the chip market structure, TPUs/Trainium, and cloud markups

    The discussion turns to market structure and where profits accrue. Steeve breaks compute into rent vs buy options and argues cloud customers get squeezed by stacked margins—making dedicated cloud chips (TPUs/Trainium) strategically attractive if software friction falls.

    • Market buckets: buy/rent GPUs; rent TPUs; buy TPUs (and emerging dedicated chips)
    • In cloud, “dedicated” options: Google TPUs and Amazon Trainium
    • Stacked margins: foundry + NVIDIA + hyperscaler squeeze end users
    • Optionality matters—avoiding dependence on one provider improves economics
    • TPUs succeed internally at Google but face external adoption friction
  7. 16:57 – 23:28

    Training vs inference infrastructure: production reliability, autoscaling, and why teams overbuy

    Steeve contrasts research-like training with production-grade inference. He explains that inference deployments rely on duct-tape today, and that autoscaling and provisioning are major unlocks—but hard due to scarcity and risk of losing capacity.

    • Training: iteration speed; “more is better”; research dynamics
    • Inference: production; “less is better”; reliability and operational simplicity
    • Key infra difference: interconnect needs are high for training, avoidable for inference
    • Autoscaling is a major efficiency lever, yet underdeveloped in AI stacks
    • Scarcity + risk incentivize overbuying reserved capacity and wasting spend
  8. 23:28 – 30:05

    Compute oversupply risk and the next inference chips: Cerebras, SRAM economics, and new entrants

    Harry asks about a future compute overhang and whether NVIDIA really wins inference. Steeve predicts potential near-term oversupply and explains why specialized inference hardware (e.g., SRAM-heavy designs) can be fast but expensive—opening room for new chip startups if they hit GPU-like pricing.

    • Signals of oversupply: discounts and cold outreach from compute providers
    • Collateral risk: amortization plans backed by GPUs can unwind painfully
    • NVIDIA may still serve inference because the chips exist and are accessible
    • Cerebras-style SRAM locality boosts single-stream performance but drives cost/yield issues
    • Potential challengers mentioned: Etched and Vysor, if they reach GPU-comparable pricing
  9. 30:05 – 36:25

    3–5 year inference future: latent-space reasoning, HBM vs SRAM, and compute-in-memory

    Steeve outlines why reasoning and agents will reshape compute requirements. He introduces latent-space reasoning and argues memory access patterns—HBM, SRAM, and ultimately compute-in-memory—will determine who can deliver low-latency reasoning at scale.

    • Latent-space reasoning: “thinking” without emitting token-by-token chains
    • GPUs struggle as memory access to external HBM becomes a bottleneck
    • SRAM helps but doesn’t scale indefinitely; you need a balanced architecture
    • Compute-in-memory is a potential next frontier (Rain.ai, Fractile)
    • Goal: much higher single-stream speed to support fast reasoning and agent pipelines
  10. 36:25 – 38:18

    NVIDIA moves up the stack—and why software abstraction erodes chip lock-in

    Harry asks if NVIDIA will climb the stack into cloud/model layers; Steeve points to NIM. He argues long-term competition shifts to software that abstracts hardware quirks, turning “moats” into commodity specs competition.

    • NVIDIA is moving up-stack (e.g., NIM) to defend value capture
    • CUDA as a moat is framed as marketing-driven rather than fundamental
    • If software abstracts idiosyncrasies, providers compete on real specs and price
    • Cloud economics make 90% vendor margins hard to sustain long-term
    • Supply power (and fear of losing allocation) is a temporary enforcement mechanism
  11. 38:18 – 52:34

    AMD’s GTM problem, switching costs, and the “zero buy-in” strategy

    Steeve explains why better price/performance alone doesn’t trigger switching: maintaining multiple stacks is painful and organizationally risky. His remedy is “zero buy-in”—making switching between compute targets effectively frictionless so incremental improvements can win.

    • Switching costs are both technical (stacks) and financial (amortization commitments)
    • To justify switching, buyers often must commit to enormous volumes
    • Incremental gains aren’t enough when adoption requires major migration effort
    • “Zero buy-in” means you can switch targets (AMD/NVIDIA/TPU/etc.) with minimal work
    • If buy-in is near zero, 30% better performance/cost is enough to win workloads
  12. 52:34 – 1:02:29

    Top-down adoption: Microsoft’s AMD push, data center capex, and the product–data–compute triangle

    Steeve argues infra-only strategies fail unless pulled by an application with distribution. He uses Microsoft’s AMD purchases and OpenAI inference economics as an example, then broadens to why Google’s ownership of product, data, and compute is strategically decisive.

    • Bottom-up infra pitches fail; top-down product pull drives adoption
    • Microsoft buying AMD supply can materially improve inference unit economics
    • Inference scaling is non-linear on multi-GPU setups; memory capacity can dominate throughput
    • Hyperscaler capex still targets training, benefiting NVIDIA near-term
    • Strategic framework: products + data + compute; Google uniquely has all three at global scale
  13. 1:02:29 – 1:03:38

    Scaling laws vs efficiency: DeepSeek’s wake-up call, new architectures, and distillation norms

    The discussion shifts to model progress: brute-force scaling meets physics and networking limits, encouraging efficiency-first approaches. Steeve highlights DeepSeek’s impact, the possibility of non-transformer models and world models (JePA), and normalizes distillation as ‘fair game.’

    • Large clusters hit practical limits (networking/physics), pushing efficiency focus
    • DeepSeek demonstrates that engineering efficiency can “create” virtual compute
    • Possible discontinuities: non-transformer architectures and world/energy-based models
    • JePA/world-model thinking aims beyond language as a narrow window on reality
    • Distillation is positioned as legitimate—and sometimes yields smaller models that excel on specific tasks
  14. 1:03:38 – 1:06:33

    RAG in practice: embeddings, vector search, context limits, and why efficiency drives model size

    Steeve explains Retrieval Augmented Generation (RAG) as a pragmatic way to inject relevant knowledge at runtime. He notes practical constraints (chunking, context windows) and argues the strongest pressure toward smaller models is still operational efficiency, not RAG alone.

    • RAG pipeline: embed query → vector search → retrieve text → prepend as context
    • It’s effective but bounded by context window size and chunking strategy
    • Many “summarize this link” workflows are effectively RAG-style augmentation
    • RAG doesn’t inherently force smaller models; cost/speed constraints do
    • Emerging frontier mentioned: “attention-level search” for better retrieval integration
  15. 1:06:33 – 1:18:09

    Geopolitics, regulation, and capex narratives: DeepSeek, export controls, Europe, Mistral, Stargate, and quickfire

    In the closing stretch, Steeve frames China’s constraints as an innovation driver and downplays simplistic narratives about regulation and competitive collapse. He critiques mega-capex announcements as ‘more of the same,’ then ends with quickfire views on the biggest infra shifts and NVIDIA’s operational risks.

    • Constraint-driven innovation: geopolitics can accelerate efficiency breakthroughs
    • Export controls may slow China short-term but can strengthen adaptation long-term
    • Europe regulation fears are overstated relative to the need to win first
    • Skepticism on giant capex pledges (e.g., “Stargate”): vertical scaling vs efficiency
    • Quickfire: latency-bound reasoning/agents as the near-term infra shift; Blackwell issues and order cancellations as a key NVIDIA risk

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.