Skip to content
Dwarkesh PodcastDwarkesh Podcast

Reiner Pope on Dwarkesh Patel: Why Token Cost Tracks Batch

Weight fetches dominate token cost until batch crosses 300 times MoE sparsity; past that crossover, compute binds and cost per token hits its lower bound.

Dwarkesh PatelhostReiner Popeguest
Apr 29, 20262h 13mWatch on YouTube ↗

Episode Details

EPISODE INFO

Released
April 29, 2026
Duration
2h 13m
Channel
Dwarkesh Podcast
Watch on YouTube
▶ Open ↗

EPISODE DESCRIPTION

Did a very different format with Reiner Pope – a blackboard lecture where he walks through how frontier LLMs are trained and served. It's shocking how much you can deduce about what the labs are doing from a handful of equations, public API prices, and some chalk. It’s a bit technical, but I encourage you to hang in there - it’s really worth it. There are less than a handful of people who understand the full stack of AI, from chip design to model architecture, as well as Reiner. It was a real delight to learn from him. Reiner is CEO of MatX, a new chip startup (full disclosure - I’m an angel investor). He was previously at Google, where he worked on software efficiency, compilers, and TPU architecture. 𝐄𝐏𝐈𝐒𝐎𝐃𝐄 𝐋𝐈𝐍𝐊𝐒

• Wrote up some flashcards and practice problems to help myself retain what Reiner taught. Hope it's helpful to you too! https://reiner-flashcards.vercel.app/

• Download markdown of transcript here to chat with an LLM: https://gist.github.com/dwarkeshsp/79100f0fdeed69d76241903bb0604dbe

• Transcript: https://www.dwarkesh.com/p/reiner-pope 𝐒𝐏𝐎𝐍𝐒𝐎𝐑𝐒

• Jane Street needs constant access to incredibly low-latency compute. I recently asked one of their engineers, Clark, to talk me through how they meet these demands. Our conversation—which touched on everything from FPGAs to liquid cooling—was extremely helpful as I prepped to interview Reiner. You can watch the full discussion and explore Jane Street’s open roles at https://janestreet.com/dwarkesh

• Google’s Gemma 4 is the first open model that’s let me shut off the internet and create a fully disconnected "focus machine". This is because Gemma is small enough to run on my laptop, but powerful enough to actually be useful. So, to prep for this interview, I downloaded Reiner’s scaling book, disconnected from wifi, and used Gemma to help me break down the material. Check it out at https://goo.gle/Gemma4

• Cursor helped me turn some notes I took on how gradients flow during large-scale pretraining into a great animation. At first, I wasn’t sure the best way to visualize the concept, but Cursor’s Composer 2 Fast model let me iterate on different ideas almost instantaneously. You can check out the animation in my recent blog post: https://www.dwarkesh.com/p/what-i-learned-april-15. And if you have something to visualize yourself, go to https://cursor.com/dwarkesh 𝐓𝐈𝐌𝐄𝐒𝐓𝐀𝐌𝐏𝐒 0:00:00 – How batch size affects token cost and speed 0:31:59 – How MoE models are laid out across GPU racks 0:47:02 – How pipeline parallelism spreads model layers across racks 1:03:27 – Why Ilya said, “As we now know, pipelining is not wise.” 1:18:49 – Because of RL, models may be 100x over-trained beyond Chinchilla-optimal 1:32:52 – Deducing long context memory costs from API pricing 2:03:52 – Convergent evolution between neural nets and cryptography

SPEAKERS

  • Dwarkesh Patel

    host

    Host of the Dwarkesh Patel podcast, interviewing guests about technology and AI.

  • Reiner Pope

    guest

    CEO of MatX, discussing LLM training and inference/serving economics and systems.

EPISODE SUMMARY

In this episode of Dwarkesh Podcast, featuring Dwarkesh Patel and Reiner Pope, Reiner Pope on Dwarkesh Patel: Why Token Cost Tracks Batch explores blackboard guide to LLM training, inference costs, batching, and networking Inference cost and latency are largely determined by a roofline-style max between compute time (active parameters) and memory time (loading weights plus KV-cache reads), making batch size the key lever for amortizing weight-fetch overhead.

RELATED EPISODES

The better AI gets, the smaller its share of the economy might get – Alex Imas and Phil Trammell

The better AI gets, the smaller its share of the economy might get – Alex Imas and Phil Trammell

What rebuilding AlphaGo teaches us about self-play, RL, and future of LLMs - Eric Jang

What rebuilding AlphaGo teaches us about self-play, RL, and future of LLMs - Eric Jang

Jensen Huang – Will Nvidia’s moat persist?

Jensen Huang – Will Nvidia’s moat persist?

Terence Tao – How the world’s top mathematician uses AI

Terence Tao – How the world’s top mathematician uses AI

Elon Musk – "In 36 months, the cheapest place to put AI will be space”

Elon Musk – "In 36 months, the cheapest place to put AI will be space”

Adam Marblestone – AI is missing something fundamental about the brain

Adam Marblestone – AI is missing something fundamental about the brain

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.