Dwarkesh PodcastReiner Pope on Dwarkesh Patel: Why Token Cost Tracks Batch
Weight fetches dominate token cost until batch crosses 300 times MoE sparsity; past that crossover, compute binds and cost per token hits its lower bound.
Episode Details
EPISODE INFO
- Released
- April 29, 2026
- Duration
- 2h 13m
- Channel
- Dwarkesh Podcast
- Watch on YouTube
- ▶ Open ↗
EPISODE DESCRIPTION
Did a very different format with Reiner Pope – a blackboard lecture where he walks through how frontier LLMs are trained and served. It's shocking how much you can deduce about what the labs are doing from a handful of equations, public API prices, and some chalk. It’s a bit technical, but I encourage you to hang in there - it’s really worth it. There are less than a handful of people who understand the full stack of AI, from chip design to model architecture, as well as Reiner. It was a real delight to learn from him. Reiner is CEO of MatX, a new chip startup (full disclosure - I’m an angel investor). He was previously at Google, where he worked on software efficiency, compilers, and TPU architecture. 𝐄𝐏𝐈𝐒𝐎𝐃𝐄 𝐋𝐈𝐍𝐊𝐒
• Wrote up some flashcards and practice problems to help myself retain what Reiner taught. Hope it's helpful to you too! https://reiner-flashcards.vercel.app/
• Download markdown of transcript here to chat with an LLM: https://gist.github.com/dwarkeshsp/79100f0fdeed69d76241903bb0604dbe
• Transcript: https://www.dwarkesh.com/p/reiner-pope 𝐒𝐏𝐎𝐍𝐒𝐎𝐑𝐒
• Jane Street needs constant access to incredibly low-latency compute. I recently asked one of their engineers, Clark, to talk me through how they meet these demands. Our conversation—which touched on everything from FPGAs to liquid cooling—was extremely helpful as I prepped to interview Reiner. You can watch the full discussion and explore Jane Street’s open roles at https://janestreet.com/dwarkesh
• Google’s Gemma 4 is the first open model that’s let me shut off the internet and create a fully disconnected "focus machine". This is because Gemma is small enough to run on my laptop, but powerful enough to actually be useful. So, to prep for this interview, I downloaded Reiner’s scaling book, disconnected from wifi, and used Gemma to help me break down the material. Check it out at https://goo.gle/Gemma4
• Cursor helped me turn some notes I took on how gradients flow during large-scale pretraining into a great animation. At first, I wasn’t sure the best way to visualize the concept, but Cursor’s Composer 2 Fast model let me iterate on different ideas almost instantaneously. You can check out the animation in my recent blog post: https://www.dwarkesh.com/p/what-i-learned-april-15. And if you have something to visualize yourself, go to https://cursor.com/dwarkesh 𝐓𝐈𝐌𝐄𝐒𝐓𝐀𝐌𝐏𝐒 0:00:00 – How batch size affects token cost and speed 0:31:59 – How MoE models are laid out across GPU racks 0:47:02 – How pipeline parallelism spreads model layers across racks 1:03:27 – Why Ilya said, “As we now know, pipelining is not wise.” 1:18:49 – Because of RL, models may be 100x over-trained beyond Chinchilla-optimal 1:32:52 – Deducing long context memory costs from API pricing 2:03:52 – Convergent evolution between neural nets and cryptography
SPEAKERS
Dwarkesh Patel
hostHost of the Dwarkesh Patel podcast, interviewing guests about technology and AI.
Reiner Pope
guestCEO of MatX, discussing LLM training and inference/serving economics and systems.
EPISODE SUMMARY
In this episode of Dwarkesh Podcast, featuring Dwarkesh Patel and Reiner Pope, Reiner Pope on Dwarkesh Patel: Why Token Cost Tracks Batch explores blackboard guide to LLM training, inference costs, batching, and networking Inference cost and latency are largely determined by a roofline-style max between compute time (active parameters) and memory time (loading weights plus KV-cache reads), making batch size the key lever for amortizing weight-fetch overhead.
RELATED EPISODES