Skip to content
Dwarkesh PodcastDwarkesh Podcast

Chip design from the bottom up – Reiner Pope

New blackboard lecture with Reiner Pope: how do chips actually work - starting with basic logic gates, and working up to why GPUs, TPUs, FPGAs, and the human brain each look the way they do. Reiner is CEO of MatX, a new chip startup (full disclosure - I’m an angel investor). He was previously at Google, where he worked on software efficiency, compilers, and TPU architecture. 𝐄𝐏𝐈𝐒𝐎𝐃𝐄 𝐋𝐈𝐍𝐊𝐒 * Transcript: https://www.dwarkesh.com/p/reiner-pope-2 𝐒𝐏𝐎𝐍𝐒𝐎𝐑𝐒 * Crusoe was one of only five GPU clouds that made the gold tier in SemiAnalysis' most recent ClusterMAX report. Gold-tier providers like Crusoe delivered 5-15% lower TCO than silver-tier clouds, even with identical GPU pricing. This is because optimizations like early fault detection and rapid node replacement don't necessarily show up in the sticker price, but still matter a ton in the real world. Learn more at https://crusoe.ai/dwarkesh * Cursor is where I do most of my work—from reading research papers to visualizing technical concepts to coding up internal tools for the podcast. Most recently, I used it to build two different review interfaces for my essay contest, one that anonymizes submissions for scoring and another that lets me see applicants' essays next to their resumes and websites. Whatever you're working on, you should try doing it in Cursor. Get started at https://cursor.com/dwarkesh * Jane Street let me ask Ron Minsky and Dan Pontecorvo, two senior Jane Streeters, a bunch of questions about how they use AI. We discussed everything from the types of models they're training to how they think about the future of trading to why they're more bullish than ever on hiring technical talent. You can watch the full conversation and learn more about their open positions at https://janestreet.com/dwarkesh 𝐓𝐈𝐌𝐄𝐒𝐓𝐀𝐌𝐏𝐒 00:00:00 – Building a multiply-accumulate from logic gates 00:16:20 – Muxes and the cost of data movement 00:25:59 – How systolic arrays work 00:39:00 – Clock cycles and pipeline registers 00:51:40 – FPGAs vs ASICs 01:03:14 – Cache vs scratchpad 01:07:16 – Why CPU cores are much bigger than GPU cores 01:11:49 – Brains vs chips 01:15:22 – A GPU is just a bunch of tiny TPUs

Dwarkesh PatelhostReiner Popeguest
May 22, 20261h 20mWatch on YouTube ↗

CHAPTERS

  1. From logic gates to multiply‑accumulate (MAC): why AI chips start here

    The conversation begins at the lowest level of chip design—logic gates connected by physical wires—and builds toward the key arithmetic primitive for neural nets: multiply-accumulate. Reiner explains why MACs map directly onto matrix multiplication and why accumulation often uses higher precision than the multiply step.

  2. Hand-building multiplication: partial products and AND gates

    Reiner walks through long multiplication for two 4-bit values, showing how hardware generates partial products and then sums them. The key insight is that each partial product bit is just an AND of one bit from each operand.

  3. Summing the grid efficiently: full adders, compressors, and Dadda multipliers

    Most of the hardware cost is in summing the partial products rather than creating them. Reiner introduces the full adder as a 3-to-2 compressor and explains how repeatedly applying these compressors yields an area-efficient Dadda-style multiplier.

  4. Bit-width scaling and FP4 vs FP8 “fungibility” in real chips

    They connect the gate-level picture to GPU marketing metrics like FP4/FP8 throughput. Reiner explains why different precisions aren’t fully fungible in hardware, why scaling is roughly quadratic with bit-width, and why real ratios deviate due to floating-point exponent complexity and data movement constraints.

  5. The hidden cost: muxes, register files, and why data movement dominates

    Reiner models a traditional CPU/CUDA-core datapath: a register file feeding an ALU, with muxes selecting operands. Implementing “select register i” requires many gates (mask + OR reduction), and duplicating this for multiple operands makes movement far more expensive than the arithmetic itself.

  6. Systolic arrays / tensor cores: baking loops into hardware to amortize movement

    To fix the imbalance, tensor cores (systolic arrays) hardwire larger chunks of the matrix-multiply loop nest. By keeping weights local and streaming activations through, they increase compute per register-file access and reduce expensive cross-boundary wiring.

  7. Loading weights efficiently: bandwidth vs time and the “trickle feed” approach

    They address a practical question: if weights are local, how do they get there? The answer is to load them slowly through daisy-chained shifts, minimizing wiring bandwidth (die area) even if it takes multiple cycles, because weights are reused many times.

  8. Sizing decisions that dominate chip design: array size vs register file size

    Reiner frames many chip-architecture choices as “sizing” problems. Bigger systolic arrays improve amortization, while bigger register files improve flexibility and real workload performance; these compete for die area and power budgets.

  9. Clock cycles, synchronization, and pipeline registers: what sets frequency

    They move from spatial compute to timing: the clock synchronizes massive on-chip parallelism. Registers sample values on clock edges; clock frequency is limited by worst-case delay through combinational logic between registers (critical path), and designers insert pipeline registers to shorten that delay.

  10. When pipelining breaks correctness: feedback loops and clock-limiting paths

    Not all logic can be arbitrarily pipelined. Reiner explains that feedback (recurrence) paths—like an accumulator feeding itself—restrict where registers can be inserted without changing semantics, and these loops often determine the maximum safe clock rate.

  11. FPGAs vs ASICs: LUTs, programmable mux fabrics, and the 10× overhead

    Reiner explains the business and architectural trade-off: FPGAs are reprogrammable and deterministic but inefficient. Their flexibility comes from LUTs (truth tables) and pervasive muxing; implementing a simple gate can take tens of gates worth of LUT+mux machinery compared to a few gates in an ASIC.

  12. Determinism and memory systems: cache vs scratchpad (and why CPUs vary)

    They discuss why CPUs often have non-deterministic latency: caches introduce hit/miss variability depending on history and interference. Accelerators like TPUs often use scratchpads where software explicitly controls on-chip vs off-chip memory accesses, improving predictability.

  13. Why CPU cores are “big,” and why a GPU looks like many tiny TPUs

    Reiner contrasts CPU, GPU, and TPU organization. CPUs spend significant area on caches and control features like branch prediction; GPUs strip much of that control overhead and tile many smaller compute+memory units (SMs), while TPUs use fewer, larger coarse-grained matrix units plus vector units.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome