Chip design from the bottom up – Reiner Pope

New blackboard lecture with Reiner Pope: how do chips actually work - starting with basic logic gates, and working up to why GPUs, TPUs, FPGAs, and the human brain each look the way they do. Reiner is CEO of MatX, a new chip startup (full disclosure - I’m an angel investor). He was previously at Google, where he worked on software efficiency, compilers, and TPU architecture. 𝐄𝐏𝐈𝐒𝐎𝐃𝐄 𝐋𝐈𝐍𝐊𝐒 * Transcript: https://www.dwarkesh.com/p/reiner-pope-2 𝐒𝐏𝐎𝐍𝐒𝐎𝐑𝐒 * Crusoe was one of only five GPU clouds that made the gold tier in SemiAnalysis' most recent ClusterMAX report. Gold-tier providers like Crusoe delivered 5-15% lower TCO than silver-tier clouds, even with identical GPU pricing. This is because optimizations like early fault detection and rapid node replacement don't necessarily show up in the sticker price, but still matter a ton in the real world. Learn more at https://crusoe.ai/dwarkesh * Cursor is where I do most of my work—from reading research papers to visualizing technical concepts to coding up internal tools for the podcast. Most recently, I used it to build two different review interfaces for my essay contest, one that anonymizes submissions for scoring and another that lets me see applicants' essays next to their resumes and websites. Whatever you're working on, you should try doing it in Cursor. Get started at https://cursor.com/dwarkesh * Jane Street let me ask Ron Minsky and Dan Pontecorvo, two senior Jane Streeters, a bunch of questions about how they use AI. We discussed everything from the types of models they're training to how they think about the future of trading to why they're more bullish than ever on hiring technical talent. You can watch the full conversation and learn more about their open positions at https://janestreet.com/dwarkesh 𝐓𝐈𝐌𝐄𝐒𝐓𝐀𝐌𝐏𝐒 00:00:00 – Building a multiply-accumulate from logic gates 00:16:20 – Muxes and the cost of data movement 00:25:59 – How systolic arrays work 00:39:00 – Clock cycles and pipeline registers 00:51:40 – FPGAs vs ASICs 01:03:14 – Cache vs scratchpad 01:07:16 – Why CPU cores are much bigger than GPU cores 01:11:49 – Brains vs chips 01:15:22 – A GPU is just a bunch of tiny TPUs

Dwarkesh PatelhostReiner Popeguest

May 22, 20261h 20mWatch on YouTube ↗

CHAPTERS

0:00 – 3:41
Logic gates to multiply-accumulate: why MACs are the core AI primitive
Reiner starts from first principles: logic gates and wires as the primitive building blocks of chips. He motivates multiply-accumulate (MAC) as the fundamental operation in matrix multiplication, and explains why accumulation often needs higher precision than multiplication in AI workloads.
- •Chips are built from simple gates (AND/OR/NOT) connected by physical wires
- •Matrix multiplication reduces to repeated multiply-accumulate in nested loops
- •Accumulation needs higher precision due to repeated summation and error growth
- •Example frame: 4-bit multiply plus 8-bit add as a toy MAC
3:41 – 10:11
Building a 4-bit multiplier by hand: partial products and full adders
Using long multiplication, they derive the set of partial products and the multi-operand addition needed to complete the MAC. Reiner introduces the full adder as a 3-to-2 compressor and shows how repeated application reduces a grid of bits into the final sum.
- •Long multiplication creates p×q partial products (via AND gates)
- •Full adder = adds three 1-bit inputs → outputs sum + carry (3-to-2 compression)
- •Summation is the dominant work vs forming partial products
- •Column-wise compression repeats until only the final result remains
10:11 – 12:57
Dadda multipliers and scaling laws: why bit-width gets expensive fast
Reiner names the approach as a Dadda multiplier and counts hardware cost in gates. They derive a clean relationship: MAC structure leads to simple gate-count scaling, highlighting why reducing precision brings large wins.
- •4×4 example: 16 AND gates for partial products; summation dominates area
- •Full-adder count can be derived from input bits reduced to output bits
- •General result: ~p×q full adders for a p-bit by q-bit multiply (plus accumulation context)
- •Reinforces quadratic scaling with bit-width as a key AI efficiency lever
12:57 – 16:31
FP4 vs FP8 throughput claims: fungibility limits and why ratios aren’t exact
Dwarkesh questions NVIDIA-style spec sheets implying easy conversion between FP formats. Reiner explains that hardware isn’t fully fungible across precisions, and that real-world ratios reflect both arithmetic scaling and data-movement constraints; NVIDIA’s newer specs partially acknowledge non-ideal ratios.
- •FP4/FP8 hardware is not inherently interchangeable; designers choose allocations
- •Quadratic arithmetic scaling suggests big gains from lower precision
- •Data movement (packing, bus widths) also affects observed throughput ratios
- •Floating point adds exponent complexity; practical speedups differ from ideal 4×
16:31 – 25:18
Muxes and the hidden cost of data movement (register file → ALU)
They zoom out to a classic CPU/GPU core datapath: a register file feeding an ALU/MAC. Reiner quantifies how multiplexers (Muxes) needed to select operands can cost more area than the arithmetic itself, making data movement the dominant expense.
- •Core model: register file + compute unit; arbitrary operands require selection logic
- •An n-input, p-bit Mux can be built from AND/OR masking and reduction
- •Reading three operands implies three Muxes (cost scales with register-file size)
- •In many regimes, moving/selecting data costs far more than the MAC gates
25:18 – 34:49
From scalar MACs to systolic arrays: baking loops into hardware
Reiner explains how tensor cores/systolic arrays arose to amortize register-file and Mux overhead by hardwiring more of the matrix-multiply loop nest. The key is increasing compute per byte moved by keeping weights local and streaming activations through a regular array.
- •Goal: make the fixed-function compute block much larger to amortize I/O costs
- •Systolic array effectively unrolls higher-level loops of matrix multiply in hardware
- •Weights stay resident locally; activations stream in; partial sums flow through
- •Bandwidth into the array can be kept ~O(X) instead of O(XY) by loading slowly
34:49 – 38:54
Compute vs communication as a universal theme (from gates to clusters)
Dwarkesh connects chip-internal tradeoffs to data-center scale: maximize compute relative to communication everywhere in the stack. Reiner reinforces that number format, locality, and array sizing all express the same underlying principle.
- •Same optimization lens applies from on-die wiring to multi-chip inference
- •Precision choice affects both compute cost and data-movement volume
- •Systolic arrays exploit reuse and locality to reduce effective communication
- •Architecture is largely about managing data movement, not just FLOPs
38:54 – 44:21
Clock cycles, registers, and pipelining: what sets frequency
Reiner defines the clock cycle as a global synchronization mechanism for massive parallel circuits. He explains critical-path timing, how pipeline registers split logic to raise frequency, and why feedback loops constrain how far pipelining can go.
- •Clock synchronizes state updates across the chip at regular intervals
- •Critical path: logic must settle before the next clock edge captures outputs
- •Pipeline insertion can double frequency but costs area/power in extra registers
- •Feedback loops (recurrences) limit pipelining because registers change semantics
44:21 – 51:34
Why fully asynchronous ‘Factorio-style’ chips are hard in practice
Dwarkesh asks why chips can’t just compute ‘when done’ without a global clock. Reiner explains hazards from path delay variation and signals arriving misaligned at reconvergent logic, motivating synchronous design and margining.
- •Reconvergent paths (F and G feeding H) require aligned timing of values
- •Manufacturing variation changes delays; without a clock you can combine wrong epochs
- •Designs are margining-heavy to make timing reliable across conditions
- •Clock-domain crossings are one of the rare places probabilities must be reasoned about
51:34 – 1:03:13
FPGAs vs ASICs: programmability via LUTs and why it costs ~10×
They switch to why FPGAs are used (e.g., HFT): low-latency determinism with fast iteration, despite worse cost/energy than ASICs. Reiner explains FPGA fabric as registers + LUTs + huge Mux networks, where configuration sets Mux selects and LUT truth tables.
- •ASICs are cheaper/faster per unit but have massive NRE/tape-out cost
- •FPGA = configurable wiring via many Muxes plus LUTs acting as programmable gates
- •Programming an FPGA is largely setting Mux controls and LUT truth-table bits
- •LUT implementation resembles a big Mux (e.g., 16:1) → large gate overhead vs ASIC gates
1:03:13 – 1:07:15
Deterministic latency vs peak CPU performance: caches as the main culprit
Dwarkesh revisits HFT’s preference for deterministic timing and asks why CPUs aren’t predictable. Reiner points to design choices like caches: huge speedups on average but nondeterministic hit/miss behavior, and contrasts that with scratchpad-managed memories typical in accelerators.
- •CPUs can be made deterministic, but mainstream designs optimize average performance
- •Cache hit/miss depends on history and contention → variable memory latency
- •Scratchpad model: software explicitly moves data between local memory and HBM/DDR
- •Determinism is easier when memory behavior is explicit rather than implicit
1:07:15 – 1:11:49
Why CPU cores are big: caches, branch predictors, and control complexity
They broaden to what ‘von Neumann’ means in modern hardware and why CPUs have far less parallelism than GPUs/accelerators. Reiner explains that CPU area is dominated by caches and branch prediction machinery needed to sustain high single-thread performance amid control flow.
- •CPU parallelism is limited (cores × vector width) compared to GPU-style tiling
- •Large die area goes to caches and register files more than ALU arithmetic
- •Branch predictors enable deep pipelines despite control-flow uncertainty
- •GPUs strip/relax control complexity to pack more simple cores and throughput
1:11:49 – 1:20:19
Brains vs chips and the ‘GPU is many tiny TPUs’ framing
They compare biological and silicon computation, focusing on clock speed, batch size, and power as switching energy. Reiner then contrasts GPU vs TPU top-level organization, concluding that a GPU resembles a grid of small TPU-like units (SMs with tensor cores), trading flexibility and bandwidth locality against coarse-grained TPU designs.
- •Switching power comes from charging/discharging capacitances; idle costs are low
- •Lowering clock reduces transitions but doesn’t magically yield huge efficiency gains
- •TPU: fewer, larger matrix units plus vector unit; GPU: many small SM ‘mini-TPUs’
- •Trade-off: TPU allows larger systolic arrays; GPU offers richer local wiring/bandwidth within SMs
- •MatX mention: “splittable systolic array” as a way to combine benefits