Skip to content
Dwarkesh PodcastDwarkesh Podcast

Chip design from the bottom up – Reiner Pope

New blackboard lecture with Reiner Pope: how do chips actually work - starting with basic logic gates, and working up to why GPUs, TPUs, FPGAs, and the human brain each look the way they do. Reiner is CEO of MatX, a new chip startup (full disclosure - I’m an angel investor). He was previously at Google, where he worked on software efficiency, compilers, and TPU architecture. 𝐄𝐏𝐈𝐒𝐎𝐃𝐄 𝐋𝐈𝐍𝐊𝐒 * Transcript: https://www.dwarkesh.com/p/reiner-pope-2 𝐒𝐏𝐎𝐍𝐒𝐎𝐑𝐒 * Crusoe was one of only five GPU clouds that made the gold tier in SemiAnalysis' most recent ClusterMAX report. Gold-tier providers like Crusoe delivered 5-15% lower TCO than silver-tier clouds, even with identical GPU pricing. This is because optimizations like early fault detection and rapid node replacement don't necessarily show up in the sticker price, but still matter a ton in the real world. Learn more at https://crusoe.ai/dwarkesh * Cursor is where I do most of my work—from reading research papers to visualizing technical concepts to coding up internal tools for the podcast. Most recently, I used it to build two different review interfaces for my essay contest, one that anonymizes submissions for scoring and another that lets me see applicants' essays next to their resumes and websites. Whatever you're working on, you should try doing it in Cursor. Get started at https://cursor.com/dwarkesh * Jane Street let me ask Ron Minsky and Dan Pontecorvo, two senior Jane Streeters, a bunch of questions about how they use AI. We discussed everything from the types of models they're training to how they think about the future of trading to why they're more bullish than ever on hiring technical talent. You can watch the full conversation and learn more about their open positions at https://janestreet.com/dwarkesh 𝐓𝐈𝐌𝐄𝐒𝐓𝐀𝐌𝐏𝐒 00:00:00 – Building a multiply-accumulate from logic gates 00:16:20 – Muxes and the cost of data movement 00:25:59 – How systolic arrays work 00:39:00 – Clock cycles and pipeline registers 00:51:40 – FPGAs vs ASICs 01:03:14 – Cache vs scratchpad 01:07:16 – Why CPU cores are much bigger than GPU cores 01:11:49 – Brains vs chips 01:15:22 – A GPU is just a bunch of tiny TPUs

Dwarkesh PatelhostReiner Popeguest
May 22, 20261h 20mWatch on YouTube ↗

At a glance

WHAT IT’S REALLY ABOUT

From logic gates to TPUs: why AI chips favor systolic arrays

  1. The conversation builds an AI chip from first principles—logic gates to a multiply-accumulate (MAC)—to show why matrix multiplication maps naturally onto hardware.
  2. Bit-width drives compute cost roughly quadratically, which is a major reason low-precision (FP4/FP8) arithmetic is so advantageous for neural networks, though floating-point exponents reduce ideal scaling.
  3. Moving data (muxes/register-file reads) can cost far more area and energy than the arithmetic itself, motivating fixed-function blocks like systolic arrays that amortize data-movement overhead.
  4. Systolic arrays improve compute-per-communication by keeping weights local and streaming activations through, even if weights must be loaded slowly to minimize bandwidth (and die area).
  5. Clocking and pipeline registers set performance via critical paths and feedback loops, while architectural choices (FPGA vs ASIC, cache vs scratchpad, GPU vs TPU organization) trade flexibility, determinism, and data-movement efficiency.

IDEAS WORTH REMEMBERING

5 ideas

MACs map directly onto matrix multiplication’s inner loop.

Matrix multiply is a nested loop where each output element repeatedly performs output += a*b; hardware therefore optimizes around fused multiply-accumulate at massive scale.

Arithmetic cost grows roughly with bit-width squared, so lower precision is disproportionately cheaper.

A p×q multiply produces p·q partial products and needs ~p·q full-adder compressions, making smaller formats dramatically smaller/faster; ideal scaling is tempered in floating point by exponent handling.

Selecting operands can be more expensive than computing on them.

A register-file read requires muxing among many sources; an n-to-1 mux over p-bit words costs ~n·p ANDs plus ~(n−1)·p ORs, and MACs may require multiple such reads per operation.

Systolic arrays win by amortizing register-file and routing overhead across many MACs.

Instead of repeatedly fetching arbitrary operands for each MAC, the design “bakes in” higher-level loop structure so a large grid of MACs reuses local data and reduces expensive global selection/wiring.

Keeping weights local is key; loading them slowly can be optimal.

Weights are reused across many input vectors, so they’re stored in local registers near compute; to avoid wide, area-expensive buses, weights can be daisy-chained in over many cycles (optimize bandwidth/area over load time).

WORDS WORTH SAVING

5 quotes

So this circuit I've described here, almost all of the cost, like, uh, seven-eighths of the cost, uh, is, is in the reading and writing the register file, and only a tiny fraction of the cost is in the logic unit itself. So this is the problem to solve.

Reiner Pope

There is a global clock signal which drives all of these registers, and it says at a certain instance in time when the clock, uh, uh, strikes, um, uh, whatever value happens to be on this wire at that instant, that's what's gonna get stored in there.

Reiner Pope

So, uh, taking that analogy, um, the, the, the, the thing that you need to be mindful of is if I've got two different paths through some logic... and the result from F and G have to sort of meet up at H, um, what can ... The, the thing that can go wrong is that F can get there early, and it meets, like, the previous value of G or the next value of G or something like that.

Reiner Pope

The trade-off is that the first FPGA costs you ten thousand dollars, whereas the first ASIC you make costs you thirty million dollars because- uh, because of, uh, it, it requires an entire tape-out.

Reiner Pope

So, uh, in a CPU you have the, the CPU... and then you have a cache system here... but whether or not you get a cache hit is dependent on the sort of ambient environment of the CPU. Like what other programs are running, what has run recently, what is the random number generator inside the cache system doing? And so, so that is a big source of non-determinism in, in the runtime of a CPU.

Reiner Pope

Multiply-accumulate as the core AI primitiveQuadratic scaling of area/cost with precision (bit-width)Mux/register-file data movement overheadSystolic arrays (tensor cores) and weight-stationary reuseClock cycles, critical paths, and pipeline register insertionFPGAs: LUTs and programmable interconnect vs ASIC efficiencyCache vs scratchpad and deterministic latencyWhy CPU cores are large (cache/branch prediction) vs GPU coresGPU as many small TPUs vs TPU as fewer large matrix units

High quality AI-generated summary created from speaker-labeled transcript.

Get more out of YouTube videos.

High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.

Add to Chrome