EVERY SPOKEN WORD
75 min read · 15,122 words- 0:00 – 16:20
Building a multiply-accumulate from logic gates
- DPDwarkesh Patel
I'm back with Reiner Pope, who is the CEO of MatX, which is a new AI chip company. Last time we were talking about what happens inside a data center. Now I wanna understand what happens inside an AI chip. How does a chip actually work? Full disclosure, by the way, I am an angel investor in MatX, um, so hopefully you have designed a good chip [laughs] .
- RPReiner Pope
[laughs]
- DPDwarkesh Patel
Okay [laughs] .
- RPReiner Pope
So I'll start with, uh, sort of the very smallest fundamental unit of, of, of chip design, then we'll sort of build up into what an overall, um, like, actual production chip, uh, what are the components of that?
- DPDwarkesh Patel
Yep.
- RPReiner Pope
At the very bottom level of a chip, the primitives that we work with are, uh, logic gates, um, which are very simple things like and, or, not. Um, and then these are connected together by, by wires that have to be laid out, uh, physically as metal traces on a chip. The main function that, that AI chips want to, uh, compute is, uh, multiplication of matrices, and really inside that is the fundamental primitive is multiply-accumulate of just, like, of, of pairs of numbers. So we're gonna sort of demonstrate what that calculation looks like by hand, and then sort of infer what, what a, uh, circuit would look like for that. It'll turn out to be sort of easiest if, if I do multiplication, uh, accumulator of something like a, um, a four-bit number, um, w- with another four-bit number. Um, and then we're going to... The, the actual clearest primitive is actually, um, multiply-accumulate. So there's a multiply these two terms, and then we're going to add in, so product of these two terms, and then we're going to add in an eight-bit, uh, number.
- DPDwarkesh Patel
And, and can I ask a clarifying question? Why is this the, um, natural primitive for, um, you know, w- whatever computation happens inside a computer?
- RPReiner Pope
Yeah. So, um, there, there's a few reasons for this. Uh, it, it's a little bit more efficient, um, but the, the reason it's natural for AI chips is that if you look what's happening during a matrix multiply, the, uh, what is matrix multiply in very short, it is, um, uh, there's a for loop over i and over j-
- DPDwarkesh Patel
Yep
- RPReiner Pope
... and over k of output i,k plus equals to, um, input i,j times other input j,k. And so multiply-accumulate happens at every single step-
- DPDwarkesh Patel
Got it
- RPReiner Pope
... of, of a matrix multiply.
- DPDwarkesh Patel
Makes sense.
- RPReiner Pope
And then the other observation is that, um, the precision will almost always be higher in the accumulation step than in the multiplication step. Um, this is maybe specific to AI chips, uh, but you're, you're multiplying low precision numbers, but then when you accumulate, uh, errors accumulate quickly, and so you need more precision here. So this is why we've chosen to do a four-bit multiplication and an eight-bit addition.
- DPDwarkesh Patel
Let me make sure I understood that. There's two ways to understand that. One is that the value will be larger than the inputs, and the other is that if it was a floating point number, it would be... Maybe that, that part is, like, less intuitive to me, but it's maybe the same principle?
- RPReiner Pope
It, it is really the same principle.
- DPDwarkesh Patel
All right.
- RPReiner Pope
Um, I guess the... So, I mean, I guess the, the separate principle is that as you are summing up this number, you are summing up a whole bunch of numbers, and so you've got a lot of rounding errors accumulating.
- DPDwarkesh Patel
That makes sense.
- RPReiner Pope
Whereas in this case, there's, like, there's, there's only one multiplication in that chain, and so there's not a lot of rounding errors accumulating in the multiplication.
- DPDwarkesh Patel
Why are you summing up a whole bunch of numbers? It's just two numbers, right?
- RPReiner Pope
Just, I mean, the summation happens. It's repeated-
- DPDwarkesh Patel
Ah, I see
- RPReiner Pope
... j, j many times.
- DPDwarkesh Patel
So, so, yeah, any errors accumulate. I see.
- RPReiner Pope
Yes.
- DPDwarkesh Patel
Yep. Makes sense.
- RPReiner Pope
So how would we perform this calculation by hand? Um, I mean, as a human, we would probably separate it into two, but we can sort of do it all, all in one, um, using long multiplication. So the multiplication t- term first. We're gonna multiply this number, th- this four-bit number here, by every single bit position in the other four-bit number. So we write that out. Um, first one, zero, zero, one multiplied by this bit position. Um, that is this number itself. Then shifted across by one, we're multiplying by zero. That gives us an all zeros number. Shifted, uh, across even one more to multiply by this one, we get one, zero, zero, one. And then finally for this last, last bit position, we get an all zeros, uh, number again. So this, this sort of gives us a bunch of terms that we're gonna have to add for the, um, for the multiplication. And then while we're doing that, uh, summation of this, we might as well add in the, the, the actual accumulator term as well. And so we just copy that directly across. So, so this is the sum. It's a, it's a five-way sum that we're going to want to commu- compute. So firstly, what did, what logic gates did it, uh, take us to even get to this intermediate step? We needed to, uh, produce all 16 of these, um, partial products.
- DPDwarkesh Patel
Mm-hmm.
- RPReiner Pope
How do I produce one of these partial products? So let's take this, this number one, for example, here. It is one, so what was... How do we produce this number by multiplying this number by this one over here? We can actually produce that by an AND gate. This, this number is one if both this bit is one and this bit is one.
- 16:20 – 25:59
Muxes and the cost of data movement
- RPReiner Pope
we'll, we'll walk back in time a little bit and, and see how did GPUs prior to Tensor Cores work, um, as which is the same way as the way that CPUs worked, in fact.
- DPDwarkesh Patel
Mm.
- RPReiner Pope
So which is like where do we stick this multiply accumulate unit? So generically I'll des-describe like a CUDA core or a CPU. You'll have some register file which stores some number of entries, maybe, maybe it's like eight, um, eight entries, um, uh, of, of like in this case I guess four-bit numbers, but typically like thirty-two bit numbers or something like that of, uh, which are numbers. Um, so this is the like inside the CUDA core I'll have some register file of some depth, and then I will have my multiply accumulate circuit multiply, uh, and accumulate circuit. And what it's gonna do, it's gonna like, it's gonna take three arbitrary registers from this register file, perform the multiply accumulate, and then write back to the register file. So it's gonna maybe write to this one, but it, it was able to read from this one, this one and, and another random one. So it'll, it'll take three inputs like this.
- DPDwarkesh Patel
Mm-hmm.
- RPReiner Pope
So this is the core data path of, of many, um, processors. Most processors look like this. You've got some set of re-registers and then you've got some set of, uh, logic units or ALUs. We wanna analyze the cost of the data movement from the register file to, to the ALU and back. So ultimately there's gonna be some circuit that says, well, I, I don't always have to select this guy, I might select any of the registers in it at any point in time. And so sort of a first question is how can I build a circuit? The circuit that I'm gonna look for is, uh, is, is a Mux. So, um, in this case it's gonna have eight inputs, one from each entry of the register file. Um, and it's gonna have one output which is actually, um, producing this output.And then, like, what is the cost of this thing? It's, it's, like, all we have to, to build it out of is and and or. Uh, and so how do we build it? We, we do the dumbest thing possible. We, like, form a mask saying we... Okay, when we want to read, like, the third entry, um, we're gonna and every single entry with either 1 or 0 based on whether that's what we want to read, and then we're gonna or all of them together.
- DPDwarkesh Patel
Okay. And ju- just to make sure I understand the basics. What the Mux is doing is it's just, like, selecting.
- RPReiner Pope
Just selecting.
- DPDwarkesh Patel
Just selecting an input.
- RPReiner Pope
Yeah. So, like, invisible to software, it's like you say I want input number three, that means there's a Mux.
- DPDwarkesh Patel
Yeah.
- RPReiner Pope
And so, like, what is the cost of this Mux? So an n input Mux, uh, operating on p bits... Well, I'm going to... So I have n, n rows. That's just eight rows, and I've got, uh, like each row is p bits wide. Well, I have to and every single bit, so I get n times p many and gates. Every single input I have to say, "Am I gonna, like, mask it out or not?" Um, and then, and then I'm gonna or them all together. And so, uh, there's gonna be, like, n minus 1 times p many or gates, which is saying, uh, I've got all of these different things. Almost all of them are 0s, but I need to sort of collapse them down into, into, like, from my eight options down into one option. And so every step I need to or, like, one row into, into an existing row.
- DPDwarkesh Patel
Got it. Yeah. It's actually kind of funny that you would sort of, um, you don't think at the level of hardware. You sort of just think like, "Oh, I'll just select element three," and something as simple as that is, uh, a sort of, like, in and of itself a quite complicated circuit.
- RPReiner Pope
Yeah. I mean, this is the first step of all of the hidden data movement costs-
- DPDwarkesh Patel
Right
- RPReiner Pope
... that show up. Yeah, yeah. Um, and so, like, the thing, like, we're, we're just gonna, like, compare, like, I have to pay this cost, and I've got one Mux here, and then in fact I have two more copies of that for each of the three inputs to my multiply accumulate operation. And so I have this cost, which is like, like, 3 times n times p and gates over here, um, compared to this p times q, um, like, uh, sort of gates in the actual circuit that I... That is doing the thing I care about. And if we plug in actual numbers, like this n being 8, um, like, I get, like, 24 times p gates over just, just in the data movement compared to, like, if q is 4, like, 4 times p gates just in the, in, in the adder, uh, uh, multiply adder.
- DPDwarkesh Patel
And, sorry, where is the 3 coming from?
- RPReiner Pope
Three different inputs here.
- DPDwarkesh Patel
Got it. Okay.
- RPReiner Pope
So the case I... Like, really just what I'm hinting at here is that, like, all of this work, which scales, like, as, as the size of the register file, and this is a very small register file, all of this work just moving the data from the register file to the, to the, to the logic unit, um, is many, many times more expensive than the logic unit.
- DPDwarkesh Patel
In the most recent ClusterMAX report, SemiAnalysis ranked almost 100 different GPU clouds. Crusoe was one of only five that made the gold tier. SemiAnalysis found that gold-tier providers like Crusoe had a total cost of ownership that was 5 to 15% lower than silver-tier ones, even when they had identical GPU pricing. This makes sense because total cost of ownership is downstream of a bunch of different things that don't necessarily show up in the sticker price, but that Crusoe has optimized. Things like how well you detect faults and how quickly you replace failed nodes. For example, Crusoe was one of the first clouds to adopt NV Sentinel, Nvidia's own GPU monitoring and self-healing software, for enhanced GPU uptime, utilization, and reliability. This lets Crusoe make use of everything that Nvidia has learned about why chips fail across all their different fleets and deployments so that Crusoe can catch faults earlier in the process. And once they identify a failure, Crusoe can swap in a healthy node in less than 10 minutes. Because you're not running bare metal, Crusoe doesn't have to spend time installing an operating system or configuring drivers. They can just spin up a new VM on an already running and pre-qualified host. If you want to learn more about this or the other reasons that Crusoe made gold tier, go to crusoe.ai/dwarkesh. It may be helpful to just see what a Mux looks like, maybe like a two-bit or a four-bit Mux.
- RPReiner Pope
Yeah. Great. So we'll take some inputs. We'll, we'll, we'll have maybe, um, uh, like... We'll just do a, a, a two-way Mux. So we've got, um, two different numbers. Um, uh, we've got these two inputs, and then we have a, um... So these, these are the inputs, um, that are being selected between, and then we have a selector which says... Which can either be like, "I want this one," or, or it could be, "I want the other one." So this is a one hot encoding. Um, so, so this is what, what we all start with. And then we... The output we want to produce, like, let, let's focus on this case. So, so this is the actual input we got.
- DPDwarkesh Patel
Yeah.
- RPReiner Pope
Um, we just wanna produce this guy as, as, as the result. Um, and so, like, sort of very laboriously what we do is we and this bit with all of these. And so, um, that produces, like, anding this bit with this row, and likewise we and this bit with this row. That produces all 0s. So this was the... There's four ands here. There's four ands here. And then finally we just or these two together, and this gives a 1. We or these two together, this gives a 1. We or these two together, it gives a 0. We or these two together and it gives a 1. And so this is the four ors. So, like, this actually ends up looking a little bit like addition, in fact. Like, we, we did exactly the set of same, same set of ands here. Um, sort of we've anded all of these things together, but then instead of-Collapsing it by using these full adder circuits, we just could get a very simple collapsing with OR gates
- DPDwarkesh Patel
And how-- But that, I guess that doesn't look like N times P
- RPReiner Pope
Um, so, so yeah, so this was, this was with, um, N equals two, um, inputs
- DPDwarkesh Patel
Ah, right
- RPReiner Pope
Um, in the general case, we will have, um, N, uh, like, uh, and so this is, this is N rows, um, and then we'll have, uh, P bits, um, uh, per row. Um, so that gives us the, um, the N pi- N times P many AND gates. So this circuit I've described here, almost all of the cost, like, uh, seven-eighths of the cost, uh, is, is in the reading and writing the register file, and only a tiny fraction of the cost is in the logic unit itself. So this is the problem to solve. Um, this, this essentially was the, the state of play, um, prior to the Volta generation of NVIDIA GPUs. This is what, what- the- this kind of thing is what was inside the CUDA cores. Um, and this, and this sort of problem statement is, uh, what motivated introduction of tensor cores, which are more generically called systolic arrays
- DPDwarkesh Patel
Mm.
- RPReiner Pope
So if we think about how are we gonna solve this problem, like we, we're spending almost all of our circuit area on something that we just really don't care about and is, and is like hidden to the software programmer, and, and the thing that we actually care about is, is, is not much of the area. Um, well, make this one bigger somehow while, while keeping this at the same size. That's the goal.
- 25:59 – 39:00
How systolic arrays work
- RPReiner Pope
So sort of the evolution was like, uh, we had baked this much into hardware, um, uh, in this stage. Like this single line is a multiply accumulate, um, and this was, this single thing was baked into hardware. Um, the idea of a systolic array is to sort of go two levels of loop up and, uh, and, and bake, uh, this entire loop out here in-into hardware.
- DPDwarkesh Patel
Mm.
- RPReiner Pope
And so the idea being that if we have a much bigger granularity fixed function piece of logic, um, maybe the taxes we pay on the input and output are much smaller
- DPDwarkesh Patel
It sounds like you're suggesting that if you have, um, if you go up one step in the, uh, in the matrix multiply loop, that there's some-- You, you can tilt the balance more towards compute than communication
- RPReiner Pope
That's right. So there's two effects that we're gonna take advantage of here. One is just that we can do more stuff before-- per every trip through a register file
- DPDwarkesh Patel
Mm-hmm
- RPReiner Pope
Um, and then the other thing we're gonna take advantage of is in fact, um, uh, in, in, in some of this loop, we can take advantage of, for example, uh, some- certain things staying fixed. So, uh, let's sort of visually we're gonna look at this matrix multiplication. So, so this portion of the loop corresponds to a matrix vector multiplication, in fact. So we'll, we'll take a matrix, um, and multiply it by, uh, a vector. So how do we do this? We take every column gets multiplied by the vector and then summed. So we're gonna sum sort of along columns. Um, so this zero and three gets multiplied by the three and seven and gets summed, and then the one and two gets multiplied by the three and seven and gets summed. So there is a multiply accumulate associated with every single one of these, um, entries in the matrix. So we'll just draw out these four multiply accumulates
- DPDwarkesh Patel
And just to make sure I understand why there's four multiply accumulates. So if each entry in the, uh, column that corresponds to the output vector is, uh, a dot product, and in this case it'll be like two multiplications and then the addition of those two multiplications. So like you're accumulating-
- RPReiner Pope
Yeah. So the addition-- So really there's only one addition per dot product, but like we like to start with zero
- DPDwarkesh Patel
But including the initialization of zero
- RPReiner Pope
Yeah
- DPDwarkesh Patel
Yeah
- RPReiner Pope
So the, the-- what we're gonna aim for is to have, um, so we've got to-- we want to have quadratically more compute. We do. We have. We have got sort of, um, X times Y, uh, as much compute as, as we had before. Um, um, but we're gonna want to somehow aim for having only X times as much, um, like, uh, communication. And, and this is sort of the, the intention so that we get this, uh, advantage, uh, term going as Y. So we've laid down the multiplications, bringing in like we're gonna want to bring in a vector of size two, and so that sort of already m- is in line with our columns target. That's fine. However, we need to somehow manage the communication of this matrix which, which, which does, which exceeds our budget of, of X. And so, uh, the idea is that in the-- in an AI context, this, this matrix is actually gonna stay fixed for a long period of time. And so instead of like bringing it in from the outside, so we've got some register file sitting over here, we, we don't want to have like the amount of stuff coming out of this register file. This is the term that we want to go, uh, sort of as, as X in some sense. Um, we don't want to bring this full matrix in from the register file every, every cycle because we don't have enough, um, that would cost too much in, in terms of wiring from the register file. And so instead we're gonna store, um, our, our key trick is that this matrix can be stored locally, uh, to the systolic array. And so, um, where we will store these numbers zero, one, two and three in, in just like a gate called a register that like physically stores these numbers and we're gonna reuse these numbers, uh, over and over again for a, for a large number of different mat- uh, different vectors
- DPDwarkesh Patel
And so the optimization here is that like the nature of matrix multiplication is you can store this like, uh, square quadratic thing directly where the logic is happening
- RPReiner Pope
Mm-hmm
- DPDwarkesh Patel
Um, and which is like higher dimension than theUh, or has an extra dimension compared to the inputs, which you keep swapping in and out.
- RPReiner Pope
That's right.
- DPDwarkesh Patel
I mean, this is the nature of what a matrix multiplication is, is that you do, uh, a lot of multiplication to get one value out. Like a dot product is a l- the result of a lot of multiplications. And so that optimization means that you, you're just, like, you can stuff a lot of, like, multiplication in before you get some value out of it.
- RPReiner Pope
That's right. That's right.
- DPDwarkesh Patel
Yeah.
- RPReiner Pope
So, like, just to complete the picture here of, of concretely how that looks, um, uh, I s- I swapped the three and the two here, um, three and two. Um, so just, like, this zero and three is gonna multiply by the three and the seven. And so, um, uh, we're gonna form a dot product sort of along columns here. So somehow we're gonna feed a three and a seven in here. These participate in sort of this feeds into this multiplication and also feeds into this multiplication. Likewise, the three feeds into here and also into here. And then we're gonna sum, uh, sort of along here with, like, starting at the top of a column, we feed in zeros, and then coming out the bottom, uh, we get results coming out. So sort of just to visually see what we've got, um, there's a dot product that is performed along columns in a matrix, and that, that sort of maps exactly to what is done spatially in the systolic array here. Um, so this is one dot product summed vertically, and this is a second dot product also summed vertically. Um, and then what is the data that needs to go into and out of the register file? We have X amount of data that's coming out here on the output, um, and we also have sort of, uh, this data coming from the input, X amount of data from the input. Um, and so with respect to the input and output vectors, at least, um, we, we've sort of met our goal of having only X as, uh, as much-
- DPDwarkesh Patel
Yeah
- RPReiner Pope
... data going in and out of the, um, uh, the register file. This leaves open the question of, like, I said that the, the weight matrices, weight matrix is stored locally in the systolic array. How did it get there in the first place? Um, is sort of a, like, at some point you need to boot your chip and populate this data, and so where did that come from? The trick is just we just do it very slowly. So we, we very slowly trickle feed it in into the systolic array. The, um, sort of the simplest strategy is that we, we sort of run this daisy chain that says, like, feed a number into here, and then, and then on the next clock cycle, it'll move down to the next entry of the systolic array. And so we can do that in every, um, in, in every column in parallel, and that gives us, um, sort of, uh, this is also gonna come from here, and this is gonna be another factor of approximately X units of, of, of bandwidth coming in.
- DPDwarkesh Patel
Uh, can, can you just-- would you mind repeating that sentence one more time?
- RPReiner Pope
So, like, we're sort of like we know that we're gonna be bringing in, uh, numbers only rarely into the matrix.
- DPDwarkesh Patel
Yeah.
- RPReiner Pope
Um, and so we just want to come up with any construction at all such that the amount of wiring that actually feeds into sort of crosses this boundary of-
- DPDwarkesh Patel
Yeah
- RPReiner Pope
... this systolic array, like this boundary right here, we just want to keep that bounded to X and not be-
- DPDwarkesh Patel
Yeah
- 39:00 – 51:40
Clock cycles and pipeline registers
- DPDwarkesh Patel
Where does the clock cycle of a chip come in? What, what determines what that is?
- RPReiner Pope
Yeah.
- DPDwarkesh Patel
And what is a clock cycle of a chip?
- RPReiner Pope
So I guess at, at baseline, it's, it's sort of worth observing that chips are incredibly, incredibly parallel, right? You've got-
- DPDwarkesh Patel
Yeah
- RPReiner Pope
... a hundred billion transistors in a chip. A key thing that you need to do whenever you have, uh, like massive parallelism i- is you need to synchronize between the different, uh, parallel units. Um, in software, typically you, like you, uh, you have these very expensive synchronization methods like a, a mutex. So one thread will finish what it's doing. It will inc- like it will, uh, grab a lock on somewhere stored in memory and then notify the other thread that it's done. Um, on chips, we take a very different approach and say that, uh, uh, every like nanosecond or so, um, all, all circuitry in the chip will kind of pause for a moment and then, and then synchronize every... So it actually synchronizes every single nanosecond or so. And so that is the clock cycle.
- DPDwarkesh Patel
Mm-hmm.
- RPReiner Pope
Um, the entire chip typically all in sort of one fell swoop, um, goes in lockstep to the next, uh, operation that happens. And so what this looks like in circuitry is that, um, you will have... It's typically drawn... So the clock is sort of mediated by registers, which are these storage devices that we've drawn elsewhere. Um, and the way to think of it is that I have, I have some storage, uh, which is storing like a bit, which might be zero or one, and then I have some sort of cloud of logic, which maybe is like this systolic array or this multiplier or something like that. Um, and then I've got some-- And that's gonna produce some output. So my inputs, I've got a bunch of inputs feeding to this cloud of logic, and then eventually later there's gonna be some output register that this writes to. There is a global clock signal which drives all of these registers, and it says at a certain instance in time when the clock, uh, uh, strikes, um, uh, whatever value happens to be on this wire at that instant, that's what's gonna get stored in there. And so the, the sort of the challenge here is like I, I would like to have my clock speed run as fast as possible because i- if I can run at two gigahertz, I can get twice as many operations done per, per, per second than if I r- run at one gigahertz. Um, but what that ends up meaning is that, uh, I'm very sensitive to the delay through this cloud of logic because, uh, any computation that, uh, is gonna happen in here needs to sort of finish before the next clock cycle hits.
- DPDwarkesh Patel
Right.
- RPReiner Pope
So, uh, a major point of sort of optimization o- on, on any chip then is, is to make, is to sort of make, uh, this delay, um, delay from here, uh, uh, as, as short as possible.
- DPDwarkesh Patel
Interesting. And, um, is there ever... Because w- the constraint here seems to be that if you add too much logic, then you might risk missing the clock cycle.
- RPReiner Pope
That's right.
- DPDwarkesh Patel
But if you don't add enough, then you're, you know, leaving potential compute on the table. Um, is there ever a situation where you're like you'd take a probabilistic chance that a compute- computation finishes and... Or is it just like, no, e- either it's gonna finish by clock cycle or not?
- RPReiner Pope
Yeah. In standard chip design, you, you margin it such that, I mean, there is a probability, but it's like many, many standard deviations, like way-
- DPDwarkesh Patel
Right
- RPReiner Pope
... standard deviations out, such that for all intents and purposes, it is a reliable part. It will, it will always, uh, meet the clock. Um, there are some weird exceptions to that. There are clock domain crossings where you go from one clock to the other clock, and then you actually do have to reason about this probability, but-
- DPDwarkesh Patel
Interesting
- RPReiner Pope
... um, uh, i- in the main path, uh, you just like, you margin it such that you'll get there like 25% of the clock cycle in advance, um, uh, so that it's very unlikely that it does.
- DPDwarkesh Patel
And this, and this, um, the clock, uh-Where the, uh, clocks synchronize, I guess, where the registers are, this is not something you determine as a chip designer. This is sort of just, like, an artifact of, hey, I want whatever sequence of logic. And then the software you use to convert your Verilog into the thing you te- send to TSMC, that just determines, like, hey, in order to make this work, you gotta kind of, you gotta put a register here, here, and here to make sure that there's a, there's no one step that is, like, too long, such that it makes the whole clock cycle of the entire chip longer than it has to be.
- RPReiner Pope
Yeah. So this is actually a huge part of the work of, of designing a chip, actually, is, is inserting them. So it is done in a combination of manually and automatically.
- DPDwarkesh Patel
Mm.
- RPReiner Pope
Um, so I mean, like, just, like, to show the very sort of dumb, uh, version of, like, what you can do here, you can take this logic and split it in half. And so, like, say, actually, instead of just one, uh, cloud of logic, I'm gonna, um, have two smaller clouds of logic, um, which do the same thing, but split them up by a register.
- DPDwarkesh Patel
Right.
- RPReiner Pope
Um, feeding in like this. Um, and this is like, like, if you split it, like, in the middle, you can hit twice the clock frequency. That's great. You get twice the performance, um, at the cost of this extra register, and so at the cost of some more, uh, storage.
- DPDwarkesh Patel
And so stepping back, why do we need to synchronize the whole chip? Like, if you have, like, if you imagine playing Factorio or something, there's no, like, global clock cycle. It just, shit is done when it's done. There's iron on the plate. You can take it if you want.
- RPReiner Pope
Yeah. So, uh, taking that analogy, um, the, the, the, the thing that you need to be mindful of is if I've got two different paths through some logic, so I have to do a computation like F here, um, and then computation G here, um, and then they're co- gonna come and meet for computation H somewhere here.
- DPDwarkesh Patel
Yeah.
- RPReiner Pope
Um, and so, uh, there's gonna be manufacturing variance here. Uh, in some chips, F will take a little longer. Maybe in some chips, G will take a bit, a little bit longer. And so if I've got some signal that's propagating through here, um, and the result from F and G have to sort of meet up at H, um, what can ... The, the thing that can go wrong is that F can get there early, and it meets, like, the previous value of G or the next value of G or something like that.
- DPDwarkesh Patel
Ah. And H needs to know when to start.
- RPReiner Pope
Exactly. Yes.
- 51:40 – 1:03:14
FPGAs vs ASICs
- DPDwarkesh Patel
Okay. So I remember talking to, um, an FPGA, uh, engineer at Jane Street, Clark, um, who actually helped me prep for your... the, the previous interview we did together, and he was explaining why they use FPGAs and, um, uh, I, I imagine that for high-frequency trading, the throughput is less important than latency, and so having very specific control over the clock cycle in a deterministic way is the most important thing. I, um, maybe it'd be interesting to talk about how you... w- why you can't just achieve that with an ASIC or why F- FPGA is the, uh, yeah, why you might use an FPGA to have deterministic clock cycles for like high fr- high-frequency trading.
- RPReiner Pope
Yeah. So I mean, firstly, like let's consider the sort of the business case for an FPGA versus an ASIC. Um, FPGAs and ASICs l- use largely the same sort of, uh, conceptual model, which is that I have, um, a series of gates built from ANDs, ORs, XORs, those like very s- small primitives, um, connected together with a fixed clock cycle, um, uh, and connected together with wires, uh, that are running at a fixed clock cycle. So anything you can express in an FPGA, you can express in an ASIC too, and it'll be about an order of magnitude cheaper and, and, like, uh, better energy efficiency on an ASIC than an FPGA. Uh, the trade-off is that the first FPGA costs you ten thousand dollars, whereas the first ASIC you make costs you thirty million dollars because-
- DPDwarkesh Patel
Right
- RPReiner Pope
... uh, because of, uh, it, it requires an entire tape-out. So sort of the business use case for an FPGA would be that I want something that has this very deterministic latency and, and, and, and fast runtime and high parallelism. Uh, but, um, uh, but, you know, is, is it, I, I'm gonna change it very frequently, change what I do every month or something like that. Uh, and so then I don't wanna pay the tape, tape-out cost every time. Now, how does an FPGA actually implement... It's sort of like it emulates the ASIC programming model, but in, in a, in a fixed piece of hardware. And so how does that work actually? So what it has, uh, at, at the, at the base is, um, it's got a, it's got a, um, it's got the two components we just talked about. It's got these registers as, as, uh, as storage devices, and then it's got, uh, these, these are called LUTs, lookup tables, um, uh, which actually provide all of the gates.
- DPDwarkesh Patel
Hmm.
- RPReiner Pope
So and then, and then we're gonna see even the, the sort of the third component, we, we then have like, uh, sort of a, a swarm of these, um, registers and LUTs. Um, and all of these are available, and then they are connected by sort of this big set of, um, sort of, uh, Muxes. So in front of, uh, every single one of these, we've got, uh, something like one of these Muxes, which selects one input from everywhere else, um, sort of selecting from all of these different things. Um, you know, we've got a whole bunch of different options feeding into, i- into, into, into all of these things. So, um, so what, what, what this allows is essentially a, uh... when I program my FPGA, I can say that I'm gonna take all of these components and I'm gonna sort of superimpose on top of this a particular wiring which like goes through this LUT and then feeds into this LUT and then goes to this register and then feeds into this LUT or something like that.
- DPDwarkesh Patel
Hmm.
- RPReiner Pope
So what I've drawn in orange is, like, how you... Like, FPGA means field programmable gate array. This is the, the orange is what has been programmed in the field, whereas the white is all of the wires that must exist in the FPGA in order to actually t- uh, to make the device in the first place.
- DPDwarkesh Patel
What does it mean to be programmed into a field?
- RPReiner Pope
Like, programmed in the field, so, like, the device has been deployed in a data center, it's sitting in the field, and then you can-
- DPDwarkesh Patel
Ah
- RPReiner Pope
... come, come and program it.
- DPDwarkesh Patel
Not field as in, like, electric field.
- RPReiner Pope
No, no.
- DPDwarkesh Patel
Field as in, like, out there in the world field.
- RPReiner Pope
Yeah, yeah, in the world.
- DPDwarkesh Patel
Okay. Um, and so if I see, look at the, how the, uh, the field programming comes out of the first, uh, lookup table and goes into the second one, how is it... H- yeah, how-
- RPReiner Pope
W- like, like, where, where are the wires that made that happen, I guess is-
- DPDwarkesh Patel
Yeah, yeah
- RPReiner Pope
... yeah. So I, I, I got a little bit, like, lazy in drawing all of these. E- every single device here has a Mux sitting in front of it, um, uh, uh, which can select from all of the, like, nearby, um, it, like, uh, circuits that are available.
- DPDwarkesh Patel
Yeah.
- RPReiner Pope
Um, and so, uh, the actual configuration of the FPGA is, like, amounts to it, it is the Mux control. So, like, we, in this Mux here, we have the data inputs, and then we have, like, the control that selects.
- DPDwarkesh Patel
Yeah.
- RPReiner Pope
Uh, and so, like, there's a little storage device sitting next to every single one of these, um, uh-
- DPDwarkesh Patel
Oh
- RPReiner Pope
... Muxes saying, "This is where you're gonna source your input from."
- DPDwarkesh Patel
Got it.
- RPReiner Pope
Um, and so programming it consists of, like, configuring every single one of these Muxes.
- DPDwarkesh Patel
So that makes sense. What is happening inside of the lookup table?
- RPReiner Pope
Yeah. So the purpose of the lookup table, so it's gonna also have a little bit of control feeding, telling it what to do as well. The purpose of the lookup table is to function, to be able to configurably take the role of an and gate, or gate, XOR-
- 1:03:14 – 1:07:16
Cache vs scratchpad
- DPDwarkesh Patel
One important point he made to me is that the reason they prefer FPGAs to CPUs is because they get deterministic clock cycles. They know when a packet will come in and go out. Um, why is it not a guarantee in CPUs?
- RPReiner Pope
Hmm. So you can actually design a s- CPU that has deterministic latency as well. Um, uh, and in fact, like the, the processes that are inside a lot of, uh, AI chips actually also have deterministic latency too. Groq has advertised this. TPUs have that in the core as well.
- DPDwarkesh Patel
Right
- RPReiner Pope
Um, the challenge is getting sort of deterministic latency and high speed at the same time. And so, um, where does the non-determinism in latency come from? Um, non-deterministic latency, um, comes from specific design choices in a CPU. Um, uh, it's actually possible to remove those design choices and make a CPU that has deterministic latency. Um, those are not very attractive in the market, and so people don't make those CPUs anymore. Um, but, uh, but, but, but actually in some sense, like, uh, deterministic latency is maybe a sort of a simpler designing start, starting point. Um, and then, and then like some chip designers have added, uh, things in to, to be non-deterministic. To take a concrete example of that, um, the, probably the most important example is on a CPU just like the, the, the, uh, the CPU cache itself. So, uh, in a CPU you have the, the CPU. This is, this is the CPU die itself, and then there is a memory, uh, off on the side. This is the DDR memory, um, off on the side. And then you have a cache system here. Inside it is the cache, uh, that sort of remembers recent accesses to DDR and, and, and stores them. And so, uh, when, when I'm running through my CPU instructions, uh, every time I, uh, I have an instruction that accesses memory, it first checks in cache, uh, was, was the data stored in cache, and then if not, it goes, uh, it fetches out to DDR.
- DPDwarkesh Patel
Yep
- RPReiner Pope
This is a huge optimization. The cache is like two orders of magnitude faster than the DDR. Um, if you never, uh, if, if, if you never like use the cache, like all, basically all programs would run 100 times slower. So, uh, the presence of a cache is absolutely necessary for a CPU to, to run at reasonable speed. Um, but whether or not you get a cache hit is dependent on the sort of ambient environment of the CPU. Like what other programs are running, what has run recently, what is the random number generator inside the cache system doing? And so, so that is a big source of non-determinism in, in the runtime of a CPU.
- DPDwarkesh Patel
Yeah
- RPReiner Pope
Um, so this is sort of the memory system for a CPU. Um, the, the big thing that you can do differently is, uh, instead of having the hardware say, "I'm gonna like read ca- uh, read memory," and then decide, the hardware decides whether or not it comes, comes, comes from cache or not, you can actually bake this in, this decision into software. So, uh, a different design philosophy is, is to, um, so and you see this in maybe, for example, TPUs. Um, the, the TPU instead has... I mean, I'll draw the same diagram, but I'll call it a scratchpad. And so the, the main difference is, um, so this would be like a TPU and then like HBM in this case rather than DDR, but it's still an off-chip memory. Um, and instead of like the software saying, "First access like memory," and then the hardware decides, um, you've got some instructions that, that go here, this is like one kind of instruction, and then a totally different kind of instruction that goes to HBM.
- DPDwarkesh Patel
Yeah
- RPReiner Pope
And so this style is, is generically known as scratchpad, um, inst- instead of cache. The key distinction being that you have like one kind of instruction that says read or write scratchpad, and a totally different instruction that says read or write HBM.
- DPDwarkesh Patel
So scratchpad being the cache?
- RPReiner Pope
Yeah. This, this thing here is the, is the scratchpad.
- DPDwarkesh Patel
All right.
- RPReiner Pope
Um.
- DPDwarkesh Patel
Just to be clear.
- 1:07:16 – 1:11:49
Why CPU cores are much bigger than GPU cores
- DPDwarkesh Patel
So stepping way back, people say computers have the "Von Neumann, uh, uh, Von Neumann architecture" where there's this-Uh, serial processing of information, and maybe just because we've been talking about parallel accelerators, but I just don't ... Like, the FPGA is super parallel.
- RPReiner Pope
Yeah, yeah.
- DPDwarkesh Patel
The, um, the AI, the kinds of AI accelerators, the TPUs-
- RPReiner Pope
Yeah, yeah
- DPDwarkesh Patel
... are super parallel. Um, even CPUs are super parallel, if you think about all the cores they have. And so is it actually, like, in what sense is modern hardware actually the von Neumann architecture? Is that actually a fair way to describe modern hardware?
- RPReiner Pope
Um, I think it's a fair way to describe CPUs. Like, just the amount of parallelism, like, on a CPU, the amount of parallelism you get is about 100 cores times maybe, like, 16 way vector units, so 106, uh, about 1,000 way parallelism on a CPU.
- DPDwarkesh Patel
Yeah. I- I- I ... So one question is: what is the ... There is a die that is being used for the CPU.
- RPReiner Pope
Mm-hmm.
- DPDwarkesh Patel
And if there's fewer threads, just as a matter of, like, transistor, uh, voltages are, like, switching on and off, is it just that there's, like, literally one control flow, like a small part of the die where, like, voltages are switching on and off? Or, like, in what ... How, how do you actually occupy the die area of a CPU if there's onl- as opposed to-
- RPReiner Pope
If, if there's so few cores, like, what are you ... Am I spending all of the die on-
- DPDwarkesh Patel
What is happening in there? Yeah
- RPReiner Pope
... yeah. The cores are just much bigger and more complicated. Um, so, uh, I mean, like, so I guess we, we should compare, like, a CPU core, which takes up one 16- one 100th of the die to, like, I mean, to a LUT. Like, a LUT is just only these 16 gates.
- DPDwarkesh Patel
Right.
- RPReiner Pope
So, like, it's clear why there are so many more LUTs in an FPGA than cores in a CPU. Um, but then sort of maybe the, the, like, w- why are there more CUDA cores, for example, than, than, than CPU cores, I think would be, like, why a ... Like, what's the difference between a CPU and a GPU or something like that-
- DPDwarkesh Patel
Yeah
- RPReiner Pope
... would, would, would be a, a, a big difference. Um, inside the CPU, you have, um ... So one big use of ... So the sort of the top unit, uh, uses of area inside a CPU are the cache. Um, very little is actually the ALUs. Um, like, mostly it's, like, these register files rather than the logic units. Um, and then the, uh ... Both of these things have equivalents in a, in a GPU, and so that's not a big difference. But the thing that does not have an equivalent in a GPU is the, um, sort of this branch predictor. And so there is a whole big area in the CPU which is, uh, sort of, uh, just a whole bunch of predictors that are saying, um, when will my next branch be and where are the, where is the branch target for that? And so, uh, stripping a lot of that out, uh, as, as well as sort of making these register files tighter in, in a sense, um, is, uh, driving a lot of where the, like, GPU gains over the CPU.
- DPDwarkesh Patel
And what is the purpose of the branch predictor? To, like, execute both branches at once? Or what does it do?
- RPReiner Pope
So the issue is that when I've got a series of instructions like, um, like, instructions, instructions, instructions, instructions, I, uh, if I have a branch like here, if this instruction is branch, um, the, the actual processing step of processing an instruction, um, takes a really long amount of time. It takes, like, maybe five nanoseconds or something like that. Um, so, like, uh, the time to actually notice that I've got a branch and then, like, uh, uh, evaluate the Boolean whe- whether it's true and then, and then update the program counter to the new target and then read from the instruction memory for that, um, that, that could take, like, like, actually five nanoseconds to, to finish. And so in reality, this may finish somewhere down here. Um, I don't want to ... Like, but, like, I want to run a clock speed that is much faster than what five nanoseconds allows. Like, I, I ... Like, five nanoseconds is 200 megahertz clock speed. I would like to run at 1 or 2 gigahertz or something like that. And so, um, I need to run other instructions while the branch is, is being evaluated. And so I, like, I really just wanna keep running the following instructions that happen after me. Um, but that might have been wrong. Like, if the branch ended up being taken, then I need to know that instead of evaluating these instructions, I actually need to, like, jump to wherever the target is and, and run, run these instructions instead. And so the purpose of the branch predictor is, like, like, genuinely to predict based on, like, before you even get to this instruction to be, like, five cycles earlier to predict there was gonna be a branch that's gonna happen.
- 1:11:49 – 1:15:22
Brains vs chips
- DPDwarkesh Patel
So if I think about how the brain works versus what you're describing here, at a high level, the differences might be that while you can do structured sparsity in these accelerators and then save yourself some, um, area that you would have otherwise had to dedicate to these gates, in the brain, there's unstructured sparsity. You know, any neur- neuron can connect to any other neuron and not in, like, ways where they'd have the column aligned or whatever.
- RPReiner Pope
Yeah, yeah.
- DPDwarkesh Patel
Um, then there's the fact that memory and computer co-located. I guess you could say in a way the memory and computer co-located on these dies.
- RPReiner Pope
This, this is exactly the co-location in some sense of the memory and computer.
- DPDwarkesh Patel
That's right. That's right. Yeah. Yeah, so may- maybe that actually isn't a big difference. And the other, maybe a big, a, a big difference is that the clock cycle on the brain is much slower than on, um, on computers. And partly that's to preserve energy because the faster the clock cycle, the bigger the voltage, uh, needs to be in order to identify, uh, for the signal to settle and to, like, i- identify what state a transistor is in.
- RPReiner Pope
Yeah, that's right. That's right.
- DPDwarkesh Patel
I don't know if you have other high-level takes about, like, how ... Any commentary on what, you know, what the brain might be doing versus how these chips work.
- RPReiner Pope
Yeah, I mean, so, so let's take the clock speed one, uh, first, actually.
- DPDwarkesh Patel
Yeah.
- RPReiner Pope
Um, the clock speed is quite high, uh, on, on a chip because that, I mean, drives higher throughput, um, like-When we compare a, like, a GPU running some, some workload, it's running batch size 1,000 or something like that.
- DPDwarkesh Patel
Yeah.
- RPReiner Pope
Whereas, like, the brain is not running batch size 1,000. [laughs]
- DPDwarkesh Patel
Right. [laughs]
- RPReiner Pope
There's only one of me. And so you could sort of imagine saying, "Well, take a GPU and, like, instead of running at a g-gigahertz, run it at megahertz," or something like that, and, and that would start to look maybe a little bit more like, like, um, uh, sort of equivalent things that you're talking about in the brain. Um, there is, in the way that silicon works, um, there are... Like, that does not give you an 1,000x advantage in energy efficiency. So, um, what it ends up looking like is you can, uh, like, you sort of just end up running this circuit, uh, once to stabilization, and then it'll sit idle for a long period of time. Um, it doesn't consume a lot of energy while it's sitting idle because, uh, most of the energy is consumed in, in sort of toggling, uh, bits from zero to one and back. Um, uh, so actually, let's, let's talk about the energy consumption, uh, of, of a circuit like this. The way to think of, uh, a bit being stored is you've actually deposited some charge in, in a capacitor somewhere, s-sitting somewhere in, in the chip, uh, implicitly. So it, it becomes charged when it is, um, bit becomes a one, and then it, it becomes discharged when it next goes to a zero.
- DPDwarkesh Patel
Yeah.
- RPReiner Pope
Um, and that cycle of, like, charging the capacitor and then dumping that charge out to ground, that is where the energy is consumed.
- DPDwarkesh Patel
Yeah.
- RPReiner Pope
This is called the dynamic or switching power. Uh, this is most of the e-energy consumption of a chip. There is some other energy consumption just coming from the fact that, uh, um, uh, insulators aren't perfect insulators, but we'll dis-discard that.
- DPDwarkesh Patel
Right.
- RPReiner Pope
Most of the energy consumption, uh, actually comes from just the charging and discharging of, or like toggling from zero to one and, and back to zero.
- DPDwarkesh Patel
Right.
- RPReiner Pope
Um, so if you run a chip much slower and you only clock it once every 1,000 clock cycles or something, you will have 1,000 times fewer transitions. It'll be about 1,000 times less energy consumption. Um, but, but, like, but not a substantial advantage in energy efficiency.
- 1:15:22 – 1:20:18
A GPU is just a bunch of tiny TPUs
- DPDwarkesh Patel
Um, okay, so you described how a TPU works at a high level. Um, what is the difference, uh, at a high level between how a GPU and a TPU work?
- RPReiner Pope
Yeah. So I mean, I think there's sort of a high-level organization principle that is different. Um, and then there's sort of inside the cores where they're different. But w-we'll look sort of, uh, outside the, uh, like, at, at the high level, um, so we'll take a GPU, um, and a TPU and what does, like, sort of the top-level block structure look like. Um, if you think of this as the whole chip, um, in each case, um, the organization of, of the GPU, um, uh, is mostly a bunch of, uh, almost identical units, which are these, um, these are the SMs. Um, and then they've got a, an L2 memory in the middle, and then a bunch more of these SMs on the bottom. Um, and so, uh, there's sort of this f-fairly regular grid of, of cores. Um, uh, and then, like, if we look at a, at a TPU in, uh, in comparison, um, you, you s- uh, you end up m- with much coarser grained units of, uh, logic, and so you, you end up with something like, um, uh, some large number of, um, maybe, like, maybe just a few M- uh, matrix units. These are the, um, these are the big, like, um, systolic arrays, and then in the middle you've got some vector unit, and then you've got your, um, matrix units at the bottom. So, um, now sort of like matrix units with a vector unit in the middle, sort of this is the whole TPU chip. You can sort of think of take- scaling this thing down into a really tiny unit with a smaller matrix unit, smaller vector unit, and that is sort of what a, a, an SM is. So sort of at a very high-level point of view, the, the GPU has a lot of tiny, tiny TPUs, um, sort of tiled across the whole, uh-
- DPDwarkesh Patel
Oh, interesting
- RPReiner Pope
... the whole chip.
- DPDwarkesh Patel
So, like, you're suggesting the tensor core within a streaming SM is analo- analogous to an MXU?
- RPReiner Pope
Yeah, it's, uh, very, very similar, yeah.
- DPDwarkesh Patel
I see. And so if you had more, like, more lack of structure, having a bunch of tiny TPUs makes a lot of sense, whereas if you kind of just have, like, huge matrix multiplications, you're like, "Why don't we just, why don't we avoid the co- the cost of having the individual SMs with their own registers and warp schedulers and things like that? Why don't we just, like, make a huge thing and, like, amortize those costs-
- RPReiner Pope
Yeah
- DPDwarkesh Patel
... across the whole thing?"
- RPReiner Pope
And I mean, I think this shows up in how large you can grow things. We've, we've sort of seen this theme, like especially with the systolic array, where larger systolic array amortizes the register file costs better.
- DPDwarkesh Patel
Yeah.
- RPReiner Pope
Um, this sort of design allows you to have larger systolic arrays. This- whereas the sort of GPU design constrains you to having small units, uh, o-of everything. Um, there is a trade-off, however. Um, the, there ends up being, um, because of this sort of coarse-grained separation of things there, you need to move a lot of data from the vector unit to the matrix units. Um, and so, like, you need to move a lot of data through, through a sort of a, like, two, two lines of parameter here. Um, whereas if you sort of look at the equivalent thing here, you've got vector units everywhere, and you need to move data through this line, through this line, through this line, through this line, through this line, through this line. So the amount of data you can move between a vector unit and a matrix unit is actually much higher in, in a GPU than in a, in a TPU because, because, like, it's, like, instead of having to, like, move all the data through these just two lines, you're moving all this data through, like, 16 lines or something of, of, o-of, of wiring instead in a GPU.
- DPDwarkesh Patel
Right. But also you might have to move i- a-across less area.
- RPReiner Pope
Which, I mean, is also a saving, like it's an energy saving as well.
- DPDwarkesh Patel
Yeah, that's right. Yeah.
- RPReiner Pope
So, so data ends up moving, like, if you can operate entirely within s- an SM-
- DPDwarkesh Patel
Yeah
- RPReiner Pope
... the data movement is much smaller. But then the moment you want to operate across SMs-
- DPDwarkesh Patel
That's right
- RPReiner Pope
... like, it, it becomes sort of more c- more complicated and expensive.
- DPDwarkesh Patel
Yeah. So you don't have to comment, but one might expect that a thing MatX might try to do is to get the GPU, like smaller structure of, um, systolic arrays surrounded by, um, SRAM, but also at the same time make it so that, like, the things you need in an SM to support the CUDA architecture, uh, but take a bunch of, bunch of space you might discard.
- RPReiner Pope
Yeah. Um, we've talked publicly about something which we call a splittable systolic array, which is sort of, in, in some sense you can think of as like big systolic arrays that can be small systolic arrays as well.
- DPDwarkesh Patel
Cool. Um, okay, I think that's a good note to close on. Reiner, thank you so much.
- RPReiner Pope
Thanks, Dwarkesh.
Episode duration: 1:20:19
Install uListen for AI-powered chat & search across the full episode — Get Full Transcript
Transcript of episode oIk3R-sMX5o
Get more out of YouTube videos.
High quality summaries for YouTube videos. Accurate transcripts to search & find moments. Powered by ChatGPT & Claude AI.
Add to Chrome