Reiner Pope – Chip design from the bottom up

May 22, 20261h 20m · 14,451 words

Open in Steadcast for Mac Apple Podcasts Overcast

Show notes

New blackboard lecture with Reiner Pope: how do chips actually work - starting with basic logic gates, and working up to why GPUs, TPUs, FPGAs, and the human brain each look the way they do. Reiner is CEO of MatX, a new chip startup (full disclosure - I’m an angel investor). He was previously at Google, where he worked on software efficiency, compilers, and TPU architecture. Watch this one on YouTube so you can see the chalkboard. Read the transcript.

Highlighted moments

almost all of the cost, like seven eighths of the cost is in the reading and writing the register file. And only a tiny fraction of the cost is in the logic unit itself.

25:18

“the thing that does not have an equivalent in a GPU is the, um, sort of this branch predictor. And so there is a whole big area in the CPU, which is, uh, sort of, uh, just a whole bunch of predictors that are saying, um, when will my next branch be and where the, where's the branch target for that?”

1:09:46

“whereas if you sort of look at the equivalent thing here, you've got vector units everywhere and you need to move data through this line, through this line, through this line, through this line, through this line, through this line. So the amount of data you can move between a vector unit and a matrix unit is actually much higher in, in a GPU than in a, in a TPU”

1:18:53

Transcript

0:00I'm back with Rainer Pope, who is the CEO of Maddox, which is a new AI chip company. Last time we were talking about what happens inside a data center. Now I understand what happens inside an AI chip. How does a chip actually work? Full disclosure, by the way, I am an initial investor in Maddox. So hopefully you have designed a good chip. Also, if you're listening to this on an audio platform, it's much preferable to watch this Blackboard lecture on a platform where you can see what's happening. So switch over to YouTube or Spotify. So I'll start with sort of the very smallest fundamental unit

0:34of chip design, and we'll sort of build up into what an overall, like actual production chip, what are the components of that. At the very bottom level of a chip, the primitives that we work with are logic gates, which are very simple things like and or not. And then these are connected together by wires that have to be laid out physically as metal traces on a chip. The main function that AI chips want to compute is multiplication of matrices, and really inside that is the fundamental primitive is multiply-accumulative, just like of pairs

1:07of numbers. So we're going to sort of demonstrate what that calculation looks like by hand, and then sort of infer what a circuit would look like for that. It'll turn out to be sort of easiest if I do multiplication-accumulative, something like a four-bit number with another four-bit number. And then we're going to, the actual clearest primitive is actually multiply-accumulate. So there's a multiply these two terms, and then we're going to add in, so product of these two terms,

1:43and then we're going to add in an eight-bit number. And can I ask a clarifying question? Why is this the natural primitive for whatever computation happens inside a computer? Yeah, so there's a few reasons for this. It's a little bit more efficient, but the reason it's natural for AI chips is that if you look what's happening during a matrix multiply,

2:15what is matrix multiply in very short? It is, there's a for loop over i and over j and over k of output i k plus equals to input i j times other input j k. And so multiply-accumulate happens at every single step of a matrix multiply.

2:48Makes sense. And then the other observation is that the precision will almost always be higher in the accumulation step than in the multiplication step. This is maybe specific to AI chips, but you're multiplying low precision numbers, but then when you accumulate, errors accumulate quickly, and so you need more precision here. So this is why we've chosen to do a four-bit multiplication and an eight-bit addition. Let me make sure I understood that. There's two ways to understand that. One is that the value will be larger than the inputs, and the other is that if it was a floating-point

3:19number, it would be, maybe that part is like less intuitive to me, but it's maybe the same principle? It is really the same principle. I guess the sort of, I mean, I guess the separate principle is that as you are summing up this number, you are summing up a whole bunch of numbers, and so you've got a lot of rounding errors accumulating. That's right. Whereas in this case, there's like, there's only one multiplication in that chain, and so there's not a lot of rounding errors accumulating in the multiplication. Why are you summing up a whole bunch of numbers? It's just two numbers, right? Just, I mean, this summation happens, it's repeated j-j many times.

3:49Ah, I see. So, yeah, any errors accumulate. I see. Yeah, makes sense. So how would we perform this calculation by hand? I mean, as a human, we would probably separate it into two, but we can sort of do it all in one using long multiplication. So the multiplication term first, we're going to multiply this number, this 4-bit number here, by every single bit position in the other 4-bit number. So we write that out. First, 1, 0, 0, 1 multiplied by this bit position.

4:20That is this number itself. Then shifted across by 1, we're multiplying by 0. That gives us an old 0s number. Shifted across even one more to multiply by this one, we get 1, 0, 0, 1. And then finally, for this last bit position, we get an old 0s number again.

4:47So this sort of gives us a bunch of terms that we're going to have to add for the multiplication. And then while we're doing that summation of this, we might as well add in the actual accumulated term as well. So we just copy that directly across.

5:08So this is the sum. It's a five-way sum that we're going to want to compute.

5:17So firstly, what logic gates did it take us to even get to this intermediate step? We needed to produce all 16 of these partial products. How do I produce one of these partial products? So let's take this number 1, for example, here. It is 1. So how do we produce this number by multiplying this number by this 1 over here? We can actually produce that by an AND gate. This number is 1 if both this bit is 1 and this bit is 1.

5:47If either of them is 0, then the multiplication of 0 times anything is 0. So to produce all of this stuff, we ended up consuming 16 AND gates.

6:00Or in the general case, if I were doing a P bit multiply times a Q bit multiply, then this will be like P times Q, many ANDs. Exactly.

6:21Finally, I sum them. Actually, most of the work is going to happen in the summing. And so let me describe the other logic gate that we use here. AND is almost the simplest logic gate that exists on a chip. It's almost the smallest.

6:37At the other extreme, typically the very largest logic gate that you'll use is something called a full adder.

6:46And what this does is sort of it does like coming from software, you might think that a full adder like adds 32 bit numbers together. In this case, it just adds three single bit numbers together. And so you can think of it as like adding 0, 1 and 1 together.

7:01Now, when I add these together, the result can be 0, 1, 2 or 3. So I can express that in binary using just two bits. So as input it has three bits and as output it has two bits, which in this case the number two in binary is 1, 0. So this is also known as a 3 to 2 compressor

7:26because it takes three bits of input and produces two bits of output. The two inputs are an X and a Y value and then some carry that came in from like. Sorry, the three inputs are all bits that are in the same sort of bit position, like three bits that are in a column here. Yeah. And then the two outputs, I have sort of drawn them vertically here and horizontally here to kind of match this vertical versus horizontal layout here, which is expressing that things that are in the same column

7:56or in the same like bit position. Yeah. Whereas things that are in adjacent columns, like this is a carry out where, whereas this was the sum. So if the inputs in the full adder are, let's say like 1, 0, 1, then the output would still be 1, 0. If it was 1, 1, 1, it'd be 1, 1. It was 0, 0, 0. It'd be 0, 0. It was like 0, 1, 0. It'd still be 0, 1. So yeah, it's just counting essentially the number of, the number of things and expressing that in binary. So this circuit actually can sort of capture what we as humans naturally do

8:28when we're doing something along a column. So I can show that sort of, I'll show sort of one iteration of using the full adder to sum. The way I sum here is going to be a little bit unnatural for humans. Humans, we would sort of sum along the column and then remember the carry. But instead of remembering the carry, we'll actually just explicitly write it out. So in this, we proceed from the right-most column towards the left. On the right-most column, we sum the 1 and the 1. And that produces like a 0 here and a carry of 1.

9:01So we've sort of used this full adder circuit on this pair of bits and produced a pair of bits as output. Now we can do the same thing with this column. We've got a column of 1, 2, 3, 4 numbers. And so maybe we'll like take the first three of them, run a full adder on them and that gives us a 0 and a 0 as output. So like the sum of these is 0, 0. So that's the full compressor, full adder applied to all of these bits. As I've used up bits, I'm going to sort of just cross them out

9:32to indicate that I've handled them.

9:36Let's just keep going a little bit more. So we'll go here. I'll take these three numbers. I add them. That gives me a 1 and a 0. I've dealt with these three numbers. And now I take 1, 2, and I can even take these three numbers, for example, right now and add them. And that gives me a 1 and a 0 and I've dealt with these numbers.

9:58So I can sort of like the way I should view this is that I have this whole grid of numbers that need to be added. I'm going to just keep applying full adders to all the bits that are here, constantly removing three numbers from a column and then writing out two numbers as output. Keep going with this over and over and over and again until I eventually get like some, just one single number coming out here.

10:22Something like that. This is probably the wrong sum. So this approach that I've described here, this is called a data multiplier. And this is sort of like the standard for how you do area efficient multipliers using full adders. Let's try and quantify the circuit size of this just so we have got a sense of like how big things are so we can compare it to them later.

10:55How many full adders do I, did I use? I started with how many numbers? How many numbers? I have the 16 partial products, which is the product of all of these terms with all of these terms. Plus the eight terms that I'm adding here. So I started off with 24 bits. And then on, I produced eight bits on the output eventually. And in every step I was sort of crossing off three numbers and writing two numbers out as a result.

11:27And so every single use of a full adder eliminates one of the bits here. And so how many full adders, it must be the 24 minus the eight. So that there were 16 full adders in this circuit. In general, this is true in the general case as well. There will be P times Q many full adders in this circuit. Let me make sure I understand the logic of that. So the input bits 24 is P times Q plus P plus Q. That's right.

11:57And the output bits is just P plus Q. And so P times Q plus P plus Q minus P plus Q equals P times Q. That's right. So I think this explains sort of, or at least hints that the second reason why we chose to do a multiply accumulate. First reason being that's actually what shows up in matrix multiplication. But second reason being it gave us this very slick P times Q, very simple algebra. So we've sort of described like this, this whole procedure, every single atomic step that I took here becomes a logic gate and then sort of the wires connected together.

12:36Like when I had these three inputs that, that I salvaged to produce these two outputs. Like if I think of mapping this to a physical device, it would be a wire that runs sort of connecting all three of these things together into a logic gate that produced this output. Okay, so this is the, the main primitive at different bit widths that, that is, that is inside an AI chip. Um, uh, we're going to build up from here to, uh, how would you use that to run all of the other operations you want?

13:07This might be the wrong time to ask this question, but whenever NVIDIA reports that this chip can do X many FP4 or half as many FP8, it seems to imply that those circuits are fungible. Uh, that there's not as dedicated, uh, that there's not as dedicated like FP4 versus FP8. Um, but the way you, you're mapping it out here, it seems like you would need, if it has to be mapped out in the logic, you would need a dedicated, um, FP4 multiply accumulate and then a dedicated FP8 accumulate.

13:39Basically, can you, can you funge them? As drawn, they're actually not particularly fungible. Um, this is actually one of the main choices you have to make when designing a chip, which is how much of FP4, how much of FP8 do I have? And then sometimes I'll make that consideration from the point of view of like, uh, what do I think that is the customer requirement? Another way to take an angle on that is to say, um, what is the, what is the power budget for, uh, equalize the power budget between FP4 and FP8. But so then when they report those numbers and they just happen to be the case of like, it does 2x as many FP4 as FP8, they just happen to choose like, give, um, equivalent die areas to all the floating points.

14:18And as a result, that ended up being like, why is the ratio exactly 2x? Yeah, exactly. Yeah. So, um, part of it is, I mean, surely that won't be exactly like, exactly equivalent to die area. Um, there's, uh, there's a data movement reason actually. And we'll maybe come back to this when we sort of, uh, look through how it goes into and out of memories. Yeah. Um, there's something really nice just from a software level of the fact that I can pack two 4-bit numbers into, into the same storage as an 8-bit number. And so when I store that to a memory or something like that, uh, it's the, the sizing of the, um, the buses that I wire within the chip, um, actually makes that work out really, really nicely.

14:55Actually, come to think of it, it's not just 2x. It would be, uh, the, the amount of area it takes, it sounds like is, um, quadratic. It's quadratic, in fact. With the, yeah. With the bit length. So, uh, that's why smaller precision is like even more favorable. This is a really big reason. So, um, in fact, Nvidia made a change, um, historically, uh, up until B100 or B200, every time you have the bit precision, you, you double the, um, the flop count.

15:26Um, that ratio is exactly like, for the reason you said, because of this quadratic scaling, that ratio is actually slightly wrong. Um, it should be like an even bigger, you should get an even bigger speed up than you, than you might otherwise think. Um, Nvidia's like product specs have sort of started acknowledging that in B300 and beyond, where the FP4 is three times faster than the FP8. Right. Though it should be 4x. Yeah. What I've shown here is like the simplest case of integer multiply. When you're, when you're dealing with floating point, as you do in FP4 and FP8, um, there's this sort of other term, which is the, the, the exponent that just complicates this calculation.

16:03Okay. So, so what can we see already from this? It's like, I think the, the big observation you've made is that there's this quadratic scaling with bit width, um, which, which is like very effective and, and is the single reason why, um, low precision arithmetic has worked so well for neural nets. Um, but the other thing we're going to do now is we're going to compare sort of the area spent on the multiplication itself with all of the circuitry that is around it.

16:29So we'll, we'll walk back in time a little bit and, and see how did GPUs prior to tensor cores work, um, as, which is the same way as the way that CPUs worked in fact. Hmm. So which is like, where do we stick this multiply accumulate unit? Um, so generically I'll describe, describe like a CUDA core or a CPU. You'll have some register file, which stores some number of entries.

16:59Maybe, maybe it's like eight, um, eight entries, um, uh, of, of like, in this case, I guess, four bit numbers, but typically like 32 bit numbers or something like that, of, uh, which are numbers. Um, so this is the, like, inside the CUDA core, I'll have some register file of some depth. And then I will have my multiply accumulate circuit, multiply, uh, and accumulate circuit. And what it's going to do, it's going to like, it's going to take three arbitrary registers from this register file, perform the multiply accumulate, and then write back to the register file.

17:33So it's going to maybe write to this one, but it, it was able to read from this one, this one, and, and another random one. So it'll, it'll take three inputs like this. So this is the core data path of, of many, um, processes. Most processes look like this. You've got some set of registers and then you've got some set of, uh, logic units or ALUs. We want to analyze the cost of the data movement from the register file to, to the ALU and back.

18:04So ultimately there's going to be some circuit that says, well, I don't always have to select this guy. I might select any of the registers in it, at any point in time. And so sort of a first question is how can I build a circuit? The circuit that I'm going to look for is, uh, is a MUX. So, um, in this case, it's going to have eight inputs, one from each entry of the register file. Um, and it's going to have one output, which is actually, um, producing this output.

18:35And then like, what is the cost of this thing? It's, it's like all we have to, to build it out of is the output. Uh, to build it out of is AND and OR. Uh, and so how do we build it? We, we do the dumbest thing possible. We like form a mask saying we, okay, when we want to read like the third entry, um, we're going to AND every single entry with either one or zero based on whether that's what we want to read. And then we're going to OR all of them together. Okay. Just to make sure I understand the basics, what the MUX is doing is it just like selecting.

19:05Just selecting. Just selecting an input. Yeah. So like invisible to software is like you say, I want input number three. That means there's a MUX. Yeah. And so like, what is the cost of this MUX? So an N input MUX, uh, operating on P bits. Well, I'm going to, so I have N, N rows, that's this eight rows and I've got like each row is P bits wide. Well, I have to AND every single bit.

19:36So I get N times P many AND gates. Every single input I have to say, am I going to like mask it out or not? Um, and then, and then I'm going to OR them all together. And so, uh, there's going to be like N minus one times P many OR gates. Which is saying, uh, I've got all of these different things. Almost all of them are zeros, but I need to sort of collapse them down into, into like from my eight options down into one option. And so every step I need to OR like one row into, into an existing row.

20:09Got it. Yeah. It's actually kind of funny that you would sort of, um, you don't think at the level of hardware, you sort of just think like, oh, I'll just select element three. And something as simple as that is, uh, a sort of like in and of itself a quite complicated circuit. Yeah. I mean, this is the first step of all of the hidden data movement costs. Right. Yeah. Um, and so like the thing, like we're, we're just going to like compare, like I have to pay this cost and I've got one marks here. And then in fact, I have two more copies of that for each of the three inputs to my multiply accumulate operation.

20:41And so I have this cost, which is like, like three times N times P and gates over here, um, compared to this P times Q, um, like, uh, sort of gates in the actual circuit that I, that is doing the thing I care about. And if we plug in actual numbers, like this N being eight, um, like I get like 24 times P gates over just, just in the data movement compared to like, if Q is four, like four times P gates just in the, in, in the adder, uh, multiply adder.

21:14And sorry, where's the three coming from? Three different inputs here. Okay. Okay. So the case I like really just what I'm hinting at here is that like all of this work, which scales like as, as the size of the register file, and this is a very small register file. All of this work, just moving the data from the register file to the, to the, to the logic unit, um, is many, many times more expensive than the logic unit. In the most recent cluster max report, semi-analysis ranked almost 100 different GPU clouds. Crusoe was one of only five that made the gold tier.

21:46Semi-analysis found that gold tier providers like Crusoe had a total cost of ownership that was five to 15% lower than silver tier ones, even when they had identical GPU pricing. This makes sense because total cost of ownership is downstream of a bunch of different things that don't necessarily show up in the sticker price, but that Crusoe has optimized. Things like how well you detect faults and how quickly you replace failed nodes. For example, Crusoe was one of the first clouds to adopt Envy Sentinel, NVIDIA's own GPU monitoring and self-healing software for enhanced GPU uptime utilization and reliability.

22:19This, that's Crusoe, makes use of everything that NVIDIA has learned about why chips fail across all their different fleets and deployments so that Crusoe can catch faults earlier in the process. And once they identify a failure, Crusoe can swap in a healthy node in less than 10 minutes. Because they're not running bare metal, Crusoe doesn't have to spend time installing an operating system or configuring drivers. They can just spin up a new VM on an already running and pre-qualified host. If you want to learn more about this or the other reasons that Crusoe made Gold Tier, go to crusoe.ai.dwarkesh.

22:52It may be helpful to just see what a mux looks like, maybe like a 2-bit or a 4-bit mux. Yeah, great. So we'll take some inputs. We'll have maybe, like, we'll just do a two-way mux. So we've got two different numbers. We've got these two inputs. And then we have a, so these are the inputs that are being selected between. And then we have a selector.

23:23Which says, which can either be like, I want this one. Or it could be, I want the other one. So this is a one-hot encoding. So this is what we all start with.

23:40And then the output we want to produce, like, let's focus on this case. So this is the actual input we got. Yeah. We just want to produce this guy as the result. And so, like, sort of very laboriously what we do is we AND this bit with all of these. And so that produces, like, ANDing this bit with this row. And likewise, we AND this bit with this row. That produces all zeros. So this was the, there's four ANDs here.

24:12There's four ANDs here. And then finally, we just OR these two together. And this gives a one. We OR these two together. This gives a one. We OR these two together. This gives a one. We OR these two together. Gives a zero. We OR these two together. And it gives a one. And so this is the four ORs. So, like, this actually ends up looking a little bit like addition, in fact. Like, we did exactly the same set of ANDs here. Sort of we've ANDed all of these things together. But then instead of collapsing it by using these full ladder circuits,

24:45we just get a very simple collapsing with OR gates. And how, but that, I guess that doesn't look like N times P. So, yeah. So this was with N equals two inputs. Ah, great. In the general case, we will have N, like, sorry, this is N rows. And then we'll have P bits per row. So that gives us the N times P many AND gates.

25:18So this circuit I've described here, almost all of the cost, like seven eighths of the cost is in the reading and writing the register file. And only a tiny fraction of the cost is in the logic unit itself. So this is the problem to solve. This essentially was the state of play prior to the Volta generation of NVIDIA GPUs. This is what, this kind of thing is what was inside the CUDA cores. And this sort of problem statement is what motivated the introduction of tensor cores,

25:52which are more generically called systolic arrays. So if we think about how are we going to solve this problem, like we're spending almost all of our circuit area on something that we just really don't care about and is hidden to the software programmer. And the thing that we actually care about is not most of the area. Well, make this one bigger somehow while keeping this at the same size. That's the goal. So sort of the evolution was like we had baked this much into hardware in this stage, that this single line is a multiply accumulate. And this single thing was baked into hardware.

26:22The idea of a systolic array is to sort of go two levels of loop up and bake this entire loop out here into hardware. And so the idea being that if we have a much bigger granularity fixed function piece of logic, maybe the taxes we pay on the input and output are much smaller. It sounds like you're suggesting that if you go up one step in the matrix multiply loop, that you can tilt the balance more towards compute than communication.

26:54That's right. So there's two effects that we're going to take advantage of here. One is just that we can do more stuff per every trip through a register file. And then the other thing we're going to take advantage of is, in fact, in some of this loop we can take advantage of, for example, some certain things staying fixed. So let's sort of visually we're going to look at this matrix multiplication. So this portion of the loop corresponds to a matrix vector multiplication, in fact. So we'll take a matrix and multiply it by a vector.

27:31So how do we do this? We take every column gets multiplied by the vector and then summed. So we're going to sum sort of along columns. So this 0 and 3 gets multiplied by the 3 and 7 and gets summed. And then the 1 and 2 gets multiplied by the 3 and 7 and gets summed. So there is a multiply accumulate associated with every single one of these entries in the matrix. So we'll just draw out these four multiply accumulates.

28:04And just to make sure I understand why there's four multiply accumulates. So if each entry in the column that corresponds to the output vector is a dot product, and in this case it will be like two multiplications, and then the addition of those two multiplications are like you're accumulating. Yeah, so the addition, so really there's only one addition per dot product, but like we'd like to start with zero. But if you include the initialization of zero.

28:34Yeah. Yeah. So what we're going to aim for is to have, so we want to have quadratically more compute. We do. We have. We have got sort of x times y as much compute as we had before. But we're going to want to somehow aim for having only x times as much communication. And this is sort of the intention so that we get this advantage term going as y.

29:08So we've laid down the multiplications. Bringing in, like we're going to want to bring in a vector of size 2. And so that sort of already is in line with our comms target. That's fine. However, we need to somehow manage the communication of this matrix, which exceeds our budget of x. And so the idea is that in an AI context, this matrix is actually going to stay fixed for a long period of time. And so instead of like bringing it in from the outside,

29:40so we've got some register file sitting over here. We don't want to have like the amount of stuff coming out of this register file. This is the term that we want to go sort of as x in some sense.

29:54We don't want to bring this full matrix in from the register file every cycle because we don't have enough. That would cost too much in terms of wiring from the register file. And so instead we're going to store, our key trick is that this matrix can be stored locally to the systolic array. And so where we'll store these numbers 0, 1, 2, and 3, in just like a gate called a register that like physically stores these numbers.

30:26And we're going to reuse these numbers over and over again for a large number of different vectors. And so the optimization here is that like the nature of matrix multiplication is you can store this like square quadratic thing directly where the logic is happening and which is like higher dimension than the, or has an extra dimension compared to the inputs which you keep swapping in and out.

30:56That's right. I mean, this is the nature of what a matrix multiplication is, is that you do a lot of multiplication to get one value out. Like a dot product is the result of a lot of multiplications. And so that optimization means that you, you're just like, you can stuff a lot of like multiplication in before you get some value out of it. That's right. That's right. Yeah. So like just to complete the picture here of, of concretely how that looks, I swapped the three and the two here, three and two. So just like this zero and three is going to multiply by the three and seven.

31:29And so we're going to form a dot product sort of along columns here. So somehow we're going to feed a three and a seven in here. These participate in sort of this feeds into this multiplication and also feeds into this multiplication. Likewise, the three feeds into here and also into here. And then we're going to sum, uh, sort of along here with like starting at the top of a column,

32:00we feed in zeros and then coming out the bottom, uh, we get results coming up. So sort of just to visually see what we've got. Um, there's a dot product that is performed along columns in a matrix. And that, that sort of maps exactly to what is done spatially in the systolic array here. Um, so this is one dot product summed vertically, and this is a second dot product also summed vertically. Um, and then what is the data that needs to go into and out of the register file? We have X amount of data that's coming out here on the output.

32:32Um, and we also have sort of, uh, this data coming from the input, X amount of data from the input. Um, and so with respect to the input and output vectors, at least, um, we, we sort of met our goal of having only X as, uh, as much data going in and out of the, um, uh, the register file. This leaves open the question of like, I said that the, the weight matrices, weight matrix is stored locally in the systolic array. How did it get there in the first place?

33:03Um, is sort of, uh, like at some point you need to boot your chip and populate this data. And so where did that come from? The trick is just, we just do it very slowly. So we, we very slowly trickle feed it in, into the systolic array. The, um, sort of the simplest strategy is that we, we sort of run this daisy chain that says like feed a number into here. And then, and then on the next clock cycle, it'll move down to the next entry or the systolic array. And so we can do that in every, um, in, in every column in parallel. And that gives us, um, the sort of, uh, this is also going to come from here.

33:36And this is going to be another factor of approximately X units of, of, of bandwidth coming in. Uh, can you, can you just, would you mind repeating that sentence one more time? So like, we're sort of like, we know that we're going to be bringing in, uh, numbers only rarely into the matrix. Yeah. Um, and so we just want to come up with any construction at all, such that the amount of wiring that actually feeds into, sort of crosses this boundary of the systolic array. Like this boundary right here, we just want to keep that bounded to X and not be, not go as XY.

34:06And so a particularly simple strategy is that we, um, we sort of bring in a, a number into the top row of the systolic array. That's what we can do in one clock cycle. And then for like, uh, for Y consecutive clock cycles, we're going to be bringing in the top row every time and then sort of shift all of the other rows down by one. And that keeps the, the sort of the wiring that needs to come from this expensive register file, uh, only down to a factor of X rather than X. I see. Okay. So, uh, there's two questions in terms of communication.

34:38There's like communication time, and then there's communication bandwidth. Yes. And you're saying, since we're only gonna be loading this in once, let's maximize, let's minimize bandwidth. That's right. Cause minimum bandwidth equals die area and let's just like load it in slowly over like smaller lanes. Cause we're just going to keep this value in there for a while. Exactly. Interesting. So it's interesting to me that, uh, when we were talking last time about inference across many chips, the big high level thing we're trying to optimize for is increase the amount of compute.

35:11Mm-hmm. Per memory bandwidth, that is to say per communication. And here also we're trying to increase the amount of like, uh, actual multiplies or actual additions relative to transporting, um, information from registers to, uh, the logic. So in both cases, you're trying to maximize compute relative to communication. Yeah. Um, this, this shows up sort of all, all the way up and down the stack. Um, this is sort of close to the bottom, um, sort of like it took to the gates. Um, there's sort of a, a version that's maybe even closer to the gates of just like, um, even the precision of number format that you choose to use.

35:46Um, we saw that same effect. Uh, there's like a square cube law or a like squared versus linear term going on both in just purely the precision of this, um, ALU, but then also in, in terms of the value. Um, but then also in, in terms of the size of this, the, the matrix. Yeah. Very interesting. So, um, so this unit is sort of the next bigger unit. We had like the multiplication, the circuit. And then, uh, on top of that, we have, um, a, like a pretty large systolic array. I drew it as two by two, but in like, um, for example, older TPUs, they were described as 128 by 128, uh, of, of this circuit, uh, shown here.

36:21Um, and this circuit ends up being, um, this is the, the most efficient known, uh, mechanism for, uh, the, uh, circuit for implementing a matrix multiply. I see. So we, we've talked about, um, sort of, it seems obvious that you should try to maximize compute relative to communication. Um, what are non-obvious trade-offs that, um, actually you are, you, you know, keep you up at night about what should we do X or should we do Y? And it's non-obvious what the answer is. Yeah. So, I mean, I think most of the decisions in chip design are, um, sizing decisions.

36:56Mm. And so, uh, already in what we've drawn so far, um, like, so AI chips all have this, this circuit in it. They have a systolic array and then somewhere near it, a register file, uh, providing inputs and outputs. Um, the two sort of, uh, like even within this scope, sizing questions that you have are how big should I make my systolic array? And how big should I make the, um, the, the register file? The, um, so, and, and then the trade-off for the size of the, size of the systolic array, actually, these two questions are coupled is, um, one way to think of it is to say, I'm gonna like have a budget for, um, uh,

37:41how much, what percentage of my chip area I want to spend on data movement. So maybe I just say that I want this to be 10% and the systolic array to be 90%. Um, and then sort of, I can size my register file. Bigger register files are more flexible. I, they allow me to run sort of more, I can get more application level performance out. Um, but, uh, but then they sort of take away from this area spent on the systolic array. Yeah, makes sense. I recently ran an essay contest where I asked people to write about what I consider to be some of the biggest open questions about AI.

38:12The submission window closed last week, so I used cursor to create a couple of different interfaces to help me review the entries. One interface anonymizes submissions and hides unnecessary information. It lets me group responses by question, add notes, and record my scores. The other interface helps me review entrants who also want to be considered for the researcher role that I'm hiring for. The UI puts the applicant's essay right next to their resume and their personal website so that I can see everything at once. Cursor's Harness is really good at helping these models see and improve their UIs. I watched it render these interfaces in the built-in browser, take screenshots, click through sections, and keep iterating.

38:47At this point, cursor is where I do most of my work. Whether I'm reading and visualizing a bunch of research papers, or coding up an interface to review applications, or making flashcards for my Blackboard lectures, Cursor just makes it very easy for an AI to look at whatever I'm looking at, and help me understand it, and work with me on it. So whatever you're working on, you should do it in Cursor. Go to cursor.com slash Dworkesh. Where does the clock cycle of a chip come in? What determines what that is?

39:18Yeah. And what is a clock cycle of a chip? So I guess at baseline, it's sort of worth observing that chips are incredibly, incredibly parallel, right? You've got 100 billion transistors in a chip. A key thing that you need to do whenever you have massive parallelism is you need to synchronize between the different parallel units. In software, typically, you have these very expensive synchronization methods like a mutex. So one thread will finish what it's doing. It will grab a lock somewhere stored in memory and then notify the other thread that it's done.

39:50On chips, we take a very different approach and say that every nanosecond or so, all circuitry in the chip will kind of pause for a moment and then synchronize every... So it actually synchronizes every single nanosecond or so. And so that is the clock cycle. The entire chip, typically all in sort of one fell swoop, goes in lockstep to the next operation that happens. And so what this looks like in circuitry is that you will have...

40:25It's typically drawn... So the clock is sort of mediated by registers, which are these storage devices that we've drawn elsewhere. And the way to think of it is that I have some storage, which is storing like a bit, which might be zero or one. And then I have some sort of cloud of logic, which maybe is like this systolic array or this multiplier or something like that. And then I've got some... And that's going to produce some output. So my inputs, I've got a bunch of inputs feeding to this cloud of logic. And then eventually later there's going to be some output register that this writes to.

40:57There is a global clock signal, which drives all of these registers. And it says at a certain instance in time when the clock strikes, whatever value happens to be on this wire at that instant, that's what's going to get stored in there. And so the sort of the challenge here is like, I would like to have my clock speed run as fast as possible because if I can run at two gigahertz, I can get twice as many operations done per second than if I run at one gigahertz.

41:35But what that ends up meaning is that I'm very sensitive to the delay through this cloud of logic because any computation that is going to happen in here needs to sort of finish before the next cycle hits. So a major point of sort of optimization on any chip then is to sort of make this delay from here as short as possible. Interesting. And is there ever...

42:06Because the constraint here seems to be that if you add too much logic, then you might risk missing the clock cycle. That's right. But if you don't add enough, then you're, you know, leaving potential compute on the table. Is there ever a situation where you're like, you'd take a probabilistic chance that a computation finishes and... Or is it just like, no, either it's going to finish by clock cycle or not? Yeah.

42:36In standard chip design, you margin it such that, I mean, there is a probability, but it's like many, many standard deviations, like way standard deviations outside. Such that for all intents and purposes, it is a reliable part. It will, it will always meet the clock. There are some weird exceptions to that. There are clock domain crossings where you go from one clock to the other clock and then you actually do have to reason about this probability, but... Interesting. In the main path, you just like, you margin it such that you'll get there like 25% of the clock cycle in advance.

43:06Um, uh, so that it's very unlikely. In this, in this, um, the clock, uh, where the, uh, clocks synchronize, I guess where the registers are, this is not something you determined as a chip designer. This is sort of just like an artifact of, hey, I want whatever sequence of logic. And then the software you use to convert your Verilog into the thing you send to TSMC, that just determines like, hey, in order to make this work, you got to kind of, you got to put a register here, here and here to make sure that there's a, there's no one step that is like too long, such as it makes the whole clock cycle of the entire chip longer than it has to be.

43:46Yeah. So this is actually a huge part of the work of, of designing a chip actually is, is inserting them. So it is done in a combination of manually and automatically. Um, so, I mean, like just like to show the very sort of dumb, uh, version of like what you can do here, you can take this logic and split it in half. And so like, say actually instead of just one, uh, cloud of logic, I'm gonna, um, have two smaller clouds of logic, um, which do the same thing, but split them up by a register. Right.

44:17Um, feeding in like this. Um, and this is like, like if you split it, like in the middle, you can hit twice the clock frequency. That's great. You get twice the performance, um, at the cost of this extra register. And so at the cost of some more, uh, storage. And so stepping back, why do we need to synchronize the whole chip? Like if you have, like, if you imagine playing Factorio or something, there's no like global clock cycle. It just shit is done when it's done. There's iron on the plate. You can take it if you want. Yeah. So, uh, taking that analogy, um, the, the, the, the thing that you need to be mindful

44:49of is if I've got two different paths through some logic. So I have to do a computation like F here, um, and then computation G here. Um, and then they're gonna come and meet for computation H somewhere here. Yeah. Um, and so, uh, there's gonna be manufacturing variance here. Uh, in some chips, F will take a little longer. Maybe in some chips, G will take a little, a little bit longer. And so if I've got some signal that's propagating through here, um, and the result from F and G have to sort of meet up at H, um, what can, the, the thing that can go wrong is that F can get there early.

45:24And it meets like the previous value of G or the next value of G or something like that. Ah, and H needs to know when to start. Exactly. Like when, when has this next iteration of, and so this explains why different chips made at the same process node, the same like TSMC, uh, technology, um, can have different clock cycles. Yeah. Like two chips made at three nanometer might have different clock cycles based on, uh, whether they were able to optimize making sure that like there's no one critical path that is so long that, um, it slows down the whole chip's clock cycle.

46:01That's right. This optimization that I, that I showed here, this is just the, uh, this is sort of, um, pipeline register insertion it's called. Yeah. Um, we've inserted in the middle of the pipeline to register here. Um, this is a sort of pure trade off between clock speed and an area. Yeah. Um, this is the easy case. There is a harder case too, which is, um, uh, sort of drawn it as a pipeline of logic here. But in, in other cases you may have some, uh, uh, some calculation which actually feeds back in on itself.

46:31So, uh, it runs some function F and then writes back to, uh, itself like this. So for example, this might be this addition, like you've got some number that you're adding into every clock cycle. Um, and so this, this could be like a, um, this could be like plus, uh, we're adding in some number every clock cycle. Mm-hmm. So this, like this little circuit, uh, um, it, essentially it's just gonna sum all of the numbers that you presented on different clock cycles. And the challenge is if this plus takes too long, what can I do?

47:04If I like split it in, if I try and put a pipeline register, like right in the middle of it, um, like, like here in the middle of it, um, this will end up changing the computation that's done. Instead of forming a running sum of everything that comes here, I will actually have two different running sums. I'll end up having a running sum of the even numbers and a running sum of the odd numbers. So sort of this constraint where I have, um, a loop in my logic, which all chips have somewhere. Um, this is actually the thing that is the hardest thing to, to, to address and sets the clock cycle.

47:38I don't understand why it'd be a problem to have that, or I'm not sure even what it would mean to have a register there. Like, cause it's, it's a sort of, uh, atomic operation, right? Yeah. Well, so plus is not really atomic. Uh, like I think, um, as we just demonstrated. Yeah. Yeah. It took a whole lot of work to do a summation. And so like you can take the early parts of that work and then, and then stick a register in the middle and then do take the late parts of that work. Okay. Yeah. Uh, and I guess it's then up to, so TSMC offers a, uh, PDK, which says, hey, here's the primitives of, um, logic that we can grant you in the, in the chip.

48:11And it's up to them to determine that no, no primitive is bigger than like the clock cycle they're hoping a process node targets. But other than that, is there like what further optimize? Okay. Can't you just say like, Hey, here's all the primitives from TSMC. And if, keep adding registers in between the primitives as much as is needed until you get to your desired clock cycle. Yeah. As a logic designer, like the chip architects set the clock cycle. So just for, for one example, the primitives you get from TSMC are on the order of like and gates or full adders.

48:45Um, they depends a lot on voltage and frequent and, and, and which library you choose and so on. But generally they, you can typically have about like 10 or 20 or 30 of these in, in a, in a clock cycle sequentially. So these primitives are very, very fast, like 10 seconds or something like that. Uh, and so as a logic designer, I mean, like in principle, if you literally just had like, um, like register and then and gate, um, kind of in a loop like that, you could get an insanely fast clock speed, like more than four or five, six gigahertz, something like that.

49:19Um, but if you take this, this like really sort of like simple circuit, um, and you look at the area you're spending here, like, um, this is maybe like one, I mean, this is, this is called one gate equivalent, um, in size. So like unit of one in area, and this thing is like unit of eight in area or something like that. And so like, this is just, um, again, almost all of your cost has been this like synchronization or communication cost compared to the actual logic. And so, uh, so this would be a case where you've gone too far.

49:51You've made your clock speed really, really fast at the cost of spending almost all of your area on pipeline registers. Interesting. So what you're hinting at is a dynamic where you can have really fast clock speed, but you're not getting that much work done. Yeah. Yeah. Um, and so you can have like low latency, but, or, but low bandwidth or throughput rather. Yeah. The, it, it hurts your throughput in fact, because like, uh, like the, the throughput of your chip, you can think of as the product of, um, how much do I get done per clock cycle, which is based on this area efficiency thing times how many clocks I get per second.

50:25This is, this is actually so similar to the thing we were discussing last time about like batch size, um, where if you have a low batch size, then you can, any one user can receive their, uh, next token really fast. But the total number of tokens that are processed in say an hour will be kind of lower than it could otherwise be. Yeah, exactly. You get, you get less parallelism out if you, if you drive your clock speed up really high. Language models are starting to compete against the best human forecasters. I sat down with two senior Jane Streeters, Ron Minsky and Dan Pontecorvo and asked, at some point, does AI just do what Jane Street does?

51:01There's a world that we should take seriously where, you know, we're going to build large language models or some other AI systems that are like strictly smarter than all humans on the planet and more capable at all cognitive tasks. Trading in particular feels to me as like kind of AGI complete, sort of like NP complete, because at the end of the day, trading involves figuring out what things are worth, which means making predictions about the future. Jane Street isn't betting against AI. They just signed a $6 million compute deal.

51:31But Ron's view is that the edge keeps moving. I have never been more desperate to hire more engineers and more traders than I am today. You know, you have the usual thing of like the other hard parts that we don't yet know how to automate. Well, that ends up being where the competitive edge lies. You can find these open positions and watch the full interview at JaneStreet.com slash Thorkesh. Okay, so I remember talking to an FPGA engineer at Jane Street, Clark, who actually helped me prep for the previous interview we did together.

52:01And he was explaining why they use FPGAs. And I imagine that for high frequency trading, the throughput is less important than latency. And so having very specific control over the clock cycle in a deterministic way is the most important thing. Maybe it'd be interesting to talk about how you, why you can't just achieve that with an ASIC or why FPGA is the, yeah, why you might use an FPGA to have deterministic clock cycles for like high frequency trading.

52:34Yeah. So, I mean, firstly, like, let's consider the sort of the business case for an FPGA versus an ASIC. FPGAs and ASICs use largely the same sort of conceptual model, which is that I have a series of gates built from ANDs, ORs, XORs, those like very small primitives connected together with a fixed clock cycle and connected together with wires that are running in a fixed clock cycle. So anything you can express in an FPGA, you can express in an ASIC too.

53:05And it'll be about an order of magnitude cheaper and like better energy efficiency on an ASIC than an FPGA. The tradeoff is that the first FPGA costs you $10,000, whereas the first ASIC you make costs you $30 million because it requires an entire tape out. So sort of the business use case for an FPGA would be that I want something that has this very deterministic latency and fast runtime and high parallelism. But, you know, I'm going to change it very frequently, change what I do every month or something like that.

53:41And so then I don't want to pay the tape out cost every time. Now, how does an FPGA actually implement? It's sort of like it emulates the ASIC programming model, but in a fixed piece of hardware. And so how does that work actually? So what it has at the base is it's got the two components we just talked about. It's got these registers as storage devices, and then it's got these, these are called LUTs, lookup tables, which actually provide all of the gates.

54:16So, and then, and then we're going to see if in the sort of the third component, we then have like sort of a swarm of these registers and LUTs. And all of these are available and then they are connected by sort of this big set of sort of MUXs. So in front of every single one of these, we've got something like one of these MUXs, which selects one input from everywhere else, sort of selecting from all of these different things.

54:54You know, we've got a whole bunch of different options feeding into, into, into, into all of these things. So, um, so what, what, what this allows is essentially a, uh, when I program my FPGA, I can say that I'm going to take all of these components and I'm going to sort of superimpose on top of this, a particular wiring, which like goes through this LUT and then feeds into this LUT and then goes to this register and then feeds into this LUT or something else. So what I've drawn in orange is like how you, like FPGA means field programmable gate array.

55:32This is the, the orange is what has been programmed in the field, whereas the white is all of the wires that must exist in the FPGA in order to actually take, uh, to make the device in the first place. What does it mean to be programmed into a field? Like programmed in the field. So like the device has been deployed in a data center, it's sitting in the field and then you can come and program it. That field isn't like electric field, that field isn't like out there in the world field. Okay. Um, and so if I see, look at the, how the, uh, the field programming comes out of the first, uh, lookup table and goes into the second one.

56:04How is it? Yeah. How? Like, like where, where are the wires that made that happen? I guess. Yeah. So I, I, I got a little bit like lazy and drawing all of these. Every single device here has a MUX sitting in front of it. Um, uh, uh, uh, which can select from all of the like nearby, um, like, uh, circuits that are available. Um, and so, uh, the actual configuration of the FPGA is like amounts to it. It is the MUX control.

56:34So like we, in this MUX here, we have the data inputs and then we have like the control that selects. Uh, and so like there's a little storage device sitting next to every single one of these, um, uh, MUXs saying this is where you're going to source your input from. Right. Um, and so programming it consists of like configuring every single one of these MUXs. So that makes sense. What is happening inside of the lookup table? Yeah. So the purpose of the lookup table, so it's going to also have a little bit of control feeding, telling it what to do as well.

57:07The purpose of the lookup table is to function, to be able to configurably take the role of an AND gate or gate XOR, any of those different things. So there's many ways you could consider doing that. Um, the way it is done in sort of traditional FPGAs is to say, um, it will support. So it will be a lookup table will be, um, it will have four bits of input, one bit of output.

57:41How many different functions are there from four bits to one bit? There are 16 different functions. Um, and so, uh, you can actually just tabulate this as like 16 different, like 16 different numbers. You've got a table of 0, 1, 1, 1, 0, 0, 1, 16 entries. And so what it does is, um, this, this table is stored in this blue configuration bit. Um, and then, uh, it views these four bits as binary, looks up the relevant row of the table, and emits that bit.

58:15So this is a truth table view of, of lookup tables essentially. Okay. So the lookup table, if you think about an AND gate, OR gate, NOR gate, XOR gate, these are all like take as input. Those are like two input functions. Yeah. Sometimes we have like more complicated, like a three input function would be a three way XOR. Right. Or a four way XOR. Yeah. And in this case, how many, it just depends how, how big it is, but. Typical size for LUTs is four input. Got it. Which is sort of just a sweet spot between, um, there's another computer communication trade-off like here.

58:49Like if it, if it has too few inputs, then you need to use more LUTs. Yeah. If it, if it has too many. But basically the lookup table is like a truth table. It's a truth table. Yeah. And with a truth table, you can program in any gate you want. That's right. Yeah. And so it, instead of lookup table, it just thinks like a programmable gate. That's right. And so, I mean, one of the things you can do here is you can see why, where the rule of thumb that, uh, an FPGA is like an order of magnitude more expensive than an ASIC comes from. Um, is to count how many gates would be inside this lookup table. Yeah.

59:19So we can view this lookup table essentially as one of these MUXs. Um, and so it is a MUX with, has to select between 16 different values. Uh, and so it is a MUX with, uh, sort of N equals 16, uh, options. P equals one bits. And so what we saw way earlier is that, um, this circuit costs like N times P many gates.

59:50And so it's like, um, so it costs like N times P equals 16 AND gates. And also 16 ORs. This circuit being the MUX. Uh, yeah, exactly. The MUX. The MUX is the core. The MUX that goes into the lookup table. Uh, so the lookup table itself, you can think of as being actually a big MUX that like selects from all 16 rows down to one output. Yeah. Okay. That is the lookup table. But the way you've drawn it here, there's like a MUX and then a lookup table. It's MUX is all the way down.

1:00:20So, I mean, there's a, there is a second MUX that is inside here. This MUX is, is this MUX. Got it. Okay. And then the, the other MUX is just saying, uh, Where it came from in this sort of mess of, of, of gates. Right. And then the second MUX is, okay, now you have one value, but that value is still, um, still a four bit value. Yeah. So I've selected four bits from the soup. Right. And then, and then I use those four bits to select which entry in the lookup table. Yeah. Uh, I am going to use. Right.

1:00:50Okay. So like, suppose in the first MUX, there's like eight nearby eight, you're, you're pulling from eight nearby registers as input. Um, and so that's like a total of like 32 bits going in. Um, and then out of that, uh, four bits come out. Those four bits go into the second MUX, which is inside the lookup table. So actually I would say in, yeah, in this case, these registers are single bit registers. So if there are eight, like eight nearby registers and lookup tables, then I have eight bits total coming in, in, in nearby.

1:01:21I select from eight down to four different values. So there's actually like four different MUXs, one associated with each of these input, like little MUX associated with each of these input bits. Each of them is selecting one out of eight. And where are those eight coming from? Um, nearby registers and other LUTs. And each register is one bit. Yes. Yeah. And so I guess you, AMD or whoever makes these, um, uh, FEJs still has to be opinionated about what, uh, what registers should connect to which registers.

1:01:55Uh, and then you can program in the actual gates, but they add a wire in the connection, like the communication topology, right? Yeah. So there's the sort of like, you get flexibility, uh, in a local grain thing, there's a sort of nearby neighborhood where you can select from. Um, but then more grossly, like more coarsely, longer distant connections, they, they form an opinion. Right. Yeah. And the reason it's 10 X lower is why? So if you look at the cost of like building this lookup table, it's like 32 gates.

1:02:25Yeah. And then it can give me the equivalent of like, what's one, an interesting thing I can do here. I can do a four way and gate. And so that's like, I am using 32 gates of lookup table to sort of implement like a four way and means like, what is a four way? And I would do like, and, and, and then, and of, and.

1:02:46So to like, this is a, a circuit that I could implement in an ASIC directly using these three and gates. Um, but using a lot, I can also implement it, but it's going to take like these 32 gates instead of three. Yeah. And so the overhead is really coming from the, uh, like, um, the fact that the lookup table, the mux in the lookup table is, there's a more concise way to describe a truth table. Then listing out every single possible combination of inputs, uh, which is just to like write out the gate.

1:03:19Yeah. Like to, to like place down the polysilicon and the, and the lines and so on. That's right. Yeah. Interesting. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Yeah. Um, why is it not a guarantee in CPUs? Hmm. So you can actually design a CPU that has deterministic latency as well. Um, uh, and in fact, like the, the processes that are inside a lot of, uh, AI chips actually also have deterministic latency too.

1:03:51Grok has advertised this TPUs have that in the core as well. Um, the challenge is getting sort of deterministic latency and high speed at the same time. And so, um, where does the non-determinism and latency come from? Um, non-deterministic latency, um, comes from specific design choices in a CPU. Um, uh, it's actually possible to remove those design choices and make a CPU that has deterministic latency. Um, those are not very attractive in the market.

1:04:21And so people don't make those CPUs anymore. Um, but, uh, but, but, but actually in some sense, like, uh, deterministic latency is maybe a sort of a simpler designing start starting point. Um, and then, and then like some chip designers have added, uh, things into, to be non-deterministic. To take a concrete example of that, um, the, probably the most important example is on a CPU, just like the, the, uh, the CPU cache itself. So, uh, in a CPU, you have the, the CPU, this is, this is the CPU die itself.

1:04:58And then there's a memory, uh, off on the side. This is the DDR memory, um, off on the side. And then you have a cache system here inside it is the cache. Uh, that sort of remembers recent accesses to DDR and, and, and, and stores them. And so, uh, when, when I'm running through my CPU instructions, uh, every time I, uh, I have an instruction that accesses memory, it first checks in cache, uh, was, was the data stored in cache? And then if not, it goes, uh, it fetches out to DDR.

1:05:30Yep. This is a huge optimization. The cache is like two orders of magnitude faster than DDR. Um, if you never, uh, if, if, if you never like use the cache, like all, basically all programs would run a hundred times slower. So, uh, the presence of a cache is absolutely necessary for a CPU to, to run at reasonable speed. Um, but whether or not you get a cache hit is dependent on the sort of ambient environment of the CPU. Like what other programs are running, what has run recently, what is the random number generator inside the cache system doing.

1:06:01And so, so that is a big source of non-determinism in, in the runtime of a CPU. Yeah. Um, so this is sort of the memory system for a CPU. Um, the, the big thing that you can do differently is, uh, instead of having the hardware say, I'm going to like read, uh, read memory and then decide the hardware decides whether or not it comes, it comes, comes from cache or not. You can actually bake this in this decision into software. So, uh, a different design philosophy is, is to, um, so, and you see this in maybe for example, TPUs.

1:06:35Um, the, the TPU instead has, I mean, I'll draw the same diagram, but I'll call it a scratch pad. And so the, the main difference is, um, so this would be like a TPU and then like HBM in this case rather than DDR, but it's still an off chip memory. Um, and instead of like the software saying first access like memory and then the hardware decides, um, you've got some instructions that go here. This is like one kind of instruction and then a totally different kind of instruction that goes to HBM.

1:07:05Yeah. And so this style is, is generically known as scratch pad, um, instead of cache. The key distinction being that you have like one kind of instruction that says read or write scratch pad, and a totally different instruction that says read or write HBM. So it's a scratch pad being the cache. Yeah. This, this thing here is the, is the scratch pad. Um, just to be clear. Just to be clear. So stepping way back, people say computers have the quote unquote John Moin, uh, uh, Von Neumann architecture where there's this, uh, serial processing of information.

1:07:38And maybe just because we've been talking about parallel accelerators, but I just don't like the FPGA is super parallel. Yeah. Um, the, the, the kinds of AI accelerators, the TPUs are super parallel. Um, even CPUs are super parallel if you think about all the cores they have. And so is it actually like, in what sense is modern hardware actually the Von Neumann architecture? Is that actually a fair way to describe modern hardware? Um, I think it's a fair way to describe CPUs. Like just the amount of parallelism, like on a CPU, the amount of parallelism you get is about a hundred cores times maybe like 16 way vector unit.

1:08:12So 106, uh, about a thousand way parallelism on a CPU. Yeah. One question is what is the, there is a die that is being used for the CPU. Mm-hmm. And if there's fewer threads, just as a matter of like transistor, uh, voltages are like switching on and off. Is it just that there's like literally one control flow, like a small part of the die where like voltages are switching on and off? Or like in what, how, how do you actually occupy the die area of a CPU if there's a, as opposed to-

1:08:44If there's so few cores, like what do you, am I spending a lot of time in there? What is happening in there? The cores are just much bigger and more complicated. Um, so, uh, I mean like, so I guess we, we should compare like a CPU core, which takes up one 16, one, one hundredth of the die to like, I mean to a lot, like a lot is just only these 16 gates. Right. So like, it's clear why there are so many more lots in an FPGA than cores in a CPU. Um, but then sort of maybe the, the, like why are there more CUDA cores, for example, than, than, than CPU cores, I think would be like, why, like what's the difference between a CPU and a GPU or something like that would, would be a big difference.

1:09:23Um, inside the CPU you have, um, so one big use of, so the, sort of the top unit, uh, uses of area in a side of CPU are the cache. Um, very little is actually the ALUs. Um, like mostly it's like these register files rather than the logic units. Um, and then the, uh, both of these things have equivalents in a, in a GPU. And so that's not a big difference, but the thing that does not have an equivalent in a GPU is the, um, sort of this branch predictor.

1:09:53And so there is a whole big area in the CPU, which is, uh, sort of, uh, just a whole bunch of predictors that are saying, um, when will my next branch be and where the, where's the branch target for that? And so, uh, stripping a lot of that out, uh, as, as well as sort of making these register files tighter in, in a sense, um, is, uh, driving a lot of where the, like GPU gains. And what is the purpose of the branch predictor to like execute both branches at once or what does it do?

1:10:23So the issue is that when I've got a series of instructions, like, um, like instructions, instructions, instructions, instructions, I, uh, if I have a branch like here, if this instruction is branch, um, the, the actual processing step of processing an instruction, um, takes a really long amount of time. It takes like maybe five nanoseconds or something like that. Um, so like, uh, the time to actually notice that I've got a branch and then like, uh, uh, uh, evaluate the Boolean, whether it's true and then, and then update the program counter to the new target and then read from the instruction memory for that.

1:11:02Um, that, that could take like, like actually five nanoseconds to, to finish. And so in reality, this may finish somewhere down here. Um, I don't want to like, but like, I want to run a clock speed that is much faster than what five nanoseconds allows. Like, like five nanoseconds is 200 megahertz clock speed. I would like to run at one or two gigahertz or something like that. And so, um, I need to run other instructions while the branches is being evaluated. And so I, like, I really just want to keep running the following instructions that happen after me.

1:11:37Um, but that might've been wrong. Like if the branch ended up being taken, then I need to know that instead of evaluating these instructions, I actually need to like jump to wherever the target is and run, run these instructions instead. And so the purpose of the branch predictor is like, like genuinely to predict based on like, before you even get to this instruction to be like five cycles earlier to predict there was going to be a branch that's going to happen. So if I think about how the brain works versus what you're describing here at a high level, the differences might be that while you can do structured sparsity in these accelerators,

1:12:11and then save yourself some, um, area that you would have otherwise had to dedicate to these gates. In the brain, there's unstructured sparsity, you know, any neuron can connect to any other neuron and not in like ways where they'd be column lined or whatever. Yeah. Yeah. Um, then there's the fact that memory and computer co-located. I guess you could say in a way the memory and computer co-located on- This is exactly the co-location in some sense of the memory and computer. That's right. That's right. Yeah. Yeah. So maybe, maybe that actually isn't a big difference. And the other, maybe a big, a big difference is that the clock cycle on the brain is much slower than on, um, on computers.

1:12:50And partly that's to preserve energy because the faster the clock cycle, the bigger the voltage, uh, needs to be in order to identify for the signal to settle and to like identify what state of transistor is that. Yeah, that's right. That's right. I don't know if you have other high level takes about like how, any commentary on what, you know, what the brain might be doing versus how these chips work. Yeah. I mean, so, so let's take the clock speed one, uh, first actually.

1:13:20Yeah. Um, the clock speed is quite high, uh, on, on a chip because that, I mean, drives higher throughput. Um, like when we compare a, like a GPU running some, some workload, it's running batch size 1000 or something like that. Whereas like the brain is not running batch size 1000. There's only one of me. And so you could sort of imagine saying, well, take a GPU and like, instead of running at a gigahertz, run at a megahertz or something like that. And, and that would start to look maybe a little bit more like, like, um, uh, sort of equivalent things that you're talking about in the brain.

1:13:55Um, there is in the way that silicon works, um, there are like, that does not give you an 1000 X advantage in energy efficiency. So, um, what it ends up looking like is you can, uh, like you sort of just end up running this circuit, uh, once to stabilization. And then it'll sit idle for a long period of time. Um, it doesn't consume a lot of energy while it's sitting idle because, uh, most of the energy is consumed in, in sort of toggling, uh, bits from zero to one and back.

1:14:29Um, uh, so actually let's, let's talk about the energy consumption, uh, of, of a system. Uh, of, of, of a circuit like this. The way to think of, uh, a bit being stored is you've actually deposited some charge in, in a capacitor somewhere, sitting somewhere in, in the chip, uh, implicitly. So it, it becomes charged when it is, um, bit becomes a one and then it becomes discharged when it next goes to a zero. Yeah. Um, and that cycle of like charging the capacitor and then dumping that charge out to ground, that is where the energy is consumed. Yeah. This is called the dynamic or switching power.

1:15:00Uh, this is most of the energy consumption of a chip. There is some other energy consumption just coming from the fact that, uh, um, uh, insulators aren't perfect insulators, but we'll just discard that. Right. Most of the energy consumption, uh, actually comes from just the charging and discharging of, or like toggling from zero to one and back to zero. Right. Um, so if you run a chip much slower and you only clock it once every thousand clock cycles or something, you will have a thousand times fewer transitions. It'll be about a thousand times less energy consumption. Um, but, but like, but not a substantial advantage of energy efficiency.

1:15:32Um, okay. So you've described how a TPU works at a high level. Um, what is the difference, uh, at high level between how a GPU and a TPU work? Yeah. So, I mean, I think there's sort of a high level organization principle that is different. Um, and then there's sort of inside the cores what are different, but we'll look sort of, uh, outside the, uh, like at the high level. Um, so we'll take a GPU, um, and a TPU and what does like sort of the top level block structure look like?

1:16:02Um, if you think of this as the whole chip, um, in each case, um, the organization of, of the GPU, um, uh, is mostly a bunch of, uh, almost identical units, which are these, um, these are the SMs. Um, and then they've got, uh, an L2 memory in the middle and then a bunch more of these SMs on the bottom. Um, and so, uh, there's sort of this fairly regular grid of cores. Um, uh, and then like, if we look at a, at a TPU in, uh, in comparison, um, you, uh, you end up with much coarser grained units of, uh, logic.

1:16:44And so you, you end up with something like, um, uh, some large number of, um, maybe like, maybe just a few, uh, matrix units. These are the, um, these are the big, like, um, systolic arrays. And then in the middle, you've got some vector unit and then you've got your, um, matrix units at the bottom. So, um, now sort of like matrix units with a vector unit in the middle, sort of, this is the whole TPU chip.

1:17:16You can sort of think of take scaling this thing down into a really tiny unit with a smaller matrix unit, smaller vector unit. And that is sort of what, uh, uh, an SM is. So sort of at a very high level point of view, the, the GPU has a lot of tiny, tiny TPUs, um, sort of tiled across the whole, uh, whole chip. Oh, interesting. So like you're suggesting the tensor core within a streaming SM is analogous to an MXU. Yeah.

Reiner Pope – Chip design from the bottom up

Show notes

Highlighted moments

Transcript

More from Dwarkesh Podcast

Adam Brown – A deep but accessible introduction to general relativity

Grant Sanderson – AI and the future of math

The next big breakthrough will be AIs learning on the job

The data black hole at the center of AI

Ada Palmer – Machiavelli is the most misunderstood thinker of all time