Reiner Pope – The math behind how LLMs are trained and served

April 29, 20262h 13m · 22,975 words

Open in Steadcast for Mac Apple Podcasts Overcast

Show notes

Did a very different format with Reiner Pope - a blackboard lecture where he walks through how frontier LLMs are trained and served. It’s shocking how much you can deduce about what the labs are doing from a handful of equations, public API prices, and some chalk. It’s a bit technical, but I encourage you to hang in there – it’s really worth it. There are less than a handful of people who understand the full stack of AI, from chip design to model architecture, as well as Reiner. It was a real delight to learn from him. Recommend watching this one on YouTube so you can see the chalkboard. Reiner is CEO of MatX , a new chip startup (full disclosure - I’m an angel investor). He was previously at Google, where he worked on software efficiency , compilers, and TPU architecture. Download markdown of transcript here to chat with an LLM. Wrote up some flashcards and practice problems to help myself retain what Reiner taught. Hope it's helpful to you too! Sponsors * Jane Street needs constant access to incredibly low-latency compute. I recently asked one of their engineers, Clark, to talk me through how they meet these demands. Our conversation—which touched on everything from FPGAs to liquid cooling—was extremely helpful as I prepped to interview Reiner. You can watch the full discussion and explore Jane Street’s open roles at janestreet.com/dwarkesh * Google’s Gemma 4 is the first open model that’s let me shut off the internet and create a fully disconnected “focus machine”. This is because Gemma is small enough to run on my laptop, but powerful enough to actually be useful. So, to prep for this interview, I downloaded Reiner’s scaling book, disconnected from wifi, and used Gemma to help me break down the material. Check it out at goo.gle/Gemma4 * Cursor helped me turn some notes I took on how gradients flow during large-scale pretraining into a great animation. At first, I wasn’t sure the best way to visualize the concept, but Cursor’s Composer 2 Fast model let me iterate on different ideas almost instantaneously. You can check out the animation in my recent blog post . And if you have something to visualize yourself, go to cursor.com/dwarkesh Timestamps (00:00:00) – How batch size affects token cost and speed (00:32:09) – How MoE models are laid out across GPU racks (00:47:12) – How pipeline parallelism spreads model layers across racks (01:03:37) – Why Ilya said, “As we now know, pipelining is not wise.” (01:18:59) – Because of RL, models may be 100x over-trained beyond Chinchilla-optimal (01:33:02) – Deducing long context memory costs from API pricing (02:04:02) – Convergent evolution between neural nets and cryptography Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe

Highlighted moments

“if you do not batch together many users, the cost and the economics you get can be like a thousand times worse than if you do batch many two users together.”

Jump to 4:33 in the transcript

“The fact that they are charging 5x less for pre-fill than decode does suggest that they are bottlenecked on memory bandwidth to quite a degree”

Jump to 1:48:05 in the transcript

Transcript

Introduction to Rainer Pope

0:00Today, I'm interviewing Rainer Pope, who is CEO of Maddox, which is a new chip startup. Previously, he was doing TPU architecture and many other things at Google. This is a very different format from my usual interviews. This is going to be a Blackboard lecture. We're going to get up in a second. We, in fact, built this whole new studio with specifically this format in mind. And so it's a pleasure to get to inaugurate it with you. We're going to be talking about model architecture, ML Infra, many other things. And the reason I think it's an important topic is because once you actually understand how training and inference actually work in a cluster, as we'll see, a lot of things

0:34about why AI is the way it is, why AI architectures are the way they are, why API prices are the way they are, fundamentally also how, why AI progress is the way it is, start making sense. And you need to understand the details to get there and you need a Blackboard to understand

Rainer Pope Interview

0:47the details. So, Rainer, thank you so much for doing this. Yeah, very happy to be here. Just a heads up, this is a lecture with graphs and equations and all that stuff. So, if you can, I would really recommend watching it on a video platform like YouTube. Okay, full disclosure, I am an angel investor in Maddox, but that's unrelated to this podcast. Rainer, maybe to kick us off, I'll ask this question. So, we have a couple of companies like Claude and Codex and Cursor are offering something like FastMode, where for 6x the price, they'll give streaming tokens at 2.5x to speed.

1:20So, mechanically, I'm curious what's going on here. Why is it the case that you can pay more to get faster latency? And two, could you keep going? Could you pay 100x more and somehow get even faster speeds or much, much faster speeds? And three, could you go the other way? Could you have something like Claude Codex slow mode, where if you are willing to wait for minutes on end, you could get even cheaper prices? So, maybe this will help motivate the kind of analysis that you'll be doing through the

Batch Size and Latency

1:48lecture, great. I mean, a little bit to jump to the conclusion, the big effect is batch size, but what we're going to do now is quantify exactly what that looks like and what its implications are on latency and cost. There's going to be another effect, which is, you can call it speculative decoding or multi-token prediction. We can maybe come back to that later, but I think the first thing that we'll talk through is batch size. So, what I'd like to introduce is sort of the two principles of analysis. Firstly, we're going to look at a roofline analysis of how we run a transformer model on on a cluster of chips.

2:19We'll take a sort of, let's say, a Blackwell NVL72 cluster, so a rack of 72 GPUs. And so, the roofline analysis means we look at memory bandwidth and compute performance. And then, the other side of that is that we're going to look at just two simple factors of the model, which are the time to operate on the weights and then the time to operate on the context, the KB cache.

Memory and Compute Time

2:45So, let's jump in. What we're going to try and do is we're going to try and estimate the time that it takes to run an inference of a certain shape. Now, we're not perfect here. We can't exactly predict the time. And so, instead, we're going to approximate. And so, we're going to say that the time must be greater than or equal to a certain quantity. And so, we're going to consider two different aspects. We're going to look at the time for it takes to do the memory fetches and then the time it takes to do the compute.

3:16And it'll turn out that this actually gives us a very strong predictive power, even with

Compute Time Calculation

3:19a simple one. So, one by one, what is the time that it takes to do the compute? So, there are really two things I need to do in the compute. I need to multiply by all of the active parameters. And then, I need to do some work on the attention. So, multiplying by all the active parameters. I have a certain batch size that I'm running. And then, I've got a number of active parameters in my model.

3:47And then, I'm just going to divide this by the compute throughput, which is the flops of the chip. So, this is a hardware constant. So, this actually accounts for all of the compute time for all of the weight matrix multiplies. There's a little caveat here. We've sort of ignored the time to do any of the attention computation, but that, in general, will be quite small in comparison to this. So, we'll ignore this. Maybe I'll just interrupt from time to time to ask some very naive questions or to clarify some basic points.

4:17But, just for the audience, you're not serving one user at a time. The batch refers to the fact that you're serving many different users at the same time. And that's a whole batch. Yeah, so, I can motivate the batch at least a little bit. So, I mean, we will see exactly why batch is such a favorable optimization. But, what will turn out to be the case is that if you do not batch together many users, the cost and the economics you get can be like a thousand times worse than if you do batch many two users together. And we'll be able to see that quite explicitly.

4:47And then, a number of active parameters. This is saying, like, if I look at, for example, a DeepSeq model, the DeepSeq v3 model has about 37 billion active parameters and then 700 billion total parameters. So, we're focusing on just the ones that are active for a single token. Okay. So, we modeled compute performance. I'm going to keep writing equals. But in all of these cases, you can think of this time as being at least this much. And maybe there'll be some terms we ignored.

Memory Fetch Time

5:13On the memory side, what do we need to do with memory? We need to fetch all of the weights. And so, there is some time to fetch all of the total number of parameters, not just the active parameters.

5:30So, there's weight fetch time. And then, in addition, there's a kv cache fetch time. So, there is, this actually depends on batch size. So, for every element in the batch, we have to fetch an entire context length worth of tokens. And then, there's a size per token. So, like, bytes for one token. And so, there's a model parameter. And maybe, just back in, let's just explain what the kv cache is real quick.

6:01Yeah, so, when I do a forward pass, let me draw, actually, how the autoregressive inference works. So, this is doing decode.

6:10So, if I think I have a bunch of tokens of text, I'm drawing a tensor because, ultimately, the tokens are represented as some, like, tensor of, in some embedding dimension. And then, in this direction, I have the sequence length. The work of running a decode is, I have to run each token through a, through a whole bunch of matrix multipliers over a bunch of different layers. And I have, in general, I'm going to have to do that work over all of these tokens.

6:44But then, one step of decode is actually to produce just this one additional token out here. And so, what I'm going to do there is I'm going to run a full forwards pass of multiplying by all of the weight matrices in the entire model. But then, I've got this attention mechanism where this token sort of, it's, like, looking at all of the past tokens in this way. And what is it looking at specifically? It is looking at some internal representation that the model has produced of the tokens. And we call that the kv cache.

7:16So, this process of attending, this single token attending to all of the history of tokens, that's attention. It is mostly dominated by memory fetches rather than matrix multipliers. So, we've got the amount of memory that we're fetching, shown over here. And then, there's, of course, just then divided by the memory bandwidth. So, the memory bytes per second.

Plotting Latency and Cost

7:44So, in fact, these equations here are actually enough for us to now draw some fit lines. And so, the things that we'd like to look at are sensitivity to batch. And then, also, which we'll draw separately to context links. So, we said that the big effects you can get is, like, some trade-off in latency versus cost in batch size. So, let's draw them out. I think there's just really two graphs we want to draw. We'll first just draw batch size versus time here.

8:18So, when we look at the shape of this, we've got a maximum of, well, the sum and then another term. So, let's look at these terms one by one and how they scale the time for compute and memory and how they show up. So, let's first look at this compute time. This is just purely linear in batch size with no offset. So, it is some curve like this. This is T compute.

8:52And then, on the memory side, we've got some portion here that is just this constant that is, you know, constant in some base offset here, which is the waitfetch.

9:06Waitfetch. And then, finally, we have this term here, which is the kbfetch, which we're going to draw as the kbfetch, which is linear in batch size. So, it looks like that. So, the sum of this plus this maxed with this. So, let's at least first draw the sum.

9:38So, the two memory times in conjunction end up looking at this curved slope like this.

9:44And then, we get a, the overall maximum is, I'll draw a little thicker here, is the maximum of these two curves. Makes sense. Okay, so, so, so, so what does, what does this mean, actually? So, this is a latency plot.

10:02So, if I grow my batch size, I, I get initially some not very strong dependence on batch size. And so, there's some lower bound on latency here. Latency, lower bound.

10:15Lower bound. So, this already partially answers the question. For a given hardware configuration, and then we can talk about varying hardware configuration. But for a given hardware configuration, there is a lower bound on latency, which is simply the, I need to read all of my total parameters from memory into the, into the chips. And that takes a certain amount of time. If, if I use all of my memory bandwidth, I can't do any better than that. Which, it seems like the way you've drawn the slopes for compute time and how the KV grows, and what implication the KV has on memory time, that, as a batch size.

10:57Yeah, what if this were above or below, or? Yeah, or is that necessarily the case? Because if this is always true, then as batch size grows, compute always dominates KV, which suggests that if you have big enough batch size, maybe memory is never an issue. Yeah, this is really sensitive to the context length. So, I think we should come back and explore this. Yeah. The, there will be, as you vary the context length, the KV fetch time will go up and up. And so, that'll cause a transition from compute limited to memory limited. And is there something especially significant about the slope being exactly the slope of the, the compute time?

11:34Yeah, whenever we have balance points, it kind of says that you're getting it exactly right. And so, for the particular context length where the slopes match, that says I am equally memory bound and compute bound, which is a really desirable place to be. Yeah, yeah. But, but, suppose it's like, this is a very simple algebra, algebra problem, but suppose it's, you know, the optimal is 100K context length. And you go to 200K context length. Does your MFU go down to like 50%, like, does it have a humongous impact on MFU?

12:06Yeah, it does. To be like slightly outside of context length, optimal range, Goldilocks zone? That's right. So, that is true as modeled here. There's a key point here that I'm modeling this context length as, or I'm modeling the memory fetch as linear in context length. That actually depends on model architecture. It is true for many of the, or all of the model architectures with dense attention. Yeah. There's a, sparse attention actually scales much better than that. Got it. And is sparse attention that everybody uses in practice? I'm pretty excited about sparse attention.

12:37It's hard to know what the labs are using. DeepSeq has published a sparse attention mechanism. I'll just like put a plug in that sparse attention, some of the DeepSeq papers that have published sparse attention end up putting a square root in this term.

Sparse Attention and Model Quality

12:48Okay.

Sparse Attention and Model Quality

12:48So, so far we've done, we've looked at the latency. It's kind of hard to read off cost from this. So, if I think, what does cost mean? I'm going to, like, to run this inference, I'm going to use the GPU for a certain number of seconds, like one millisecond or 20 milliseconds or something like that. And I have to pay the rental time for that time. So, like, it's $2 an hour per GPU or something like that.

13:13So, so that's the cost of this inference. But how much value have, how many tokens have I processed during that inference? That is the batch size. And so what we actually want to plot is going to be the, the cost versus batch size, which is like T over B versus batch size. This is the cost per token.

13:37So, like, we have to imagine dividing each of these three curves by, by B, so multiplying by this reciprocal. And so what we end up with there is the compute curve is going to, it was linear, we divide by B, that makes it a constant here. And this is T compute. The, the kv fetch was linear, now it becomes a constant as well, kv fetch.

14:10And then the, the, the, the weight fetch was constant and now we're divided by B and so it becomes this hyperbole. And so, again, we're going to compute the, the max of the sum.

14:37So the sum of these two terms shifts the, the, the parabola up. Sum of the kv fetch and the weight fetch gives us a sort of a, a higher parabola that's like this. And then we're going to take the max with the compute.

14:54Here. So we end up with this, this being the overall shape that we care about.

15:01So again, so like we, we see some limiting behavior. The cost initially starts very high at batch size of one. Actually, like it almost goes to infinity. Like it's because we've got so many weight fetches, which are not amortized over a large batch size. But then as we increase the batch size, the weight fetches become amortized over so many different batch elements that they, their cost goes, grows very small. And eventually the compute time ends up driving the cost. So there is a limiting, like lower bound, lower bound on cost.

15:34Which is this one here. Yeah. So Claude code slow or codex slow or whatever would just live on this line. And it wouldn't help much because you're, you're not able to amortize the kv values over a much bigger batch. Yeah. Yeah. They're unique per batch. The compute is also unique per batch. And so what is the minimum work you can do per batch after amortizing everything else away?

16:01So this point where you are no longer memory bandwidth bound, what practically, how big a batch do you need to, like how, yeah, how big are the batches practically for frontier models? You can, you can just solve for that actually. And it's not even particularly sensitive to model architecture. So let's, let's go ahead and do that. So what we're talking about is we're going to say when the memory time is equal to the compute time. That's, that, that's what that question is.

16:32For now, I'm going to discard the, because we're focused on what, what the batch size is. And really there's a question of what, when the weights are amortized over the, the, the multiplies. I'm going to focus on comparing the weight fetch time to the weight multiply time. I'm going to disregard the kv fetch term just, just to simplify the analysis so we can get a kind of a clean answer out. So we're going to equate this portion with this, with these two times.

17:03Yep. So writing that out, um, we get N number of total parameters over memory, uh, memory, uh, memory bandwidth, uh, is equal to, um, batch size times number of active parameters divided by the compute performance. So looking over here, everything on the top, these are model parameters, everything on the bottom, these are hardware parameters.

17:37Um, it, it turns out to be nice to rearrange them such that we have the hardware parameters on one side. So, so let's, this is equivalent to, um, flops over memory bandwidth being equal to, um, batch size times number of active parameters divided by the number of total parameters. So, so this is a hardware parameter, um, actually the, this actually ends up being a dimensionless constant.

18:10Uh, if you look in terms of flops, what are the dimensions of this? This is, um, multiplies per second, this is bytes per second. So that's not quite dimensionless, but what do you do is you say like multiplies per second times, let's say I'm doing FP4. Um, so I, I do like how many FP4 multiplies per second times the fact that, uh, each one, each FP4 is half a byte. Um, and so I can actually make this end up, ending up being dimensionless, um, and, and this ends up being on most GPUs, um, around 300, somewhere around 300.

18:46And sorry, has that ratio changed over time as we've gone from model generation to model generation where the flops keeps increasing? So there's a hardware parameter, um, to what extent has the hardware changed? So, um, from like A100 to H100 to B100, um, the, the flops has increased substantially. Similarly, the memory battery has also increased substantially and it has remained reasonably stable. Okay. And we can, we can express this one as well. This is a sparsity parameter. Yeah. Um, and I, I might even phrase it slightly different. Let's solve for batch size in total. Um, we end up with, um, so we're just moving this back over to the other side.

19:17We end up with batch size needs to be bigger than approximately, um, 300 times sparsity. So, for example, if I have a hundred, like I activate in DeepSync, uh, I activate 32 out of 256 experts. So this would be like eight for DeepSync. Got it. Okay. So, so this actually gives you a ballpark, which is like remarkably accurate to practice. Generally people will go a little bit larger than this. They don't really want to be exactly at the balance point because, um, real world efficiencies

19:47aren't as good as a roofline analysis would say. Um, but like take this and maybe double it or triple it. Okay. So basically it's like two to 3000 tokens per batch, but then if you included the KB cache, the implication would be that the optimal batch size should grow larger. So this has got like, we, we solve for the equivalence between when, um, compute time is equal to memory time. If I add in more memory bandwidth, like something that consumes more memory bandwidth, then I have

20:19less available for the, for the weight loads. And so I need to grow the, uh, the memory bandwidth more and therefore the batch size more. This seems incredibly small, like a batch. This would be like less than one sequence, right? Yeah. Okay. So, so I guess this is, um, keep in mind that I'm talking about the number of tokens that I'm generating one more token for. So, so it's like, it's actually 2000 unique sequences. Okay. You're thinking about the, a single forward pass on these sequences. Yes. This is like the, do you think of it like the batch is the number of sequences rather than

20:50like, that's right. Okay. Cool. Yeah. So for Reiner, I chatted with two of Jane Street's engineers, Clark and Axel. Clark, who works on low latency trading systems, walked me through why Jane Street uses FPGAs to make sure that they have predictable nanosecond latencies. You can just build these like giant grids of compute very easily that do exactly what you need to touch a hundred megabytes of SRAM and then get your response back in tens of nanoseconds very easily. And that's basically impossible on a CPU. He then went on to explain why CPUs just wouldn't work for this kind of thing.

21:22And so if you have a clock that's going every three nanoseconds, you actually have several bytes of information at a time to make your decision. That's as opposed to a CPU where you'll just collect up a whole packet, you know, let's say a 1500 byte packet. And then you say, okay, this packet is ready. Here you go, CPU. You can start thinking about it now. FPGAs allow you to react to the earliest part of the packet as it arrives, rather than having to wait for the full thing. We also talked about liquid cooling, network design, and many other things. If you're interested in this stuff, Jane Street is hiring. You can check out their open roles at janestreet.com slash dvorkash.

21:55And if you want to watch the full prep conversation, we posted it there too. If you've got a frontier model and you are actually doing inference, surely they must have more than 2000 concurrent users. Yeah. Is there any added latency from the fact that you need to have the whole batch fill up? Or is it, if you have a reasonable amount of users, it's so unlikely that you wouldn't, it would not take you a hundred milliseconds to fill up the next 2000 slots? Yeah. The way to think about this, I guess we think of it as like, when does the train depart

22:26as a model? Yeah. I've picked a batch size that I'm going to run at. Maybe I pick, you know, this batch size. Yeah. And so like, well, and by the way, this intersection point is the same intersection point here. So I picked this batch size. I know that it's going to take, for example, maybe it's something like 20 milliseconds is a common place to sense up landing. What I'm going to produce is like, so this is a timeline of what is running on the GPU. It's going to start a new batch every 20 milliseconds, regardless.

22:57And so, sorry, this is 20, this is 40, I guess. You can think of this as a schedule for the train, a new train departs every 20 milliseconds. Any passengers who are ready, board the train. If the train is full, then they wait to the next train. If the train is not full, the train's going to go anyway. And so in terms of what that means for queuing latency, it means that the worst case is that you, like a request arrives just after the train departed.

23:27It has to wait for the next train. So that's up to 20 milliseconds. And then it has to wait for that train to complete. And so the worst case latency is 40 milliseconds. Sure. How is a 20 milliseconds derived? I mean, rule of thumb, but where it comes from is not fully explained yet, but it's not fully explained yet. So far, we've focused on memory bandwidth and compute time. When we look at memory, the other consideration is that we want to use all of the memory capacity we have. And so generally, we're going to use all of that memory capacity to store the weights

24:00or the KVs. And so we just want to read, like in the time of doing a forward pass, maybe we want to read all of the memory capacity into the chip. And so that is capacity divided by bandwidth. That tends to be 20 milliseconds on many different generations of HPM. The units make sense. You would have a byte divided by bytes per second. Yeah. So for example, I mean, on I think the Rubin generation, it is something like 288 gigabytes divided by 20 terabytes per second.

24:34And this looks like it comes out to about 15 milliseconds.

24:42Yeah. Let me just make sure I understand what it's saying. I mean, I understand why the units can't do the sort of unit analysis. But what it's saying is, we can evacuate and replace the HBM in this amount of time. And so we don't want to be in a situation where the HBM is not big enough that we're not actually able to keep write everything we want to it or take everything out of it. Or we don't want to be in a situation where our ability to write back and forth is so big

25:16or so small compared. Yeah. There's sort of two scenarios. Why don't we pick a latency that is bigger than 15 milliseconds? And if I think what that means, it means I actually have time to read the HBM like twice. Yep. By the way, most of HBM accesses is reads, not writes. It's like almost all reads because the weight matrices are read only. And then almost all of the KV cache accesses are reads. So in like, let's say I run 30 milliseconds, I can read all of HBM twice. But what's the point of that? Like, I don't want to read the weight matrices twice.

25:46I don't want to read the KVs twice. Yeah, makes sense. Makes a ton of sense. Okay. So a couple of actually quick questions. One, if it is the case that the optimal batch size is something like 2000, and that actually true, it's totally dependent on sparsity. It's not dependent on the model size or anything. I mean, sparsity shows up in model size. But beyond that, it only depends on sparsity, not on scale. But that's a very interesting result. And that seems to imply that you can... One question is, how much of a push towards centralization is it that you would have these economies of scale from inference, from batching?

26:19Yeah. But it seems like it's not that big a deal. Like, I don't know, is 2000 users at the same time a lot? It doesn't seem like a lot? We can do a bit of analysis on this, which would be actually, it's like, you can think of it in terms of number of users, but maybe a more productive way to think of it is in terms of number of tokens per second. So what does this batch size mean in terms of tokens per second of the system? So tokens per second, tokens per second is going to be equal to the batch size. We run a batch many tokens, and then we do that every T. So every time in terms, which is, let's say, which is, which is, this thing is equal to the 15 milliseconds, 20 milliseconds number.

26:57So this ends up being batch size itself times about 60. So like 64 times B. And so this ends up being around 2000 times 64. So like 128, 128K token specific. So this is sort of in more digestible units. Like, it's hard to reason about concurrent users, but what is the global traffic for a system?

27:29When you look at some of the announcements, sometimes the API providers will brag about how much traffic they have. The numbers that I've remembered from some announcements of Gemini last year were in the hundreds of millions of tokens per second worldwide. So about 1,000, like this is 1,000th of that range. But I mean, Gemini is big, so that's actually 1,000th of Gemini is a lot. To actually be like, to be competitive at scale, you need to be able to serve at least 1,000th of Gemini.

27:59Yeah, yeah. That's interesting. Cool. Okay, so the more sparsity you have, the less compute you need.

28:13And it does seem that as batch sizes get bigger, compute ends up being the bottleneck, according to this analysis. So then the question is, how far can you take sparsity? That is to say, as the sparsity ratio increases, as you have fewer and fewer active parameters relative to total parameters, how much is performance of the model degrading? And is it degrading faster than your saving compute by increasing the sparsity factor? Yeah, so performance, quality of the model, rather than speed of the model.

28:44Yeah. So unfortunately, we're not able to answer that analytically. That is an empirical question of model quality.

28:53Best I can do is pull up a paper and answer that empirically. Okay. Should we follow the paper now or so? Yeah. So this paper, this is Unified Laws for Routed Language Models. It's a somewhat old paper by this stage, but one of the things that they did is looked at, if I keep increasing sparsity, what is the model quality impact? This answer is very sensitive to the actual choice of mixture of experts. Mixture of experts has been around for a really long time. I think it was even back in 2017.

29:21But the techniques have changed a lot. DeepSeq mixture of experts was a big change in how it worked. There have been older papers, which are G-shard switch transformer. So the actual empirical results are going to depend on all of that. But on one of the older techniques that is shown here, you can see if I hold constant the number of active parameters at a certain size, and then I increase the sparsity, which they call expert count here. The quality keeps increasing. And then if you imagine drawing a horizontal line from 1.3b dense across, you end up seeing that, for example, in this case,

29:54the 64 expert, 370 million activated parameters model is as good as a dense 1.3 billion model. So in some sense, it's actually not amazing returns where you need to increase total parameters 100-fold to get the equivalent of 10x as many active parameters. Yeah, I mean, actually, even more so. Yeah, it's a huge increase in parameter count for a modest increase in... Yeah, so in this case, actually, it's, what is it, 4x? 64x for 4x. Yeah, so while it is true, I guess, that you get this benefit of being able to economize on your compute time

30:35if you increase sparsity, naively, it would seem like, oh, that's a trade-off worth making. But if this, you're decreasing this by 2x and then having this go up by 8x every time you double sparsity... So is that good or bad, actually? Even from a memory point of view, keep in mind, you are doubling this portion of the memory fetches, which is amortized by batch, and so just keep running a larger batch size.

31:06From the point of view of the analysis we've done here, this is pure win. Keep doing it. Keep doing it until you run out of available users, basically. So there's actually this equivalence between if I want to go sparse, or if I have a lot of users, I can go to a much sparser model. So from that point of view, it's a reasonable trade-off. The other trade-off that shows up here is that it also consumes memory capacity, which we've only reasoned about memory bound with it, but it also consumes memory capacity.

31:36So let me just make sure I understood. You're saying we want to spend less time computing, therefore we do more sparsity. To make that work, we need bigger batch sizes, which means we need more memory capacity. Yeah, so... To have more sparsity. Yeah, so maybe this would be a good point to actually talk about how a mixture of experts layer is typically laid out on a rack of GPUs or something.

32:07Yeah, yeah, makes sense. Yeah, where were we? Sparse mixture of experts. Yes. Maybe how we lay that out on a GPU. Yep. So let's zoom in on the mixture of experts layer first and sort of draw what that looks like. So we typically will have some kind of a router layer, which is making the decision of where we route the tokens to. So we have tokens coming in here. They go through a router layer, and then we have a bunch of different experts.

32:43I'll draw a few more to line some up. And then the router will make a decision, which experts am I going to route to? And it'll be a small fraction of them, maybe one in 32. So maybe it'll make a decision to route to this one, maybe this one, and maybe this one.

33:04These experts, so each expert itself is a normal MLP. It has an up projection and then a down projection and a non-linearity in between. And then finally, we sort of do the inverse operation. So where we were broadcasting things out here, we're going to bring them back in and sum them up. So bringing them in like this.

33:28And then finally, we have our residual connections. The token is also passed through here, and it gets added to the result of the MOE layer. So this is a normal MOE layer. What I want to talk through is how this is mapped to a GPU rack and what this means for communication. Because I think this will start to show some of the limits of how sparse we can go. So the standard practice here, and it is the best solution, is to use expert parallelism.

34:00So that means different experts go on different GPUs. So if we take something like a DeepSeq model, they have 256 experts.

34:10Let's say we want to run that on a Blackwell rack. So there are 72 GPUs.

34:16We have a divisibility problem. This is not a power of two. So we'll just like simplify and say we're only going to use 64 of them. Just ignore the other eight. It's not a big deal. And so we have four experts per GPU. Very simple.

34:33For the sake of the diagram, I'll actually just say, let's say we have two experts per GPU. So we end up just putting, these are the GPU boundaries. Every pair of experts is on its own GPU. And then we can look at the communication cost. We had some experts stored, there's some tokens stored centrally here. They get routed to all of these experts. And so there is some communication cost paid here. There's the same communication cost paid on the output.

35:03And then the hope is that this does not become communication limited.

35:09Now, what is the traffic pattern here? The traffic pattern here is that any GPU, in fact, will be talking to any other GPU depending on the decisions made by the model. So this is an all-to-all traffic pattern. So when you say any GPU in the pretense, the router is more than one GPU? Yeah, so I drew this as one router. In reality, you would actually have many copies of the router. And so you would have as many routers as GPUs, in fact. As the incoming traffic.

35:42Yeah. So these are 64 GPUs. These are 64 GPUs. It's actually the same GPUs. We just, like, draw them as separate because they're serving different purposes. So at this point, any GPU can be sending to any other GPU. So this all-to-all pattern of communication that shows up and how the blackwall racks are configured is a perfect fit for the communication pattern that the MOE actually wants to do. However, if you think, maybe I want to do,

36:13like, maybe one rack is too slow and I want to do two racks, then I have this challenge that, like, maybe I've got some sort of rack boundary drawn outside here like this. And I no longer, in fact, have all-to-all communication between all the GPUs in two racks. And so the rack-to-rack communication ends up being a substantial bottleneck. So this sort of, like, the fundamental thing here is that one rack is actually the, bounds the size of an expert layer you can do.

36:44And so this has been part of what's been driving towards larger and larger interconnector domains. Yeah. Before we, it may be worth you explaining what exactly a rack is, the differences in bandwidth between a rack and within a rack, and the all-to-all versus not all-to-all nature of communication within versus outside. Yeah. And this is a place where it starts to be very different, in fact, between NVIDIA, for example, and Google, and then others, including us.

37:13So generally, a rack is a, it is a physical structure. It's a few meters tall, a meter or two wide, depends on configuration. And it stores some number of GPUs or XPUs, which is typically about 64. The, what constrains it being a certain size is power delivery, weight, and cooling ability. It ends up being about this size in many cases because of these physical constraints.

37:45Um, so, then when I deploy a data center, like I've got, a data center may have thousands of these racks. So I've got one of these tall racks, it's got a bunch of GPUs in it, um, and so on. Um, and then I put another rack, um, next one. You make it sound so easy. Yeah, right? I just, like, drop them in. Um, in NVIDIA's case, um, the, the communication, uh, topology, um, is, uh, actually, it, it, it, they put the GPUs on, on the outside of the rack,

38:15and then they put these switches on the inside of the rack. So, what this ends up being is that there's a set of switches in here. Um, these are the NVID switches. Mm-hmm. And then they run a bunch of cables. Um, every single GPU, uh, has cables, um, going, going to the switches in the middle. Um, so,

38:39uh, every GPU goes to the switches in the middle, and then, uh, the switches have connections to all the GPUs, so all of the GPUs can talk to all the other GPUs, uh, in, in just, like, two hops. Going to the switch, going to the other GPU. Now, when I want to leave the rack, I end up going via a, different path. Um, the GPUs have also a much slower, um, uh, connectivity, which is typically about eight times slower. Um, which is, uh, so, so the green that I drew here in GPU cases is the Envy link. More generally,

39:09it's called the scale up network. Um, uh, this is the scale up network. Um, you will typically, um, also have a scale out network, which allows you to connect to, like, some data center switch. Um, so data center switch. And then all of the GPUs will have some connectivity up to some data center switch somewhere. Um, but this is, this is about times, uh, like this is the scale out. Um, and it tends to be about,

39:42about eight times slower, uh, in bad words.

39:47So the, the challenge, if you want to, for example, lay out a mixture of expert layer across two racks, is that half of the GPUs here are going to be wanting to talk to the GPUs, GPUs here. And so, um, like, half of the, like, just on average, like, when I look at where the tokens on, on these GPUs want to go, half of the tokens want to go inside the rack. That's great. They can use the, the fast scale up network, but half the tokens are going to want to leave the rack and go to the other rack. And that's not as good. They're going to need to use a much slower network.

40:19And so that becomes the bottleneck on, uh, on, on the all-to-all pattern. Um, a different choice would be, well, why don't I like have a big switch here and sort of like, um, and connect, uh, everything to some big switching, uh, like a much bigger switch that actually combines the two racks together. There are many ideas in this direction, but in general, it becomes, uh, the reason you have this sort of hierarchy of switches rather than one big switch is to manage the cabling, congestion. Uh, you just need to run a large number of cables. Sorry, is this,

40:50is that question you just asked basically, why isn't it a bigger scale up? Yeah, exactly. Why not, why not just like have like a million chips and scale up? What has changed that is allowed in video to go from Hopper was eight, then, uh, Blackwell is 72. And now Rubin will be, I don't know, is it 500 or something? Yeah, 500 or something. Yeah. Um, what, what has allowed that to happen? Uh, from Hopper to Blackwell is, is mostly just, uh, uh, the decision to switch from, uh, uh, trays as the form factor.

41:21One of these is a tray, just, just switching to racks as the form factor. That's a product decision. Yeah. Um, there wasn't a substantial technical barrier there. Um, uh, switching from, uh, from the, like, uh, 64 to, to 500 or so, um, there's a bit of Jensen math there, but, uh, uh, uh, there is at least a genuine 4x increase, um, which is, um, coming from a much more complicated and difficult rack design. And so that, that is actually like new, new physical design to run more cables.

41:51And the cable complication is just the, the, the, the cost of figuring out which cable hops to which, or like which signal. Yeah. I mean, let's sort of zoom in on this and look at the, the wire density. Um, I'll draw this diagram, just run some more. So we have a bit of a cleaner version to work with. I'm in a larger version. Um, let's say I have some switches in the middle. Yep. Um, and let's say I'm going to have, initially I'm going to start with just two GPUs on each side or two, two trays of GPUs on each side. Um, and let's say maybe,

42:22maybe each tray wants to have, uh, two cables coming out of it. Um, so I get some kind of, I, I physically run vertical cables that look like this running onto the switches. Um, now if I want to double the number of GPUs in a rack, um, uh, I need to run like literally twice the density of cables. So, um, I need to run, yeah, uh, these as well. Um,

42:51it shouldn't have a question, but if you look at a physical data center, it seems like there's a lot of space within a rack. I don't know, just like the cables are like really big and. Yeah. So there is space outside the rack. Like inside the rack, like these racks are like, I mean, as they become more optimized, these racks are very tight. So, um, there's, uh, connector density going from, um, from, from, from the tray into the rack and the racks backplane. Um, and then the backplane itself has a, has, has a really high density.

43:22Um, there are other physical constraints, including like bend radius of cables. Like you don't want to snap them and so on. Yeah. Okay. So it's literally the physical space to put a cable. Yeah. That's constraining it. Yeah. I had no idea. Interesting. Uh, that seems surprising that like, although the rack is so big and they're just like, we can't just stuff more cables in there. Yeah. So, I mean, rack design is not my expertise, but like when I talk to, to folks and what are the constraints they're up against, it's, it's a combination of, um, uh, so you, what are the big physical things you're optimizing for? Um, space,

43:52uh, weight of the rack. Like it's actually really heavy. And so like you need enough metal to not sag and fall. Uh, but then you add more metal and it's heavier, um, and then power and cooling. And so all of those are competing for like modern racks are pushing all of those to very extreme physical limits. Deep work is by its nature quite aversive. So even things which seem like work, like Slack and email can be easy ways to distract yourself. So I often wish that I could just turn the internet off. But if I'm prepping for an interview, even if I have the papers and books on hand,

44:24it's still super useful to be able to do a back and forth on the LLM. So I can break down concepts and research follow-ups. Google's new Gemma 4 is the first open model that allows me to have this kind of fully disconnected focus machine. It's small enough to run on my laptop, but good enough to actually be useful. So to prep for this episode, I downloaded Reiner's scaling book and shut off the internet. I was able to have Gemma help me understand the material and answer my questions. If you want an LLM that you can run locally on your laptop or even your phone, you should check out Gemma 4.

44:54When was GPT-4 released again? It was 2022 or 2023? Three. Three. Yeah. And it was rumored to be over one trillion parameters. And it seems like only now and within the last six months have models been getting released that are significantly more parameters than the model released three years ago. When supposedly there should have been this scaling in the meantime. Is the reason that we were just waiting for racks with enough memory to hold a five trillion parameter model along with its KV cash for enough users for a lot of sequences?

45:30Or if you're doing RL, kind of a similar consideration of actually holding the KV cash for all the batch of problems you're trying to solve. So if you look at like Hopper, you had eight Hoppers, and I think that's 640 gigabytes as of 2022. With Blackwell finally, which was deployed, what, 2020? Very recently, maybe last year. Last year? Yeah. You finally have a scale-up with on the order of like 10, 20 terabytes, which is enough for like a 5T model plus KV cash.

46:02Yeah. Deploying in larger scale-up domains is a huge unlock. I mean, I've drawn here the sort of NVIDIA Blackwell deployment. The Google deployment has actually had a very large scale-up domains for a long time. And that also explains why Gemini seemed to be ahead. Like was Gemini 2.5, was it successful? Or it just seems like Gemini has that successful pre-train for longer than some of the other labs. Not having been there at the time, I'm not sure how much is coming from like successfully deploying higher sparsity ratios, which could be. It could also be, I mean, there's a whole bunch of actual modeling things of like, specifically, how do you do the mixture of experts?

46:39Yeah. We've seen the DeepSeq, like the DeepSeq mixture of experts has said, actually activate more experts, but finer grained experts was a big innovation. I'm sure that there are many other innovations on the model architecture, as well as on the training data. It's kind of hard to disentangle all of them. But what shows up in terms of the limits of what you can do, the active parameters, as we saw, is limited by the compute cost. And then the total parameters is limited by the scale-up size.

47:11Yep. When you're operating within a single scale-up domain, is that a consideration specifically for either forward or backward? Or specifically for pre-fill versus decode? Or is it preferred to always be within a scale-up, whatever kind of workload you have, whether you're doing a pre-training run or whether you're doing RLL generation or whether you're doing inference for users?

47:41Yeah, really interesting. So, okay, so to answer that question, we're going to need to talk about the communication patterns that, so we've talked about the mixture of expert communication pattern. That is this all-to-all.

47:56There's all-to-all.

48:00All-to-all. All-to-all very strongly favors full connectivity, which is what we've kind of just shown here, and favors being within one rack.

48:12There are other kinds of parallelism besides expert parallelism, which we just showed here. In the literature is tensor parallelism. This is, with the trend towards smaller experts, this has become much less relevant, so we can ignore that. But the other two things that we have available are data parallelism and pipeline parallelism. And they are actually much, they can be a much better fit for using multiple racks. So, let's focus on pipeline parallelism specifically. This is one layer of MOE.

48:44I'm going to have like 100 more layers up above.

48:48I could decide at this point, for example, to move to a different rack, a change rack. Now, is that going to become a communication bottleneck? So, we can actually just solve for when this becomes a communication bottleneck. But before we do that algebraically, like, let's just sort of visualize it out and sketch the path. So, we're going to have a bunch, this is another MOE layer, and we're going to have another MOE layer here, and so on. So, let's say I change rack here, and then some number of layers later, I change rack here as well.

49:28So, our methodology that we're going to use to determine whether we have a communication bottleneck in this point where we change rack is we're going to compare the, this is the scale-out, scale-out bandwidth requirements to the scale-up bandwidth requirements.

49:52So, let's try this, I mean, the hint is going to be that there's a lot more sends here, like we're sending many things here, whereas we're only sending one thing here, and then we're also maybe doing it many times. So, that's going to be the, what makes the difference. Can I try to guess? Yeah. Just out of curiosity to see if I'm actually understanding. It seems like you're sending, like, batch size into the rack. In here? Yes. But the communication within a rack is sort of batch size times number of GPUs.

50:27Yeah, so number of activated GPUs, right? So, like, I don't send to this GPU at all, right? So, there's an explosion from one to, like, three times larger here in this diagram. Yeah. The key thing is that I didn't even need to send to this GPU at all, and so that's a big saving. I see, yeah. Okay, so we're going to talk through sort of how much more, what is the slowdown of, to what extent is scale-up a bottleneck over scale-out? So, we will directly jump to the ratio of the time spent on scale-up, time on scale-up,

51:08over the time spent on scale-out. So, this is the quantity we're talking about.

51:18And the first consideration is that the scale-up is, like, scale-up is eight times faster than scale-out, generally. And so, at a baseline, if the bandwidths were the same, we would have this one over eight, which is coming from bandwidths, bandwidths.

51:37But then we have some amount of expansion in how much data we're sending. So, if one token comes in here, then this one token gets routed to, in the DeepSea case, it'll get routed to maybe 32 experts or 16 experts, gets routed to some number of experts. So, this is the number of activated experts, number of activated experts. And then, it also, the same thing applies on multiple different layers.

52:15So, maybe I'm going to run two layers. So, there's also multiple times, number of layers per stage.

52:28And there's a need to multiply the whole thing by two for the, um, for the up and down. Yes, yes, and there's a factor of two. Thank you.

52:37So, what we would like is the, for the scale-up time to be greater than the scale-out time, because, like, the scale-up time is the more important and precious resource. And so, we just, we want this one, we would like this number to be greater than or equal to one. And this really doesn't seem hard. Like, we've, there's just a factor of eight that we need to overcome. So, we need the product of these three things to be bigger than eight.

53:00Typically, we have a fairly large number of activated experts. It could be eight, um, by itself. Um, and then we can increase the number of layers per stage a lot until, until we satisfy this. I see. Um, so what this ends up looking like is that I can, in fact, have an entire pipeline of racks, where one rack does one layer, and then I move on to the next rack, and I do another layer, and then I move on to the next rack, I can do another layer. It's interesting to me that the best parallelism, uh, strategy in practice

Reiner Pope – The math behind how LLMs are trained and served

Show notes

Highlighted moments

Transcript

Introduction to Rainer Pope

Rainer Pope Interview

Batch Size and Latency

Memory and Compute Time

Compute Time Calculation

Memory Fetch Time

Plotting Latency and Cost

Sparse Attention and Model Quality

Sparse Attention and Model Quality

More from Dwarkesh Podcast

Alex Imas and Phil Trammell – What remains scarce after AGI?

Reiner Pope – Chip design from the bottom up

Eric Jang – Building AlphaGo from scratch

David Reich – Why the Bronze Age was an inflection point in human evolution

Jensen Huang – TPU competition, why we should sell chips to China, & Nvidia’s supply chain moat