Steadcast
Dwarkesh Podcast cover art
Dwarkesh Podcast

Eric Jang – Building AlphaGo from scratch

May 15, 20262h 37m · 29,825 words

Show notes

Eric Jang walks through how to build AlphaGo from scratch, but with modern AI tools. Sometimes you understand the future better by stepping backward. AlphaGo is still the cleanest worked example of the primitives of intelligence: search, learning from experience, and self-play. You have to go back to 2017 to get insight into how the more general AIs of the future might learn. Once he explained how AlphaGo works, it gave us the context to have a discussion about how RL works in LLMs and how it could work better – naive policy gradient RL has to figure out which of the 100k+ tokens in your trajectory actually got you the right answer, while AlphaGo’s MCTS suggests a strictly better action every single move, giving you a training target that sidesteps the credit assignment problem. The way humans learn is surely closer to the second. Eric also kickstarted an Autoresearch loop on his project. And it was very interesting to discuss which parts of AI research LLMs can already automate pretty well (implementing and running experiments, optimizing hyperparameters) and which they still struggle with (choosing the right question to investigate next, escaping research dead ends). Informative to all the recent discussion about when we should expect an intelligence explosion, and what it would look like from the inside. Watch on YouTube . Read the transcript . And check out the flashcards I wrote to retain the insights. Sponsors * Cursor ‘s agent SDK let me build a pipeline to generate flashcards for this episode. For each card, I had an agent read the transcript, ingest blackboard screenshots, generate an SVG visual, and run everything through a critic. A durable agent is much better at this kind of work than a chain of LLM calls, and Cursor’s SDK made it easy. Check out the cards at flashcards.dwarkesh.com and get started with the SDK at cursor.com/dwarkesh * Jane Street gave me a real deep-dive tour of one of their datacenters. I got to ask a bunch of questions to Ron Minsky, who co-leads Jane Street’s tech group, and Dan Pontecorvo, who runs Jane Street’s physical engineering team. They were willing to literally pull up the floorboards and take out racks to explain how everything works. Check out the full tour at janestreet.com/dwarkesh Timestamps (00:00:00) – Basics of Go (00:08:17) – Monte Carlo Tree Search (00:32:04) – What the neural network does (01:00:33) – Self-play (01:25:38) – Alternative RL approaches (01:45:47) – Why doesn't MCTS work for LLMs (02:01:09) – Off-policy training (02:12:02) – RL is even more information inefficient than you thought (02:22:16) – Automated AI researchers Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe

Highlighted moments

In AlphaGo, you don't train the policy network to imitate the MCTS action. You train it to imitate the MCTS distribution.
Jump to 2:19:28 in the transcript

Transcript

Introduction to AlphaGo

0:00Today I'm here with Eric Zhang, who was most recently vice president of AI at 1x Technologies, before that senior research scientist at what is now Google DeepMind Robotics. And you've been on sabbatical for the last few months. One of the things you've been doing is rebuilding and improving and hacking on AlphaGo. And so today what we're going to do is you're going to explain building AlphaGo from scratch and what it tells us about the future of AI research and development. But before we get to that, why is AlphaGo interesting? Why is this the project you decided to do on sabbatical rather than just hanging out at the beach?

0:33Sure, yeah. I like making things. And AlphaGo and GoAI is one of those things that really got me into the field. When I saw the kind of early breakthroughs on AlphaGo in 2014, 2015, 2016, and so forth,

Deep Learning Breakthrough

0:46it was just profound to see how smart AI systems could become and the kind of computational complexity class that they could tackle with deep learning. This is a problem that has long been understood to be kind of intractable for a search, and yet it was solved through deep learning. And so that was quite mysterious to me, and I've always wanted to understand that phenomena a little bit better. My training is often in deep neural nets for robotics, where the decisions

1:17made by the neural networks are a bit more intuitive. But AlphaGo is a sort of problem where the decisions are actually the result of a very, very deep search. And it's always been very mysterious to me how like a 10-layer network can sort of amortize the simulation of something so deep in the game tree. Yeah, interesting.

Compute Reduction

1:35So if you plot out how much compute it took to build various iterations of strong GoBots over the years, you can see that in 2020, there was a open source project called KataGo by David Wu from Jane Street, who basically achieved a 40x reduction in compute needed to train a really strong GoBot tablet rasa. I'm not certain if it's stronger than AlphaGo Zero or AlphaZero or MuZero, but it's very, very strong. And this is what most Go practitioners today train against when they're playing an AI. And thanks to LLM coding, what took a whole team of research scientists at

2:07DeepMind and, you know, millions of dollars of research and compute can now be done for, you know, a few thousand dollars of rented compute. By the way, if you're listening to this on an audio platform, this is a Blackboard lecture, so I highly recommend switching over to a video platform like YouTube, if you can, to look at the math and the graphs and the Go board.

How Go Works

2:26Okay, I guess we should first discuss how Go works. Great. So, yeah, how does the game work? Um, so the game of Go is a very simple one that can be implemented, uh, quickly and easily in a computer. The, the objective of the game is basically to put down black and white stones and try to occupy as much territory in the game as possible. So I might start by putting down a black stone. Uh, black always goes first. So go ahead. And so the way you capture an opponent's stones is that for every, um, intersection, if you can surround all four of its neighbors with, um, with your stones, then, um, then this one is sort of cut off from oxygen, if you will. And then

3:02it, uh, it, it, it is, it is a dead, dead stone. So, so then now I control these four stones as well as this empty intersection here. So there's like slight variations between Chinese, Japanese, and what is called Trump Taylor rules. Um, Trump Taylor rules are designed to be completely unambiguous for Go. So this is what all Go AIs train against and resolve against. So in typical Go, like when the humans play, you're actually not allowed to put this white stone down here. It would be instant suicide. Um, in Trump Taylor, it's actually fine. You put it down and then it immediately resolves to death. So the outcome is sort of the same. Let's go ahead and start over and, and play, play a few

3:36stones and then I'll explain some more. So I'll just start there. All right. I'm like basically playing randomly here, but I'm trying to get around your stones and see if I can close by them. Sounds good. Yep. Hmm. So this move, um, basically exposes one empty neighbor for your white stone and it's very akin to a check in chess, where if you don't respond immediately by putting one here,

4:08then I can immediately capture this. Ah, I see. Okay. Cuz it is, it is sort of the diagonals that determine whether you're grounded in it. The cross section, not the diagonals. So, so this one is surrounded on three sides. Yep. And so, um, you're at threat of losing that stone if you don't play one immediately there. Now you can see that I'm starting to pressure you because by putting a stone, uh, here, now you are forced to, um, put one here. Otherwise, you would have this two block. Yes. To yourself. And then if you think, think through, like, what happens if you were to respond here, you can probably, you know, search into the future and deduce what I'll

4:41do in response, uh, once you, once you do that. You have a lot of confidence in my abilities, but I'm guessing you'd put the black here. That's right. And then I would capture all three of the stones. So I should just assume that this is gone. This little block is gone. Yes. So in Go, it's actually okay to let an opponent capture, um, some stones if, for example, it allows you to position, uh, to capture more stones in somewhere else on the board. Yeah. And, and this is what makes Go a very beautiful game is that, um, you can kind of, uh, lose the battle but win the war. Interesting.

Game End Conditions

5:08Right. And, and as the board size increases, the complexity of these kind of, like, micro versus macro dynamics, uh, gets, gets more interesting. Presumably you'd put one here. Mm-hmm. And so now I would capture this entire group. Okay. And this, this would be mine. Okay. There's one more, um, uh, case that I want to demonstrate, which, uh, actually I had a bug in my code, uh, recently, which is the following situation. So let's consider a formation like this, right? And then, you know, we have other pieces on the board in play or whatever. Um, and so, um, let's talk a little

5:41bit about how the game ends, right? Um, in this territory, who controls these, these areas? Is it white or is it black? White. It's actually black because I have actually surrounded this whole area. Yeah. Yeah. And it's, um, very, uh, assuming I have, like, other black stones here, it's actually very hard for you to break this out of the control of these stones. Yeah. So when the final score is tallied, would, would these ones also count as being in, um... Yeah. Great question. So, so, um, this is where different rule sets have different ways of scoring. And so we should talk a little bit about how, like, uh, you resolve scores between humans and how

6:15you resolve scores between a computer code. Mm-hmm. Uh, because there's actually some ambiguity in how humans evaluate this. So most humans would look at this board configuration and conclude that, like, black has kind of totally surrounded white. And so white has no chance of life. Yeah. We could play out more here, but then at the end, I would capture everything. Yeah. Um, however, if you have a way of breaking this formation and connecting white to something outside of it, then it can flip, right? Yeah. And so this is where it's, you know, a little bit hard for a computer to decide these kind of things, right? So how do humans do it, right? Like, it's, it's worth thinking a little bit about how

6:46humans resolve this because this will actually map later to how we think about the deep neural network. Um, humans basically say, uh, I think the game is done. And then you, you have to also say, I think the game is done. And then we'll say, like, I think these are, these are my stones. And then you have to agree. If you don't agree, then we keep playing. Yeah. So, um, essentially once two humans, their, uh, so-called value function, um, agree on a consensus, then the, then the Chinese rules, uh, result that. Yeah. Interesting. So in Trump Taylor scoring, um, it's perfectly unambiguous. So it can be decided, you know,

7:19algorithmically by computer. So if, if let's say you, you have this at the end, end game, the way you score this is that you first count how many stones you control and that's unambiguous. Then you count how many empty intersections that are not touched by your opponent's stones. So these intersections would not count for either player because both of all of these intersections are connected to both white stones and black stones, right? If, um, this were like this, then white would get three points. Now, this is, uh, a little odd because a human would know that

7:52white is actually losing these points. Yeah. But, um, Trump Taylor scoring would consider white to have all of these points as well as these points. Got it. Okay. Right. So, so that is a very big difference in, um, how computer go scores things and how humans score things. How does the game end? The game ends when either a player chooses to resign or both players pass consecutively. Cool. Yep. So that's the rules. Nice. All right. Now help me correct this with AI.

Neural Network Explanation

8:17Great. Okay. Let's understand how, um, AlphaGo actually works and how somebody in the audience might be able to implement it. Great. Yeah. Let's start with, um, kind of an intuition about the underlying, um, you know, search process used to make moves. And we'll layer on, uh, ideas from deep learning to make it much more efficient and tractable. So, Go is a game where there's just two players. We're gonna draw a person here and we're gonna draw an AI here. And, um, let's say this

8:48person is playing black, so they go first. So, we're gonna draw, they go here. And then now the AI, um, is going to make a move based on what it sees here. So, there's a question of, like, how you encode these inputs into the AI. Maybe you could use ones and zeros, but you want to represent, um, you know, black, white, and empty. So, so you would need at least three different values here, right? So, maybe you could use zero ones and twos or something. So, so the AI might see something like, you know, zero, zero,

9:20zero, zero, uh, one. Great. So, so this is the input to the AI, uh, on its turn. Yeah. So, so the AI can choose, let's just pick three possible random moves that can go, and I just drew these at random. And so, which, which move is best here, right? Well, we don't know until the game ends. Um, there's no, Go does not have any kind of local reward of which move here is good. And this is what makes Go a very

9:52difficult game, is that you don't actually know who won until you really get to the end of the game. So, how deep is this tree, right? Well, in a 19 by 19, um, Go board, there are, uh, you know, roughly to the order of 361 moves on any given, uh, move. And of course, as it fills up, you have less moves. Um, and, and the, the number of steps in the game can be somewhere from 250 to 300 moves. And maybe experts might, uh, decide to end the game, uh, well before that. But, uh, you know, under Trump Taylor's scoring, you actually have to play things all the way to the end. So this could be like 300 moves

10:25or something. Right. So like 300, um, like depth of the tree. Yeah. So if you keep on expanding possible moves here, so in, in this move, the AI is going, and then, you know, here the human would go.

10:39And then, you know, there's, there's some

10:47and so forth. You can find that like essentially what you end up with is an enormous explosion in the possible game outcomes originating from just this one state. So this is something to the order of like, you know, 361 to 300, power of 300, which is far more than the number of atoms in the universe. Right. Like it's, it's just, uh, it's just, and of course, actually there are redundancies and symmetries. So it's not actually 300, but, but that's sort of the, if you were to do a naive tree

11:17where there were no merging of children, then actually you end up with a tree about this big. What do you mean by merging of children? Right. Let me, uh, use this board here. So if we start here, and then you play here, and then, um, I play here, and then you play here, that is equivalent to, I start here, you play here, I play here, and then you play here, right? Yeah. So, so both of them arrived at the same spot, but through different paths. So this child node can be thought about as a shared ancestor. Yeah. And I guess it's not 361, it starts at 361, but it decreases by one each time.

11:50And the branching factor decreases by one each time. Yeah. Yes, yes. But in any case, this is a very, very, very large tree. Yeah. And, uh, this is also why, you know, computer scientists for many years thought that Go was not a tractable problem this century, because the amount of compute you would need to exhaustively search every possible possibility, um, is just too large. Yeah. Um, if you could, Go is actually a deterministic game. So, um, on any given state, you can actually compute what the, uh, best possible strategy you can, uh, you can make is, uh, in order to win the game. You can search all the possible futures where you win, and then just make sure you always

12:23stay in, in, in that, you know, set of futures. Um, so AlphaGo's kind of core conceptual breakthrough was using neural nets to make this search problem tractable. So, before we get into, you know, how neural networks are involved, let's talk a little bit about how we can, you know, assuming we had a very powerful enough computer, um, search this, uh, this tree to find the best move. Right? So, in the beginning, um, you're not gonna build out the whole tree, uh, because storing that tree would be very expensive. Instead, you might do something like interactively figure out which, um,

12:56which leaves of this tree are worthy of exploring and expanding into the future to see, you know, what else is there. So, um, there are some early algorithms in, uh, bandit literature like, you know, UCB, uh, one, which is not exactly appropriate for a, you know, sequential game like Go, but very much inspired the action selection, um, algorithm used in, uh, in AlphaGo. So, so UCB one looks like, um, on every move, we're gonna take the best action, uh, or, you know, the arg max over a that maximizes,

13:29um, um, um, you know, the Q of A, and I'll explain what Q of A is in a moment, plus some sort of exploration bonus. So, on every node, we're going to track a few quantities. So, so let's, you know, consider each of these a node. This is, this is the, the root node, um, where you're making decisions from, um, and these are the children of the root node. And, um, we're gonna say each node is basically

14:03a data structure that is, um, it stores a visit count of this, um, this node, this, this child node. Is how often the parent visited this node. Yes. And we'll call this an action. So, so one thing that is easy to trip on and is like, if you come from, uh, you know, robotics or, um, other kinds of reinforcement learning is like, where are the actions, right? I'm only talking about nodes. Nodes here represent states. And because this is a perfectly deterministic game with no randomness,

14:35you can actually just infer the action based on the child. So, so if I go here, that implies an action, and this is the state that we resolve it, right? So, so the LLMs, if you ask to, uh, you know, vibe code a, uh, MCTS implementation, it'll most likely design the right data structure here. But, um, you know, it's up, it's sort of a chef's choice. You can actually rewrite the, the tree structure however you like. This was what, um, Claude 4.6 wrote for me when I, when I asked it. And it was a very reasonable choice. So, um, so then, you know, Q represents the, um, mean action

15:10value of this action. And I'll use a subscript a to denote that this kind of corresponds to taking a specific action to, to get here, right? From, from the, from the root node. Um, so, so like, uh, if we, if we have root, basically taking a gets us to this, this node here. Um, and then we're going to also store the probability of taking this action. Again, from the parent. From the parent, yes. Like, like what are the odds that we sample this one? Yeah. And, and this will become relevant

15:41later. You know, like, uh, we've, we've talked about a deterministic tree for now. So I'll, I'll bring probabilities into this later. And then finally, we have a sort of, uh, dictionary of children.

15:51Which is just like, uh, you know, more of these nodes. You know, in a, you know, sort of classic linked list style reference tree. So, um, this is the basic data structure to implement a tree. And, um, in AlphaGo, um, they use a slightly different action selection criteria called, um, Pucked. And it's short for predicted upper confidence with trees. And, uh, this is basically, um, when you, when you select which, which child to take, you do argmax a of

16:23q of s a plus, uh, constant. So the equation and forms are actually pretty similar. Um, these are both scoring criteria, right? Like you want to argmax this quantity and you want to argmax this quantity to determine which action to take. So let's break down the intuition of like, how you select actions here. This is the

16:56mean action value. So how good is a given child on average? Um, and, and, and if you actually, you know, knew the whole tree, then, uh, this is all you need, right? To select the best action. You don't really need to do more than that. But if you're interactively building this tree as you're, figuring out what the q values should be, then what you have to have to do is occasionally try some other actions, you know, as a sort of explore versus exploit trade off. So, um, in both UCB and, uh, Pucked, there is this term here that basically rewards, um, taking actions that you haven't taken

17:27before. So as we mentioned before, each node stores the visit count of taking that specific action, right? So everything is initialized to zero. And so for a given action, let's just say like call it and like action a, um, initially it's zero. And, and so as n is increasing, if, if let's say we've already made 10, um, 10, uh, action selections from that root node, uh, but we haven't picked a yet, then this term actually starts to become quite large for a. Yeah. Right. And conversely, if we have chosen a 10 times out of 10, then now this term is quite small. It diminishes very quickly.

18:02And the same thing is actually true here. Just make sure I'm understanding it. Maybe I can, uh, put it in my own words. Let's just focus on UCB. What we're saying here, you can think of it conceptually as two different things, the Q and then this exploration term. Let's just be clear about what Q is. Q is basically saying, Hey, once we do these rollouts, so you're actually running all these simulations, you go down the tree and then you figure out, okay, if I end up at the terminal value of this tree, do I win this game or not? And then you do this,

18:35you average whether I win this game or not across all the, you know, the leafs of this tree starting from this node, that average you put in Q. Correct. And so you're saying the Q is basically representing, will I win this game or not? Uh, what does this probably believe I'll win this game starting at this node? That's your sort of, um, that is your sort of exploit. That is like saying, I've run these simulations. I think this is a good move or not. And then this other term is saying, um, have I explored this branch enough yet relative to the other actions I could be exploring?

19:06Or I have already explored, uh, if I haven't explored this branch yet, you know, maybe I think it has a low score, but I just haven't explored that many branch leaves of this, uh, down this, uh, leaves down this, uh, down this node in this tree. So I should maybe like try this, even though the Q, this sort of exploit is telling me that this is not that valuable. And because LN of N grows slower than N, uh, basically as over time, you will move from the arg max being dominated by this exploration term, which is the second term here to the arg max being dominated by the Q term, which is like,

19:40okay, I've done enough simulations. I'm quite confident that like, this is the branch to go down. Yes, that's right. So, um, the motivation for UCB was to come up with an algorithm where if you don't know the payoff of the arms, uh, the different actions you can select to begin with, this, uh, strategy basically with given some exploration term here bounds your regret, uh, in terms of how wrong you can possibly be. Um, I don't know the proof. I don't also know if this one is proved to have a logarithmically or, or like, uh, you know, square root bounded regret or anything, but I think the

20:12algorithm was just derived to look something like this. And you can tell that these terms are, they grow a little bit differently. And this is actually just to account for the fact that Go has many more actions in every given move compared to your standard bounded problem. Yep. Um, so, uh, one small clarification to make is that you talked a little about simulations on and probabilities and forth. Um, we should remember that Go fundamentally is a deterministic game. So the notion of, like, where does the notion of probability come from here, right? Um, uh, if you had a very powerful computer,

20:42there is no probabilities. You just, you can just compute the true average of what the, the mean action value is. So where does the probability come in? Well, it turns out that, um, uh, as in, you know, computer Go before, uh, AlphaGo, we've always done some sort of Monte Carlo method where we have some, we, we take the, um, expected Q value averaged over a randomly selected tree. Um, and that randomly selected tree is where probabilities come in. So the interpretation of Q is,

21:13what is the expected action value under the, um, under the random distribution induced by some random search process. Makes sense. And so where does the random search process come in? That's where, uh, you know, P of action comes in. Yeah. So if we assume a very naive algorithm where you have a uniform probability of taking any valid action, then this would just be one over, you know, the, the number of valid moves in, uh, in the, in this, uh, setup. And you would be kind of taking this average over this very diffuse tree, right? And, and this is,

21:43uh, this is a valid, um, integral you can take, but it's very slow because you're going to consider a lot of trees that have very low value. And, uh, it's essentially almost like a important sampling problem where you want to, there's only a few actions and, and, and sort of, uh, paths that can contribute, you know, high value and almost everything else is low value. So, so that's a sort of a tricky, um, problem here. Okay. Um, so this is the action selection criteria for how you decide which moves to move down. Now, as you move down, um, in, in tree search, you will eventually

22:15run into a node where, um, it's quite clear you've won or lost, right? At the, at the very, very end of the game when, when there are no valid moves to play left under, under Trump Taylor scoring, you can decide whether you like, you know, won or lost, right? So you, you either win or you lost. And so this is basically, um, you know, the, the final return of the whole game, right? Um, and so the, the question here is like, we, we can assign a, um, a value, you, to a terminal leaf node of the tree,

22:50but how do we assign the values for nodes prior to that, the parents? And it turns out, um, you know, what you simply do is you just take the, um, your, your mean action value is essentially your average. So let's suppose these were leaf nodes. Um, sorry, these were all leaf nodes. The, the mean action value of this node, you know, this action here is just the average of whether you won or lost at the, at the leaf nodes. And, uh, correspondingly, you can kind of walk up the chain and say like, well,

23:22the mean action value of this node, let's call this like QB and this is action B is just the average of a weighted average of these ones here. Yeah. Right. And, and the weighted average is, um, it could be dependent on if you have a different sampling distribution or not, but the, that, the basic intuition is that you want to resolve the game where you have a deterministic win or lose, and then you can kind of go backwards. Uh, this is called the backup step and assign values to these, uh, these, these, these nodes or actions, um, uh, corresponding to the averaged over,

23:53over the final terminal leaf. Yeah. Okay. So, um, if you were to do this without neural networks, it would still be intractable. Um, you would, you would have trouble finding, you know, which, um, actions to sample. A lot of the actions would contribute very low value, especially if you're like, you know, trying to fight your way out of a losing position and only a few actions give you high value. So the search in practice is still very, very expensive. Um, but, but the, the idea is that like, if you can, because go follows a tree structure, you can actually, you know, inform a very good

24:26estimate of the value of this node based on the, uh, values of, uh, uh, downstream, assuming they're all correct and assuming you've searched deep enough. Your explanation earlier about the, um, the sorts of states where it's obvious to a human who's going to win, but it's not obvious to, or like use deterministically still how to play it out actually drove home the intuition of why the value function both is trainable and to why it's necessary in order to actually be able to learn this game effectively. I mean, it's worth defining value in the first place, but it sounds

24:58good. Yeah. Yeah. So we, we talked about, uh, you know, this U value being, you know, your final resolution of whether you won or lost, and this is the terminal leaf node condition. Um, now humans don't play all the way to the sort of edges of the tree, the leaves of the tree, right? They kind of stop, you know, some dozens of moves before, maybe, maybe even a hundred moves before in, in sort of high level play. So how do they know, right? Like you can think about humans as implicitly having a neural network called a value function that basically, um, you know, takes in, uh, a board state,

25:30and then it kind of evaluates, um, you know, he win. And so the human glances at the board and they know like, I'm probably going to lose, right? And, and they're essentially running a neural network that looks at a board and implicitly they are amortizing a huge number of possible game playouts and, and taking that average and then deciding whether the board is winnable or not, and then whether they should concede or, or, you know, keep playing or not. And, uh, this is remarkable. If you think about like the, um, the beauty of something like this, it's like a neural network in a, in a

26:05human can somehow do all of this simulation at a glance and then just know like within a few seconds without actually playing every single game logically based on just kind of like crystallized knowledge and experience that like they can do this. And so this gives us a hint that like in games like Go, um, there are ways to basically radically speed up the search process. And this is one of the fundamental intuitions behind why AlphaGo works is that you can train a value function to look at a board and quickly resolve the game without playing out all of these trees into the, you know, into a

26:39very deep search depth. Yep. Makes sense. I will say for the audience, um, I sort of found, uh, uh, for previous episodes when I was prepping and it seemed somewhat relevant to understand how AlphaGo works, I would find it very, very confusing. And, but it's the kind of thing where once you understand the problem in this way, and then you'll build the next few pieces, it is actually much more understandable and it will make a lot of sense. And it's okay to be confused right now. Uh, but it's, it's probably simpler to understand by the end of this lecture than you anticipate. So

27:10I'll just make that note for the audience. Uh, yeah, the important intuition at a high level, just to, you know, step back about where we're going with all this is that, um, classically for games like Go, you could build a tree, but we don't have computers powerful enough for that. Yeah. And, um, estimating the value of every action that you could possibly take is also hard because you don't know until the end of the game. You could take averages, uh, by playing them to the end, but that's also hard because you don't know which actions to take to sample these averages.

27:41So conceptually there's kind of two problems. There, there's the breadth of the tree and then there's the depth of the tree. And AlphaGo gives us a way to basically, uh, shrink both of those to be very attractive. Yeah. That's, that's essentially the kind of core idea behind it. Okay. So we, uh, we take this idea that like, you know, humans can glance at a board and instantly predict whether we win. And maybe that gives us the opportunity to really truncate the, how deep we, we search. Yeah. And then, you know, we also know that humans can look at a board and, um,

28:12and, uh, decide, you know, um, what, what boards, you know, like intuitively at a glance, what moves might be good on a good board, right? So, so these are kind of two things that we can use deep neural networks for to accelerate this search process. Um, let's go back, before we talked about neural nets, let's just go back to how this play out works. So we've only talked about making one move, right? So, so the AI looks at this encoded go board, it has a tree, um, it searches for, you know, deeply into the tree to find out which of its actions might be the best. And then it takes

28:44that action. And then now, you know, it goes back to the human. So maybe now the human sees, uh, a go board that looks like, you know, like this. And, um, and then they, um, they make their move. So maybe they put, um, they put, um, they put their stone here. And then now we, um, we go back to the AI,

29:07which now looks at a new encoded board. So I've used two to denote the AI's playing as white, and one to denote the human playing as black and zero as empty. And then now on the AI's turn, it does the MCTS tree search all over again, from scratch, right? So, so it throws away this old tree that it searched last round, and now there's a new root node, and it begins to search anew.

29:39And then so and so forth. So MCTS is basically a, you can think about it like a search algorithm that is, um, deciding what moves to play best, aided by neural networks. Um, and, and it's, it's done on every, every move. Okay, great. So let's talk about the neural network part of this. And while you're racing, another sort of thing that was important for me to understand was this MCTS data structure with nodes and children of nodes and whatever, um, this is done per move

30:11and reinstantiated once a move is made. So a human makes a move, then the AI looks at this and is trying to basically run a bunch of simulations, uh, to figure out, okay, which move should I make next? And those simulations just, a simulation is basically like exploring one more node in this MCTS tree. And at the end, um, once all these, once all this, you know, you run a thousand simulations, that informs then this, um, I guess, as you will explain this probability of what move to make next. That's what you store. You sort of choose the best move given those probabilities. You discard

30:46all of that. Then the next player makes a move and you restart this process at the beginning of every move. Correct. One small addendum. You don't discard all of that. You keep one thing behind that we'll use later. Yeah. Just like I did for Reiner, I wanted to make flashcards for this episode so that people could retain these concepts. And ideally an LLM could generate some candidates for me to then refine, but to actually get high quality suggestions, I needed to design a whole pipeline where the AI could take and ingest screenshots of the blackboard and the right timestamps and then make SVG diagrams

31:19in case visuals were helpful and then run their writing and drawing through a critic and then revise the card in response to this feedback. It's very hard to accomplish this just by sacking LLM calls. This sort of step-by-step recipe works much better if you have a durable agent that's been engaging with the task across all the previous stages. So I used the Cursor SDK to spin up an agent for each card. The Cursor hardness saved me a bunch of work in designing some custom context scaffold or figuring out how to design tool calls for taking screenshots or making animations. These agents all run in the cloud, so I don't have to worry about leaving my laptop open. I just get

31:53an email when I have candidates to review. You can check out my cards at flashcards.dwarkesh.com. You can start building with the agent's SDK at cursor.com.dwarkesh. Okay, so now we have a basic intuition of how moves are made with search. We're going to talk

Monte Carlo Tree Search

32:10about how neural networks can speed this up by providing an analog to the human intuition. Mm-hm. So there's two networks. There is the value network, which takes in a state and it predicts, am I going to win or lose? It's a binary classification problem. So then we're going to have a policy network, which induces a distribution over good actions to take. Mm-hm. So I'm going to draw a one-dimensional flattened

32:40move distribution, but this is really like a square kind of grid, right? So maybe like it thinks actions are like, these are the kind of probability distribution over good actions. And both of these are categorical classification problems, right? So you can train this like any classifier with deep learning, you know, cross entropy, loss, that kind of stuff. So the specific architecture does not actually matter too much. I tried a few different architectures,

33:10transformers work, Resnets work. For small data regimes, my experience is that Resnets still kind of outperform transformers and kind of give you more bang for the buck at lower budgets, but this may not be true. Wait, why is that? They provide the inductive bias of like local convolutions. Yeah. And generally, transformers start to outperform residual convolutional networks when you want more global context. I see, okay. So one interesting finding from the Katago paper was that they found it actually quite useful to pool together global features together and aggregate global features like throughout the network to kind of

33:47give the network a global sense of how to like connect value from one side of the board to another side of the board. What does it mean to aggregate global features? Yeah. So if you have a go, a very large 19 by 19 go board and you, you know, you've got some, some sort of battles going on here and you've got some battles going on here. When you pass this through a convolutional neural network, the receptive fields of the convolutional network are going to be good at computing local things and making that invariant. But they won't be able to kind of connect these two features easily, right?

34:22They need to sort of be pooled together and attend to each other somehow. So the argument about, you know, why transformers are good for computer vision tasks like with, you know, vision transformers and so forth is that because they have a sort of global attention across the whole thing, they can more easily draw these connections. Right. But you do need more data there so that you can kind of learn through data the, the sort of invariant local, local features. Got it. I've tried very hard to make transformers work for this problem because I was kind of curious if transformers would present some sort of breakthrough and go and just remove a lot of those tricks.

34:56But try as I might, I actually haven't figured out a way to make transformers better than resnets for now. So one, sorry, one more potential question.

35:04It makes sense why transformers with their like global pooling of information would be better if you need to consider information that is not just spatially or yeah, CNNs give you a sort of bias that the things that are next to you are especially irrelevant. And then they're sort of aggregated up. Yeah, exactly. Yes. But suppose, okay, so for games where it isn't that relevant, what is happening locally, you just kind of have to consider the whole thing. You're saying transformers would work better. How about games where, so we're talking about the spatial dimension. How about the temporal dimension where right now we're only considering the

35:38previous move because it is a deterministic full information game where, but what if it was something like poker or diplomacy where really a bluff they made a while back is sort of relevant to understanding now and isolating to decide to make your next move. And so you need to consider all those previous states. Would that then change the consideration of what inductive bias is most relevant and what architecture is most relevant? Right. Great question. So Go is a perfect information game. Yeah. And in perfect information games, there does exist a Nash equilibrium strategy for which

36:12you can do no worse than any other strategy. So if you know that your opponent has a particular bias, like they love to play aggressively, you can actually in principle counter that specific strategy better than a Nash equilibrium policy. But to counter any given strategy, there does exist a single Nash equilibrium that can be decided solely using the current state. So that is a design choice that most Go agents, AlphaGo chose to do, which in hindsight turned out to work very well because

36:43the Nash equilibrium seems to be superhuman. Like no human strategy seems to be able to beat it. Now, there are variations of this where you would actually need to consider temporal history. And this is a very exciting research area that I would encourage people to kind of fork my repo and try these things out, which is if you were to play, let's say, 2v2 Go, then you actually need to model your partner's behavior. And you may not have information on how they play, so you need to aggregate some information on how they play so that you can respond accordingly. Yeah. Right. Like these are situations where it's no longer a perfect information game.

37:17Yeah. And then in those cases, in games of imperfect information or partial observability, then you do need some context to build a model. Yeah. Yeah. And I think that's a place where things can get very, very exciting in terms of like self-play or diplomacy style. Yeah, interesting. Okay. So returning back to the neural network, the architecture again is not super important.

MCTS Algorithm

37:37You can get it to work with transformers, you can get it to work with Resnets. I found that for low budget experiments, Resnets work a little better. You can also use kind of a Karpathy style auto research hyperparameter tuning to make your architecture pretty good. And so you don't have to worry too much about that. You just need to sort of set up the problem so that you have a sort of target optimization. Yeah. Okay. So we're going to pick just a somewhat arbitrary architecture that worked for what I did. But again, this part is not super important. You have your encoded board state. And we're going to just

38:08choose to let's say do three, like similar to an RGB, we're going to have three kind of channels. One channel to include black, one channel to include white, and then one channel maybe to encode

38:24empties or maybe like a masked region if you want to train on multiple board sizes. I'm actually not going to talk about multiple board sizes for now. That's a little bit too complicated. So we'll just say like, you know, we've got this two or three channel RGB-like image. And then we go into a, you know, a ResNet. And then we have two branching heads. One head predicts the value function. And this is like a single logit. So this is like R1. And then we have the policy,

38:59which is, you know, R361. So this is the architecture. And we're going to basically train this to predict the outcomes of games given the board state. And we're also going to train this to predict what are good moves. Yeah. Right. So the OG AlphaGo paper, or called AlphaGo Li, initialized this network with a supervised learning data set of expert human play. Later, they removed this restriction by having the model teach itself how to play well.

39:32But I find it actually from a matter of like implementation for your audience, super, super nice to always kind of initialize your experiments to something that's easy. And then like, you know, get the problem working before, you know, trying to bite off the whole thing and learn a tabular resin. You generally want to kind of initialize, just as in deep learning, initialization is everything, right? You always want to initialize your research project to something as close to success as possible, especially if you're, you know, doing something new that you haven't done before. Like always pick something that works and then get it to do something better rather than start from

40:04something that doesn't work at all and then, you know, try to make it work. So under that philosophy, it's a great idea to start from something that like, you know, has a good initialization. So we're going to take human expert plays and train this model to predict, you know, good actions, right? So we're going to take all of the winning games, all the moves in which a human won and, sorry, an expert won and then predict those actions. And then regardless of board state, like, you know, whether you won or lost, you're going to predict the outcome. So you might be wondering like,

40:35okay, well, some of the early boards, you know, where basically only one stone has been put down, how could you possibly know whether who the winner of this game is, right? Well, if you have, you know, hundreds of thousands of games, then on average, you'll probably see that boards that start like this have a sort of half of the games that branch off from this will win and half of the games that branch off from this will lose. So that'll actually be fine. When you train this model to predict those, the logit will sort of converge to, you know, 0.5. And so for these things,

41:05it's sort of expected that once you train the model, a starting board state will look like 0.5. And then as you progress towards the end of the game, it'll actually look something like, you know, if this is 0.5, the win probability will sort of either go like this, or it'll go like this. Right. And this is sort of your move number. Yeah. And so as you, you know, get hundreds of steps into the game, it becomes much more clear, like who's more likely to win or who's more likely to lose under your expert data distribution.

41:36I didn't understand the significance of why this way of thinking about value is especially relevant to the expert data. It is not relevant to the expert data. It's true for any data that you train it on. Yeah. So if you were to learn a tabular rasa, you would also expect this to fall out. Yeah. So if you just do this, like, so imagine, you know, you're vibe coding AlphaGo and you, you gather some expert data sets from like how to go online, or you, you know, you have a data set of human players and you train this model. Actually, it turns out this model is already a pretty good Go player.

42:08It'll most likely beat most human players, right? So, so like, if you just take this policy recommendation and take the argmax over, you know, it's, if this is the, you know, probabilities, if you take the argmax and you just take this action as your Go play, it'll be a very, very fast Go player that doesn't think in terms of like reasoning steps. It just kind of shoots from the hip and it'll be a very strong Go player, which is already quite miraculous if you think about like, you know, 10 neural network layers, maybe under like 3 million parameters can already do something

42:42that impressive. And so you can start this way and it's important when implementing this to kind of just verify that this is probably true. It's good to verify that your Go rules are implemented correctly, that like, you know, you can run these simulations relatively quickly and just as almost like a sort of a checkpoint that like you want to make sure that you can actually do this basic step before you try to layer on more complex things like search. Yeah. So, but yeah, we can do a lot better than taking the raw neural network and playing the moves. And

43:14this is how we can apply it to Monte Carlo Tree Search. So let's apply the neural network to to improve Monte Carlo Tree Search. So we start with our root node. And we now have a four step iterative process to do MCTS. So this tripped me up when I was first reading the paper and trying to understand it. But essentially what we're going to do is we're going to choose a number of simulations. So like, you know, num simulations.

43:49And this number varies. This can be, you know, somewhere between 200 to 2048. I believe in

Improving MCTS

43:57in the AlphaGo Li match, they use tens of thousands of simulations per move because they really wanted to boost the strength of the model as much as possible. Yeah. But in training, you don't actually need too many. And Kotago, I think, uses something on this order as well. Do you know if they used, if you watch a documentary, they had a laptop out during the game? Yeah. They didn't use a laptop itself. It was like on some... It was on some TPU pod, I think. Cool. Yeah. But now... Honestly, kind of unfair. Well... Like, Lee is not using like one E22 flops to do a move, you know? Fair enough. Interestingly enough,

44:29modern Go bots don't need that much compute at test time. And what we'll actually find out as we talk about how the MCTS policy improvement works is that over time, the raw network actually takes all of the burden of that big TPU pod and just pushes it into the network. And you can do actually all of that work with one, you know, neural network forward pass. But the TPU pod will always add the extra oomph on top. And so that's what they wanted for the match. So we're going to pick this kind of like num simulations thing. And for every simulation, we're going to basically do

45:04several things simultaneously. We're going to see which moves are the best in the current tree. We're going to add extra leaves to the tree if we get to a point where we need to add a leaf. And we're going to update the action values for the tree. So that's what every simulation involves these kind of like four step process. So the four step process is basically selection,

45:28expansion, evaluation, and backup. So at the beginning of our Monte Carlo tree search, our tree is very basic. It only has the root node, or our current board that our AI wants to play at. And so we're going to basically select the best action for this. So when this root node is created, we also know that we can evaluate this under our neural network and get the quantities, you know, v theta, as well as our probability over actions.

46:09And I'm going to say root.

46:11So for all of the actions here, we can create a bunch of children, right? So this one has, well, in this case, I'm drawing a three by three board with one board missing. So basically, there are, you know, eight possible children associated with this root node. And each of these has an associated probability of taking that action, right?

46:48So there's p8, p1, p2, et cetera. Okay, so at the beginning of our Monte Carlo tree search, we have our root node, and we can initialize it with some children, right? Because we know it's the policy network evaluated on the root node gives us on a three by three board with one existing stone placed, eight possible children that this AI could take. So with each of the children, their policy network also gives us the probability of selecting that child. So the first step is to do the selection of the tree. And again, this is a very shallow tree.

47:20All we have so far is a tree of depth one, essentially, right? So our first move is to select by maximizing or argmaxing the pucked criteria, which is basically, you know, q, q s a plus c pucked times p of a divided by n over 1 plus n a. So for each of these, we're going to, you know, n a is zero for all the actions initially, n is zero.

47:56And so we're going to basically just, you know, pick according to this.

48:04Initially, what is going to be the, you know, chosen action here is most likely going to be biased towards, you know, the highest likelihood action here, right? Because these are sort of uniform for everybody. So let's suppose p1 was the highest probability node. So you selected this one here. Now, you got to this node, and you realize that it's not a leaf node, right? There are more games, it's not a terminal game, so you cannot resolve the final resolution. So the next step that you do is

48:34expansion. So you will then run this node, this board state through the policy network. Note that this is the AI's move, right? Like AI is making this move. And so when we expand this tree, we're now thinking about what the human might do, or any opponent might do, right? So this is like, you know, your opponent. The tree expansion process actually is completely, so when we evaluate the

49:05node here, we're going to now evaluate the node from the perspective of this player.

49:11So then this one has possible actions that we could take, and we expand basically the leaf nodes here. So for each of these nodes that we could, you know, arrive at, we're going to now check how good those nodes are, right? So maybe from here, like the human could play here, the human could play here, or human could play here. And we're going to store essentially the V theta for each of these things. So V theta of, you know, node one, or like node one prime, V theta node one.

49:57And so we're basically using our neural network to make an intuitive guess of how good is this board from the perspective of this player. And fortunately, because it's a zero sum game, it's easy to deduce that, you know, the value for this player at this step is just one minus the value for, you know, from this perspective. So it's easy to flip the search process depending on which player you're at. And so this is the expansion step. You've taken a non-leaf node and expanded it and

50:28evaluated the value. And this is essentially a quick guess as to like, if I were to play to the end, am I going to win or not, right? So you can almost think about the V theta as a shortcut for searching to the end of the tree for any given simulation.

50:44And then we're, and this is essentially the evaluation step. We're evaluating the quality of each of these boards. In original AlphaGo Lee, they actually did something kind of interesting, which is that they took this value and they averaged it with the value of a real Go playout. So they actually played a real game from here all the way to the end. So like, I'm just going to draw this squiggly line to indicate some path. And they kind of like play this all the way to Trump Taylor resolution

51:14to the end of a full board. And so this is like a zero or one, right?

51:25And so they took this value and they just averaged it with this one here. So the formula they did was like alpha times V theta of some node plus sort of like one minus alpha of a true randomly sampled playout. And you might be wondering like, okay, well, how do they play this out, right? Like it would be very, very costly to do another search on this playout, like almost like a tree within a tree. So they

51:56don't do this. Instead, they just take the policy network and play it against itself. So they just take this as both players and they just play it all the way to the end. And this is something that helps ground the estimates here in reality because you can get a single sample estimate of like whether you win or not. You can think about in the end game where the board is almost resolved that this one actually becomes quite useful because the random, the play according to the policy will most likely decide a pretty reasonable guess of the game. And so you're not facing a problem where

52:27this one kind of becomes untethered from reality. It turns out this is totally unnecessary. So in all subsequent papers after AlphaGo Li, they just got rid of this. Yeah. And so in my implementation, I also did the same and it speeds things up a lot because you don't have to roll these games out on every single simulation. Yeah, yeah. Okay. So again, just to reinforce my own understanding and just to re-explain it. For the audience, by the way, in case it's not obvious, the P there in the select, that is the probability coming from the network in this case. Correct. The policy network here. Yeah. Okay. So fundamentally,

53:00a simulation, just think of it as like rolling out one more node in the search process. Almost. So a simulation is easy to think about when the whole tree already exists, right? You just walk down the tree using the puck selection criteria and then you keep going. Yeah. Now, in AlphaGo, the data structure is such that we begin with a tree that has basically only depth one,

53:31which is its only children. And you want to iteratively build out the tree as you're also selecting actions down the tree. So that's the kind of core thing here is that because Go is such a combinatorially complex game, you cannot afford to build the tree in advance and then search it. You must search while building the tree, right? Okay. So let me just finish up with actually the last step, which is the backup, right? So once you've scored these things, you basically take the mean, the value, the Q value assigned to the node here for taking this action is now just the average across your

54:05evaluated values. Yeah. So that's what is known as the backup step. And once you evaluate this, you can actually kind of recursively go back. So if you know the action value of this node, you can then take the average on its parent and so on and so forth. So you have this kind of four step process where you are choosing the best action that you know of so far, then you may run into a node where you haven't been to before,

54:38so you need to grow the tree a bit. And then you run it through the network to guess whether you're going to win or not. And then you walk all the way back up to the root node to update your values on what the best moves are. So as you do this iteratively, this selection criteria will cause you to visit the, because you're always selecting according to this criteria, you're always going to be selecting the best action you think at any given branch, right? So the final visit counts of like how often you chose these things will reflect your correct policy distribution as induced through this search

55:13process. And so the visit count that you store in the node earlier actually becomes the sort of vote for like which way we should finally select an action here. Yeah. So as a sort of test of understanding, it's worth thinking a little bit about whether we could make this even simpler, right? Like could we actually maybe even get rid of this one and still make the thing work? So recall that, you know, when you do an expansion and then an evaluation at let's say this node, you are checking the sort of win probability of each of the child nodes, right? And so if this one is, you know, like one and these are zero,

55:48you do kind of know something about which action might be better to take. And so why would you need, still need this, right? Like why not just normalize this one into some distribution and call that your policy distribution? This is fine. You can do this. And this probably does work. But in practice, having a single forward pass that gives you a pretty good guess is how the breadth is pruned out.

56:20There is a sort of duality here. Like it would be weird if let's say the policy recommended an action that disagreed with the value, right? If let's say the policy said this was very high probability, but this one said it was, you know, low value, then there's actually something kind of fundamentally wrong between your policy head and your value head. So they are linked, and you probably could get rid of this if you came up with a different way to recover this from just the value evaluations. Right. But just to make sure I understand, the reason you don't do that is so that you don't have to do 360 independent

56:52forward passes to like, okay, here's the value of everything. Let's target max over it, right? Instead, you can just do one forward pass and get like the probabilities of all of them. You can usually batch these somewhat efficiently. So it probably is not a huge computational burden in practice. But yes, you would have to pass up to 361 boards into a single mini batch update to evaluate all the values here, then normalize them. Now, there's actually a more important reason why we still do this, which is how Monte Carlo Tree Search is used to feedback on itself and sort of

57:26recursively improve its own predictions and search capabilities. And that's where this one, having this as an explicit entity you're modeling rather than an implicit normalization over your value is a good idea. Makes sense, okay. Okay, so we talked about the simulations and basically, you know, what you end up with as you roll out the number of simulations is a tree that kind of looks like,

57:51I'm drawing a very low dimensional version of this. Of course, in the real game, it's much more high dimensional. But like, you'll end up with basically a tree structure that like has

58:03a lot of leaves that kind of terminate and are not visited again because their value is deemed to be too low. But then, you know, along one path, there will be a set of actions with very, very high visit counts that kind of gravitate towards that one set of decisions as you increase n. So this is kind of like the mental picture of what the tree in Monte Carlo Tree Search looks like. And you should contrast this with like an exhaustive tree, like in tic-tac-toe, where you could say like, you know, there's nine actions and then eight and then seven and six. And so it's a sort of like nine factorial

58:36sized tree. The Monte Carlo Tree Search in Go is very, very sparse, right? It only considers the paths that you've expanded children nodes on. Okay, so now that we have the search algorithm that applies the value function as well as the policy function, we can now talk about how the Monte Carlo Tree Search algorithm can actually act as a improvement operator on top of these guys here. 20 years ago, Jane Street's data center fit in the corner of an office. Ron Minsky,

59:10who co-leads the tech group there, told me about how it all got started. One of our compute clusters we called The Hive. And I remember the first mission of The Hive was literally like six Dell boxes stacked on top of each other at the end of the row. And the trading systems themselves, we also had there because we actually wanted the ability to make sure we could turn the damn thing off. I mean, there were ups and downs, like literally at some point, you know, one of the people who was cleaning the office unplugged one of the trading systems in the middle of the day as they were vacuuming. So, you know, in the end, it is in fact better to have it all in a data

59:41center. Jane Street's data centers have come a long way since those six Dells. And I got to tour one of them in Texas with Ron and Dan Panticorvo, who leads Jane Street's physical engineering team. You know, these cabinets, these GB300 cabinets consume at peak about 140 KWE. Compare that to traditional air cooled, you're talking about 10 to 40 KW. It's a lot more. We got deep into the details of running one of these data centers, things that I had never considered before. It's filled with a liquid, a mix of distilled or deionized water and propylene glycol, 25% of propylene glycol. That's to inhibit any bacteria or algae growth.

1:00:14I don't love the world where we have to worry about bacteria growing in our servers. I got to see way more of what actually happens in the data center than I've ever seen before. Jane Street was willing to literally pull up the floorboards and take out the racks and take me to the back where all the chillers are. You can check all of this out at janestreet.com slash thwarkash where we posted the full tour. Okay, so we now talk about the RL part of like how this thing gets stronger by playing itself, right?

RL Part of AlphaGo

1:00:40Let's say we play a game where the AI, so you make a move.

1:00:51AI will kind of compute the search and then this is this sort of visit count distribution. Let's say this is your policy, your policy, initial policy recommendation at this node. And then after MCTS, it gets more confident about one of these actions, right? And so maybe the distribution looks a bit more peaky like this based on the search. Now, of course, you can tune the search process so that it ends up more diffuse, but that's probably not a good idea. MCTS should get more confident about specific actions than others, but it, of course,

1:01:25might place a lot of weight on, you know, other actions initially. And then as you increase this number of SIMs, it should converge to a very peaky distribution. So this is your new, let's call this like pi, let's wrap this in like a MCTS operator

1:01:42of, you know, A given S, right? So after applying MCTS process, your policy recommended distribution looks like this. It's a bit more peaky than the previous one. And so then you take the ArcMax or maybe you just sample from this. It doesn't have to be the ArcMax and then you make your move. And then you throw away the tree and then you begin a new on the next move, right? So again, like you, you know, you compute a new distribution.

1:02:13So initially maybe your guess looks like this, and then you refine it through MCTS.

1:02:28There should be one more X on the board, right? I'm sorry, that's correct, yes.

1:02:34To something that looks like this, right? So on every move, you have your initial guess from your policy network. And then the search process that combines your policy network and your value network arrives at a more confident action that you take. And then so and so forth. And then the game ends and one person wins and one person loses. So the way that, the beauty of how AlphaGo trains itself is that it actually can take this final search process, the outcome of the search process, and tell the policy network,

1:03:09hey, like, you know, instead of having MCTS do all this, you know, legwork to arrive here, why don't you just predict that from the get-go, right? Like, why don't you like, you know, not use this guess and just predict this to begin with? And if you have this guess to begin with in your policy network, then MCTS has to do a lot less work to get things to work. And so if we draw like a sort of test time scaling plot, so let's say like this is like number of simulations.

1:03:34Let's say, you know, at zero simulations, your sort of implicit win rate is like, is like, I don't know, here. And then, and then, without any simulation, if you just take this raw action, that this is what your winning rate is. And let's say as we increase the number of sims, maybe you kind of have a win rate that looks like this, right? So when you search for, let's say, a thousand simulation steps, that gets you to a policy here that gets you to here, which is great.

1:04:07But if you were to distill this MCTS policy network back into your sort of shoot from the hip policy network, then you could actually, you know, start here. Like, if let's say this was, you know, zero by distillation, then if you spend another 1,000 sim steps, then you actually kind of get to here. It's almost like if you could just, you know, amortize the first 1,000 steps actually into the policy network instead of the search process, then you can begin at a much better starting point

1:04:41and then get a much better result for the number of sims that you put. The save more type nature of test time scaling as the number of simulations increases, the increase in win rate is smaller. Is that true even for the distilled network? That is to say, is there some gain of like, okay, we start from the distilled, we get these early gains again, or is that just inherent to like the nature of MCTS? To be honest, I actually don't know the test time scaling behavior of MCTS simulations. And I believe it might actually be quite sensitive to how strong this one is in practice.

1:05:12I'm just drawing a monotonically increasing function that gets to one. Okay, cool. So don't pay too much attention to the shape of the curve. Just know that it's monotonic with respect to sim.

1:05:21Okay, so the idea of MCTS is very brilliant, which is like, we're going to, we got something better by applying search. Yeah. And we're going to now, on our next iteration of updating this network, just train this to approximate the outcome of 1,000 steps of search. Yeah. And so instead of starting here, we get to now have a neural network start here, and then the play gets stronger once we then apply another 1,000 steps on top of it. And you can keep going, right? So the training algorithm for AlphaGo is to basically take the games where you've

1:05:54applied the search on every move that the policy encountered, whether you won or lost, and that's quite important. And you're just going to train the model to imitate the search process. So there's an analogy to robotics, actually, which is the dagger algorithm.

1:06:11First, I'm going to draw like a schematic of like, let's say, you know, the states, right? So S0, S1, S2, S3. So let's say, you know, we took a series of actions in an MDP to get a trajectory. And these actions may be suboptimal, right? Maybe we lost at the end of this game.

1:06:36So there is a family of algorithms that basically take trajectories and relabel the actions to better trajectories. So maybe a better action here would have been to take, you know, A0 prime. A better action here would have been to take A1 prime and yet another one like A2 prime, A3 prime. So what MCTS is doing is basically saying, like, you play this game where you eventually lost,

1:07:06but on every single action, I'm going to give you a strictly better action that you should take instead. It does not guarantee that you are going to win, but it does guarantee that, you know, if you take these tuples as training data so that you retrain your policy network to predict these ones instead of these ones, you're going to do better. And this is very related to Dagger in robotics and imitation learning, where you want to collect an intervention here. And even if you're in a not great state, for example, like a self-driving car that, you know,

1:07:38veers off the side of the road, there is still a valid action that kind of corrects you and brings you back. Yeah. Okay. So, pedantic question, but is there a guarantee that MCTS must be better than the policy? For example, you could imagine early on in training, because MCTS is informed by the value network, early on in training, when the value network hasn't been well-trained on finished games, that like MCTS is worse than sort of randomly exercised policy. So is it just like a heuristic that MCTS is better than policy or is that like, is there some guarantee?

1:08:10Right. In practice, it is a heuristic and it does work also in practice, but let me illustrate an example where MCTS can give you a worse distribution than your policy network. Yeah. So, and this can often happen if your self-play algorithm has trained to a good point, but then somehow it collapses because it's not trained on diverse data or something. Right. So, let's say we have a board state where the policy recommendations here are very good. So, like, you know, pi of AS is like, great.

1:08:42But somehow, because maybe we're playing on a lot of games where the bots just resign instead of playing all the way to the Trump Taylor resolution, they kind of forget how to evaluate those kind of late stage plants, right? Like in the case that we showed with the corner play, maybe like 100% of our training data in our replay buffer has lost examples of how to evaluate the value function at those states. So you might end up in a scenario where your terminal value is like very bad. And if the terminal values of the leaves are not good, then this will actually propagate all the way up

1:09:19and cause your puck selection criteria and your backups to be off. And then you end up visiting a very, very different distribution than what your policy initially recommended. Also, if your number of sims is low, then you might also have a variance issue where you just don't explore enough, right? Like it's only guaranteed to converge when you kind of take end to infinity. So, variance in, you know, your search process as well as inaccuracies in your evaluation can definitely screw with the quality of your policy network. And so that's why it's not a guarantee

1:09:51to improve. But, and that is why I think, I suspect why AlphaGo Lee had the playouts to the end in their training algorithm so they could ground this thing in real plants. In practice, what you could also do is just like for 10% of the games, you prevent the bots from resigning and you just say like resolve it to the end. So you get some training data in your replay buffer to really resolve those kind of like late stage playouts that normal human players would kind of not play to. Yeah, yeah. So, so this is why MCTS kind of, if you assume that the value functions are correct,

1:10:24why it gives you a better policy is because, and it's a very critical, you know, chain of assumptions. Assuming that this is accurate, then your search process should give you a better

Training Algorithm

1:10:34recommendation than your initial guess. Right. Okay, so if you have a cold started policy, if you have an AlphaZero type thing, really what's happening for the first few epochs is the policy is kind of useless. And what you're really just doing is, hey, but let's play full games. And once we have played full games, for the preceding moves, we'll have labeled who won, who didn't win. And the loss for AlphaZero has two components, which is like, how good is the policy relative to MCTS? And how good is the value

1:11:06prediction relative to who actually won the game from this move? And this is this sort of like, you can think of this being applied to every single action or every single move. Correct. And really what's happening in the beginning of AlphaZero training is just like, we're trying to get the value function to actually predict who will win the game if you're, if you find yourself in this state and you're this player. And functionally, that's all that's happening. And later on, once that's well trained, now the policy is also improving. Correct. Okay. One trick I did find to be pretty useful, and this is not a peer-reviewed claim, so just like take this with a grain of salt, is like, I found it useful in my own

1:11:39implementation to do the following. You want to first make sure that this is good before you invest a lot of cycles doing MCTS, right? Like, it doesn't really make a lot of sense to do search on garbage value predictions. So you want to kind of start at a good place where this works. AlphaGo lead does a very good thing where it just takes human games and then you like train on it and it just works, right? Totally works. You can also take an open source GoBot, play it against itself, generate data, also works. So if you have some like offline data set that

1:12:11has realistic good play, you can easily learn the late stage value functions pretty well. And that's what you kind of need to start the search process. Sorry, can you just read the sentence one more time? Sure. So it's quite easy to evaluate a late stage Go game. Like when almost all the pieces are on the board, like it's almost like a decidable problem, right? Because there's the lower and lower uncertainty as to like the depth of the tree. So most games played to the end by reasonable people will be good training data to train a good value function at terminal

1:12:42parts of the tree. Got it, okay. Then as you play more games, the search will back up good values into the sort of intermediate nodes of the tree. And then like as you increase the amount of data, your value head gets a good intuition of like what is a healthy board state versus a not healthy board state. Yeah. That those are much more subtle to judge in the mid game than the beginning or the end. So the most difficult part to score is like not the beginning or the, because the beginning is just like obviously 0.5 and then at the end it's like pretty obvious who's winning. So the hard part that

1:13:12you want to learn in the value function is like who is winning in the middle. And so this is actually very analogous to TD learning. Yes. And there's a beautiful connection to TD learning that we can talk about in a bit as opposed to, you know, contrasting with Monte Carlo Tree Search. So you first want to get good value functions and expert data can kind of give you a quick shortcut. I recommend for, you know, practitioners just do that first just to, you know, initialize to a good starting point. And then if you want to do the alpha zero thing or kind of tabula rasa learning, then what you can try to do is on a small board, play random games, just take a random agent. And if you play like,

1:13:47you know, 50,000 games, you'll actually learn a pretty good value function as well. Because on a nine by nine board, there's actually, you can see enough of the common patterns with random play. And then if you train a model that kind of can train on both nine by nine and 19 by nine data, and Katago was a proposed one of these architectures, then there's some pretty good transfer learning from the value head evaluated at nine by nine to the 19 by nine. Right. Because this, this unlike other games as much like a very much a sense of like, there's not like a new kind of piece that is introduced when you increase the size or something.

1:14:20If we take it to its limit and consider like a very tiny, like four by four Go board. Yeah. Like if you play 50,000 games, you're going to have a lot of end states that look like human play, right? Like it's just like tic-tac-toe at that point. Um, so, so if you like broaden this a little bit to like nine by nine, five by five or nine by nine, it's not unrealistic to imagine that like purely random play will actually generate pretty reasonable looking boards. Um, and then so you can score those pretty easily. And so that is what gives you the bootstrapping to be able to then improve your policy with search. But it's very, very critical that MCTS has accurate value, uh, estimates and you need to ground the value. Ultimately,

1:14:54MCTS will fall apart if you don't have a grounding, uh, function for, um, for the value. I, I'd be curious about how much compute you save by training the value and policy on the same network that because they share the same representations, how much more efficient learning is. Cause that would be interesting if, um, they're basically kind of, uh, we've just talked about how they're kind of making similar predictions or they shouldn't be in line with each other. And so I'd be curious if like, actually, yeah, you just, you're, you're like having the amount of compute you had to do by giving them the same network. Right. AlphaGo Li, the original AlphaGo paper had two separate networks.

1:15:25And then, um, in all subsequent papers, um, they, they merged them into two heads. And presumably this saves compute. But answering that question in a very rigorous scientific way is actually, uh, it's, it's a simple question, but, but in practice actually takes, like, if you really want to chase that question down to its, its limit, it's, uh, it takes quite a bit of work to, you know, really resolve that. Yeah. Um, but intuitively, yes, they share a lot of representations. So, so, and, you know, as we mentioned, there is, there is a sort of like, your, your policy network and your value network when doing evaluation should kind of agree, right? So there, there, there really should be this sort of

1:15:56consistency between them. Yeah. Could be if this is the wrong way to think about it. I feel like when I learn how an LLM works and how simple RLVR is, at least as an algorithm, how simple it is, I'm sort of stunned by the kinds of things it can do, uh, that it can learn how to build very complicated code repositories and whatever, simply from getting like a yes, no. And here, I feel like the, if you understand it more deeply of like just predicting MCTS and

1:16:26it actually seems, awful go seems less impressive in retrospect, the more you understand it, because you're like, oh, you're putting in a lot of bias by just saying how much you do, you're like telling it how it should titrate exploration as things go on. You're building this very explicit, uh, tree search for it. And so I don't know if you share that intuition where it actually, the more you understand it, the less impressive the accomplishment in 2017 seems. I, uh, I personally disagree. I, I think they're profound for different reasons and I don't understand the LMRL like enough to like kind

1:16:58of comment on your podcast about it. Um, but, um, I think AlphaGo, so, so yeah, why is it a profound accomplishment? I think maybe it's worth stepping back a little bit and just like, uh, it is different than modern RL and we can talk a little bit about like some of the algorithmic choices there. But, um, I think the most profound thing here is that a, um, a 10 layer neural network pass, so basically, uh, 10 steps of, 10 steps of reasoning. Yeah. And of course, the reasoning is not just one trail of thought. It could be like the distributed representations and a lot of thoughts going

1:17:31on at the same time. But, uh, by construction, let's say a 10 layer neural network can only do 10 sequential steps of thinking, right? Yeah. Um, 10 steps of neural network paralyzed distributed representation thinking is able to amortize and approximate to a very, very high fidelity, a nearly intractable search problem. Yeah. So, so this was a breakthrough that I think most people don't even understand today, like fully comprehend like how profound that accomplishment is. And this

1:18:01is what also girds like, um, AlphaFold for example, right? Like, um, uh, where you have a very, very difficult physical simulation process that you would need to roll out so many micro scale simulations. And yet, like, 10 steps of a somewhat small neural network can somehow capture what feels like a, you know, MP class, uh, problem into a single, um, problem. And, and so, I, I, it, it actually makes me wonder if, you know, our understanding of problems like P equals NP, or, you know, these very fundamental,

1:18:34like computational hardness problems are incomplete, right? Like, like, it's, it's not like, you know, obviously this is not a proof of like P equals NP or anything, but, but there's something to it that like kind of is very disturbing where like what felt like a very hard problem can fall to a very, very simple macroscopic simulation. Yeah. That is a very interesting insight that a lot of problems which are proven to be NP hard, like, I don't know if Go is proven to be NP hard, but protein folding, et cetera, have been like neural networks can solve them because they're

1:19:06NP hard in the worst case, but we're not dealing with the worst. We're usually not concerned with the worst case. We're, you know, like these problems have a lot of structure to them. Yeah. I think that the, the, the kind of question we should be asking ourselves is like, we've been formulating, you know, solutions to NP hard problems as in like kind of worst case complexity. And I wouldn't say, you know, this solves Go, right? It doesn't give us a exact solution of the optimum. But in practice, like it is extremely useful. And the same thing has been shown in like alpha tensor, alpha fold, where like, yes, there is a very hard problem that in the worst

1:19:38case seems intractable, and yet we're able to make like almost arbitrary amounts of progress. So, so here's a sort of like, you know, in the limit, what is, what, what might this look like, right? Well, if you want to simulate, you know, something very complex like weather or predict the future, like, you know, do we live in a simulation or not? The computing resources you need to build a very complex simulation might be much smaller than you think, based on, you know, our ability to amortize a lot of that computation into the forward pass of a single network.

1:20:09Interesting. So to me, AlphaGo was the first paper that kind of like really showed this like profound level of, you know, simulation being compressed into a small amount of compute. I feel totally not at all qualified on the computational complexity of the math to comment on this. But I wonder if there's an important role of chaos here, where if, what is the problem with weather? And why does it take 10x the amount of resources to predict weather a day out, and continually so for every more day out,

1:20:41is because it's a chaotic system. And so small perturbations can totally change the final estimate as time goes on. And I guess it's interesting. Well, I guess you would expect that for Go and protein folding as well. So here's an analogy to weather that might be relevant in Go. So the problem of like, you know, here's our current board state. Yeah.

1:21:07Given what we know about both players, what is the board state in the future? Yeah. Right? This is, this is extremely sensitive to initial conditions, like a single stone place here can kind of disrupt the entire prediction. Yeah. Right? So this is hard. This is kind of intuitively the chaotic problem. And yet somehow, so this is, this is hard. Somehow we can predict who's going to win. Yeah. Like, and this captures a lot of possibilities here. And so there's this more macroscopic quantity

1:21:40that we really care about, which is the average or expectation or some sort of global macrostructure over a lot of like, you know, possible futures. That's an interesting way to think about it. And so, so in, and whether it could be the same thing, right? Like we don't exactly care like what the, you know, velocity of wind 6,000 feet above a specific latitude, longitude is. We kind of care like, where's the hurricane? Or, you know, you know, things like that. And, and I would say like in chaos, you know, there's a classic like Lorenzo tractor, which kind of looks like this, right? Yes, you, you don't, if you start anywhere on the Lorenzo tractor, you don't know where you're going to end

1:22:13up. But you, you do know that the thing looks like this. Yeah, yeah. Right. And, and so there's this kind of beauty of like, sometimes we don't necessarily care about the microscale things, we actually care about the macroscopic structure. That's interesting. And, and these things can be predictable. And, and contrast that say to something like a hash function, which is also incredibly independent on initial conditions, but doesn't have a macrostructure, or at least hopefully if like the arguments work. Yes, one would hope. And so there's like no equivalent of a value function or like broadly, how's the weather going to be that is interesting there. It's really just about what is the move? What is

1:22:48a board going to look like 100 moves from now exactly? Yes, intuitively, that seems correct. I, and then again, this is also out of my, my area of expertise, but I, I find it interesting that like cryptography has not been able to like, um, the tools of cryptography and, uh, you know, um, hashing have also not been able to prove that like, uh, you cannot come up with fast approximations, like you cannot come up with fast approximations, right? If that, if they were able to do that, then you could prove p is not equal to mp. Yeah, yeah. In fact, we know that there's structure

1:23:19in many cryptographic protocols, obviously like, uh, uh, RSA cryptography, there is structure and that structure is what quantum computers exploit to break them, right? I see. Um, Reiner has a very interesting blog post, which we talked about in the episode where he, uh, talks about how, if you look at, at a high level, what cryptographic protocols look like and what neural networks look like, it's extremely similar where you have sequential layers of jumbling information together. And it's because there's this conversion devolution in the algorithms where in cryptography,

1:23:51you want the final state to be incredibly sensitive to initial conditions so that it can come out sort of looking jumbled based on if you change anything. And then neural networks, you similarly want everything to be dependent on all the information because you want to process all the information and consider how it relates to itself. Yeah. You have the maximum power of a neural network at the edge of chaos. Um, I think there's some like research papers from, uh, Joshua Stroldick on this. Yeah. Yeah. Yeah. Like, like there's something kind of quite fundamental about like chaos that is, it's not just like hopeless noise. It's like there's something kind of

1:24:24useful, right? In, in, um, in chaotic systems. Yeah. At least at, at that boundary. Um, but yeah, this is just my, like, think about this as a philosophy. I don't, I don't actually know the math, uh, well enough to comment on it. Anyway, if we, um, go back to, um, we'll talk about LMRL in a little bit because there's some connections there, but let's just go back to like the MCTS. Like what is it doing? It is not, um, crucially, it is not saying we're going to, uh, increase the probability of winning directly. It's not going to say like we're going to, um, up weight all actions that won and down weight

1:24:55all actions that didn't win. Yeah. Um, importantly, what it is doing is saying for every action we took, we did a pretty exhaustive search, uh, on MCTS to see if we could do better. And we're just going to make every action that we took better by predict, like having the policy network predict that outcome instead. And, and so this is, um, a very, very nice idea because you have one super vision target for every single action. Yeah. So the, the variance of your, your learning signal is very low compared to the alternative naive, uh, RL thing. So, so let's actually consider what, let, let's consider a very

1:25:28naive algorithm, uh, uh, that looks a lot more like, you know, modern LMRL today, where, where we do something like, um, let's take the winner of a self-play game and encourage it to do more of that. Okay. So, uh, it, it's worth kind of thinking a little, a little bit about like, okay, what are some alternatives that we could do to train self-play agents instead of MCTS, right? Like, you know, we, we use a lot of, uh, LLM style RL these days, like, is that relevant? Could we do that instead? Um, so, so let, let's think through this a little bit. Let's suppose we have a very naive algorithm where we take a league of agents of different checkpoints and we play them against each

1:26:01other. And, um, for the, for the games where a single player wins, uh, we're going to reinforce those actions up and then, and then, uh, and train, retrain the policy network to imitate those, those guys instead of, uh, instead of the MCTS objective. Um, so what ends up happening is, let's say you have a chain of actions that led to a win. And you're, you're, you have a matchup between two agents that are basically the same. So in fact, let's just assume that like, uh, you know,

1:26:34policy A and policy B are like evenly matched, right? So their true, their true win rate is, is like 50%. Um, so let's say you play a hundred games

1:26:50and then, uh, each game, let's say lasts, you know, 300, um, moves.

1:26:57And, um, you're doing some sort of like evolution strategy or some way to perturb these things to get them, get them to do different things, or maybe you don't and you just play them against each other and you see like occasionally this one might actually have a better strategy than this one, right? And so, so let's say, um, you know, 51 games, um, policy A wins.

1:27:21And then 49 games, policy B wins. And this is just due to random luck or maybe you perturbed policy A in some way that let it do this. And just to have a very, very simple model, let's pretend that for, uh, for, uh, for like, uh, 49 of the games, they played exactly equally. Um, I'm sorry, for,

More from Dwarkesh Podcast

Alex Imas and Phil Trammell – What remains scarce after AGI?

Jun 4, 20261h 16m

Reiner Pope – Chip design from the bottom up

May 22, 20261h 20m

David Reich – Why the Bronze Age was an inflection point in human evolution

May 8, 20262h 13m

Reiner Pope – The math behind how LLMs are trained and served

Apr 29, 20262h 13m

Jensen Huang – TPU competition, why we should sell chips to China, & Nvidia’s supply chain moat

Apr 15, 20261h 43m