Steadcast
The Cognitive Revolution cover art
The Cognitive Revolution

The RL Fine-Tuning Playbook: CoreWeave's Kyle Corbitt on GRPO, Rubrics, Environments, Reward Hacking

May 1, 20261h 46m · 21,146 words

Show notes

Kyle Corbitt, founder of OpenPipe, breaks down reinforcement learning and custom fine-tuning for modern AI models. He explains how RL differs from supervised fine-tuning, why GRPO and LLM-as-judge post-training matter, and how these techniques can improve performance, latency, and cost on open source models. The conversation also covers reward hacking, evaluation design, LoRA adapters, and how Chinese labs are using distillation to fast-follow frontier models. Sponsors: Sequence: Sequence handles the full revenue workflow for complex pricing, from quoting and metering to invoicing, revenue recognition, and collections. Book a public demo at https://sequencehq.com and use code Cognizant in the source field to save 20% off year one AvePoint: AvePoint is building the control layer for AI agents so you can securely govern, audit, and recover every action at scale. Design trusted agentic outcomes from day one at https://avpt.co/tcr VCX: VCX, by Fundrise, is the public ticker for private tech, giving everyday investors access to high-growth private companies in AI, space, defense tech, and more. Learn how to invest at https://getvcx.com Claude: Claude by Anthropic is an AI collaborator that understands your workflow and helps you tackle research, writing, coding, and organization with deep context. Get started with Claude and explore Claude Pro at https://claude.ai/tcr

Highlighted moments

i have a hard time seeing them as like a durable long-term kind of like venture shaped business um i think they're potentially like really really good businesses for the founders if they don't take capital and just kind of like take the profits while they're good
Jump to 1:07:24 in the transcript

Transcript

Introduction

0:00Hello, and welcome back to The Cognitive Revolution. Today, my guest is Kyle Corbett, founder of the reinforcement learning and custom fine-tuning company OpenPipe, which Corweave acquired last year. I open this conversation with a bit of a confession. I've done a lot of supervised fine-tuning work over the last few years, both for Waymark in the early days of getting GPT-3 to write decent video scripts, and for research projects such as the emergent misalignment paper. But I've done essentially no hands-on RL work, both because my perception has been that

0:32Frontier models are probably my best option in any case, and because I'm afraid, perhaps irrationally, of reward hacking. Kyle says that while it may or may not be worth the extra work and slower iteration time, he does believe that using RL on an open-source model probably would deliver me better performance, and would certainly reduce both latency and inference cost dramatically.

Reinforcement Learning

0:53With that motivation in mind, Kyle proceeds to offer a masterclass on all things RL, which repeatedly challenged my premises and, in multiple instances, updated my understanding. He explains how RL differs from SFT in terms of the weight updates it makes to the models, how this difference makes RL fine-tuning less likely to cause catastrophic forgetting, what distinguished the DeepSeq GRPO algorithm from its predecessors, and what additional improvements on GRPO people are using in industry today.

1:24We talk about the distillation strategies that Chinese labs are using to fast-follow American frontier models, and he argues that their use of LLMs as judge in the context of RL post-training is a bigger deal than supervised fine-tuning. He also explains why he thinks that compute is the primary constraint preventing Chinese companies from catching up, and why he believes that we're already in a recursive self-improvement loop. He describes the cottage industry of reinforcement learning environment companies that sprung up to serve frontier labs, and why, though it is a good business to be in for now,

1:55he's declined to invest in any of them. He surveys the use cases that are most commonly deployed by CoreWeave customers, and he offers a lot of advice on how to run RL in practice, including how to develop and iterate on evaluation rubrics, whether to train N models for N tasks or a single model to perform multiple tasks, how the flagrant nature of reward hacking makes it relatively easy to deal with, at least when you're focused on specific, narrow tasks, and how CoreWeave's use of LoRa adapters drives efficiency and convenience for their customers.

2:29Kyle is both a technical expert and successful commercial practitioner, and from start to finish, this is a super high-signal conversation on a classic training technique that has become an industry unto itself. And so, I hope you learn as much as I did from CoreWeave's RL fine-tuning guru, Kyle Corbett. Kyle Corbett, founder of OpenPipe, now after an acquisition leading the serverless training team at CoreWeave, welcome to the Cognitive Revolution.

2:59I am super excited to be here. Thank you. I'm excited to have you. This has been a long time coming since we've met almost a year ago now, and I'm glad to finally be doing it. So, and that's all on me, by the way, just so everybody knows.

Guest Introduction

3:12So, you are a specialist in reinforcement learning. What I want to do in the next hour and a half or so is get basically a comprehensive survey, crash course, and rundown of what is going on in reinforcement learning, how we should understand it and, like, what the techniques are looking like, who's using it and where and for what purposes and who's having success and not and, you know, what makes a difference and all those things. And so, I guess I was just going to start by telling you my story very briefly and then allowing you to react to that and tell me if I'm, like, way off base or not.

3:47My story, in short, is I've done a lot of model fine-tuning over time, mostly on managed platforms, not so much, like, on open weights models, just a little bit of that, more so on, like, the open AI platform. But almost entirely supervised fine-tuning, not really much at all reinforcement learning fine-tuning. And the story I'm telling myself, which you're invited to pick apart, is, for one thing, increasingly these days, like, I can just use base models with few-shot prompting and that's getting me a lot of what I need.

4:18But even before that was possible, the problems that I was working on in the context of my company, Waymark, are sort of taste-driven problems where we always kind of felt like we'd be better off going to our creative team and say, hey, give us 100 great examples. We'll fine-tune on that and hope that the AI can follow your lead rather than try to go through some sort of seemingly more complicated, maybe more powerful, but kind of, like, harder to wrap our heads around notion of, like,

4:48well, if we get the AI to do it and then we compare and we score, maybe there's an LLM as a judge. We're kind of like, I don't know. It feels a little bit sometimes like a shell game and I'm not sure, like, how much I should, where I should invest or how much I should trust that process. Whereas I know, at least if the AI is, like, imitating my creative team, there's some, like, decent true north there. And then the other thing I'm kind of afraid of, although I'm not so sure it's a big problem in my context, is reward hacking, but I am kind of afraid of reward hacking in general. How would you advise me on whether or not I'm making a good decision or not?

5:19You know, should I be using reinforcement learning or am I thinking about it the right way? Yeah, no, that's a great question. I think one that lots of folks think about. Maybe my first question for you would be, how are the results you were getting from your existing process? So you mentioned, first of all, that, like, these days you mostly just do prompting, but, like, when you were doing fine-tuning with SFT, do you feel like you were seeing the models improve substantially on that? Do you feel like it was, you know, and this is a very high bar, which I imagine wouldn't clear, but did you feel like it was, like, behaving as well as your creative team

5:51and matching the quality of those examples it was given post-training? I would definitely not say it was matching the best work that our creative team could do, but definitely a notable improvement on the base model. I would say our typical complaint was probably most often that, and this would definitely vary through different generations, but more recently it was, like, able to do the job perfectly well, so to speak, and I think that's true today, too, with prompting,

6:22but few moments where you're like, damn, that was awesome, incredible turn of phrase, or, you know, nailed it, really nailed it, you know, in the way that sometimes you just get something from the creative team that's like, oh, wow, like, that was a really good creative idea that impressed me, surprised me, delighted me. I wouldn't say we see too much of that coming from the models, even today. Okay, yeah, okay, so that was going to be my next question. So, yeah, even with the latest frontier models, you know, that sort of spark or wow moment,

6:53it sounds like is not something you see commonly.

6:57Rarely at best, I would say, yeah. Yeah, so here's what I think. Like, I think it is likely that you would have been able to get better performance out of the models with reinforcement learning than with SFT, and there's a few different factors here that sort of model it. I mean, one is that, like, OpenAI's support for RL was half-hearted at best at any given point, and I think technically they still do it, but that entire kind of, like, you know, model customization platform feels very much in maintenance mode at this point generally. So I think on that side, like, yeah, that might just not have worked.

7:29In a parallel universe where you were using an open source model and you're using, like, a Quen model or something like that, then I would say with a fairly high degree of confidence that if you're able to get decent results out of SFT, the ceiling of the best results you can get with reinforcement learning is going to be higher. And that's true even if the data you're using for SFT is high-quality data, human data. And the reason why is because it's just, like, the whole trick to RL, like, the whole reason RL works or is something that people invest in at all

7:59is because it turns out it really does matter how well your data distribution matches kind of the models, you know, like a standard mode of thinking or just, like, what it's picked up from pre-training. And what RL gets you is it just gets you, you know, more, it is working within those channels that are already carved quite deeply within the model. And when you work within those channels, you can just get a lot further because you're not trying to overwrite what it's doing. And so, and yeah, you might say, like,

8:31well, you know, overwriting is what we're trying to do. You know, we're trying to get it to do something it's not good at, which is fair. But it ends up being quite destructive. It's actually really interesting if you, like, you know, look at the weights, like, if you're doing an SFT, it's just, like, even with very few examples and even with a very low learning rate, like, it's just, like, throwing the weights all to pieces and, like, the average differences are so much larger than doing RL. And that's a big part of why you get this catastrophic forgetting because it's kind of, like, overriding other pathways. And it's just, like, you're trying to get the model

9:01to do something that's, like, quite different than what it was trained to do, whereas RL is going to let you stay in those grooves and get a lot further. So, yes, I do think that would have worked. Now, in your specific case, would it be worth it? Would it get you to a place where it's, like, oh, this is, like, better than just using the frontier? My guess is probably not. So I think concretely for your task, if the trade-off you're making is, hey, we're going to take an open source model and use RL to try and make it better at this versus, hey, we're just going to take whatever the best off-the-shelf model is and do prompt engineering, you know,

9:32and we're allowing ourselves to expand to, like, you know, the best frontier models to that point, I suspect for a creative writing task, you would end up in a position where you're better off using the frontier models. And, yeah, like, we can sort of, like, get into, like, there are definitely tasks where I would say the exact opposite and say that, you know, the RL could do well. I would also say that, like, this is, like, obviously, like, all dependent on the amount of compute. Like, I think, theoretically, anything is possible. If you buy yourself a data center and spend a couple billion dollars on this task, you would be able to surpass the frontier. But the trade-off point would be fairly long,

Compute Constraints

10:04I suspect, along that curve for a task of this shape. I'd like to understand this grooves thing better. And, I mean, I do know what you're gesturing at. When I think about, like, how much the weights change with fine-tuning, I usually think of that as kind of more a function of, like, some sort of divergence penalty, some sort of tethering of the, you know, the model as it's evolving to the base, to the starting point, I think you can do that on any kind of fine-tuning, right?

10:35So how is it that if I have kind of a similar divergence penalty term to my loss function, why does the supervised, like, why is it more destructive than the reinforcement margin? Yeah. No, that's a totally fair question. So, yeah, I think what you're talking about is, you know, there's a term called, like, a KL divergence penalty, which is, you know, a sort of, like, auxiliary term you can add to any loss function saying, hey, you know, prevent the, it doesn't actually prevent the model weights from drifting. What it prevents is specifically,

11:06like, the log probs that are generated, you know, at each token position from drifting too far from the base model. And this is often considered best practice because, you know, it can help you from getting, you know, like, catastrophic forgetting and, like, moving too far away. However, what it's not, like, the fundamental issue is that, let me put it this way, like, there are often different ways to get to the right answer, right? And this is, like, the easiest example here is if you're talking about a reasoning trace where, you know, it's like,

11:37hey, you're doing the math problem and you're training this model with RL to solve the math problem. And there's probably, like, you know, an infinite number of ways you could reason through from a problem description to the answer. And some of them are going to be paths that the model already is comfortable with and is, like, you know, like, oh, like, you know, these, you know, eight tokens in a row, like, even the base, you know, the model you're starting from would have generated them anyway. And then, you know, the next token, yeah, maybe it would have gotten that one wrong. And so there, you know, the learning signal is teaching you to move

12:07that one slightly. But fundamentally, like, what RL just structurally optimizes for is changing the fewest tokens, the fewest log props necessary to get to that right answer. Whereas what SFT does, if you're doing SFT on, you know, you're, say, distilling a larger reasoning model into a smaller reasoning model. And this is particularly true if, like, the smaller model you're going on had different pre-trained distributions. So, you know, you would expect that kind of, like, it's kind of, like, built-in, you know, intuitions or inclinations

12:38are different. Is you're not respecting those kind of, like, pieces of the reasoning that it would have gotten right anyway. You're overriding the whole thing with the reasoning from the larger model. And by overriding the entire thing, it's, like, this is quite confusing potentially for the backprop algorithm because the backprop is just seeing, like, oh, all of these tokens need to change. And, like, maybe some of them didn't actually need to change. You know, like, maybe, like, the direction the model would have gone with this token actually was also fine. And so, but you're, like, changing all the weights

13:08to, like, get to this new one. And really, there was this other token that was much more important that did, in fact, need to change to get to the right answer. But, like, that one's, like, kind of just mixed in with all these other random unrelated changes. And so that, like, general intuition generalizes to other task shapes as well, including creative writing where, you know, the, maybe there's, like, two different ways to phrase this and they're both fine, right? And the model would have chosen one and your creative team chose another and they're both okay. And you don't really want to, like, sort of, like, waste your model updates because every time you update the weights,

13:40you know, there's a potential for catastrophic forgetting and sort of just off-target effects in general. And so you don't want to, like, waste those model updates on, like, changing something that was already fine. You want to really direct them to upweighting the things that, like, the model wouldn't have gotten read on its own or very rarely, more specifically, would have gotten read on its own and, like, very, and focus kind of your sort of, like, updating budget on those. So the KL Diversions doesn't give you that, you know, if what you're doing is just penalizing KL Diversions, it doesn't distinguish between things the model is already doing fine

14:11and you just, like, you know, there's, like, a different way you happen to have it in your training data versus things that the model really was getting wrong. Okay. Very interesting. Hey, we'll continue our interview in a moment after a word from our sponsors. Most billing platforms were built to send invoices and assume your pricing is simple and predictable. But if you're building an AI product, a fintech tool, or a developer platform in 2026, your pricing is anything but. Usage tiers, consumption billing, and bespoke enterprise contracts

14:42are now the norm and you're probably managing it all across disconnected tools and fragmented systems. Sequence handles the entire revenue workflow from contract to cash. Quoting, invoicing, metering, revenue recognition, plus Sequence agents that automate the manual finance work that usually takes teams days each month while also helping them to collect cash faster. Companies like Cognition, Incident I.O., Runway, and Open Router use Sequence

15:13to run their full revenue process between CRM and ERP without the spreadsheet mess. If your pricing has gotten more complicated than your current billing setup can handle, check out SequenceHQ.com and use the code Cognism in the source field when you book a public demo to save 20% off year one. AI is rapidly moving from assistants to agents and it's causing a sea change. AI isn't just helping anymore, it's taking action. And here's the reality. You don't get outcomes from Agentic AI

15:44unless you trust it to operate at scale. That's why Avpoint is building a control layer for AI. This foundational layer helps you govern what agents can access, secure how they operate, make activity auditable, and recover when something goes wrong. All as one connected system. See every agent, app, and workflow and what they touch. Govern with policy and guardrails that work at machine speed. and recover quickly so a mistake doesn't become an outage.

16:14That control layer creates trust and trust is what unlocks the right outcomes, letting you automate more work, move faster, and deploy agents with confidence instead of hesitation. If you're scaling agents and want those outcomes by design, learn more about Avpoint at avpt.co slash tcr. That's avpt.co slash tcr.

16:40When you describe,

GRPO Algorithm

16:43you said more specifically, you know, something that not the model can't get right, but that it rarely gets right. That's key because when we do things like GRPO, the, you've got to have at least one right answer, right, to have any sort of advantage. I guess it also depends on whether you're doing binary scoring or some, you know, more rubric-based evaluation. But I guess several different questions

17:13coming to mind at once. Can you give me a little bit more intuition and maybe we could do this for GRPO and you can maybe describe like, I'm not sure if GRPO is still like the hotness that it was a year and change ago I'm not entirely sure if that was something that broke out for kind of memetic social media reasons or if it really was like a huge advance over its immediate predecessors. But can you give me a little bit more intuition for, okay, I understand that in this

17:44algorithm we are running multiple rollouts. Some of them are going to get to a right answer or if it's a rubric score, they're going to get a higher score than others. And then there's a computation that creates the group relative advantage, which is to say, you know, we want to shift toward the patterns that gave us the right answer or the higher scoring answer. How is it though that that is, because it still ultimately goes to a token by token thing, right? So how is it that if I have

18:15like eight different chains of thought and they're all kind of different and at any given token position, like we might even have very different parts of speech, right? at token position and it could be a preposition here and a verb there and whatever we're like in very kind of different moments in the chain of thought. But my understanding is that the advantage calculation does still ultimately cash out to like token level advantage. So how is it that like, where's the, I'm a little bit lost on the alchemy of like why this translates in the end

18:45to really only updating the, you know, making change on those tokens that really mattered. How is, I'm missing a little logic there. Let me, let me take several parts of this question and I will finish on the one you were there at the end and then hopefully that'll give you the chance to ask follow-ups if my explanation doesn't make sense. Okay, so first of all, like, yeah, I think the reason GRPO specifically like that algorithm and that acronym like, you know, very concretely took off was not necessarily because it

19:16was like a big quantum leap on what came before. It was because DeepSeek did a lot of engineering work around actually scaling it and released an actual artifact, a model that worked really well with it. Like, that was kind of the reason why, you know, there was a whole constellation of other algorithms that probably would have worked just about as well. There was one that came out a little bit before called RLOO, RLU, which like basically is the same as GRPO and likely would have worked just as well if you'd scaled it. After GRPO, very shortly after, like, you know,

19:47within a few months, certainly at the R1 release, you know, there were various, numerous improvements made upon it which really do probably deserve their own algorithm. So there was a paper called DAPO, you know, there's GSPO came out from the Quinn Lab, I believe, and then Syspo was another one that came out shortly after that are all like significant improvements. And then there's a bunch of like minor tweaks that don't even have like named things. But so I would say, yeah, the algorithm that people use today in practice is actually as far, you know, probably further away from like

20:17GRPO as initially described as GRPO was from kind of like what came before it. But we all just still call it GRPO because that was kind of the name that stuck.

20:26Okay, so moving on to kind of like how it actually works and like let's, yeah, I'll talk through, I think this will be helpful to build your intuition on, you know, how the advantages are calculated and everything. So maybe I'll talk first about what came before GRPO because GRPO is kind of interesting in that like a big part of its like development was that it threw away something that everyone had used before and some people still use, which so sort of the spiritual, you know, grandfather of all

20:56RL that people do on LLMs is an algorithm called PPO that was developed by John Schulman in 2017, I believe, actually pre-LLMs or pre- them being big, it was used for games and stuff and the key thing about PPO is it's sort of like you have your policy, which is what you call the model of your training, it's taking a bunch of actions and the key thing you need to do is every time it takes an action you have to like kind of score how good or bad this action is. If it's a good action then you want to, you know, you basically want to update your weights to make it more likely to do that action

21:27and if it's a bad action you want to update your weights to make it do less, right? The way you, and also importantly this is something that happens at a, you know, at an action by action basis but your reward in PPO is sort of like can be very long term, right? So it could be at the end of like a very long sequence of actions you finally find out, you know, commonly this was used with games, right? And so you'd say, hey, at the end of the game or after a minute of gameplay, like what's my score or something like that? So what PPO does is a few different things

21:58and it's actually, of course, building on older work as well, you know, there's an algorithm called reinforce which is trying to solve the same problem. PPO, you know, adds some extra terms to keep it stable, keep it in sort of like a trust region where, you know, you're kind of like hopeful that the model hasn't changed too much as you're updating it. But the way that it, but kind of the key, the key thing that PPO does, and actually this is not unique to PPO, this is from older than PPO, but you want to calculate the advantage at every single action. So every single time it takes an action, you want to say, hey, was this a good or bad action?

22:29And the way it does that is by actually training a couple of different models in parallel. So you have the policy model, which is just your normal model that's generating the actions. And then you have a separate model which is called the value model or the critic model. And the value model is actually predicting, saying, hey, based on the set of actions up to this point, you know, what do I think the score is going to be in the long term? Like basically, it's predicting for this action, like what do I believe is the value of this action? How will, what impact will this action have on the score in the long term? Okay?

22:59And it's predicting that for every single action in the sequence. And then eventually you do get to see like what the actual score is. And then basically if the score ends up much higher than you expected, then you can say, oh, some of these actions clearly were much more valuable than we expected. So if, you know, if it's like, hey, my, you know, my, my, my critic model thought it would, it would have a low score and it actually has a high score, then, then I want to like make it much more likely that I have a high or that this action happens in the future. Okay. Now, moving on to

23:31GRPO, the sort of like key difference here is instead of having, figuring out what the value of any specific action is, oh, actually, before I go into GRPO, I should, I should mention this all translates directly into LLMs and, and it's, and the translation that people do, people have tried actually a lot of different translation, but the one that most people do, and it's, you know, kind of like the, the simplest thing that works is every single token generated is an action, right? So, so we're using the exact same concepts as we were using before and just saying like, hey, every token, you know, the state up to that point

24:02is, is the full context and then this token is an action and the next token is another action. Okay. What we do with GRPO is it turns out that it's, that calculating, that figuring out that, that kind of like value model and keeping it up to date and is sort of painful. It's just like tricky to get right. It's like another set of hyper parameters you have to tune. I'm like, okay, you know, like we have to keep this model updated or else, or else train doesn't work well. And what GRPO did, and they actually were not the first ones to do this, but, you know, they, they, they sort of get the credit because they were the first ones

24:32to do it at scale and prove it worked well. And they said, hey, we're just going to like completely throw away the value model. And what we're going to do is we're going to try it. So the way we're going to figure out whether a given action is, you know, like, like basically like a trajectory of actions is better or worse than what the model would have done. Otherwise is we're just going to run a bunch of them in parallel, right? So we are going to, with the exact same setup to the same initial conditions, we're going to run whatever, four or eight or 512. Like there's lots of different, you know,

25:03like hyper parameters to tune here as well. Different runs in parallel. And we're going to see how often the model succeeds and how often it fails. And the reason we want to do this is because you don't want, so let's, let's say you just run a single, you, so, so, you know, we've thrown away the, the credit model and we do a single run through with GRPO and we get a score and the score is one. Hey, it got it right. You don't know from that run whether the model like just would always get this right or, or this is like one in a million times

25:34that it got it right. And if you're just like naively updating your model because it got it right, but it would have always got it right. You're just reinforcing, like you're, you're sort of, it's a spurious correlation, right? Where it's just like, Hey, it made some random choices. Choices didn't affect the score at all because, you know, it just always would get it right. And if you're up weighting those random choices it made, then you're, you know, just like kind of moving around in a, in a, in a pretty random direction. So what GRPO lets you do is it says, okay, you know, the sort of like advantage that we allocate to each of these tokens

26:04is going to be based on how much better this run did than the run than sort of the average of like, really what you want to compare it is to the average. If we ran the model, the current model, infinite times on this, like how much better did this do than that average, obviously we're not going to run infinite times. So we approximate that by doing it, you know, end times. Okay. So then getting to sort of the end of your question, which is like, you're right. Like when we're actually updating the model weights, we are doing this at a, at a token by token basis, right? So somehow we have to say

26:34for every single token, you know, we want to update the way it's such this token is, is more likely if the advantage is positive or less likely if the advantage is negative. And this is a big problem in reinforcement learning. It's called the credit assignment problem, right? Because really what you want to do is you want to, you want to assign credit and up, wait, just the key tokens that like were, were critical to this going right and not up with the tokens that like, you know, it just like always would have been right and, and didn't really contribute

27:04anything to the solution. and, and so the, the sort of like, I guess the key insight of GRPO is to, to sort of just like, do a very unsatisfying thing and kind of just punt on that a little bit. It's not a full punt. So, so what you do is you look at how likely every token was to be produced, right? Because you're, you're sampling at a high temperature when, when you're doing these. And so some tokens, it's like that it produces are very common. Some tokens are not very common. And basically what you do is you say,

27:35hey, if I got a high score, then I want to give more credit to the tokens that, you know, just by random chance were less common because I assume that the high score is, is most likely, you know, if, if my score is much higher than the average score across the entire group, right? Then I assume that it probably was because there was some rare thing that I did in this that I didn't do in other cases. And that rare thing led to me doing well. And the exact same thing in the opposite direction. If I get much lower score than the average on the group, then the rare things are the things that I'm going to penalize

28:06the most because I'm like, hey, that's probably what put me there. Now you could ask the question, well, it's like there could be many rare tokens. If you've got like thousands, you know, tens of thousands of tokens in a reasoning trace, how do you decide which rare token is most important? And you don't. You just throw up your hands and you say, all the rare tokens get upvoted the same way. This is, like I said, a very unsatisfying answer. And I think that's one of the reasons why there was like an almost 10 year gap between PPO that had this value model that tried to, you know, determine on a token by token basis and like GRPO where it's like, hey, we're just going to throw that all away

More from The Cognitive Revolution

Babysitting the Machine: Glean's Rebecca Hinds on the Hidden Human Labor of AI at Work

Jun 10, 20261h 46m

AI in the AM — Week 1 Highlights (June 2026)

Jun 6, 20261h 22m

Nested Learning: Ali Behrouz on the Quest for Continual Learning & Illusion of AI Architectures

Jun 3, 20263h

Inside Nathan's Second Brain: Daniel Miessler, Security Expert & Creator of PAI, Audits My AI Setup

Jun 1, 20262h 32m

Your Biggest Lever: Designing your AI Career for Maximum Impact, with 80,000 Hours founder Ben Todd

May 26, 20261h 42m