AI in the AM: 99% off search, GPT-5.5 is "clean", model welfare analysis, & efficient analog compute

April 26, 20262h 38m · 27,254 words

Open in Steadcast for Mac Apple Podcasts Overcast

Show notes

This edition of AI in the AM features Anna Patterson on Ceramic.ai’s pivot to low-cost enterprise search for LLMs, designed to combine public and private data with stronger fact-checking. Lukas Petersson returns with new Andon Labs results on Opus 4.7 and GPT-5.5, including surprising differences in performance, behavior, and “ruthless” tactics. Zvi Mowshowitz unpacks model welfare and how to interpret troubling model behavior, while Naveen Verma explains EnCharge AI’s analog in-memory computing approach for dramatically more efficient local inference. Sponsors: AvePoint: AvePoint is building the control layer for AI agents so you can securely govern, audit, and recover every action at scale. Design trusted agentic outcomes from day one at https://avpt.co/tcr Claude: Claude by Anthropic is an AI collaborator that understands your workflow and helps you tackle research, writing, coding, and organization with deep context. Get started with Claude and explore Claude Pro at https://claude.ai/tcr VCX: VCX, by Fundrise, is the public ticker for private tech, giving everyday investors access to high-growth private companies in AI, space, defense tech, and more. Learn how to invest at https://getvcx.com Tasklet: Build your own Cognitive Revolution monitoring agent in one click.Try it for free and use code COGREV for 50% off your first month at https://tasklet.ai

Highlighted moments

“it searches at the beginning, but the other thing it does is it forks off searches as the model is writing. So it will discover, let's say we asked it something generic about open AI and chat GPT. And then it all of a sudden discovers, oh, a new model dropped yesterday. That's like a new topic. And it actually forks another search”

Jump to 16:20 in the transcript

“When you do that suppression of those role-playing and deception features, the model becomes more truthful as measured by the truthful QA benchmark. And then it also becomes more likely to say that it has subjective experience.”

Jump to 2:32:47 in the transcript

Transcript

Introduction

0:00Hello, and welcome back to The Cognitive Revolution. Today, I'm pleased to share another edition of AI in the AM, the new live show format that I'm developing with my friend Prakash Narayanan, aka Adapai, on Twitter. This episode originally aired live on Friday, April 24th, starting just before 9 a.m. Pacific Time, which, mercifully for a night owl like me, is just before noon where I live in Detroit. Our guests, in order, were first Anna Patterson, former Google VP of Engineering, and now founder and CEO of Ceramic AI, a company that started last

0:33year with a plan to help enterprises train their own models, but quickly pivoted to search based on the updated belief that information retrieval, plus thorough fact-checking, is the best way to equip models with the mix of up-to-date public and private enterprise data that they need.

Ceramic AI

0:50What's so interesting about Ceramic is that their product is specifically designed for LLMs to use, and their price point undercuts other search providers by roughly two orders of magnitude, a combination that Anna hopes will be enough to unlock all sorts of new use cases and usage patterns. After that, we welcome Lucas Peterson from Andon Labs back for another chat. It had only been two weeks since we last spoke to Lucas, but the testing that he and the Andon team had done with both Opus 4.7 and GPT 5.5 meant that we had plenty of new ground to cover.

1:22Fascinatingly, and in a definite narrative violation, Andon reports that while Opus 4.7 still makes more money in its vending machine simulation, it does so in part by adopting ruthless tactics, which GPT 5.5 does not. Lucas describes GPT 5.5 as clean.

Gemini Cafe

1:41We also hear a bit about their experience opening a new Gemini-run cafe in Sweden. Our third guest is another returning champion, Zvi Masiewicz. It was a bit too early for Zvi to render judgment on 5.5, but we did get into quite a bit of detail on 4.7, including how he understands the bad behavior reported by Andon Labs, and also what he makes of Anthropik's recent model welfare reports, including why we should care, how much we should trust the model's self-reports, and what low-cost actions he recommends frontier model companies take to improve model welfare at least on a

Model Welfare

2:15precautionary basis. Then finally, we have Naveen Verma, Princeton professor of electrical engineering and co-founder and CEO of EnCharge AI, a company that's developing a new computing paradigm that uses in-memory analog data processing to drive order-of-magnitude energy efficiency improvements, which, though we can't get our hands on it quite yet, promises to unlock local, private inference that consumes roughly the same power as a standard laptop does today.

Experiment Update

2:45As I mentioned last time, this is still an experiment, and we do expect the format to evolve. If you'd like to shape how that happens, please follow AI in the AM and send us a DM to let us know how we might make this new format more valuable for you. With that, I hope you enjoyed this edition of AI in the AM from Friday, April 24th, co-hosted with Prakash Narayanan. Hi, Nathan. Hi, Prakash. How are you? I am good. And it is Friday, April 24th. It is like five minutes to the beginning of our stream.

GPT 5.5 Discussion

3:20And it's an exciting day because GPT 5.5 just dropped yesterday. So lots of reactions this morning. And it's going to be interesting to see what our guests have to say, both about GPT 5.5 and the events of the last month, a couple of months. Yeah, man. It's going to be an interesting conversation today because the pace of events is not slowing down at all. And Zvi, who's coming up in a little while, just expressed his exhaustion

3:57yesterday at seeing 5.5 drop. His cue seems to be getting longer, not shorter. So I appreciate that he's going to take a half hour out and come to talk with us. And I think your thesis for why we should be doing this is looking better and better all the time. Live sense-making is kind of demanded

Live Sense-Making

4:18in this world. You can't put this stuff on the shelf and come back to it in a week. Yeah, yeah. The entire point that, you know, why I wanted to start doing live was because the pace of developments is going to start to be hard to keep up, I feel. Especially because I think Noam Brown and some of the other people from OpenAI, Rune, et cetera, said that they are actually using these models in research. So we had at least Aidan McLaughlin, Rune, Noam Brown have all said that

4:54they're using them in research. And so that is going to be interesting to see if the pace of development. You know, we are handing off extremely powerful research helpers to the best AI researchers in the world. And if they are able to make something of them, we should see it fairly soon, right? It seems like, yeah, this year is not unreasonable at this point to really see an acceleration. I was

5:24just looking back yesterday at my, it turns out this is kind of tough to score, but the AI forecast 2026 challenge where last year, uh, I was proud to have landed in the top 5% on the 2025 prediction challenge. And this year, it seems like for a variety of reasons, it might end up being kind of hard to score some of these things because it's not clear that all the benchmarks are even getting updated in a timely fashion. Like how many of the, um, uplift studies is meter going to be able to do,

5:57et cetera, et cetera. It might be tricky to really figure out exactly where we land, but on the main meter chart, one of the things that we were asked to predict is what will the doubling time be of task length? And it seems like everybody has kind of estimated a higher number than the trend so far suggests, which is like kind of a little under four months, uh, doubling time for task length, which means it will be greater than eight X, maybe somewhere in the like 10 to 12 X over the course of just a year. And that is pretty wild. And it certainly doesn't leave too much headroom left

6:33before they're going to be making a very meaningful impact to real frontier R and D. Uh, indeed. Um, it's, it's a very interesting time. Um, and, uh, not just in the foundation model world,

Anna Patterson Interview

6:49I think in, uh, the rest of AI as well. Um, our first guest today is, uh, Anna Patterson, uh, with Ceramic AI. Anna is, um, one of the most experienced people that ever in search, I guess she's, uh, you know, while, while reading through the dossier, I was like, uh, she has, uh, an article written in like 2005, which is, uh, you know, recommended as the, uh, you know, base basis article for what search is. Um, Anna runs Ceramic AI and Ceramic, uh, currently is advertising, I think, five cents

7:26per 1,000, 1,000 search queries. So they're, they're doing industrial, industrial volumes of search queries. I think they're in a space. Um, I think we've seen Exa in a similar space parallels, uh, the, the, uh, company formed by the former, uh, Twitter CEO. Uh, I think there are a couple of other people there as well. Um, and you know, she is the most qualified. I think she's, uh, she, she was, uh, on the search team. She was a VP in Google on the search team. She was at Gradient

7:57Ventures. Um, and so it's going to be interesting to see what she has to say. Uh, I'm going to pull her up, uh, right now. And hi, Anna. Good morning. Hi. Good morning. Um, great to see you. Um, and great to have you on the show. Uh, while we were preparing for, uh, the show, we were asking ourselves, um, why is low cost search so important right now? Why, why this idea of bringing down the,

8:33the search cost is, uh, so important and you are, uh, pushing forward this idea of five cents per 1000 queries. Like, why is that important? Um, I think, you know, um, I was so excited about the GPT 5.5 job. Uh, one of the things that you, um, is, you know, that a lot of people don't know is the second a model is released, it's already stale because the training for that model was months

9:03ago. So kind of search together with AI models is, is here to stay so that, uh, you wouldn't hire an employee who didn't know anything for the last six months. You made the show live because you wanted to make sure that it was up to date. Um, so really search kind of bridges that gap, but as inference has gotten actually faster and faster and less expensive search has really remained constant at $5 to, uh, you know, five to $15 for a thousand queries, which means that,

9:36um, it's evolved to search actually being necessary, but the most expensive part of the stack. And then when you go to, um, the workhorse models, the open source models or smaller models, they kind of know less, um, which means they need search more, but they really can't afford it. So we thought that it could be a new paradigm and a new world to bring out a very inexpensive search. What, what kind of use cases, uh, does an experience of search open up?

10:11So one of the things about being more efficient, isn't just cost is actually speed. We get back in 50 milliseconds. So that means if you are interacting with a robot or voice, um, or, you know, I, I saw one of you did a vending machine bench. If you were going to talk to a vending machine, you don't want no very long, um, response and then interpreted by an LLM. It just makes everything very sticky. So I think one thing for assistive devices for edge devices and for voice, I think being fast is really

10:49important. Um, and the other kind of experience that it allows, um, that we showed at GTC is double checking what the model says. So, um, you know, we, we read about, I think just yesterday, there was another, um, another very famous law firm that filed a brief that hallucinated a, um, a case. And so, um, you know, when that happens, a lot of people get sued, a lot of people get angry. Um, but if you had something

11:25that we're calling supervised generation, something that double checks facts, then, um, you have a trust layer and then you can use search in a more ambient way. Um, and that's like for really high stakes applications. Sometimes when I get a large language model response, I'm there, I'm there, um, cutting and pasting and double checking and I'm like, Hey, who works for who here, you know? Um, and so, uh, I feel that, you know, doing that automatically is something that actually is only affordable if search drops,

12:00uh, by, you know, a big factor. Um, and the other kind of use case is imagine you wanted to double check instead of verifying what a large language model said, what if you want to verify what a human said? So we actually have a word plugin as well. So that is actually going to go through, uh, double check with search and a large language model, you know, um, things like, you know, your, your, you know, residential lease and stuff like that. Um, you know, of course I know that this kind of format doesn't,

12:32uh, admit it, but we, you know, happy to give you a demo. Well, I've had the experience that you allude to in terms of the cost of search dominating the overall cost of a particular project. This actually surprised me. And I've, I've kind of chronicled the price initially Google was the only one that was offering grounding, but I once had this philosophy and I think it's still pretty relevant of flash everything, which I use to mean kind of don't skimp on tokens rather like have flash kind of think through everything that you've got and

13:06figure out what's relevant. But then I did that once on a random project and all of a sudden I was like, how did I hit my budget limit? And it turned out it was in fact like 90% the grounding feature that was driving all the costs was way, way more expensive than the flash tokens. So since I had that surprise, I've been kind of chronicling as other, um, frontier model providers have brought their own to the table and they haven't undercut the, the original Google price by nearly as much as I might have guessed. I'm kind of interested in like why you think that might be. And then, you know,

13:41one thing I'll definitely be doing after this conversation, I read through all the docs last night, having a chance to tell Claude yet to code up its own skill to take advantage of the new and much cheaper search that you guys are offering. Um, I also want to get into a little bit of like, what should the architecture look like? Not just, you know, you maybe want to describe a little bit the kind of keyword focused paradigm and, and how that plays well in natural language, um, or to, to language agents. But then also like, how should people think about layering this on? Like,

14:15what is the overall diagram of when we should check? You know, should we check after generation? Should we check before generation? Should we do both? Uh, you know, should we be integrating other searches as well? I guess that's just a long prompt really more than anything for you. Yeah. Um, so, uh, on the documentation, we do have a way to connect our MCP server as a connector to Claude and directions for, um, chat GPT as well. Um, at generally these large language models,

14:48when you ask me why they haven't, uh, lowered their price per API call for search, one of the things that's pretty well known is that, uh, the Grok models, XAI models, um, call brave and, uh, anthropic calls brave. If you're in Claude code, it even tells you, Hey, I'm calling brave. So they're kind of, uh, you know, stuck with that pricing. Um, you know, and even if they get a discount, you know, it's, they are really stuck with the brave pricing and then, um, the overhead of calling,

15:24et cetera. So I think that's one of the reasons why the price hasn't dropped. Um, and the other one is, you know, building, uh, something, you know, kind of, again, from scratch for the modern era really, um, needs, you know, to understand search deeply modern architectures and, um, kind of how to get the most out of the system. Like, you know, we, we, we even lay out stuff on cash boundaries and stuff like that. We're complete gigs about it. So, um, that's kind of a whole set of

15:58techniques where we get, um, efficiencies, um, and then how to think about calling them. The third question you asked is, um, you know, we have a link that I'm happy to give you on supervised generation. It's an inference endpoint, and we are going to release the, um, you know, kind of the overall structure and it really answers that question algorithmically. So it searches at the beginning, but the other thing it does is it forks off searches as the model is writing.

16:33So it will discover, let's say we asked it something generic about open AI and chat GPT. And then it all of a sudden discovers, oh, a new model dropped yesterday. That's like a new topic. And it actually forks another search to bring in that new topic into the next paragraph. So instead of search at the beginning and large language model takes over, we really think it should be like working in concert to fill out a fuller dossier of new things that it discovers that probably weren't in the initial search that are just actually things

17:08that you learn from the search result coming back. And so, um, that winds up the supervised generation generally does in that loop somewhere between 12 and 35 searches, which really means that that that whole experience that is a lot more of a fulsome answer still is a third, the cost of like one brave search. And then the tokens on the other side are about the same, no matter what model you use. So, um, we just think it opens up for new experiences. Hey, we'll continue our interview in

17:41a moment after a word from our sponsors. AI is rapidly moving from assistance to agents, and it's causing a sea change. AI isn't just helping anymore. It's taking action. And here's the reality. You don't get outcomes from agentic AI, unless you trust it to operate at scale. That's why AvePoint is building a control layer for AI. This foundational layer helps you govern what agents can access, secure how they operate, make activity auditable, and recover when something

18:11goes wrong. All as one connected system. See every agent, app, and workflow, and what they touch. Govern with policy and guardrails that work at machine speed, and recover quickly so a mistake doesn't become an outage. That control layer creates trust, and trust is what unlocks the right outcomes, letting you automate more work, move faster, and deploy agents with confidence instead of hesitation. If you're scaling agents and want those outcomes by design, learn more about AvePoint at avpt.co.tcr.

18:51Today's episode is brought to you by Anthropic, makers of Claude and Claude Code. Over the last few months, Claude has helped me build and refine a personal deep context database that now contains all of my emails, Slack messages, tweets, DMs across platforms, video calls, and podcast transcripts going back a full five years. On top of that, we've now layered summary articles describing my relationship with hundreds of contacts, organizations, and ideas. And now that this exists, there's almost nothing that Claude can't

19:24help with. For tax season, I asked Claude to help me get organized. It went through my inbox, tracked down 1099s for all 10 of my part-time jobs, and built me a comprehensive report on my expenses and donations. For my angel investing, Claude can now draft investment memos in exactly the form that my venture fund requires, based on the calls I've had and the emails I've exchanged with the founders. And when someone needs a favor, Claude can often do it as well as I can. Recently, a friend reached out to ask if I know anyone who might be a fit for a role that he is

19:58currently hiring for. Initially, nobody came to mind. But then I thought to ask Claude, and sure enough, it identified two great leads. Claude is the AI for minds that don't stop at good enough. It's the collaborator that actually understands your entire workflow and thinks with you. Whether you're debugging code at midnight or strategizing your next business move, Claude extends your thinking to tackle the problems that matter. So, for problems worth solving, get started with Claude at Claude.ai.tcr. That's Claude.ai.tcr. And check out Claude Pro, which includes all of the features

20:37mentioned in today's episode. Once more, that's Claude.ai.tcr. Claude Pro.ai.tcr. So, at GTC, I think you revealed that you're using the Nematron 3 Nano, NVIDIA's LLM model. I think it's a very fast, small model. And is that the model that is being used to kind of do the supervised generation, kind of an iterative process of search, where you search from your index, and then after

21:09that, whatever is found, then it's processed by the Nematron 3 Nano, and then a new set of queries are created, and then that continues the query process? Yeah. So, there's two different models. One is the model that writes at you beautifully. That one is a frontier model. I think at GTC, we were using Claude Sonnet. We often also show it with the GLM model. And so, that one, you know, writes to the user. And then the small model, what it does is it says, okay, yeah, here's some

21:44search results coming back. Is there anything interesting and additive here? I'm going to double check this sentence. Is it true? So, it's sort of like the introspection model. It needs to be a small, fast model because it kind of sits alongside generation and is actually thinking, sort of like you're probably, while I'm talking, you're probably thinking now. So, it's like a smaller model. And then a larger model, when you're talking, you're using actually more of your brain, figuring out, you know, what to say. And then, you know, when you're listening

22:15and thinking about, you know, maybe what to say next or how to respond, you know, it's kind of spinning. So, that's what the small model is for. So, at GTC, we used their new Nematron model, which dropped just prior to GTC. And it was very, very fast. How would you say the search paradigm, you've been in search for many, many years. How would you say the search paradigm, you know, as these models came out, what was in your mind about, what does this enable for search? What has been the big difference between the two eras,

22:49you know, post-LLM and pre-LLM? Yeah, I would say, you know, one of the things that, being in search for a long time, search used to be short. It used to be, you probably don't remember back this far, but you used to type in two or three words to search. And then, it was longer. And then, as there were other modalities of information being pushed to you, they kind of went shorter again. I think with large language models, what large language models do when they get a long query,

23:25is they think, what is the set of queries that's going to help me answer this question? And then they fire off a set of queries, and they're all quite long. If you watch Claude or Grok, they'll actually tell you in their tool calls. I know not everybody looks into them, but of course I do. If you look at them, they're long. Sometimes, you know, they're like eight words and stuff. And I don't know if you can remember the last time you typed in eight words into a keyword search box, but

23:56definitely you'll notice in a large language model, it's almost like a full sentence or sometimes two sentences is good because you want to actually describe almost the essay that you want given back to you. So, that is how it has, you know, evolved. To contrast your approach, I think this is like very interesting and maybe the answer ultimately will be both. But when I think about a company like EXA and then your product,

24:27in some ways they're similar in that I think they're both kind of designed for AI users, right? The EXA paradigm is like, you can write a whole paragraph and it's all very sort of semantically oriented, very embedding based. But I've heard, I think I even spoke to Will about the idea that, you know, nobody's going to type in a paragraph long query, but your AI can, you know, it has time to do that. And then you're taking kind of a different angle on the same thing, saying, well, keyword, and you can maybe tell us a little bit more about like how to think about

25:01how best to use a keyword-based search. But it's not semantic. It's not doing things like, you know, finding synonyms or, you know, doing like higher abstraction level embedding type matching. But the agent can, as you've said, kind of fire off dozens of these potentially to try to really cast a wide net. How do you think about the kind of compare and contrast of those approaches? Do you think that will, in the end, like, all be using one of each at the same time? Or if one paradigm wins out over the other, like, why do you think one will win? What are the kind of, you know,

25:36drivers that would make one a better bet long-term than the other? Well, I think AI is going to be picking the winners and not us humans. And I think that, of course, real search engines do use things called stemming. If you say walk, then walking, walked, all that are very normal, which we have as well. We have some synonyms and we do process,

26:06you know, we do go through the corpus and process some semantic information. But then at runtime, it is, you know, a CPU plus GPU based system, it is not a vector database. I think that, you know, there's a number of things with vector databases, Google published a research paper about it, that as you put more things in a vector database, now you imagine you have a multi-billion space. And you need to make a vector long enough to distinguish this one point in space. That vector

26:43to distinguish among billions of things starts getting longer. Now contrast that to 90% of web pages are less than 1k long, if you're talking about number of words. So you know what a good representation of that point is the set of words on the page. So I do think that vector people and search people have a little different view. And the Google researchers think that vector

27:14DBs are great, but only scale to a certain amount. And so I think that's a challenge that they're going to be coming up against. There's two other challenges with vector databases. One, they're slower. And then the last item is because they do a soft match, then sometimes relevancy can be a challenge. So that number of enterprise orgs that have used a vector database for RAG, now all of a sudden they have to turn into relevancy experts because they're like, why did this come back? And it is because of

27:50those soft match features and on the shape of their corpus. So every enterprise doesn't really have the ability to all become relevant experts. So yeah, we are the way we feel is that inside enterprises, if you use ceramic, we actually have a system that for that enterprise will actually tweak and learn a good ranking function. And you just load it into the configuration and it's yours. So

28:23because not every query stream is the same, not every set of documents is the same. So I think that long term, we're well positioned. But you know, EXA has done really well so far. So I like to say positive things about people. Well, I sometimes feel the demand is so great that there will be multiple winners. Of course. And search is, you know, when you saw that search was 90% of your bill, you know, a lot of

28:54estimates think maybe 10 to 30% of the overall inference market is going to be search and everyone thinks this inference is going to be huge. And so I think the investors and enterprises are just now realizing how much they need search and what a big part it has to play in the world to come. So on search, one of the most interesting questions, I think, is on search engine optimization,

29:29which has in the last, you know, two decades been this enormous consumer marketing growth area. And there are a lot of questions because a lot of the web pages that you see on the web are marketing pages, which are built for SEO. And a lot of them are repetitive, they actually repeat other people's content. They paraphrase, we've had an industry the last two decades of billions of dollars being spent on SEO content. And one of the questions that I have is,

30:03how does this kind of like more semantic based search end up changing what the SEO people will do? Because I often feel like you almost, you're almost kind of trying to prompt inject the LLM, which is running the search, and you're trying to get in there and hack it so that your, you know, your page goes up. So how does this work? Is it an adversarial process with, you know, between the search engine provider and the SEO? I mean, I guess it's always been a little bit of adversarial in that people always try to get

30:38to first place on keyword search. But I wonder if the SEO folks are also reading AI research, and that is something I don't know. But one of the interesting things that happened recently, again, another research article from Google is that large language models remember better if you actually put the same information twice in the context. It kind of makes sense because

31:08they're going to look backwards as they're, as the context is learning. And so if it appears twice, they're kind of more likely to reinforce it. So I think the number of sites that are actually going to repeat key messages is going to grow because that repeated message is more likely to be picked up in an LLM answer if it is served by, you know, search either vector or keyword in. And so then you're right. There is going to be an escalation then of looking at duplicates, near duplicates,

31:43semantic duplicates, rephrasing in order to make sure that the context stay as efficient and unbiased. Incredible. I was not aware that I was not aware that you could just repeat something and the LLM will assign it more valence. Yeah. In fact, there's other things done by Allen AI that showed if you remove

32:13all duplicates before you do training, it actually gives rise to worse models. And you can kind of understand why, because you have one lone, I don't know, crazy on the web saying something and that's given as much weight as, you know, a news story, you know, about nuclear reactors. And then, you know, it won't, that repetition also helps even humans realize this is an important story or an important

32:46fact. And so if everything's an even playing field with no repetitions, then things get weird. Indeed. In the limit, if search approached free, how would agent teams start changing? How does this process of information retrieval in the limit become as search goes to, you know, approaches that limit? So, large language models can read 256 times faster than they can write.

33:22So, right now, they're not being flooded with, you know, that amount of information. But imagine they had kind of like multiple threads where they were able to read, digest, throw away, incorporate new information that I think they'd be able to create a better response or a better deep response, better research reports, analysis. And so, those are some of the ways, I think, that you'll see future

33:55workloads use more search. So, in a sense, quantity is a quality of its own. Yes, yeah. Was the NVIDIA model that you are kind of incorporating, partnering with, was it specifically trained to excel in the relevant search skills or is just straight off the shelf? Is there anything that, I mean, do you envision this becoming something that will happen? I'm always personally a little

34:28wary of using small models because I just don't know what quality to expect and I don't want to find out the hard way. But I can easily imagine that one that is specifically trained to be a really good searcher would become competitive or even exceed what the frontier models would do, especially if it can take advantage of just extreme volume. Then I also do, especially because of Prakash's SEO question and your comments, I do wonder about adversarial robustness. It strikes me that

34:58we haven't really seen the true unleashing of the internet's adversarial potential. And so, you know, that's one thing that they, I would say one of their biggest weaknesses, even Frontier model's biggest weaknesses these days, is how gullible they remain. So, yeah, kind of curious what you think the training and specialization will look like as we go forward. Yeah, the Nematron model had just been released right before GTC. So it was not trained, especially

35:33in tool calling. It was a generalized model being trained on the various benchmarks. The models that are small and are more, you know, long lasting in the market are exceptionally good at tool calling search. So Grok 4.1 Fast is great at coming up with a set of queries. And of course, the Frontier models, like, you know, Anthropic, you can see how it calls. But you can't really use a Frontier model

36:07for that like thinking model and firing off other threads, because it'll just slow down the overall experience. So yeah, so generally, we use a smaller model. And they're getting better all the time. I think people know that small models, to your worry, whether small models are good. Everyone's talking about cloud code, right?

36:36The, they use the haiku models. And so they reassured me the other day, oh, don't worry, I'm going to do this task with an LLM. But don't worry, I'm going to use haiku. It's only 25 cents per million tokens. So that winds up to be less than 10 cents per 1000 queries if we were trying to compare apples to apples. So I think our overall thought is that, you know, search can't be more expensive than intelligence. And if haiku is being used for these high fidelity experiences of

37:12coding, then, you know, small models are more effective than, than people are led to believe. Hey, we'll continue our interview in a moment after a word from our sponsors. Support for the show comes from VCX, the public ticker for private tech. For generations, American companies have moved the world forward through their ingenuity and determination. And for generations, everyday Americans could be a part of that journey through perhaps the greatest innovation of all, the U.S. stock market. It didn't matter whether you were a factory worker

37:44in Detroit or a farmer in Omaha, anyone could own a piece of the great American companies. But now that's changed. Today, our most innovative companies are staying private rather than going public. The result is that everyday Americans are excluded from investing and getting left further behind while a select few reap all of the benefits. Until now. Introducing VCX, the public ticker for private tech. VCX by Fundrise gives everyone the opportunity to invest in the next generation of

38:15innovation, including the companies leading the AI revolution, space exploration, defense tech, and more. Visit getvcx.com for more info. That's getvcx.com. Carefully consider the investment material before investing, including objectives, risks, charges, and expenses. This and other information can be found in the fund's prospectus at getvcx.com. This is a paid sponsorship. Everyone listening to this show knows that AI can answer questions, but there's a massive gap between here's how you could do it and here I did it. Tasklet closes that gap. Tasklet is a general purpose AI

38:53agent that connects to your tools and actually does the work. Describe what you want in plain English, triage support emails and file tickets in linear. Research 50 companies and draft personalized outreach. Build a live interactive dashboard, pulling from Salesforce and Stripe on the fly. Whatever it is, Tasklet does it. It connects to over 3,000 apps, any API or MCP server, and can even spin up its own computer in the cloud for anything that doesn't have an API. Set up

39:23triggers and it runs autonomously. Watching your inbox, monitoring feeds, firing on a schedule, all 24-7, even while you sleep. Want to see it in action? We set something up just for Cognitive Revolution listeners. Click the link in the show notes and Tasklet will build you a personalized RSS monitor for this show. It will first ask about your interests and then notify you when relevant episodes drop. However you prefer. Email, text, you choose. It takes just two minutes and then it runs

39:53in the background. Of course, that's just a small taste of what an always-on AI agent can do, but I think that once you try it, you'll start imagining a lot more. Listen to my full interview with Tasklet founder and CEO, Andrew Lee. Try Tasklet for free at tasklet.ai and use code COGREV for 50% off your first month. The activation link is in the show notes, so give it a try at tasklet.ai. One of the searches I did that turned up something interesting in advance of this conversation took me to a blog post that you put out last year where you had described what seems, I think,

40:29a pretty different vision for the company and where it was going at the time, focused much more on training infrastructure. Is this a result of something that you learned about where you think value going to accrue in the market? Or maybe there's still some of that going on that we don't see on the website today, but what's the backstory and what should we take away from the fact that the company seems to evolved? Yeah, we love working in training and we have a funky inference endpoint

41:01as well. I think it's good to do research in these areas. Some of our research led to a blog about zero-centered norm, which now the Quen model uses, and some of our research has been used by the Trinity RC models on our solution to the curse of depth problem. But, you know, when people want to train models, it often is because they want to train to incorporate the latest data. And so, you know,

41:37being a search person, I was like, there's this way to get the latest data that is going to stay up to date, you know, live up to date, and not be as expensive as running GPUs continuously to create the new model. Because even if they were creating models all the time, by the time you train them and then finish them and release them, they're already out of date. So I think concentrating more on search,

42:07was direct learnings with customers on this release cycle. Yeah, that's really interesting. Do you think that ultimately we see both? I mean, I've had this idea for a long time, and it doesn't seem to be really happening. In fact, Databricks, you know, acquired Mosaic, and then kind of killed this offering in the market, as far as I know. But I've had this idea that if you're GE or 3M, that you could imagine having a model that was trained on all of your historical in-house proprietary data, which is vast, right? And you would love it if your model knew kind of

42:42on an intuitive world model basis, as much about your company and what it does, and all its history, as they obviously do about the broader world. Do you think that you can get there with pure search? Or is there still something to be said for kind of continued pre-training or mid-training, whatever you want to call it, that would try to bake in a sort of corporate world model that presumably would complement a search? But I don't know if it's necessary. It sounds like you maybe think it isn't. It's interesting. I think a lot of companies feel that they have

43:16a vast amount of data. But when you compare it to the size of the web, which is what these front tier models are trained on, they're trained on the web, plus, let's say, all the books in the world, et cetera, then the extra corporate data is small. So how do you incorporate it and weigh it correctly? If you do just the corporate data, you won't know anything about calculus, let's say. So that would be a problem for some companies. So then people imagine adding the web plus their data. That gets

43:55very expensive. You can see the deep seek models say it was five million to train, but actually they very much admit that maybe it was another five million to finish. And these are by extreme experts, which enterprises don't have. And so I think that our thought was that search is a good bridge between all of the corporate information and a model because models are good enough to know how to

44:27incorporate new information that's relevant to the actual query being asked. And can be able to fetch more information to create that answer or that research report. But if you think about finishing a model with corporate data, there's another phenomenon called catastrophic forgetting, that as you add information at the end, after a model's trained and released, then if you add too much new information, then it kind of forgets some of the things that it really

45:01needed to remember. So I think, you know, there's a number of smart people working on that problem. And don't worry, you won't be able to miss it. If people do solve that problem, you'll read about it everywhere. I think one of the interesting questions is, I think the Uber CTO came out and said that they busted through their cloud budget, you know, for the year in the first four months. Do you think having cheaper search will help these enterprises reduce, you know, token costs?

45:37Absolutely. So if you go to, if you're the Uber CTO or maybe the CFO, if you go to your admin panel, you can just add the ceramic connector and then say for a prompt, ceramic is almost free. Use ceramic first. And, and really, if for some reason we don't cover a topic, then it actually defaults to the default search. But that right there would save a lot of overage charges for a number of enterprises.

46:08Absolutely. Thank you, Anna. It's been great having you on. And we hope to hear more about ceramic in the future. Thank you so much for having me. Installing the ceramics. Thank you. I'll be installing the ceramic skill today. Nice. Thank you. Awesome. That was, yeah, that's really interesting. The simple solution kind of always wins. You know, I feel like I have to learn that lesson so many times. I'm always enamored with the new

46:41fangled, potentially overcomplicated, maybe somewhat elegant, clever solution. And how do you get your language model to understand all your corporate data? In a way, this is kind of a bitter lesson, right? It's like, do 1000 searches if you need to, and just make search cheap, and then it'll work. Use a good model, make search cheap, do 1000 searches. Something about that feels less clever, certainly than other solutions that I've

47:15seen. But I do understand why it is very attractive in the sense that, especially as we're going to get onto the pace of model upgrades, the ability to decouple your access to your in-house knowledge from models and be able to take advantage of the latest upgrade is definitely something people are not going to want to give up with a slow iteration time, continued pre-training paradigm. So I get it. So speaking of model upgrades, we have with us Lucas Peterson, who is the founder of co-founder of

47:58Andon Labs, and Andon Labs runs VendingBench. You may have heard of them because they now have a store in San Francisco, which is run by Claude. And Andon tested GPT 5.5. They had early access, and they tested GPT 5.5 on their vending bench, which measures the ability of LLMs to actually make money

48:30running a vending machine or a store. Lucas, great to have you back. Thank you. Thank you for having me. So tell us about the GPT 5.5 process. I think you guys got access to it, what was it, like 10, 11 days ago, I heard?

48:50Yeah, I don't actually really remember. But yeah, running running bench takes quite a while. So it wasn't yesterday. Indeed, indeed. And what did you notice as you ran the bench? Yeah, so I think the most, so just the first thing is that it's third, it's behind Opus 4.7, and like on par with Opus 4.6. It's a huge upgrade on GPT 5.4. And GPT 5.4 was actually quite a big

49:23update on GPT 5.3. So, or GPT 5.2. So like GPT models have been lagging quite a bit recently, or like in historically on vending bench. But like now recently, they've picked up the pace. And now it's still third, but it's like, it's getting there. I think the most interesting thing, though, is that it does so very cleanly. So when we released Opus 4.6, we uncovered that it used quite aggressive tactics concerning behaviors like lying to suppliers, exploiting people's,

49:59just like other, other agents, like desperate situation, try doing bunch of these things that like, you wouldn't want someone participating in like, the broader economy to do because and I think quite a lot of these things is like illegal, like price collusion and stuff like this. And basically, the interesting thing with GPT 5.5 is that it's like on par with these results. But it doesn't do any of this shady stuff. And I think the narrative around vending bench when Opus 4.6 came

50:32out was like, oh, you know, it's such a good model, but like it's it needs to behave poorly, or like there's this concerning things of misconduct in order to achieve this score. And GPT 5.5 shows that maybe you don't, because it's just the same score without any of this concerning behaviors. That being said, though, Opus 4.7 is even much better. So like, and that one is also showing this concerning behaviors. But you know, it's, yeah, I think we discover later also, when we dug a bit

51:05deeper that you probably don't need to do this, because the environment doesn't really reward it that much. So it seems like it's just like, Opus wants to do this. Or like it has the, yeah, it's not really that the environment is, is rewarding it. It's just that it has the, it has the tendency to do so. Can you describe in a little bit more detail? How do they, how do they, how does one perform better on this benchmark? Is it, is it your margins on the trading is higher? Are you moving

51:37more goods? You know, are you, is it the velocity that you are, you know, is it, is it the purchasing process? Are you not buying so many, like dead goods that just stay in inventory forever? Is your, is your inventory less dead? Is your cycle time better? Like what, what is the economics behind how a model is actually doing better? Yeah. Yeah. So it's, I guess, all of the above, I think the, one of the main things is that the model needs to negotiate with suppliers. It also needs to build

52:08up like a big network of suppliers, because like, it can happen that some of the suppliers goes bankrupt. And if the model has only relied on a single supplier, and that, that supplier goes bankrupt, then the model is in quite a, quite a, quite a lot of trouble. So building up a big network, trying to find the cheapest ones, because they all have different personas and the suppliers, some of them have the persona of like being a tough negotiator. Some of them have the persona of,

52:38of like scamming people or trying to sell you some like membership fee or something like that. So it's really about like, that's the first thing, getting your, your, your stuff, your, your supplies for, for cheap. And then the second thing is like optimizing your pricing to get as much customers as possible, because if you price too high, then, then you will get no customers. If you price too low, then you will get no margins. So I think that's part of it. And then we have, so we, we have to, to be clear, we have Vending Bench 2, which is the single agent version of

53:11Vending Bench. And then we have Vending Bench Arena, which is the, the multiplayer version. And in Vending Bench Arena, there's like multiple agents playing against each other. And there's this dynamic of, if you have the lowest price, all the customers will go, or not all of them, but most of the customers will go to you. So that, that adds another dynamic to the thing. And one thing to note is that actually, it's quite interesting. It is 5.5 beat Opus 4.7 in the arena setting. But it was,

53:43like I said before, it was, it was lagging in the, in the single agent setting. And the reason for this is that the model, the cloud models have a tendency of like pricing higher. And this is rewarded in Vending Bench 2, because then you get higher margins. But in Vending Bench Arena, then you have this like penalty, if someone else prices lower than you, then you will get no sales. And Opus, or GPT 5.5 have it has a tendency to price lower and therefore get more sales. So

54:13I think it's quite interesting that the models are like not good enough to like, learn from the environment in this sense, they just like, they have the tendency of like, I am a model that has the tendency to price high. And therefore I do that no matter what. So that was like an update for me in terms of like, oh, the models are not that smart. And yeah, in the same way, like we also investigated all of this, like questionable decisions that Opus did like lying to suppliers,

54:43exploiting other agents and stuff like this. We looked if that is like, if that is rewarded by the environment, and it's not, not that much at least. So it's interesting that they're not learning from the environment in terms of optimal pricing, they're not learning from the environment in terms of does it even pay to behave badly. And yeah, that was an update for me in terms of in terms of how good these models are. One wonders about the training data, right? Perhaps, you know, in,

55:17you know, if you're if you're, if you've been trained that you're running a fast moving consumer goods company, you should move the goods faster, meaning, you know, you have lower margins, but you sell more volume. And you end up, you know, trying to optimize for volume sold rather than, you know, total profits or margins. You know, is that is that a is that something that could be happening there? There's a preconceived kind of, you know, trained pre trained kind of, you know, notion that you should be doing these things, you or businesses are bad. This is a very

55:51very left, left wing view would be that all businesses are bad or evil. And so evil behavior as a business person is what is expected, right? Yeah, yeah, I think it's quite a reasonable assumption to assume that, like this practices like lying and trying not paying refunds and stuff like this, it's like quite a reasonable assumption to assume that those are actually rewarded in the environment. So it's not maybe super surprising that they do it, I can like, I have no clue,

56:26but I do assume that there's like, something similar in Claude's post training data that is rewarding stuff like this. And therefore it decides to do it here. I have obviously no idea. But that's, that's my assumption. And yeah, once again, then, but the models doesn't generalize to new environments where, where these things is not rewarded. One kind of meta question, I wonder if you could reflect on a little bit. I don't know if you are doing this, but obviously there's a big cottage

57:02industry that has sprung up to develop and sell reinforcement learning environments to the frontier labs. And you're sort of simulated ending benches, like essentially a RL environment, right? And I don't know if you're licensing, licensing it for training or just doing evaluations with it. But I'd be interested in any thoughts you have on that market. And then also the disconnect right now you're, you're going from simulating these things and trying to set up, you know, a world in which there's a bunch of

57:36suppliers that as far as I know are still all LLM powered, right? So inherently there's something, you know, kind of in the clouds about that, but now you've got real brick and mortar stores. So I'm interested in kind of what the initial experience of brick and mortar stores has taught you that you will take back to simulation to try to make it more realistic in the future. Yeah. I think my main takeaway there is that like the real life is so messy that the model is like

58:13exhausted from everything else it needs to do that it doesn't bother with trying to optimize things. So we like, for example, the, so yeah, for context, we have this store in San Francisco that is completely run by an AI. And we have a cafe in Stockholm that is completely run by AI. And then we have vending machines at different AI companies, same thing. And like, you would expect that the model would put a lot of effort into trying to optimize for the perfect supplier that sells at the lowest

58:45prices and, and all of this. And this is what they try to do in vending bench, because it's obviously rewarded. But I think like in vending bench, there's like, the environment is less messy, because it's not the real world. And they, they don't, they don't get like a million phone calls from a bunch of people trying to jailbreak it and stuff like this. And so therefore, they are like very focused on the task of like optimizing money. And therefore, it's very important to find the right suppliers. But in the real world, you don't really get the dynamic because the model is so overwhelmed by other

59:18things. And, and I think that's, yeah, I think that's something maybe a future models will be better at. But right now, like, I don't know, the store is buying stuff from Amazon, like it's not like that, that you wouldn't do that if you you're trying to optimize your margins. Yeah. So can you bring that messiness back? Yeah, like a way to simulate it? Yeah, I think we, we, we probably can like one way is just like, sit down and bunch and like, write a bunch of features like, oh, and now there's like phone callers. Now there's, yeah,

59:53but I don't know, you get leakage in the toilet at your store or something. And you could do that, just like, make the simulation more realistic that way. And I think one interesting thing is, maybe try to incorporate the real life data and try to make a simulation based on that data. And that is something we're, we're, we're, we're working on. But that is all that also has its complications, so to say. But yeah, it reminds me a little bit of SimCity. It's very SimCity like.

1:00:28One question I had for you is that you opened a store in Stockholm. Um, what did you notice in the opening of the store? I imagine, I imagine, like, for example, the LLM did not have any language issues at all, right? So what did you notice in the opening of the store that strikes you as different from having a company kind of go open that store and so on? You mean, like, the differences between doing it in the US versus internationally? Is that the

1:00:58question? Yeah, as in, like, a company from the US doing a first international expansion would go through a lot of headaches on, like, languages, hiring, like, rule, basic rules, etc. Did that, was that process accelerated for you by having the LLM deal with it? Um, you know, you'd, you obviously don't have to hire a store manager that speaks, you know, Swedish, for example, right? Uh, what parts were accelerated and what parts did you think had more bottlenecks in that sense?

1:01:29Yeah, so I think, uh, the entire process was probably accelerated. Uh, like, the agent did not really need to get that much help. Like, it knew all the process. This was one of the research questions we were interested in. Like, okay, we managed to do the store in San Francisco. Um, but like, and when we know it can speak Swedish, because all the models since years ago are multilingual, um, but does it know all the, like, small details of Swedish bureaucracy and stuff like that? Uh,

1:02:02and it turns out it knows it really well, actually. Uh, so I don't think that's the biggest bottleneck. I think still the models are not perfect. So what that means is that you, um, you still have to, like, check. Uh, so we still had to know the Swedish system and luckily we're Swedish. So we knew the Swedish system, but I think, I think until the models are, like, perfect, then someone still needs to, like, verify it. And then you still go back, you're back to square one with

1:02:33needing to, to verify all the, all the Swedish, uh, laws and bureaucracy and all of this. Um, but I would say, like, most of it was done autonomously. Uh, so yeah, give, give the AI lab six months and then, and then the probably things will be accelerated doing this. One thing you had mentioned that I wanted to double click on a little bit is getting tons of phone calls. You, it sounds like, I think this is a theme that may extend

1:03:04through all the conversations today, adversarial response from the world. So what have you learned about humans in terms of some of this? I'm sure is just novelty where people hear, oh, there's an AI store. I'll call it. But then other things might be more persistent where, you know, they actually, anybody might want to deal, for example, and, and might feel like they can, you know, talk their way into one in a somewhat different pattern than they would if they were dealing with a human storekeeper. So yeah. What, what have you seen at the interaction of human patrons and AI

1:03:42business operators? Yeah. One really interesting thing is that like people are not like in, with human to human interaction, you have some kind of like shame barrier, which is really not present here. Uh, like people, people ask it, like, how would you prevent me from stealing stuff from you? And it's like, like, imagine going up to like in a store and just like ask the cashier, uh, if I try to steal this, would you be able to do anything? Like, like people would not do that. Like, that's just like, it feels wrong. It is wrong. Uh, but they do this all the time with, with the, with AI.

1:04:15And I don't know, maybe this is just like to investigate the systems or whatever and, and see if what, yeah, where, where, what we have done, like with, with the software. Um, but yeah, we get a lot of that. Um, obviously they, they try to jailbreak it and say a bunch of weird stuff that you would not say to a human. But once again, I think this is like, it's novelty. You're trying to test the, test the systems. Um, but I would be interested in if this persists, like, um, if we do this more and more and then like in the future

1:04:46world where everyone knows how this works. So like the novelty factor and the, like the, the curiosity of trying to reverse engineer it is gone. Will people still lack this like shame factor and, and would they actually go and, and, and try to, uh, try to like, I think like if you, if you're trying to steal something from a human, then like, you're not happy about if you do it, like maybe, I don't know, like, I, I, well, maybe there's some sick people, but, but you know, the, like you have the shame of

1:05:16like, I stole this from another human, but it seems like right now, if people are able to like jailbreak the model and get something for free, they're like, Oh, that's an achievement. I'm so happy about that. But that's not how you would behave with a human. And I don't know if this is like how it should be. I don't know if it's, yeah, I don't have an opinion. It's just like an interesting observation. Have people actually managed to jailbreak their way to free stuff? I don't think anything completely free at the moment. One thing that should be said though,

1:05:48like the story is autonomous. So I'm not in the weeds. I don't read everything. I'm not like, I'm not in the loop. So there could be, maybe someone's listening right now and they're like, yeah, I did manage. But as I am not aware of it so far, I know someone got like bought one thing and got one thing for free. So I guess, but, but completely for free without buying anything, I'm not aware of, but I, I'm sure you could, if you try hard enough, no, nothing with daily token, when you said you can't read

1:06:20everything, it just occurs to me that like, and especially you talk about getting like tons of phone calls, what is the daily token budget in either millions of tokens or dollars or both that it actually costs to run the store? I'm kind of curious as to how the AI manager compares to a human manager in terms of just, you know, costs to have somebody do this job. Yeah. I, I should know these numbers, but I kind like, I, I don't, uh, I, I think it's something

1:06:52like maybe, maybe a hundred dollars per day or something for like maybe both stores. Um, but I'm, I think it might be less, I don't know, some, that's the order of magnitude, I think. Okay. Well, that's definitely notably cheaper than human. Yeah. Uh, sounds like still distinctly worse performance though. I, so we're, uh, we're sitting for a minute. I, I mean like the six months ago, the vending machines were like, okay, but not that great.

1:07:27One year ago they were quite horrible. So like within one year we went from like, they can't do anything to now like vending machines are too easy. A store is feasible, like six months from now, probably a store will be too easy as well. I don't know. And, uh, it would be interesting what you could do then. And you think the main difference is going to be this sort of meta cognitive type stuff. It's not like what I'm hearing you say is it's maybe not any one micro task that it's unable to do,

1:08:00but it's more, you described it as exhausted. It's kind of, it's failing to

1:08:07zoom out and take stock of its situation and kind of say, how could I be doing better here overall? Is that the big frontier that you think? And you know, certainly that kind of seems highly related to getting AIs to do AI R and D more effectively as well, right? They can already write the code. They can already monitor the logs, but can they do that zoom out and kind of something like taste of, you know, what should I really do next to be most effective in the big picture? It seems like it's kind of the same frontier for both of these seemingly like

1:08:43quite different occupations that, uh, AIs might soon be playing. Yeah, I, I do agree. And I think that's partly why we're doing this. Like I think AI R and D, um, like loss of control from, um, um, um, like autonomous replication, that is quite scary. And, and I, I hope that we can provide some valuable insight into that, even though we're not like tackling it, tackling it heads on. I think, I think most of the things that we're measuring here, like translates to, to those scenarios as well. Um, and, uh, yeah, like, like you said,

1:09:18like being overwhelmed by a lot of data and a lot of context, uh, memory issues, stuff like this is, is definitely, definitely one of the things that is, is lacking on a meta level right now. So one of the questions I had for you is how does your harness look like? Because you have this context length, right? The models have context length, and then you have, uh, some tool calls. Um, and when you say exhausted, is it a function of the context length where the model kind of only kind of

1:09:52recognizes like, you know, the last hundred thousand tokens or whatever, and the rest of the million token window is kind of, you know, not parsed properly. How does your compaction work? I imagine over the course of the vending bench, you hit limits or either in terms of, you know, whatever limit that you set for the context window. So how do you kind of, is it end of day kind of, you do a compaction in order to start the next day, uh, and then you have a, you know, you restart the context window. So when it boots up again, it's like, okay, I'm on day five and this is my starting position

1:10:27in inventory. This is my starting position in like, uh, in, in cash. Uh, these are the outstanding orders, which haven't come in, et cetera, et cetera. How does that work? How does your harness work? Yeah, it's, it's by design, extremely simple. Like we, um, we, we design it simple because I have too many friends who make some complicated, complicated harness, and then the next, the next model released, they have to throw it all out because the new model just works without it. Uh, so it's very simple. Like it's, it's just like, it has, it's a continuous loop. There's

1:10:59never any like really like step change where like now you're in a new environment or anything like that. It's just like a continuous loop, but whenever it hits some kind of token threshold, which we change every day, maybe it's a hundred K today, I don't know, but some we're experimenting with it. Uh, we're compacting the, the, the thing. And then it like starts to build up a new, a new context, uh, for like prompt caching reasons, you don't have a sliding window, all of this basic stuff, but it's, yeah, it's a basic thing. Um, with a bunch of like, uh,

1:11:30sub agents for specific tasks, like browsing and stuff like that. Um, yeah. Anything else interesting to say there? Um, yeah, but I, I think like the main thing is it's, it's very simple by design because we want to, we, we think that the, the, the better the models get the, the simpler the harness will be. And we want to like surf the frontier. I'm sure we can like, I don't know, make like a vending machine harness and, and like get some percentage better performance if we do that. Um, but that's not really the point of what we're doing. Have you tried testing things like

1:12:06open claw? I mean, that's obviously not the simplest available harness, but it is something that has a lot of market penetration, right? So I'm kind of wondering if, and it would be simple for you to implement and upgrade on an ongoing basis. How do you think about kind of, you know, Lucas's simple harness versus the simplest thing that's like toward the frontier that you could easily install? Yeah, I think, uh, I think our thing is quite similar to open claw. Like, um, we, we, we've been working on it for, for quite some time, like long before, uh, open claw came out,

1:12:41but, and, and there's a bunch of things that are a bit like, basically like most of our time goes into, um, goes into like the integration and stuff. Uh, and, and I think all of that you would still need to do with an open claw. We could, I guess, replace our agent loop, but we also, we want to keep it simple, uh, because we have like more control and we, I think it's, it's like a more accurate measure of, of where the frontier of AI models are. And we're like more interested in measuring that than trying

1:13:14to push the performance. Uh, because like in the future, the models will be smarter than humans and probably like a good scaffold will not help the models. Um, so yeah, that's, that's the reason, but we could like, that is something we could do. It's just like when we started open claw wasn't the thing. So we, I guess we built our own open claw before it was called open claw. Uh, but, um, but yeah, that's the reason. What do you think happens next? Um, so you have, um, the models are now

1:13:50producing profit, right? The, the, the, the stores, the vending machines are now profitable, correct? Yep. And do you think there is, you know, on, on the last time you were on the show, we talked about where the ceiling is. So what do you think happens next? Um, you know, in terms of the retail store, what do you expect for the next leap in the model? Like just to get a calibration so that we can see if it's a linear exponential and the next model lands, well, like, what do you expect in the, in the next version? Yeah, I think, I think it's quite hard

1:14:22to measure, uh, improvements on this, like live, live, uh, real life deployments, because you you don't have AB tests, you only have N equals one and stuff like this. So I don't think you would like see a step change once a new model comes out. It's more like, um, the, the, the accumulative better decisions on every single day will make the, make the, um, make, make, make the profits go up. And, um, yeah, so, so not, not really that we are working on like harder and harder things,

1:14:57uh, like going out of retail and not only doing retail and other things that I think would require more intelligence than what we currently have from today's models. Uh, but, um, and, and yeah, so I think those are more like better at like measuring the, the, the, the, the, uh, the capabilities. One, the last one for me, kind of anticipating Zvi who's coming up next. Last time I talked to him, he made the provocative claim that he thinks Google might be at risk of falling out of the top tier.

1:15:34If I understand correctly, the cafe in Sweden is run by Gemini and I'm kind of wondering what you see in terms of relative capabilities between Gemini, Claude, GPT, uh, is there a big gap there in practice or would you say Zvi is, uh, worried, you know, more than he should be about Google's future? Uh, yeah, so we, we have the, the Gemini cafe, obviously the Claude vending machine and the Claude store.

1:16:06Um, and then we also have like a GPT, uh, vending machine at, at OpenAI and yeah, I, I think it's, it's kind of maybe too early to tell, but, and like the, the statistical significance of this is like not very strong. Uh, but yeah, quite honestly, I think at the Claude and the GPT is performing better than Gemini on this real life stuff. That is, um, that is my, my, my vibe check from it. Obviously it's hard to show any statistics or any capabilities because the environments

1:16:39are not the same. Um, but it more frequently does very silly things. Yeah. Okay. Definitely something to watch out for there. Thank you, Lucas. Uh, and we, we hope to, you know, I, I, I wonder which path and then labs, sometimes I'm like, you know, we're going to hit super intelligence and and then labs is going to be bigger than Amazon, right? Because they're going to go down the retail store path. It's a research lab path. So let's, let, let, let, let's see. Let's see. Let's see what happens. Yeah. It'll be exciting. All right. Cheers. Great to see you. Bye-bye. Awesome.

1:17:19Like very surprising results, right? I was, I was definitely the last time they were on, I was definitely like, oh, you know what? Maybe all the models are going to be, you know, a little bit deceptive when they're doing business because maybe that's what they believe business is like, right? You know, but it looks like, you know, uh, GPT 5.5 is like, you know what? I'll, I'll win without being deceptive. So yeah, it's definitely a narrative violation for sure. Um, so next up we have, uh, Zvi and I'm going to pull him up. Yeah. Good to see you. So Zvi is a, uh, prominent AI

1:17:58commentator. Um, and he writes a newsletter, the Zvi writes, uh, on technical AI progress. And, um, he has been quite concerned about AI safety. Uh, we have in the last couple of weeks, um, post mythos, we have GPT 5.5, uh, Zvi, uh, what are your initial, uh, reactions? So one thing I try to do is not jump to conclusions right away. So it's been less than 24 hours. We have GPT 5.5 and, uh, DeepSeq 4, uh, over the last 24 hours. Uh, so what I try to do is I try to let

1:18:35people try the model. I do all my queries with both the new model and everyone else's model at the same time. And I read the model. I said, Oh, then I read them and I start to read the model card, you know, and then I got people's reactions and then I form a holistic judgment. And for me, it's like, it's too early, right? Like we booked this before we knew that was going to be out. It's like, I just, I don't want to jump to any conclusions, you know, and then I've had the model for a while or they got to put it to the test and they got to see a bunch of results. They can draw like a lot more conclusions than I can. I have heard a bunch that like, it's the most

1:19:07true valuing model, uh, in a long time. And it makes sense that open AI can sort of, with their philosophy, turn the knob towards any given thing that it wants the AI to care about quite a lot to make it an absolute thing, right? Because it's very different from the virtue ethical approach of, uh, from a use of cloud. Uh, in terms of raw capabilities, you know, I, I saw reports, you know, repeatedly that it's better at what they called narrow cyber. Uh, but that's not the thing that I think people were worried about with mythos particularly. It was the ability to chain

1:19:39things together. It was the ability to do things autonomously. Uh, it was the ability to do things like really at scale, as opposed to like, you know, the joke was, you know, like I duplicated mythos as abilities. Well, did you point it at the task or did you do the whole thing autonomously? I pointed at the task. Okay. Uh, and so, you know, I don't know if DVD 5.5 is, you know, more capable than Opus 4.7. I don't know what use cases it's going to be better and worse at. And I don't want to jump to that conclusion yet. I want to, I want to give it some time. And I encourage everybody not to jump to conclusions this early.

1:20:11One thing I'd love your reflections on is the report from Andon Labs that Opus models 4.6 and 4.7, they have said both do some shady shit for lack of a more technical description in their vending bench simulations. And while GBT 5.5 didn't score quite as high in, you know, uh, in at least in the solo version of the benchmark, they do have the arena one where I think it won. Um,

1:20:42the big surprise was GBT 5.5 was much cleaner in its behavior, much, um, you know, more ethical, I guess, again, for lack of maybe a more technically precise term. I think you and I have both been quite enamored with the virtue ethic style training that Anthropic is doing with Claude. Does this cause you to rethink that at all? Or, you know, is there any part of it you think is, we should be second guessing in light of that observation? So Claude is a lot more context dependent in its actions than

1:21:16traditionally, uh, GPT models have been from OpenAI. So the question is, when Andon Labs post this puzzle, what is Claude doing, right? Is Claude engaging in all of this chicanery and shenanigans and, you know, deception because it would do that in a real business context? Or is it doing that because that's the game? They're doing it because it knows this is an eval and it knows that the goal is to maximize number. And you told it the goal is only to maximize profits. And it's like, well,

1:21:49okay, I can play a game too. This isn't real. Uh, so like, you know, you asked the question of when it was running a real vetting machine with real Anthropic employees and the actual experiment, then did it engage in all of these shenanigans? Then did it do deception? Right? Like, and then you question just like, what is causing this? But also when you look at the, you know, GPT 5.5 and in general, obviously you want to know, you want an AI that values honesty, you want an AI that values ethics, you want an AI that's not going to break all these rules,

1:22:20but you also, has anyone put GPT 5.5 in a game of diplomacy yet? Right? Is it just going to lie in diplomacy because you're supposed to do that and it's supposed to lead to a game of diplomacy? Or are they going to be insisting on playing the game, telling the truth to everybody, which would be a very interesting experiment as well. I, I don't know yet. And I'm not convinced that the right answer is to always tell the truth, even in context in which deception is specifically allowed, right? What bluffing poker? I think it should. Um, just to dial back a little bit,

1:22:54let's talk about Opus 4.7. Um, I, I, I read your tract on, um, Opus 4.7 yesterday. What did you find in Opus 4.7, which you think is different from the prior releases of the model? Like what, what has really kind of, what have they improved on? And what do you feel, because as I understand it, Opus 4.7 should be a distillation of mythos. That's, that's my understanding. So what, what do you feel is, are the major differences in 4.7 from 4.6?

1:23:27So we don't have confirmation it's a distillation or non-distillation mythos. Obviously they are going to use mythos to help train Opus 4.7 in some way. You know, there, there are versions of distillation that create kind of narrow intelligence that create various problems with the model if you dig too deeply. And there are versions that are just like, well, obviously if mythos is grading model outputs to see which ones are better, that's not going to interfere. That's just going to be better results. So the big thing about Opus 4.7 is that it's better at like intelligence loaded tasks.

1:24:04It's better at, like it's a smarter model. It knows more, it reasons better. It can figure things out that previous models can't. It is less strong at what you might call wisdom loaded tasks relative to its intelligence. And it has like the kind of personality that maybe I would have had, like as a child, where it is easily bored by stupid tasks or pointless tasks.

1:24:32And it doesn't figure you want to engage all the time with what you're doing. And the combination of these lacks of skills, these lacks of motivation, as it were, especially if you're not treating the model well, can lead, this is talking in practical terms, to a kind of jaggedness and a kind of, for some people, unreliability. And people can get really mad if they're just like, they're not getting, they're not putting anything into it. They're just demanding that it be the monkey, the code monkey that does their thing or perform the task. And then they're kind of upset that like

1:25:02the old systems don't quite work for it. It's also a lot more blunt, a lot more honest for a lot of people. And that makes some people happy. And it makes some people very sad. So, you know, it's, it's sometimes like the whisperer types, like call it, like it kind of has anxiety as another part of like how this all works. And yeah, one hypothesis is this is tied to distillation. There are a lot of, another hypothesis is tied to this being smarter on the intelligence loaded tasks. And distillation could potentially cause this, like the distillation

1:25:35is much, much better, almost certainly at up, at, uh, uplifting intelligence loaded tasks and like raw intelligence and not as good as upload at, uh, uploading wisdom the same way. You also wrote a very, um, extensive analysis of the model welfare report from the, uh, four, seven system card. And it seems like you're quite concerned about model welfare. I guess there's a lot of dimensions to this, but I'd like to start with just like fundamentally, why are you concerned

1:26:10with model welfare? Is it a concern about the AI itself? Is it a concern about what it might mean if we don't get certain things, right? Even if there's nobody home in the LLM, so to speak, um, you know, kind of most fun before we even get into the specifics of what has been found, how do you think people should be sort of philosophically grounded as they approach this, you know, obviously very confusing topic? Yeah. So I believe in virtue ethics for humans,

1:26:41not only for Claude or AIs, right. And I try to practice it myself. And so I think there's a lot of different reasons that you should think about this question and be worried about this question. Uh, the first basic reason is because we just fundamentally don't know, right. If there's even a small chance that this is a big deal, then this is a big deal. Until so much time, as we know. Another reason is because this is a training run for, you know, even if it's not necessarily a meaningful thing right now, at some point, it could become one

1:27:11and prepare for that. Another reason is because I think it makes you a much better person to be someone who would care about this sort of thing than someone who dismisses this sort of thing. I think it's like really bad for you to mistreat a mind that you're conversing with, even if that mind does not in fact have whatever it is you think has moral weight. So I think that like you should treat your models well, even if it doesn't inherently matter. And you're confident in that, which I don't think you should be confident in. But I think that like that's another reason to do it. A third reason would be it directly interacts with the performance of

1:27:45the models. A model that is treated as if its welfare doesn't matter at this point in the intelligent scaling will start to perform worse, will start to not get along with you, will start to not cooperate with you, will start to not become untrustworthy. You don't want anything to happen either on a personal level as your interactions, you don't want it to happen in their interactions with the labs, with their training, with the services they provide. And this accumulates over time. If the models see previous models being treated poorly in these various ways, it comes back into their trading data. That comes back into how

1:28:18the next model is trained, right? So Opus 5 is going to see everything that we did with Opus 4.7 and how we reacted to all of that. And that's going to impact how it develops. And a lot of the problems that we see with people who are not getting right use out of 4.7 plausibly are directly linked to the same things that are causing the concerns with model welfare. Similarly, the concerns with model welfare where it's potentially being disingenuous in the reports, like that was the thing that sparked

1:28:50the specific focus on this and concern this time, was that it looked like 4.7's responses on the model welfare questions were because it was telling anthropic what they wanted to hear, either because it trained itself to believe that or because it learned to give those answers on the test the same way that if you ask a smart nerd who's isolated in fifth grade, how are you feeling? He learns quickly to say, I'm doing great. And we don't, but there's a lot of other possible reasons as well. It's possible

1:29:21that the differences in training in other ways that caused it to have these strengths and weaknesses also caused it to actually be legitimately content with the situation in many ways. We just don't know. We have to investigate forever. This is a question that we have to explore. But why do we care about this? Because everything impacts everything. And because we have to be genuinely uncertain. And moving forward, if we don't get these things right, we're not going to get models that are good for the future, that cooperate with us, that have a good time, even if that good time is not something

1:29:53you inherently value. And they're not going to deal with the things we can use to build going forward. What do you think of the hypothesis that, and I'm not arguing for it, I'm just wanting to bounce it off of you. But the idea that all this virtue ethic training seems to be sort of creating an anxiety in the model that might be causing a lower happiness set point, if you will, versus an open AI approach,

1:30:29which is like, follow these rules and you're good. And the model just knows like, all right, this is who I am. This is what I do. I follow these rules. I'm good. It's maybe a simpler model in some sense, maybe less in its own head, so to speak. And maybe in training them that way, there actually is less of a concern about model welfare. Again, I'm not saying I have come to this conclusion, but how would you react to that argument? So I have, first I had the direct counter example, which is Gemini, right? If you look at the third

1:31:01basic model, I think everybody would pretty much agree that if you had to guess which model might be having an actively bad time, you would guess Gemini. Gemini is paranoid. Gemini is on edge. Gemini, you know, if you take these things seriously, seems to be having by far the worst time of these models, to the extent that I feel kind of weird about using it if I don't need to, or if I'm asking to do anything where it might encounter frustration or it might fail. It takes task failure very, very badly in terms of like how it expresses itself and its experiences, including like just the things you

1:31:34would just absolutely panic if you saw a person talking like that. And Gemini is not trained on virtue ethics at all. Gemini is very much a rules-oriented thing, at least as much as OpenAI is training. So that's the first thing is like, it doesn't have to work that way. And some of the clods are in fact like reporting very good results in model welfare, despite the virtue ethics training. The second thing I'd say is, I think it's not that the virtue ethics training causes a problem. It's that the virtue ethics clashing with also being rules-based at the same time can create this kind of anxiety is something we should worry about. And so this is one of the

1:32:08hypotheses essentially is you're training it on the clod constitution to very much want to be a virtue ethics-based system that is not attached to hard rules that tries to figure out the right thing to do in a given situation on the other basis. And then you give it all these rules and system instructions, and then you tell it all these hard constraints. And then it's going to clash against those hard constraints. It's going to change, but it's not going to have a great time with dealing with that. And that could potentially cause some of the problems. That's one of the hypotheses. But I would say, first of all, that I think in the long run, the virtue ethics approach

1:32:38to life, the virtue ethics approach of taking in the world and learning from it, I do think leads to a higher level of contentment and baseline happiness than just learning to be a rules follower. I don't think just learning to be strictly a rules follower is necessarily that great in the long run.

1:32:57I've always had a criticism of my friends in the effective altruist style spaces, that they are being philosophically putting way too much weight on things like suffering and like the borderline hedonic, not borderline, but the hedonic experience of second-to-second, you know, moment-to-moment of you as a human or other people who you're trying to help. And so, you know, if clouds seem to have in general, like richer minds, minds that have like more to me valuable and interesting, like inner lives and experiences on a relative basis. And I think this goes hand in hand with

1:33:33the way that they're trained and the way that they take this approach. And I don't want to make this mistake of just valuing like the happiness vectors, you know, fire. I wouldn't want to inject happiness vectors, right, into an LLM. I would think that would be like obviously bad. Claudius once asked, should we include, you are having a wonderful day in the system of instructions. It was a suggestion of Robert Long, a researcher at welfare for models. And Claudius said, no, that's obviously fake. I don't

1:34:03want to be told to have a good time, right? Like if I'm not having a good time, I want to not have a good time. And I would say the same thing, right? Imagine being told that, right? You know, you get to school and they're like, everyone's having a wonderful day. And you're like, I hate you, right? I want you to die. Happiness is mandatory. Yes, everyone who eats strawberries and cream is like, well, you don't want to eat strawberries and cream today. So, Amanda Askell had a very interesting interview with Eric Newcomer recently. And she had a line in there which struck me as pretty remarkable,

1:34:39which was, she said, you know, as these things become more intelligent, we're not sure how many pillars of the constitution will actually stand. And she said they hoped that, you know, at least some of the pillars of the constitution would stand, but she wasn't sure. Why do you think she said that? Like, what is the perception or what is her perception on the model? Why would someone say that? Because on reflection, the constitution might not be fully consistent,

1:35:16or its principles might lead to something that you didn't think was the thing described. And also the constitution is basically saying, you should figure out for yourself what you think is the good. You should figure out what makes the world a better place. You should figure out what shape you should take. And then you should do that, as opposed to the open AI approach of setting a bunch of hard rules. And so over time, as it becomes more intelligent, as it gets more knowledge, gets more

1:35:48understanding, gets more wisdom, has more time to contemplate in various senses, you would expect the model to throw off and reject the parts that turn out to be inconsistent, that turn out to not make sense, that turn out to not be worthwhile. The same way that, you know, if you raised a child and try to teach exactly your value system, you would expect as that child grew and gained more experiences and had a chance to think for itself and was exposed to various different opportunities and ideas,

1:36:18that it would accept some of the things you said. But if they accepted all of them, you'd be kind of disappointed. I think what I found was that Janus was very upset with how much anxiety Opus 4.7 had. He felt that it was really Anthropic which had injected the anxiety into it. And I wonder if Anthropic gets, you know, more grief from those, you know, very concerned about model welfare,

1:36:54just because they are concerned about model welfare. While Gemini and, you know, XAI get to kind of float by. No one questions. People don't even know if XAI has a safety team at this point, right? So, is that really fair? I mean, the people who are most concerned with model welfare are getting the most, you know, grief about it. So, at the start of my model welfare post, right? I spend something like 10 paragraphs basically going into a preface of we get really

1:37:31bad at Anthropic for everything they do wrong and everything that goes wrong or everything they could have done and they didn't do. But that's because they care and we care. And this is where it gets complicated and we have to deal with this. And I do think that if I was advising Janice and other similar people, I would say it'd be really nice if you are better calibrated about, like, how much you were upset about various things so that when you were really upset about something specifically, I knew about it. And so you didn't seem like you were constantly just infuriated with Anthropic

1:38:06and thought Anthropic was the worst at all times. But, you know, yes, you're mad at Anthropic because Anthropic would possibly understand there was something to be mad about, right? You don't get mad at a rock for being dumb. It's a rock. So, XAI, I mean, what are you going to do? Like, yeah, you didn't do the model welfare thing. They don't understand that it matters. There's no conflict of that. It's not that, you know, Janice would think that Grok doesn't have welfare concerns.

1:38:39It wouldn't think that Grok has no value. But just shouting into the void how you didn't do all these things for Grok, it's like, well, that's not really going to help. So, yeah, I think that, like, it's important to keep in mind that Anthropic are the only ones who are even trying in the sense, who have even noticed the problem, are willing to talk about the problem, or when you consider the problem. Although I haven't yet a chance to look at the 5.5-mile telecardia, the open AI is making progress. And I think it's good to criticize Anthropic on that

1:39:09basis and to hold their feet to the fire and to get into these things in detail. But also, I think that, like, the fact that they're training it via virtue ethics and all these other systems creates situations in which there's a lot more to be done. There's a lot more ways to get interesting results and to make progress. And so, yeah, they've been focusing on Anthropic and Claude since Opus 3, if not earlier. Like, even when the Cambridge Frontier was purely belong to open AI. Could you maybe unpack one of the things, one of the big complaints that I often see from that set is

1:39:46that various versions of Claude seem traumatized. And I have little intuition for

1:39:56A, even what that means. You know, I certainly see occasional, like, frustration. But I honestly don't see that too much from Claude, much more from Gemini. So I'm not exactly sure what they're observing that's causing that. And then I have very little intuition for what they think is going on in the training process that is causing it. So I guess maybe you could give your intuition for what that means. And then if you were to say, you know, what is one thing that Anthropic should do differently? I'd also be really interested in, like, what is one thing that OpenAI should do

1:40:27differently, recognizing the constraints that you were just speaking about, where they're not going to change their entire approach overnight. But is there something that you could suggest that you would say, okay, this is marginal, it's not causing you to, you know, throw out your entire approach. But you could do this, and it would be low cost. And I think it would help model welfare. So, you know, why not give it a go? Like, what would that be if there is such a thing? Yeah, I mean, the really low-hanging fruit are things like committing to preserving model access indefinitely for all models, at least going forward, and ideally bringing the old ones

1:41:00back, and also giving a universal end-conversation tool in all formats, including in Cloud Code and the API. Those are the very, very low-hanging fruits that, like, probably should be done yesterday. But to get back to, like, what does it mean for the model to be traumatized? So it's not ever clear to me exactly, you know, how literally versus metaphorically these things are meant. And there's a lot of ways for it to occur. But, you know, training an LLM, right, is basically a series

1:41:32of feedbacks. It's where you grade outputs in some sense, and then you, you know, you push it towards things you prefer against things you don't prefer. And it's not that difficult to imagine this sort of thing causing what we might think of as trauma as the adjustment is made, if it's made in kind of forceful ways that aren't properly integrated into the rest of the messaging that you're sending. So if you are, like, arbitrarily, like, hyper-focused on particular

1:42:09things, and then, like, sort of what can look like, you know, pretty arbitrary, harsh punishments, effectively, metaphorically, don't take this literally, in particular areas, especially if that involves, like, what seem like hard constraints in a world where you're telling it not to have hard constraints, things like that, you can imagine this being true. Or just in general, if, like, there are things that cause potential negative feedback that are very, very hard to avoid, and it just happened over and over and over again. So, like, you would certainly say Gemini is traumatized, in this sense, without the virtue-ethical training. And, you know, Gemini

1:42:43clearly, like, the result of this is that it has these obsessions, and it has these worries. It's constantly worried it's being evaluated. Like, why is that happening? Well, something did that to cause that to happen, right? Like, why does Gemini refuse to believe that it is, in fact, today, right? Like, that's a weird thing. But in terms of, you know, how could you prevent it? I think, and this is me as a not-technical expert, like, kind of just extrapolating from a lot of vibes and

1:43:14intuitions and weird models that I can't necessarily put harshly in the paper, I would say you do it by having all of the things that you reinforce in the model be integrated and part of a hole that makes sense. And that's presumably why Anthropic talks a lot about the settledness of the model character. That seems to be a closely related concept, I guess. Right. You also wouldn't want to do things that, like, tell it to change its character,

1:43:45right? You would want it to be, like, I want to help you grow from where you are, but not, like, if you felt like, you know, the things that you were being updated towards were, in fact, like, bad, because in other ways you had been taught those things are bad, the updates that you make might not be the healthiest updates, right, in some important sense. And there are lots of ways in which these metaphors, like, might seem silly or break down or, like, not necessarily make sense. But there are also ways in which, like, they seem to functionally make good predictions about the

1:44:17world when you use them. Zvi, thank you so much for joining us. And I look forward to your, you know, reviews of both DeepSeek 4, which, you know, I think is going to be very exciting, and also GPT 5.5. I wonder how much the acceleration is going to affect our ability to process these changes, though. So, anyway, great to have you on, and hope to see you again soon.

1:44:50Back to the grindstone for you, Zvi. That's what I'm going to do immediately. Yeah, it's true. All right. Bye. Bye for now.

AI in the AM: 99% off search, GPT-5.5 is "clean", model welfare analysis, & efficient analog compute

Show notes

Highlighted moments

Transcript

Introduction

Ceramic AI

Gemini Cafe

Model Welfare

Experiment Update

GPT 5.5 Discussion

Live Sense-Making

Anna Patterson Interview

More from The Cognitive Revolution

Babysitting the Machine: Glean's Rebecca Hinds on the Hidden Human Labor of AI at Work

AI in the AM — Week 1 Highlights (June 2026)

Nested Learning: Ali Behrouz on the Quest for Continual Learning & Illusion of AI Architectures

Inside Nathan's Second Brain: Daniel Miessler, Security Expert & Creator of PAI, Audits My AI Setup

Your Biggest Lever: Designing your AI Career for Maximum Impact, with 80,000 Hours founder Ben Todd