
Moonlake: Causal World Models should be Multimodal, Interactive, and Efficient — with Chris Manning and Fan-yun Sun
April 2, 20261h 6m · 11,827 words
Show notes
We’ve been on a bit of a mini World Models series over the last quarter: from introducing the topic with Yi Tay , to exploring Marble with World Labs’ Fei-Fei Li and Justin Johnson , to previewing World Models learned from massive gaming datasets with General Intuition’s Pim de Witte (who has now written down their approach to World Models with Not Boring), to discussing the Cosmos World Model with with Andrew White of Edison Scientific on our new Science pod, to writing up our own theses on Adversarial World Models . Meanwhile Nvidia , Waymo and Tesla have published their own approaches, Google has released Genie 3 , and Yann LeCun has raised $1B for AMI and published LeWorldModel . Today’s guests have a radically different approach to World Modeling to every player we just mentioned — while Genie 3 is impressive, its many flaws demonstrate the issues with their approach - terrain clipping, noninteractivity (single player, no physics/no objects other than the player move), and maximum of 60 second immersion. Moonlake AI (inspired by the Dreamworks logo ) is the diametric opposite - immediately multiplayer, incredibly interactive, indefinite lifetime, capable of MANY different kinds of world models by simulating environments, predicting outcomes, and planning over long horizons. This is enabled by bootstrapping from game engines and training custom agents: In Towards Efficient World Models , Chris Manning and Ian Goodfellow join Fan-Yun in explaining why their approach to efficiency with structure and casuality instead of just blind scaling is sorely needed: SOTA models still show physical or spatial understanding glitches, such as solid objects floating in mid-air or moving “inside” other solid objects. If the goal is to plan for the next action, how often is a high-resolution pixel view necessary for modeling the world? Our bet is that there is a disproportionately large share of economically valuable tasks where such detail is not required. After all, humans with a wide variety of sensory limitations have little difficulty doing almost everything in the world. Furthermore, for a large number of purposes, describing a scene or a situation in a few words of language (“the car’s tires squealed as it cornered sharply”) is sufficient for understanding and planning. Experiments also show that h umans only partially process visual input in a top-down, task-directed way, often making use of abstracted object-level modeling . In almost all cases, partial representations combined with semantic understanding are sufficient. …If the goal is to facilitate the understanding of causality in multimodal environments, then the world model—whether it is used in the virtual world or the physical world—must prioritize properties such as spatial and physical state consistency maintained over long time periods, and an ability to evolve the world that accurately reflects the consequences of actions . That’s what Moonlake is building. Game engines are the right starting point abstraction to efficiently extract causal relationships, and building the interfaces and community (including their new $30,000 Creator Cup ) to kickstart the flywheel of actions-to-observations. We were fortunate enough to attend their sessions at GDC 2026 (the Mecca of Game Devs), and were impressed by the huge variety and flexibility of the worlds people were building with Moonlake’s tools already! Live videos on the pod. Full Video Pod on YouTube ! Timestamps 00:00 Benchmarking Gets Hard00:47 Meet Moonlake Founders01:26 Why Build World Models03:12 Structure Not Just Scale05:37 Defining Action Conditioned Worlds07:32 Abstraction Versus Bitter Lesson14:39 Language Versus JEPA Debate20:27 Reasoning Traces And Rendering Layer37:00 Gameplay Over Graphics38:02 Fiction Rules And World Tweaks39:15 Code Engines Beat Learned Priors41:10 Diffusion Scaling Limits43:23 Symbolic Versus Diffusion Boundary46:14 Platform Vision Beyond Games50:24 Spatial Audio And Multimodal Latents54:23 NLP Roots Hiring And Moon Lake Name Transcript [00:00:00] Cold Open [00:00:00] Chris Manning: Think this whole space is extremely difficult as things are emerging now. And I mean, it’s not only for world models, I think it’s for everything including text-based models, right? ‘cause in the early days it seemed very easy to have good benchmarks ‘cause we could do things like question answering benchmarks. [00:00:20] But these days so much of what people are wanting to do is nothing like that, right? You’re wanting to get some recommendations about which backpack would be best for you for your trip in Europe next month. It’s not so easy to come up with a benchmark, and it’s the same problem with these world models. [00:00:41] Meet the Founders [00:00:41] swyx: Okay. We’re back in the studio with Moon Lake’s, two leads. I, I guess there’s other founders as well, but, sun and Chris Manning. Welcome to the studio. [00:00:54] Fan-yun Sun: Thanks. Thanks, Chris. Thanks for having us. [00:00:56] swyx: You’ve got, you guys have, come burst onto the scene with a really refreshing [00:01:00] new take of mold models. [00:01:01] I would just want to, I guess ask how you, the two of you came together. Chris, you’re a legend in NLP and just AI in, in, in general. You’re, you’re his grad student, I guess [00:01:10] Fan-yun Sun: Actually my co-founder. [00:01:11] swyx: Oh, yeah. [00:01:12] Fan-yun Sun: I should give a lot of credit to my co-founder, Sharon. Yeah. She was, she was actually working with Professor Fe Androgyn and then she ended up working with, Ron and Chris Manning here. [00:01:22] And then, so I got connected through to Chris initially, actually through my co-founder, [00:01:26] What is Moon Lake? [00:01:26] swyx: what is Moon Lake? What, what is, actually, I’m also very curious about the name, but like why going into world models? [00:01:33] Fan-yun Sun: So I was working a lot. With actually Nvidia research during my PhD years on essentially generating interactive worlds to train reinforcement learning agents or embody EA agents. [00:01:44] And then there’s two observations. One in academia and one in industry. An industry like folks at Nvidia are actually paying a lot of dollars to purchase these types of interactive worlds, whether it’s for the sake of evaluation or training the robots, or policies or models. And [00:02:00] then, in academia, same thing is happening. [00:02:02] And more specifically, when I was actually working with Nvidia on the synthetic data foundation model training project, we were actually generating a lot of these synthetic data and showing that, hey, you can actually, these synthetic data are actually as useful as real world data when it comes to multimodal pre-training. [00:02:16] But then, like I said, there’s a lot of dollars being paid out to like external vendors or, or like. Other folks to manually curate these types of data. It was very clear to us that, okay, on our way to, let’s call it embody general intelligence models need to learn the consequences behind their actions, which means that they need interactive data and the demand for those types of data are growing exponentially. [00:02:38] But everybody’s sort of thinking about it from a pure, say, video generation perspective or something else. But we feel like the true actually opportunity is actually building reasoning models that can do these things, like how humans do these things today. So that’s a little bit on the genesis of Moon Lake, and I think the reason I got into world models was partly. [00:02:59] A philosophical [00:03:00] take of the on the world where I like, believe the simulation theory and stuff like that. But on the other, on the other hand, it’s really just like, oh, like there’s an opportunity there that I feel like nobody’s doing it the way I think should be done. [00:03:10] Structure, Not Scale: The Vision [00:03:10] Chris Manning: I can say a little bit about that. [00:03:12] Yeah. So of the overall goal is the pursuit of artificial intelligence and most of my career has been doing that in the language space and that’s been just extremely productive. As we all know, the story of the last few years, I don’t have to tell about how much we’ve achieved with large language models, but, uh. [00:03:31] Although they have been extremely effective for ramping language and general intelligence, it’s clearly not the whole world. There’s this multimodal world of vision, sound, taste that you’d like to be dealing with more than just, language. And then the question is how to do it. And despite, a huge investment in the computer vision space, right, as the research field computer [00:04:00] vision has been for decades, far, far larger than the language space, actually. [00:04:05] I think it’s fair. Say that, vision, understanding sort of stalled out, right? You got to object recognition and then progress just wasn’t being made right? If you look at any of these, vision language models, it’s the language that’s doing 90% of the work and the vision barely works. And so there’s really an interesting research question as to why that is and at heart, the ideas behind Moon Lake are an attempt to answer that, believing that there can be a really rich connection between a more symbolic layer of abstracted understanding of visual domains, which aren’t in the mainstream vision models, which are still trying to operate on the surface level of pixels. [00:04:50] swyx: I think one of your blog posts, you put it as structure, not scale. Is that, a general thesis? [00:04:57] Chris Manning: Yeah. Well, scale is good too. [00:04:58] swyx: Yeah. Scale is good. Too [00:04:59] lot, [00:04:59] Chris Manning: [00:05:00] lots of data is good as well and scale, but nevertheless, you want the structure Yeah. To be able to much more efficiently learn. [00:05:07] swyx: Yeah. The other thing I really liked also is you put out an example of what your kind of reasoning traces look like. [00:05:12] Right. Which you would distill is the word that comes to mind. I don’t even think that’s a good, good description, but it would involve, for example, geometry, physics, affordances, symbolic logic, perceptual mappings, and what, what have you. But like that, that is the kind of example that involves, let’s call it spatial reasoning, role model reasoning as as compared to normal LM reasoning. [00:05:35] Yeah. [00:05:36] Defining World Models vs Video Generation [00:05:36] Vibhu: But also like taking it a step back. So how do you guys define world models? A lot of people see okay, you can do diffusion, you can do video generation. But, you guys put out quite a few blog posts. You put out a essay recently, we can even pull it up about efficient world models. You have a pretty like structural definition here, but for the general audience that don’t super follow the space, right. [00:05:55] What’s, what’s the difference in what we see from like a video generation model to [00:06:00] a world gen A simulator? How do you kind of paint that last [00:06:02] Chris Manning: year? Yeah, so I think this is actually a little bit subtle because, people look at these amazing generative AI video models, SAWA VO three, one of these things, and they think Genie, they think, oh, this is amazing. [00:06:17] This is we’ve solved understanding the world because you can produce these generative AI videos, but. The reality is that although the visuals do look fantastic, those visuals actually are accompanied by an understanding of the 3D world, understanding how objects can move, what the consequences of different actions are, and that’s what’s really needed for spatial intelligence. [00:06:49] So I mean, a term we sometimes use is that you need action condition, world models. That you only actually have a world model if you can predict, [00:07:00] given some action is taken, what is going to change in the world because of it. And in particular, that becomes hard over longer time scales. So if you’re simply, trying to. [00:07:12] Predict the next video frame. That’s not so difficult. But what you actually want to do is understand the consequences, likely consequences of actions minutes into the future. And to do that, you actually much more of an abstracted semantic model of the world. [00:07:32] The Bitter Lesson & Data Abstraction [00:07:32] swyx: Yeah, the question comes where you want to have more structure than is available in just predicting the next token. [00:07:41] And typically, well, let’s, let’s call it the experience of the last five years has been that is just washed away by scale, right? So what is the right middle ground here that, you don’t ignore the bitter lesson, but also you. Can be more efficient than what we’re doing today. [00:07:57] Chris Manning: One possibility [00:08:00] is, look, if we just collect masses and masses and masses and masses of video data, this problem will be solved. [00:08:11] Under certain assumptions that could be true, but there are sort of multiple avenues in which it could not be true. The first is what’s really essential is understanding the, the consequences of actions producing an action conditioned world model. And if you are simply, collecting observational video data, which is the easy stuff to collect, when you’re sort of mining online videos, you don’t actually. [00:08:41] Know the actions that are being taken to see how the video is changing. And so if you are never collecting directly actions and you are having to try and infer them from what happened in the observed video, that’s not impossible. But it’s very [00:09:00] hard and it’s not really established that you can get that to work at any scale yet. [00:09:05] And so there’s a lot of premium on collecting action condition video data, which is part of why there’s been a lot of interest in using simulation so that you can be collecting data where you do know the actions, which isn’t quite limited supply, but there’s also in the limit of as much data as you could possibly have. [00:09:28] Maybe the problem is eventually solvable, but. Even though we collect huge amounts of text data is always at a great level of abstraction, right? Language is a human designed, abstracted representation where there’s meaning in each token and it’s representing and abstraction of the world, right? [00:09:51] As soon as you are describing someone as a professor, and as soon as you are saying that they’re condescending, right? These are very [00:10:00] abstracted descriptions of the world. It’s not at what you’re observing as pixel level, and to get to that kind of degree of abstraction, starting from pixels is orders and magnitude of extra data and processing. [00:10:14] And so, although, we absolutely want to exploit, get as much data as possible, use the bitter lesson. Nevertheless, if there are ways in which you can work with five orders of magnitude less data than people working purely from pixels, you’re gonna be able to make a lot more progress, a lot more quickly. [00:10:34] And that’s the bet here. And so you could just say that’s only wanting to be able to, do it more efficiently, do it more quickly, do it more cheaply. But I think it’s actually more than that, I think. One should be making the analogy to how human beings work at one level. You know? Yes, we have these high [00:11:00] resolution eyes and we can look and see a scene like a video, but all of the evidence from neuroscience and psychology is that most of what comes into people’s eyes is never processed. [00:11:13] Right. That you are doing fairly fine ated processing of exactly what you’re focusing on. But as soon as it’s away from that of yeah, there’s another guy over there that you’ve sort of only processing top down this very abstracted semantic description of the world around you. And so, that’s what human beings are doing. [00:11:33] They’re working with semantic abstractions and so. I think it is just the right representation. ‘cause we also have other goals we want to be able to do, real time worlds. So that means there’s a limit to how much processing you can do and we want to do long-term planning and consistency. And again, that favors abstraction. [00:11:55] I mean, I guess there was actually a recent. Blog posts that [00:12:00] came out from our Friends of physical intelligence and, they were sort of heading in the same direction they were saying Oh, to the pay [00:12:06] swyx: pay model. [00:12:07] Chris Manning: Yeah. Yeah. To maintain a long term memory of what’s happening in the world. So we can, do longer term we actually storing text of what is, been happening in the world. [00:12:19] Right. It is not such a successful strategy of trying to keep it all at a pixel level. [00:12:24] Vibhu: And yeah, I mean, you can see it in video models like that Temporal consistency. We’re at a scale of train on, all the video data we have. We have it for maybe 30 seconds, a few minutes. That’s not the same as a game state played for half an hour. [00:12:37] Right. I thought you guys break it down pretty well. You have a, you have a blog post about. Building multimodal worlds with an agent. I dunno if you guys wanna talk about this. This is one of the things I read, I [00:12:48] swyx: thought, yeah, it’s the thing I talked about with the reasoning chain. Yeah. [00:12:51] Vibhu: So there’s like different phases to this. [00:12:53] It seems like it’s more of an agent, a scaffold, very different approach than just, type in a prompt and you, you don’t have the same consistency. [00:13:00] It also, like, for people that are listening, I, I would highly recommend reading it. It breaks down the problem in a different light, right? [00:13:06] So like, what do you need to consider when you’re talking about video, like world game models, right? How would, what do you need to consider? What are the factors? What are the elements? What’s the state? So I don’t know if you guys have stuff to talk about for this one. [00:13:19] Fan-yun Sun: Yeah. Actually, I wanted to add on a little bit Yeah. [00:13:22] On our previous point, which is just like, change topics so quickly. I, I do feel like sometimes people confuse like, oh, like we’re taking an an, an method with abstraction. That means they don’t believe in bitter lesson. Like that’s just false, right? Like we are believed is a bitter lesson. But then I feel like the question that we always discuss is like, what is the right abstraction level today? [00:13:42] The analogy I like to make is like, let’s just say we can encode and decode. Represent all of images, videos, audio and bytes. Then the most bitter lesson approached is to train a next byte prediction model as opposed to the next token prediction model where it’s just like, okay, it’s natively multimodal, can just, but it’s like, yeah, like [00:14:00] to, to Chris’s point, it’s like the scale and computing you need to achieve that. [00:14:03] So that’s why we always come back to like, okay, what is the most efficient way to do it? And reasoning models to the point of this blog post is a showcase of like, Hey, we’re actually just like reasoning about the world and reasoning about. The aspects of the world that CAGR that matter for me to learn what I want to learn from this role model. [00:14:21] swyx: Yeah, it’s like you’re improving the en encoder of whatever you’re, trying to model. And like a better representation would just represent the important things in less space. Yeah. Which would just be more efficient. [00:14:33] Fan-yun Sun: Yeah. [00:14:34] swyx: So yeah, I, I, I fully agree that it is not, antagonistic to, bitter lesson. [00:14:38] I do wanna wanna mention one more thing. Is there any philosophical differences with the JPA stuff that, Yun is working on? I gotta go there. You, you, you, you’re, you’re imagining like some latent abstraction. I’m like, okay, fine. Let’s, let’s talk about it, right? Like it’s an elephant in the room. [00:14:52] Chris Manning: Yeah. [00:14:53] JEPA & Philosophical Differences with LeCun [00:14:53] Chris Manning: There are philosophical differences. Jan Lacoon is a dear friend of mine, but. [00:15:00] He has never appreciated the power of language in particular, or symbolic representations in general. Yarn is a very visual thinker. He always wants to claim that he thinks visually and there are no words, symbols, or math in his head. [00:15:21] Maybe that’s true of yarn. It’s certainly not the way I think. Um. But at any rate, the world according to yarn is the basic stuff of the, the world and of intelligence is visual and language is just. This low bit rate communication mechanism between humans and it doesn’t have much other utility and it’s far inferior to the high bit rate video, that comes into your eyes. [00:15:53] And I think he’s fundamentally missing a number of important things [00:16:00] there. Think of this evolutionary argument looking at animals, right? That the closest analogies, the things with chimps, right? So chimpanzees, have fairly similar brains to human beings. They have great vision systems, they have great memory systems. [00:16:18] They’ve got, better memory than we do of short term memories. They can plan, they can build primitive tools that, humans. Massively ahead in what we understand about the world, what we can plan, what we can build. And essentially what took off for us was that humans managed to develop language and that gave a symbolic knowledge, representation, and reasoning level, which just, okay if this sort of vaulting of what could be done with the intelligence in brains. [00:16:59] So the [00:17:00] philosopher Dan de refers to language as a cognitive tool and argues that, humans unique among the creatures in the world have managed to build their own cognitive tools and language is the famous first example. But other things like, mathematics and programming languages are also cognitive tools. [00:17:21] They give you an ability to. Think in abstractions, in extended causal reasoning chains. And that allows you to do much more. And we use that for spatial representation and intelligence and planning and gameplay as well. So we believe, and this is, underlying the specific technologies that Moon Lake is making, that symbolic representations are powerful. [00:17:50] And you want to use that in your understanding of the visual world when you want a causal understanding, when you want to maintain long-term [00:18:00] consistency and prediction. And as I understand it, that’s just not in ya Koon’s worldview. So I think that’s the fundamental philosophical difference. Then there’s the specific model. [00:18:11] He’s been advancing jpa, that’s a reasonable. Research bed is a direction as to, to head for building out a model of the visual world. To my mind, it’s sort of one reasonable research bed. It’s not really established. It’s the best one that everyone should be following, [00:18:32] swyx: at least developed at scale, at Meta. [00:18:34] But it’s not just vision, right? Like, I mean, JPA is a, just joint admitting prediction can be applied to anything really. And people have done it. The argument is that there is a latent representation or that is probably more. Suited to the task, then why not let machines do it for us instead of predefining it at all? [00:18:50] And isn’t something like a JPA shaped thing the right answer? And if not, why not? [00:18:55] Chris Manning: So I think there’s a part of jpa that’s right, which is [00:19:00] you do want to have a joint. Embedding that gives you a consistent model of the world. And Jan’s argument is you can never get that from auto aggressive language models ‘cause they’re sort of left to right churning out one token at a time. [00:19:22] I guess this is where we’re the research arguments of the field, I’m not actually convinced that’s right. ‘cause although the token production is this auto aggressive, process that’s heading, left to right, I guess don’t have to be left to right. But anyway, in sequence of tokens we could have right to left Arabic. [00:19:40] But although that’s true, all of the weights of the model that are internal to the transformer, they are a joint model of the model’s understanding of the world. And so I think you can think of the weights of the model as a form of. Joint representation, [00:20:00] and therefore it is plausible to think that could be the basis of a world model, which avoids, ya’s objections. [00:20:10] swyx: I think I follow, and obviously that would touch on what Moon Lake eventually ends up doing as well. Right. Like, which it’s hard to tell because you put out the end results, but we don’t know the inputs that go into it. So it’s, it’s, that’s something that we have to figure out over time. [00:20:25] Vibhu: Yeah. I mean, I guess this kind of breaks down some of the outputs. Do you wanna walk us through it? [00:20:31] Reasoning Traces & Interactive Worlds [00:20:31] Fan-yun Sun: Yeah. So this, this really just walks us through the reasoning traces of like, okay. So that just say, if we wanna build a world in this context, it’s really just a game demo that, that shows the, the variety of interactions that this world model can build. [00:20:45] And yeah, it’s really just a reasoning traces of like, okay it prompted to create a bowling game. Like how did it achieve what you saw? That level of causality, interaction and consistency, right? So yeah, this is almost just like a, an example of [00:21:00] like a reasoning traces. Very [00:21:01] swyx: detailed. [00:21:01] Fan-yun Sun: Yeah. [00:21:01] Vibhu: Very, very detailed. [00:21:02] You gotta you don’t even realize it, right? Like when a video is generated, what happens when a ball strikes a pin, right? So first, like you, there’s audio in that, like audio triggers happens, score increments, the world changes. Like pins have to start dropping. There’s a timer that goes on. It’s just like very similar to how now we’re used to reasoning for language models. [00:21:20] There’s a whole state of what happens. So geometry, physics, all this stuff. And then yeah, there’s kind of that single prompt. So asset, ation all this stuff. It’s like a, it’s a nice view to see what’s going on. [00:21:32] swyx: I think Sun is also too polite to point out that, both like Google’s genie, demos as well as world Labs is marble, do not have interactive worlds. [00:21:41] Fan-yun Sun: That’s the benefit of having a reasoning model, right? Like, because you can, you can say, oh, like maybe in this particular context, I want to learn how to bowl. And then you can say, okay, then what is it important when it comes to learning how to bowl? Okay, maybe it’s like I need to understand the, the basic of like, physics and I want to throw it over [00:22:00] them. [00:22:00] I wanna know that when I, when it resets it’s a new game. So I know that yeah, basically, you know to pick up the ball, you know that ball’s gonna cause the pins to fall down. You know that what’s important to this particular bowling game is to score and you know that the score corresponds to the number of pins that fell down. [00:22:19] So it’s just like, if it’s a model that sort of knows what it. Looks like, knows what a bowling game looks like, but doesn’t actually allows you to practice over and over again and to understand that, oh, like what it takes to actually get a high score. Then it sort of doesn’t actually allow you to learn what you set out to learn within the world model. [00:22:38] And I think this is really just one example of showing like the advantages of the approach that we’re taking over most the, let’s call it the zeitgeist, is today, when people talk about clinical role models, [00:22:51] Chris Manning: right? So it sort of seems like the question to ask when there’s a world model is. [00:22:58] Can I not [00:23:00] only just wander around the world and look at the beautiful graphics, can I interact with the objects in the world and see the right consequences of actions? [00:23:11] Vibhu: And you also understand what the consequences would be if you do something right. So it’s not just like, okay, there’s one thing if I pick it up, something will happen. [00:23:19] But, there’s 50 options and I know I can expect, I can infer what would happen if I do any of them. Right. So very different when you can actually see it play around with it. [00:23:28] swyx: There, [00:23:28] Beyond Unity: Cognitive Tools for World Building [00:23:31] swyx: there’s two cheeky elements of that. I mean, the, the, the I guess, less ambitious one is, let’s really establish for listeners, why is this fundamentally different than writing Unity code, right? [00:23:40] Like just creating a model to translate a prompt into Unity code [00:23:44] Fan-yun Sun: so there is an underlying physics engine. Yeah. In that sense, there’s some overlapping things to Unity, but the way we think about it is like physics engine. Tools or code are cognitive tools like borrowing Chris’s term, right? Like tools [00:24:00] that the model can employ as means to an end. [00:24:04] So today maybe you say, okay, in this particular context we care about physics, we care about the long-term causality consequences. Then yes, we deploy it, employ physics engine, and then maybe tomorrow we say, okay, we’re we’re training that. Just say drones where we only care about really fluid dynamics and the visual aspect of the world. [00:24:25] Then, then yeah, maybe we don’t actually, the model actually doesn’t have to use a physics engine. Or maybe it employs other types of representation or physics engine to achieve the task. So yes, writing code for Unity is sort of similar to a tool that our A model can employ, but our goal is for a model to take a representation conditioned reasoning. [00:24:46] Approach or process. [00:24:47] swyx: Yeah, [00:24:47] Fan-yun Sun: internally. [00:24:48] swyx: Yeah. Using these things as just like general two calls. Right. Which I think is very interesting. The other more ambitious one is, some kind of recursive element where it becomes multiplayer, right? Like here, there’s a single player element, you’re not [00:25:00] modeling any other people involved. [00:25:01] And that is a whole other thing. [00:25:04] Fan-yun Sun: But in fact, we can really do multiplayers. Oh yeah, okay. I haven’t seen any double situations. So just actually just like prompt our, our model to say, Hey, like configure to multiplayer. Then it’ll do like this. You’ll be able to configure multiplayer [00:25:16] swyx: great [00:25:17] Fan-yun Sun: persistency database for you. [00:25:18] Easy. Yeah. [00:25:19] Vibhu: So what, what are like some of the current limitations in where we’re at? So there’s one approach of like, okay, scale up video predictors. Obviously there’s data issues. With approaches like this, is it data constraints? What are like the next steps? Is it real time? Like, so there’s one side of, write an agent to write Unity code, but okay, I want to be streaming a game real time. [00:25:38] I want to have characters being also like agent, but where, where do we kinda see this scaling up? Right? [00:25:44] Fan-yun Sun: Yeah, there’s definitely a data constraint. Like the more data, the, the better. This reasoning model can almost basically act as humans to like operate a variety of tools and softwares to build whatever’s necessary. [00:25:57] And then there’s a sort [00:26:00] of fidelity constraint, which we’re actually solving with another model, which we can talk about later. But it’s like, it’s not as easy to get to photorealism with the approach that we’re taking. But we think there are better solutions to that, which is we can dive into later. [00:26:14] Later. [00:26:15] Vibhu: The one one thing you note here is it’s a diffusion model, right? So there’s, there’s a few approaches, diffusion caution, splatting, yeah, so Ry diffusion model, you guys wanna [00:26:25] Fan-yun Sun: Yeah. [00:26:25] Vibhu: Introduce, [00:26:26] Fan-yun Sun: yeah, totally. [00:26:26] Rie: Neural Rendering & Skins for Worlds [00:26:26] Fan-yun Sun: So within our world modeling framework, we think there are two models that we train, right? [00:26:31] Like, there’s the multimodal reasoning model that we just talked about that essentially handles. Mainly the, the causality, the persistency and logic determinism of the world. And then RY is our bet on saying, okay, like while all those model, can take care of all these things that we just talked about, it’s limitations compared to existing, say, video models, is that it doesn’t have as high of a pixel [00:27:00] ality right off the gate, right? [00:27:02] And EE is to say, Hey, we can actually take whatever persistent representation that we generate with our multimodal reasoning model and learn to restyle it into photo photorealistic styles or arbitrary styles you want. So this model is almost to say, Hey, I’m going to respect the persistency and interactivity of the world that you created, but my only job is to make sure that its pixel distribution is close to what we want. [00:27:29] Vibhu: Yeah. [00:27:30] swyx: Great example right there. You kept the KL divergence. [00:27:33] Fan-yun Sun: Oh. Where, [00:27:34] swyx: no, no. I mean this, this is a, a classic like, how you don’t stray too far from the source material as you, you kept the kl, which is Oh yeah. Kind of cool. Yeah. [00:27:43] Fan-yun Sun: Yeah. [00:27:44] swyx: I mean, and the [00:27:44] Chris Manning: difference is, and I mean sun was pointing at this, where sort of saying it’s in one way a more difficult path, but a better path that, typically the diffusion models are producing the whole scene and it looks lovely, [00:28:00] but there isn’t spatial understanding behind it, which is allowing for the real time graphics gameplay, the spatial intelligence, understanding the consequences of worlds where this is, taking a path where it is assuming an abstracted semantic model of the world’s state. [00:28:20] And then the diffusion model is then being used on top of that to produce the high quality graphics. [00:28:27] swyx: Is there an intended practical, or business use for this, or is it like a, like a demonstration of capabilities? [00:28:34] Fan-yun Sun: We actually believe that this is gonna be the next paradigm of rendering. So it’s gonna replace how ra raizer, it’s gonna replace DLSS today because it not only has these pixel prior that’s learned from the world such that you can literally play any game in photo realistic styles, which is a lot of people’s desire when they do GTA, right? [00:28:51] Like, [00:28:51] Vibhu: all the mods, all the people adding perfect lighting and all this. [00:28:54] swyx: So [00:28:54] Fan-yun Sun: skins [00:28:55] swyx: for worlds, let’s call it [00:28:56] Fan-yun Sun: skins, let’s call it skin for worlds. I, [00:28:58] Vibhu: it’s also like, you can call it skin, you can call it [00:29:00] customization. You can play it how you want, right? [00:29:01] Fan-yun Sun: Yeah, exactly. And I think another thing that we really pointed out specific specifically in this blog is the programmability of it, right? [00:29:09] So what this means is that this render historically render is always a derivative of the game state, right? You’re saying, oh, here’s the game state, I’m rendering out a frame. But here I’m saying actually this render can be part of the gameplay loop. I can say something along the lines of, if upon getting 10. [00:29:26] Apples, I’m gonna, my weapon of choice, my bullet’s gonna turn into apples. And that’s, that’s possible because we can say, we can basically dynamically have certain game state trigger the, the preconditions to the render such that the rendering is now part of the game loop too. One thing is to just say, okay, it’s, it’s, it’s the appearance. [00:29:47] But the second thing is also to say there’s these novel interactions that are possible because this render now has actually priors of the world. [00:29:57] swyx: It is up to the artist to figure out what to do with it. [00:29:59] Fan-yun Sun: It [00:30:00] is up to the creators. Yes. [00:30:01] swyx: Yeah. [00:30:01] Fan-yun Sun: And I also think that’s actually another big argument that we’re making and the reason that we’re picking, taking the bet we’re baking is that a lot of the times, whether it’s for embody AI gaming, like you want a layer where human can inject their intentions. [00:30:15] So, for example, let’s just say in the context of gaming, it’s obviously like my creative intent, but maybe in the context of embodied ai, it’s like, oh, like I take this foundational policy and I want to actually fine tune it to deploy in my house. So you want to almost say, inject, have a layer where human can say, oh, here’s the distribution of things I want to create to achieve my goal. [00:30:35] And I think 3D graphics as it as it is today, is basic, the layer for people to say, Hey, what do I care about in this world? And it allows, basically human intent to be expressed in these worlds much more explicitly and distributionally as opposed to just saying, Hey, I’m gonna generate like, arbitrary. [00:30:54] And it’s like just prompts, [00:30:55] swyx: it’s one of those things where like, I think you, you’re going to build up a series of models, right? [00:31:00] This is just one of, this is probably like the highest utility or heaviest, frequency one, I don’t dunno what to call this. Where like you Yeah. You can immediately drop this in on any game and you don’t need anything else that. [00:31:10] That you guys do. But, I, I could see, I could see that I think the, the human intent is something that people are not even used to because we’re so used to static worlds or, worlds that just don’t react, or, I don’t know. It’s, it, you’re kind of blowing my mind right now with like, I’m, I wonder if you’ve talked to people at GDC Hmm. [00:31:27] And what are they gonna do with it? [00:31:30] Fan-yun Sun: Yeah. Now the stance that we take on this front is like, we’re not gonna be more creative than our users to ship [00:31:35] swyx: it out. [00:31:35] Fan-yun Sun: Yeah. But we wanna make sure that we’re building things in a way that really allows them to express their intent. [00:31:41] swyx: The thing that you said about, here’s the distribution that I want. [00:31:45] I think text may be too low of a bandwidth to. To really demonstrate, because I, I, there, I’m, I’m probably just gonna want to drop in a bunch of, reference assets and then you can figure it out from [00:31:58] Vibhu: there. But you probably wanna do a, a mixture of [00:32:00] both, right? Like you throw in a few images. I wanted this style. [00:32:02] Yeah. I want it to look like this. So it, it’s, it’s a mixture, right? [00:32:05] Chris Manning: I, I think it’s a mixture. I mean, yeah, I mean there’s clearly a visual component of this, and it’s not that, everything can be text. ‘cause of course you want to give a visual look, but there’s also a massive amount of giving the overall picture of the look of the world and the behavior of things that you can express in a few words of text. [00:32:32] And it be very time consuming and difficult to do via visual means. So I think, yeah, you want a combination of both. [00:32:40] Evaluating World Models [00:32:40] Vibhu: So one question I kind of have is, how do we go about evaluating world models? So like, there’s many axes, right? One is like, okay. I have preferences. How well do we adhere to prompts? One is the simulation. [00:32:50] One is like do things, is there core logic that’s broken? So coming from we know how to evaluate diffusion, there’s fidelity, there’s [00:33:00] stuff like that. But what are some of the challenges that most people probably aren’t thinking about? [00:33:04] Fan-yun Sun: Yeah, I think this is like a great question and probably one of the hardest questions in role models because like, I think it always comes back to what are you building this role model for? [00:33:13] And depending on your end goal and purpose, the evaluation should defer. So in the context of games, then the most direct way of measuring is how much behind are people actually spending in this world that you create? And if your goal is to say, for example, in the context that we just talked about, like, hey, deploying, deploying action in body, a agent, then your, your end. [00:33:33] Metric is then, okay, after training in these worlds that you generate how robust it is to when you actually deploy to the target environment. But then, it’s, it’s hard to measure these end metrics. So today people have like these proxy metrics that I call that basically try to measure what we really care about, which is the end metrics, but then frankly it’s different for every use case. [00:33:57] Yeah, [00:33:57] Vibhu: which seems like quite a challenge, right? Like in [00:34:00] in language models or video models. Image models, your benchmarks are proxies, right? People aren’t actually asking instruction, following tool use questions. They’re proxies of how well it will do downstream. But for this, so like, should teams, should companies have their own individual benchmarks outside of games? [00:34:16] If you think of stuff like, okay, video production, movies, stuff like that, that also want to use world models. Should, should they sort of internalize like. Their own proxy. Is this something you guys do? Where, where does that connect [00:34:28] Chris Manning: go? Yeah, I think this whole space is extremely difficult as things are emerging now. [00:34:35] And I mean, it’s not only for world models, I think it’s for everything including text-based models, right? ‘cause in the early days it seemed very easy to have good benchmarks ‘cause we could do things like question answering benchmarks and could you answer the question based on these documents and the various other kinds of, do pieces of logical reasoning or math. [00:34:58] But again, these are sort of. [00:35:00] And there were sort of visual equivalents of things like object recognition, right? For these small component tasks. These days so much of what people are wanting to do also with language models is nothing like that, right? You’re wanting to, have an interaction with the language model and get some recommendations about which backpack would be best for you for your trip in Europe next month. [00:35:25] And it’s not the same kind of thing, right? And it’s not so easy to come up with a benchmark as to does this large language model give you an effective interaction for guiding you in a good way for shopping, right? So, and it’s the same problem with these world models. So if we take the game design case, well success is that a game designer can. [00:35:57] Produce what they are [00:36:00] imagining in a reasonable amount of time. And that’s really the kind of macro task. That’s a very hard thing to turn into a benchmark and I think a lot of this is actually going to turn into people walking, walking with their feet. Right? I mean, I guess that’s what’s happening, at the large language model level, right? [00:36:23] When people are choosing to use, GPT five or Gemini or clawed, individuals are trying out these different models and deciding, oh, I like the kind of answers that GT five gives me, or no, I feel like I get more accurate detail from Claude, right? [00:36:43] Vibhu: It’s a lot of [00:36:43] Chris Manning: vitech, a lot of people just using it. [00:36:45] It’s vibe checking. I realize that, but it’s actually whether. People feel it’s giving them utility in what they want. Right. [00:36:52] Vibhu: And the the interesting thing there is like a lot of people prefer the visual, right? This looks pretty, which is not the objective of what this is [00:37:00] for, right? It’s if a, if a game designer is working on something, they care about the game engine, right? [00:37:04] The state, it’s, it can look whatever. You can fix that up later. Or you can have a really good game state and you can quickly edit it to 20. 20 different versions, like Keep State, [00:37:14] Chris Manning: right? [00:37:14] Vibhu: So [00:37:14] Chris Manning: that’s a really important distinction, for and for speaking to Moon Lake strength, right? So, yeah, great visuals are lovely to look at for a few seconds, but gains are really all about the concept, the game play. [00:37:33] And a lot of the time that doesn’t actually even require great visuals. I mean, there are just lots of very successful games which have relatively primitive visuals, and there are other games where people have spent millions producing photo realistic, visuals, and the game sucks, right? So, keeping those two axes apart is really important in thinking about what’s important in a [00:38:00] world model for different uses. [00:38:02] swyx: This conversation is reminding me of some game review and fiction discussions I’ve, had in my sort of non-AI related life. Some, for some people might know Brandon Sanderson, who’s a very famous, fiction author, had, is is a big game reviewer. And he, he’s a big fan of video games where you change one thing about a normal what you might assume about, about the world. [00:38:22] For example, Baba is you, I don’t know if you might have come across that, where like the rules change as you play the game. And also like where, you can do things like reverse time selectively or like change gravity selectively. And I think this is also reminds, reminds me of other kinds of world models that are created by authors. [00:38:38] Where Ted Chang is, is my typical example where he’ll take the world that, you know today, but change one thing about it and, but then create a consistent world based on that. Which is long-winded answer of me to, of. For me to say is it’s it easy to create alternative roles that don’t exist, but you change one thing and then let’s, let’s run a whole bunch of people through it to see if it works. [00:38:58] Chris Manning: My first dance will [00:39:00] be, that seems a lot easier and more conceivable to do using Techn technology like Moon Lakes than with some of the other world models out there, where the sun can actually make it happen. I’ll let him give a second answer. [00:39:15] swyx: If I guess for you, you’re constrained by the game engine tool, right? [00:39:18] Like at the end of the day, that’s the, that’s the thought, partner that you have. If I ask for something where like, if it never is allowed to reverse time or if gravity only ever works one way, then well that’s it. But sometimes gravity might change, [00:39:33] Fan-yun Sun: but it’s a lot easier to change with code as opposed to a model that is learned primarily on data of. [00:39:42] Real world and virtual worlds that are, I guess, like for example, junior, like there’s actually trained on a lot of real world data and a lot of virtual gaming data, and it’s hard to say maybe it’s easier to say, okay, I wanna change the visuals in like the time period of, of the world. Like, you can’t change gravity, for [00:40:00] example. [00:40:00] Vibhu: I feel like you can to light bounds, right? Everything comes down to like, code is a better way to execute it, but the models aren’t that diverse and creative, right? You can say, okay, make gravity slower. It can do that, but it’s limited to your representation of how you text it out, right? Like they’re, they’re only gonna do a few iterations, whereas programmatically, if there’s a game engine under the hood, you can kind of go wild, right? [00:40:22] So one of the, I dunno, one of the limitations of most models is that they’re very overtrained to one style. Right. And extracting diversity is pretty difficult. At least that’s something we’ve seen. [00:40:35] Fan-yun Sun: I mean, are there examples you have in mind where you Existing models? Yeah. Like it would be easier to do that’s not using code. [00:40:43] Certain types of creative intent or like transition state transitions, [00:40:47] swyx: Clipping, other models, other wo models are very good at clipping through things. Clipping my, my, my legs clipping through a rock because it’s, it’s just, it’s just bad. [00:41:00] Like, you would have to struggle very hard with your stuff to actually make that happen. [00:41:04] Which I think is maybe a topic that you actually prepared on, Gian Splatting versus, the other stuff. [00:41:09] Vibhu: Yeah. Yeah. It’s just for those not super familiar, right? There’s a, there’s gian splatting, there is diffusion. Like what works, what scales up. I feel like in February when Soro one came out the blog post was literally titled like, [00:41:21] swyx: you bring it up. [00:41:22] You never know. [00:41:23] Vibhu: World, world, video generation models are world simulators. It’s super bitter lesson pilled. Yeah, emer, a lot of it is emergence, right? So, not to go through their blog post, basically their whole thing was as you scale up all this consistency, all this stuff just kind of solves, it’s a very simple premise, right? [00:41:41] They just scaled up, diffusion, and from there, this is, this is Feb 2024, how much can we, it’s already been two years, which is basically five years. How much more in AI time do we need to just scale up or, or do we hit a data cap? But I think we already talked about this a lot, right? Like this is back to the beginning discussion of what’s [00:42:00] appropriate for the time. [00:42:01] And that seems like your approach, right? [00:42:03] Fan-yun Sun: Yeah. The point I’m trying to make is that they’re very many, many different types of world simulators and like having a world simulator that can produce pixel coherency is very, very useful for games and, marketing and all these things, but it’s not as useful as people think when it comes to causal reasoning. [00:42:25] When it comes to embodied ai. Yeah, like it this title is true. We’re not saying that it’s, it’s like, not a great world simulator, but actually in the blog that we, we, we, we wrote, the bet is more so that there are gonna be disproportionately large share of value of real world tasks or, and virtual tasks where high resolution pixel fidelity is not needed. [00:42:47] Yes. Video models have their values. [00:42:50] swyx: Yeah. This is at the absolute limit of my physics understanding, but one example that comes to mind is basically having to solve like ba the equivalent of a three [00:43:00] body problem in a deterministic Well, where the video models, which is approximated good enough. Yeah. [00:43:08] Right. Like there’s, there’s some point at which your approach kind of runs into like the you now have to simulate the world. Please, thank you very much. And like you’re trying to do that, but only to the extent that the game engine lets you and like game engines cannot do some things. [00:43:23] Fan-yun Sun: Yeah, no, I mean, I think the interesting or more technical question here actually is where do you draw the boundary between. [00:43:32] What’s handled with, let’s say, diffusion prior and what, when? What’s handled with symbolic priors? [00:43:38] swyx: Yes. [00:43:38] Fan-yun Sun: Okay. [00:43:38] swyx: Okay. [00:43:39] Fan-yun Sun: Right. Let’s go there. Because this, this boundary can actually be fluid. Like I think like maybe what you’re trying to get at is like, okay, people are saying pixel prior, everything. But what we’re saying is, okay, there’s a boundary that we draw where this is where we think provides the most economical value for the domains and things that we care about today. [00:43:59] [00:44:00] And I actually do think, and it’s something that we do internally all the time, which is like, okay, given new equations that we learn or new elements of the world and that we, we learn, or maybe some other knowledge that we acquire in the process of developing the models. Should we still be maintaining this line exactly as it is today? [00:44:22] Or should we move it a little bit left or a little bit right? Right. Like sometimes that we realize that, oh, like maybe customers or, or folks like want certain things that are better handled with preop pryor as opposed to, symbolic prior than, [00:44:34] swyx: yeah. Your, your skin thing is a, is a example moving it, right. [00:44:37] Yeah. [00:44:37] Or left. Yeah, [00:44:37] Fan-yun Sun: exactly. [00:44:38] swyx: I dunno what the, the left right is. [00:44:39] Fan-yun Sun: Yeah, yeah, yeah. No the, the model. [00:44:42] swyx: Yes. [00:44:42] Fan-yun Sun: Actually we have a few iterations of them. They’re actually at slightly different [00:44:45] swyx: I know boundaries. You should, you should do that. That’s a cool dimension to show. [00:44:49] Fan-yun Sun: Yeah. [00:44:50] swyx: Is quantum mechanics the diffusion prior of our world? [00:44:55] Right. It’s like that’s the boundary of classical mechanics versus quantum. Right? Like, that’s it. At one [00:45:00] point God plays dice and the other point doesn’t. [00:45:02] Fan-yun Sun: I dunno if Chris, you wanna say it, but I think, I think generally I feel like physics is better with symbol P priors. [00:45:08] Chris Manning: Even quantum physics. [00:45:09] Fan-yun Sun: Even quantum physics. [00:45:11] swyx: Yeah. This is starts against to, MLST territory is, is what I call it, where, he, he likes to get philosophical. We, we we’re quite friendly. [00:45:18] Vibhu: I mean, we need to get, we need to get singularity. I heard some of that. [00:45:23] swyx: No, no, I think that is actually really helpful and man, I just want you to productize this like, as a product guy, I’m just like, oh, also [00:45:32] Vibhu: a gamer, I [00:45:33] swyx: wanna, it’s like a researcher, like, it’s cool. [00:45:35] Like this is a, the theoretical, like you have a very good, I don’t know, like the way of thinking about these things, but I just wanna see you like, express it. I do think like your fundamentally things when, when you leave open new tools, like, okay, use, use human intent to incorporate it into how you render. [00:45:52] Artists are gonna have to take like two to three years to figure out what to do with this. And you just don’t know. [00:45:57] Chris Manning: Right. But I think, this is, [00:46:00] gives a much more approachable and controllable world for the society, which is the beauty, the beauty of, NLP, that that will enable it to be adopted and used. [00:46:10] And we are very hopeful about that. Yeah, [00:46:13] Fan-yun Sun: yeah. Yeah. I mean, we are, we are very focused actually on commercialization in the sense that like we do, we do really believe in the data flywheel app approach. Yeah. Where, we put this in the hands of the creators and the users and then they will teach us when, what capability our model should improve. [00:46:27] And that’s why we are, we are actually, like products and beta [00:46:31] swyx: Yeah. Focusing on gaming. What, what’s like the adjacent thing to gaming [00:46:34] Fan-yun Sun: embody adjacent, basically. So maybe we can, we can I’ll maybe start with where we see the platform in three years. Yeah. Which is like, okay. The users would tell us what they want to achieve. [00:46:45] The end goal could be, Hey, I just, I wanna make something to teach my kids the value of humility. Or it could be, Hey, I wanna fine tune my, drones to be really good at rescue situations. I could be vacuum robots. I want to like train [00:47:00] my manipulation or like vacuum robot to be very robust to my office, right? [00:47:04] But it’s like, whatever it is, scenario robust to [00:47:06] swyx: my office [00:47:07] Fan-yun Sun: or like navigate very robustly in my office. But then it’s like, whatever end goal that you want, our role model will say, okay, given what you want to achieve, let me generate a distribution of environments such that I can train and evaluate whatever it is you want. [00:47:24] Yeah. Right. Maybe for the purpose of games, it’s just the end simulation and that’s the end product for certain policies. It’s like I can train it within these environments and then help you see where your policy is failing or not. Yeah. And then, so I think, [00:47:37] swyx: so in that case, much more of a training tool. [00:47:40] Than in other training [00:47:41] Vibhu: evaluation? Both. Right? [00:47:43] swyx: Sure. Same. Same thing. [00:47:43] Fan-yun Sun: Yeah, same thing. I think it’s just this role model that allows people to train any policy that can act in any multimodal environments. [00:47:51] swyx: Would it be harder to reward hack? Is there an angle here where it is harder to reward hack? Like it’s just, I’ll just put it generally because I think that’s a, that’s obviously a key [00:48:00] problem that a lot of people face when in training agents in these environments, and I don’t know, can you solve it? [00:48:07] Chris Manning: I think not necessarily. To the extent that there’s a mis specified reward that. It seems like it could be hacked in a more symbolic world or in a more pixel based world. I dunno if Sun’s got any thoughts, but I don’t think that’s really being solved. [00:48:26] swyx: The other thing that comes to mind is just you could just build a better sawa as a video generator model, right? [00:48:31] Because then you, you would move the diffusion, side a bit more further to the right. I think if I got the directionality correct. And that’s it. [00:48:40] Vibhu: It’s better on domains, right? Like on consistency over now, or for sure it exists versus something doesn’t, right. [00:48:46] Chris Manning: So [00:48:46] swyx: yeah. Yeah. Is [00:48:49] Vibhu: is a question more like, like [00:48:51] swyx: I’m just riffing on like, how do you, what can you build, you know? [00:48:54] Oh, with the stuff that you have. I do think that the minor, the academic does go immediately to training [00:49:00] and in eval evaluation, but like art tends to take unusual directions. Like you might end up, [00:49:06] Chris Manning: okay. Yeah. But the question is, can you use this piece of software to develop compelling gameplay and. I don’t think you can take SOAR and produce compelling gameplay, right? [00:49:19] If you want to have a world that you can wander around in a bit, you are good. But what are your abilities to have gameplay mechanics implemented the way you’d like them to be and to have things stay, with the long-term history of your gameplay that influences future actions. I think there’s just nothing there for that. [00:49:39] swyx: Yeah, I do tend to agree. I, I’m just trying to sort of test the boundaries. I would also make the observation that as AAA games industry has developed the line between what is a movie and what is a game has blurred. And you, you, you do end up basically producing a two hour movie as part of your game. [00:49:57] Fan-yun Sun: No, honestly, there, there’s so many actually [00:50:00] applications in adjacent markets that our world model can go into. Yeah. But yeah, it, it’s sort of fun to riff, riff on. Although on the execution side, we we, we need to stay focused with like, okay, what are the capabilities we want to unlock over time? [00:50:11] And there’s a roadmap for that. But yeah, if we’re just riffing on sort of like the possibilities, I feel like, whether it’s endless Yeah, it’s like classic [00:50:18] swyx: and the embedding for a possibility and endless in my mind, it’s very close. Yeah. I do wanna, focus on one, like weird choice. I, I don’t know if it’s weird. [00:50:28] Maybe I’m, I got something here. Audio, right? You could have just said no audio And audio in my mind has a lot of recursion, whereas in video you can just do recasting and that’s much computationally much simpler. Audio just seems way harder. I don’t know if you wanna just comment on just the special 3D audio. [00:50:46] Problem. Did you really have to do it? I guess you do to be immersive, but like a lot of people do treat it as like, well, you just stick a, a tt S model on top of [00:50:57] Vibhu: Well, there’s a lot more to game audio than [00:51:00] just speech. Right. It’s not just [00:51:01] swyx: tts. Yeah. Tts. S Fxt, GM Spatial in my mind Echoes [00:51:06] Chris Manning: Yeah. [00:51:06] swyx: And reflections. [00:51:07] And I, I don’t even know what’s, what else? I don’t know what, what other problems in this space. [00:51:13] Fan-yun Sun: Yeah, I think this point like the, it’s sort of a more, more pointing to the benefits of using an game engine as a tool that’s available to the model, right? Because like part of the spatial audio is from the code that is underlying the simulation. [00:51:32] And while we do give our model access to other types of audio models as. Tools. [00:51:39] swyx: None of them would be spatial, I think. [00:51:41] Fan-yun Sun: But that’s exactly sort of more 0.2. We’re giving our model an abstraction or a suite of tools such that it’s able to achieve that. And you can argue that sort of spatial is like a, like a emergence out of the, the tools that we and abstraction that we provide to the agents. [00:51:59] And I think that’s the beauty of [00:52:00] this, this, this approach is like there’s a lot of things kind of like how human’s built technology and they’re like Lego blocks that build on top of each other. And it’s the same thing here. There’s gonna be things that sort of just sort of emerges from being able to put these things together in like combinatorially interesting ways, [00:52:14] Chris Manning: right? [00:52:15] So this integrated audio model exploits the understanding and semantics of the Moon Lake world, right? And whereas in general for the Gen AI video models. There’s no actual integration across to audio at all, right? That someone might stick some music or stick a soundscape or whatever else on top of their video. [00:52:44] So it’s not a silent video, but they’re in no way connected into a consistent world model. And there’s nothing that’s okay. An action is happening in the video. Therefore there should be a sound that’s [00:53:00] coming from this part of the visual field. [00:53:03] swyx: Yeah. [00:53:03] Vibhu: Is that different than Sora too? Does it not have audio? [00:53:06] Not to say it’s not like [00:53:08] swyx: amazing [00:53:08] Vibhu: isn’t a spatial [00:53:09] swyx: audio. [00:53:09] Vibhu: It doesn’t, [00:53:10] swyx: no. I’ve played around it with it enough. It just sounds like someone put an 11 laps voice on top of it and just tried to do the lip sync. [00:53:18] Vibhu: Oh, yeah. I’ve seen, okay. Generate a dog at the beach and reactions to big wave and move [00:53:23] swyx: around. [00:53:23] It’s definitely like, so have the dog, have the dog move away from camera and see if the, the song goes down. It doesn’t. ‘Cause they don’t have facial audio. [00:53:32] Fan-yun Sun: We do want to basically like we, our moral model, like the one we’re training is basically towards the goal of having a combined latent representation across all these different modalities. [00:53:42] Right? Such that it can like reason across these different modalities. So for example, if I close my eyes and like you play a video, you play a sound of like a car skidding away from me. I almost can like, visually extrapolate that trajectory in my mind. And I think that type of capability, we want our model to be able to reason, right? [00:53:59] And that’s the reason that [00:54:00] we’re sort of taking this multimodal reasoning approach. It’s like we want this combine late in space that can [00:54:05] swyx: Yeah. Oh, you said late in space. We like that. Here we have to play the, the bell Every time that someone says late in space, no, you gotta train daredevil one. Where you, you, you, it’s only audio, but you have to work out. [00:54:15] Where everything is. [00:54:19] Cool. I I think that that was, that was about it for our Moon Lake coverage. I do think that we have like a couple of, Chris Madden questions on, on IR and, just any, any other sort of attention topics or n NLP topics. [00:54:31] Vibhu: Okay. [00:54:31] swyx: Go ahead. [00:54:32] Chris Manning’s Journey: From NLP to World Models [00:54:32] Vibhu: Well, no, I mean, yeah, it’s just fun. We talked a bit about how you guys met, but you basically, you, you were like the godfather of NLP per se, right? [00:54:39] You spent the whole career from early embeddings, early early attention. You did 2015 attention for machine translation, everything. You, you had information retrieval, so RAG before rag, we just wanna shout that out and admire a lot of that. Right? So what prompted the switch over to world models? [00:54:56] How, how’d all that come about? [00:54:58] Chris Manning: To some answer it [00:55:00] is, the enthusiasms and creativity of students, but there’s a bit of a history there, right? So, yeah. So clearly most of my career has been doing stuff with language and how I got into research was thinking, ah, this is just so amazing how humans can produce speech and understand each other in real time. [00:55:21] And somehow they managed to learn languages from their kids. How could this possibly happen? And so, yeah, starting off I was very focused on language, but as it sort of got into the 2000 and tens, I started, going, I’d been working on question answering, and then I started to get, interest in visual question answering. [00:55:42] And that was an area where it was very noticeable. That the visual understanding was bad. Right. These were the days when like, it sort of seemed like there’s almost no visual [00:56:00] understanding. You were just getting answers that came from priors. So, if you asked how many people are sitting at the table, it’d always answer two regardless of how many, how many people you could see in the picture. [00:56:11] And so it seemed like, oh, these models actually aren’t able to get semantic information outta IMA images. And so I was interested in that problem and tried to work more on that. And so then that required. Knowing more about what’s happening in vision and how you can represent visual information. [00:56:34] And then things start, there started to be this revolution of, doing generative AI images. And then I had students that started looking at that before the era of Moon Lake. I was also working with Demi Gore, who founded pika. And so, and [00:56:50] swyx: Ian obviously [00:56:52] Chris Manning: with gans. Yeah. Though Ian was never my student, but yeah, Ian I was very aware for the, the whole decade there of Ian with Gans. [00:56:59] [00:57:00] Yeah. And I mean, Ian was a Stanford undergrad, but yeah, [00:57:03] Vibhu: richard des u.com, I believe he was your student. [00:57:06] Chris Manning: Yeah. Yeah. And there were, there were links across at that stage as well. So there were several papers in that era of doing, I mean, so Andre Cap was a, PhD student at the same time as Richard. [00:57:20] And so there was some joint language vision work in that era as well. It seems kind of ancient by modern standards, but yeah, we’re trying to go from sort of textural dependency graphs to visual scenes [00:57:32] Vibhu: at a time. The glove embeddings really took over a lot of. T-F-I-D-F, like one hot encoding, all that. [00:57:38] The early vision language models we saw were like lava style adapters, right? It’s, it’s technically still just embedding latent space. Let’s add image, let’s like mixed modality. So, and that, that’s one of the things you super put out there too, right? [00:57:51] swyx: Yeah. [00:57:51] Vibhu: Yeah. [00:57:52] swyx: Yeah. [00:57:52] Hiring, Closing & The Name “Moon Lake” [00:57:55] swyx: Well, thank you for all of that. Thank you for all advancing the worlds on, world modeling. [00:57:56] I honestly, do think that if people deeply understand everything we just [00:58:00] covered, they will see what’s coming. I think you guys have, made some, a really significant contribution here. What are you hiring for? What is the, what do people find? We, we agreed that the CTA was a hiring call. [00:58:10] Yeah. Don’t we have a GI You don’t need, you don’t need engineers anymore, right? [00:58:14] Fan-yun Sun: Yeah. On the model side we are actually striving towards basically a self-improving system. But what that means is that we need people to set up the self-improving system. So more, more specifically people who have the intersection of knowledge within co-generation and computer vision and graphics, right? [00:58:30] Yeah. That’s, that’s sort of the core research background that we look for within OTM and, and the majority of the team today do have like both backgrounds. [00:58:38] swyx: When you say computer vision and graphics, are they the same thing or is it computer vision one thing, graphics, another thing. And how intertwined are they? [00:58:46] Chris Manning: They’re intertwined but different. [00:58:49] swyx: Yeah. [00:58:49] Chris Manning: And I think, this relates to some of the themes that we’ve been talking about, that the more explicit underlying [00:59:00] world models that are being constructed inside Moon Lake really draw on the computer graphics tradition. And so it’s then combining that with the visual understanding of vision. [00:59:16] swyx: Got it. Yeah. All right. So you’ve written a game engine, you’re come talk to us, right? [00:59:21] Fan-yun Sun: Oh yeah, definitely. Definitely. But I do think that the line is blurred, like increasingly blurred these days where it’s like if you have a general understanding of group vision and graphics, [00:59:31] swyx: I think for your standards it is, for me it feels like vision is, is. [00:59:35] I’ll leave that to the big labs graphics. I, I, I can get that, you would want to do that from more first principles, but vision, there’s so many vision models off the shelf that I can take, but probably not good enough for your [00:59:45] Fan-yun Sun: I see, I see. If, if you’re sort of like making that distinction then maybe we, we care a little bit more about having graphics [00:59:51] swyx: knowledge. [00:59:51] Yeah, exactly. [00:59:52] It could be like, sometimes a hiring call can be as simple as like, if you know the answer to blah, you should talk to me. Like the sort of core known hard [01:00:00] problem in, in your world. [01:00:01] Fan-yun Sun: Ah, I see. Yeah. In that case, if you, yeah, definitely. If you’ve written a game engine before, if you’ve rld a variety of coding models on different objectives, like [01:00:13] swyx: easy, [01:00:13] Many of those, yeah. [01:00:14] Fan-yun Sun: If you’ve done multimodal lean space alignment, I, I intentionally include [01:00:20] swyx: space. [01:00:20] Fan-yun Sun: Again, [01:00:21] swyx: a poor editor has a thing every time. Yeah. Lean space alignment. Honestly. Is it that hard? [01:00:26] I, I, there’s some scripts out there that I’ve saved for the day. I someday have to do it, but I don’t have to do it. [01:00:31] But it’s [01:00:32] Fan-yun Sun: done, I think. Yeah. There, there’s, there’s a versions of that that are done. But I, I think we are aligning audio, text, language and video. Yeah. Right. Like, and basically we have these role models that are able to act as agents to like act in these worlds and extract long horizon videos and encoding that back to the model to sort of self-improve. [01:00:52] So it’s an insanely exciting, but also technically challenge problem. Yeah. So people who wanna do their lives best work, that only [01:01:00] makes a place. [01:01:01] Vibhu: How big are you guys? Where are you guys based? [01:01:02] Fan-yun Sun: We’re currently based in San Mateo, although we’re moving up to sf. We’re about 18 folks right now. [01:01:08] swyx: My ending question was gonna be why, what, what is the name? [01:01:10] What’s behind the name? [01:01:11] Vibhu: Yeah. [01:01:12] Fan-yun Sun: Oh, [01:01:14] Vibhu: Very cool. Graphics and design, by the way. [01:01:16] Fan-yun Sun: Actually at the, at the time when the, when the, when we started the company, we were thinking a lot about how do we make a company name that gives people the vibe of like, open ai, but for like, almost like industrial light and magic vibes. [01:01:28] Wow. Because it’s like we care about creativity and using that as a funnel to solve a GI. So then we were, we, we brainstorm a lot around like Dreamworks, right? Like industrial light magic. And, so there’s a few, few basically, space of things that we feel like are very, very semantically close to the company’s identity. [01:01:47] swyx: Yeah. [01:01:48] Fan-yun Sun: And then it ended up being Moon Lake, partly because of the Dreamworks vibe, the Dreamworks, moon [01:01:54] swyx: Lake. [01:01:55] Fan-yun Sun: Exactly. Yep. So that was a little bit of that inspiration. And then the moon was sort of [01:02:00] like a, it basically was like about the. Reflection. The reflection part also implies the self-improvement loop. [01:02:07] Wow. That we sort of like, that’s really bleed and that’s the path towards multimodal general intelligence. So that’s, that’s that. I’ll leave that as I love a good [01:02:15] swyx: name. I love a good name. This is great. It’s a [01:02:16] Vibhu: very [01:02:17] swyx: good name. It’s very good. Lo I’m glad I asked the question. I will also say, one, my favorite story, books or biographies ever is, creativity Inc. [01:02:24] With Ed Kamal’s, story about Pixar and how he, was rejected as a Disney animation artist. So then he went into computing and brute forced his way into back. No, I love that story. Yeah. Disney. [01:02:37] Fan-yun Sun: Yeah. And Walt Disney is also like one of my favorite founders. He’s like, his, his story. Like at the time you’re like, okay, I’m gonna create this like. [01:02:44] Immersive park. Like people can’t, don’t even have that technology to create it virtually, but they’re like, you know what, let’s just build it physically such that people can, [01:02:50] swyx: so he is the first world modeler. [01:02:52] Fan-yun Sun: No, I, I I tell people that like, theme parks are world models too. [01:02:56] swyx: Mm. Yeah. Yeah. Yeah. I mean, it’s a small world or it’s [01:03:00] a, like the Epcot center with all the little, replicas of the countries. [01:03:03] Yeah. Those are very interesting. Okay. Well thank you, we’ve covered, a huge amount. Thank you for your time and thank you for inspiring us. [01:03:10] Fan-yun Sun: Thank you [01:03:10] swyx: for having us. Thank you. It’s fun [01:03:11] Fan-yun Sun: chatting. Yeah. It’s been a good time. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe
Transcript
Introduction to World Models
0:00think this whole space is extremely difficult as things are emerging now and i mean it's not only for world models i think it's for everything including text-based models right because you know in the early days it seemed very easy to have good benchmarks because we could do things like question answering benchmarks but you know these days so much of what people are wanting to do is nothing like that right you're wanting to get some recommendations about which backpack would
0:31be best for you for your trip in europe next month it's not so easy to come up with a benchmark and
Message to Listeners
0:38it's the same problem with these world models before we get into today's episode i just have a small message for listeners thank you we will not be able to bring you the ai engineering science and entertainment content that you so clearly want if you didn't choose to also click in and tune into our content we've been approached by sponsors on an almost daily basis but fortunately enough of you actually subscribe to us to keep all this sustainable without ads and we want to keep it that way but i just have one favor to ask all of you the single most powerful completely free thing you can do is
1:13to click that subscribe button it's the only thing i'll ever ask of you and it means absolutely everything to me and my team that works so hard to bring the in space to you each and every week if you do it
Guest Introduction
1:23i promise you we'll never stop working to make the show even better now let's get into it okay we're back in the studio with moon lakes uh two leads i guess there's there's other founders as well but uh sun and chris manning welcome to the studio thanks thanks for having us you've got you guys have uh you know come burst onto the scene with a really refreshing new take of world models um i would just want to uh sort of i guess ask how you the two of you came together chris you're a
1:55legend in nlp and just ai in in general uh you're you're his grad student i guess actually my co-founder oh yeah i should give a lot of credit to my co-founder sharon yeah um she was she was actually working with professor fevely and jodgen and then she ended up working with um ron and chris manning here and then so i got connected through to chris initially actually through my co-founder
What is Moon Lake
2:17what is moon lake what what is uh actually i'm also very curious about the name but like why going into world models so i was working a lot with actually nvidia research during my phd years on essentially generating interactive worlds to train reinforcement learning agents or embody ea agents and then there's two observations one in academia and one in industry in industry like folks at nvidia are actually paying a lot of dollars to purchase these types of interactive worlds whether it's for the sake of evaluation or training the robots or policies or models and then
2:52in academia same thing is happening and more specifically when i was actually working with nvidia on the synthetic data foundation model training project we were actually generating a lot of synthetic data and showing that hey you can actually these synthetic data are actually as useful as rural data when it comes to multimodal pre-training but then like i said there's a lot of dollars being paid out to like external vendors or like other folks to manually curate these types of data it was very clear to us that okay on our way to let's call it embody general intelligence
World Models and Intelligence
3:22models need to learn the consequences behind their actions which means that they need interactive data and the demand for those types of data are growing exponentially but everybody's sort of thinking about it from a pure say video generation perspective or something else but we feel like the the true actually opportunity is actually building reasoning models that can do these things like how humans do these things today so that's a little bit on the genesis of moon lake and i think the reason i got into world models was partly a philosophical take of the on the world where i like
3:55you know believe in the simulation theory and stuff like that but on the other on the other hand it's really just like oh like there's an opportunity there that i feel like nobody's doing it the way i think should be done i can say a little bit about that yeah so the overall goal is the pursuit of
Pursuit of Artificial Intelligence
4:09artificial intelligence and you know most of my career has been doing that in the language space and that's been just extremely productive as we all know the story of the last few years i don't have to tell about how much we've achieved with large language models but although they're being extremely effective for ramping language and general intelligence it's clearly not the whole world there's this multimodal world of vision sound taste that you'd like to be dealing with more than just
4:43language and then the question is how to do it and despite you know a huge investment in the computer vision space right as the research field computer vision has been for decades far far larger than the language space actually i mean i think it's fair to say that you know vision understanding sort of stalled out right you got to object recognition and then progress just wasn't being made right if you
5:14look at any of these um vision language models it's the language that's doing 90 of the work and the vision barely works and so there's really an interesting research question as to why that is
Philosophical Differences
5:27and at heart um the ideas behind moon lake are an attempt to answer that believing that there can be a really rich connection between a more symbolic layer of abstracted understanding of visual domains which aren't in the mainstream vision models which are still trying to operate on the surface level of pixels i think one of your blog posts you put it as structure not scale is that a general thesis yeah well scale is good too yeah scale is good too lots of data is good as well nevertheless you want the
6:02structure yeah to be able to much more efficiently learn yeah the other thing i really liked also is you put out an example of what your kind of reasoning traces look like right which you would distill is the word that comes to mind i don't even think that's a good good description but it would involve for example geometry physics affordances symbolic logic perceptual mappings um and what what have you but like that that is the kind of example that involves let's call it spatial reasoning world model reasoning as compared to normal lm reasoning yeah but also like taking it a step back so how do you
Defining World Models
6:38guys define world models you know a lot of people see like okay you can do diffusion you can do video generation but uh you guys put out quite a few blog posts you put out a essay recently we can even pull it up about efficient world models um you have a pretty like structural definition here but for the general audience that don't super follow the space right what's what's the difference in what we see from like a video generation model to a world gen a simulator how do you kind of paint that yeah so i think this is actually a little bit subtle because you know people look at these amazing generative ai video models
7:13sora vo3 one of these things and they think genies they think oh this is amazing this is sort of you know we've solved understanding the world because you can produce these generative ai videos but the reality is that although the visuals do look fantastic those visuals actually aren't accompanied by an understanding of the 3d world understanding how objects can move what the consequences of different
World Model Applications
7:46actions are and that's what's really needed for spatial intelligence so i mean a term we sometimes use is that you need action condition world models that you only actually have a world model if you can predict given some action is taken what is going to change in the world because of it and in particular that becomes hard over longer time scales so if you're simply you know trying to predict the next video
8:17frame that's not so difficult but what you actually want to do is understand the consequences likely consequences of actions minutes into the future and to do that you actually need much more of an abstracted semantic model of the world yeah the question comes where you want to have more structure than is available in just predicting the next token um and typically well let's let's call it the experience that the last five years has been that that is just washed away by scale right
8:51so what is the right middle ground here that uh you don't ignore the bitter lesson but also you can be more efficient than what we're doing today you know one possibility is look if we just collect masses and masses and masses and masses of video data this problem will be solved um under certain assumptions that could be true but there are sort of multiple avenues in which it could not be
9:27true the first is what's really essential is understanding the the consequences of actions producing an action conditioned world model and if you're simply um collecting observational video data which is the easy stuff to collect when you're sort of mining online videos you don't actually know the actions that are being taken to see how the video is changing and so if you're never collecting directly actions and
9:59you're having to try and infer them from what happened in the observed video that's not impossible but it's very hard and it's not really established that you can get that to work at any scale yet and so there's a lot of premium on collecting action condition video data which is part of why there's been a lot of interest in using simulation so that you can be collecting data where you do know the actions which is in quite limited supply but there's also in the limit of as much data as you could possibly have you know
10:35maybe the problem is eventually solvable but even though we collect huge amounts of text data text data is always at a great level of abstraction right language is a human designed abstracted representation where there's meaning in each token and it's representing an abstraction of the world right as soon as you're describing someone as a professor and as soon as you're saying that they're condescending right you know these are very abstracted descriptions of the world is not at sort of what
11:12you're observing as pixel level and so to get to that kind of degree of abstraction starting from pixels is orders and magnitude of extra data and processing and so although you know we absolutely want to exploit get as much data as possible use the bitter lesson nevertheless if there are ways in which you can work with five orders of magnitude less data than people working purely from pixels you're going to be able to make a lot more progress a lot more quickly and that's the bet here and so you could just say
11:48that's only wanting to be able to you know do it more efficiently do it more quickly do it more cheaply but i think it's actually more than that i think one should be making the analogy to how human beings work at one level you know yes we have these high resolution eyes and we can look and see a scene like a video but all of the evidence from neuroscience and psychology is that most of what comes into
12:22people's eyes is never processed right that you're doing fairly fine processing of exactly what you're focusing on but you know as soon as it's away from that of yeah there's another guy over there that you've sort of only processing top down this very abstracted semantic description of the world around you and so you know that's what human beings are doing they're working with semantic abstractions and so i think it is just the right representation because we also have other goals we want to be
12:55able to do you know real-time worlds that means there's a limit to how much processing you can do and we want to do long-term planning and consistency and again that favors abstraction i mean i guess there was actually a recent blog post that came out from our friends at physical intelligence and you know they were sort of heading in the same direction they were saying oh my model yeah yeah to maintain a long-term memory of what's happening in the world so we can do longer term we're actually storing text
13:30of what is um you know been happening in the world right it's not such a successful strategy of trying to keep it all at a pixel level and yeah i mean you can see it in video models like that temporal consistency we're at a scale of train on you know all the video data we have we have it for maybe 30 seconds a few minutes that's not the same as a game state played for half an hour right i thought you guys break it down pretty well you have a you have a blog post about uh building multimodal worlds with
14:00an agent i don't know if you guys want to talk about this this is one of the things i read i thought yeah this is the thing i talked about with the reasoning chain yeah so there's like different phases to this it seems like it's more of an agent a scaffold uh very different approach than just you know type in a prompt and you you don't have the same consistency it also like for people that are listening you know i would highly recommend reading it it breaks down the problem in a different light right so like what do you need to consider when you're talking about video like world game models right how would what do you need to consider what are the factors what are the elements what's the state
14:33so i don't know if you guys have stuff to talk about for this one yeah um actually i wanted to add on a little bit yeah on our previous point which is just like i do feel like sometimes people confuse like oh like we're taking an up a method with abstraction that means they don't believe in bitter lesson like like that's just false right like we are believe is a bitter lesson but then i feel like the question that we always discuss is like what is the right abstraction level today the analogy i like to make is like let's just say we can encode and decode represent all of images videos audio
15:07in bytes then the most bitter lesson approach is to train the next byte prediction model as opposed to the next token prediction model where it's just like okay it's natively multimodal because you just um but it's like well yeah like to to chris point it's like the scale and compute you need to achieve that um um so that's why we always come back to like okay what is the most efficient way to do it and and reasoning models to to the point of this blog post is a showcase of like hey we're actually just like reasoning about the world and reasoning about the aspects of the world that that
15:38matter for me to learn what i want to learn from this role model um yeah it's like you're improving the encoder of whatever you're uh trying to model and like a better representation would just represent the important things in less space yeah which would just be more efficient yeah um so yeah i i fully agree that it is not um antagonistic to uh bitter lesson i do want to mention one more thing um is there any philosophical differences with the jeffler stuff that uh yann lecun is working on i gotta go
16:09there are you you're you're mentioning like some latent abstraction i'm like okay fine let's let's talk about it right like it's an elephant in the room yeah there are philosophical differences um yann lecun is a dear friend of mine um but he has never appreciated the power of language in particular or symbolic representations in general yann is a very visual thinker he always wants to claim that
16:41he thinks visually and there are no words symbols or math in his head um maybe that's true of yann it's certainly not the way i think um but at any rate you know um the world according to yann is the basic stuff of the the world and of intelligence is visual and language is just this low bit rate communication mechanism between humans and it doesn't have much other utility and it's far inferior to the high bit rate
17:16video um that comes into your eyes and i think he's fundamentally missing a number of important things there right think of this evolutionary argument looking at animals right that the closest analogy is the things with chimps right so chimpanzees you know have fairly similar brains to human beings they have great vision systems they have great memory systems they've got you know better memory than
17:49we do of short-term memories they can plan they can build primitive tools but you know humans massively ahead in what we understand about the world what we can plan what we can build and essentially what took off for us was that humans managed to develop language and that gave a symbolic knowledge representation and
18:19reasoning level which just gave this sort of vaulting of what could be done with the intelligence in brains so the philosopher dan dennett refers to language as a cognitive tool and argues that you know humans unique among the creatures in the world have managed to build their own cognitive tools and language is the famous first example but other things like mathematics and programming languages are also cognitive tools they
18:52give you an ability to think in abstractions in extended causal reasoning chains and that allows you to do much more and we use that for spatial representation and intelligence and planning and gameplay as well so we believe and this is you know underlying the specific technologies that moon lake is making that symbolic representations are powerful and you want to use it in your understanding of the visual world
19:26when you want a causal understanding when you want to maintain long-term consistency and prediction and you know as i understand it that's just not in jan lakoon's world view so i think that's the fundamental philosophical difference um then there's the specific model he's been advancing jepa i mean that's the reasonable research bed as a direction as to to head for building out a model of the visual world
19:57to my mind it's sort of one reasonable research bed it's not really established it's the best one that everyone should be following at least developed at scale and meta but it's not just vision right like i mean jepa is a you know just joint emitting prediction can be applied to anything really and and people have done it if the argument is that there is a latent representation or that is that is probably more suited to the task then why not let machines do it for us instead of pre-defining it at all and isn't something like a jepa shaped thing the right answer and if not why not so i think there's a part of
20:31jepa that's right which is you do want to have a joint embedding that gives you a consistent model of the world and yarn's argument is you can never get that from auto-aggressive language models because they're sort of left to right churning out one token at a time i guess this is where we're um you know the research arguments of the field you know i'm not actually convinced that's right because although the
21:07token production is this auto-aggressive um process that's heading you know left to right i guess don't have to be left or right but anyway in sequence of tokens we could have right to left arabic um but um you know although that's true all of the weights of the model that are internal to the transformer they are a joint model of the model's understanding of the world and so i think you can think of the weights of the model as a form of joint representation and therefore it is plausible to
21:44think that that could be the basis of a world model which avoids um yarn's objections i think i follow and obviously that will touch on what moon lake eventually ends up doing as well right like which it's hard to tell because you put out the end results but we don't know the inputs that go into it so it's it's like you know that that's that's something that we have to figure out over time yeah i mean i guess this kind of breaks down some of the outputs do you want to walk us through it
22:14yeah so this this really just walks us through the reasoning traces of like okay let's just say if we want to build the world in this context it's really just a game demo that that shows the variety of interactions that this world model can build and yeah it's really just a reasoning traces of like okay so you're prompted to create a bowling game like how did it achieve what you saw that level of causality interaction and consistency right um so yeah this is almost just like a an example of like a reasoning traces very detailed very very detailed you gotta like you don't even
22:47realize it right like when a video is generated what happens when a ball strikes a pin right so first like there's audio in that like audio triggers happen score increments uh the world changes like pins up to start dropping there's a timer that goes on um you know it's just like very similar to how now we're used to reasoning for language models there's a whole state of what happens so geometry physics uh all this stuff and then yeah there's kind of that single prompt so asset um physication all this stuff it's it's like a it's a nice view to see what's going on i think sun is also too polite to point out
23:21that uh both like google's genie uh demos as well as uh world labs's marble do not have interactive worlds uh that's the benefit of having a reasoning model right because you can you can say oh like maybe in this particular context i want to learn how to bowl and then you can say okay then what is it important when it comes to learning how to bowl okay maybe it's like i need to understand the the basic of like physics and i want to throw it over them i want to know that when i when it resets
23:52it's it's a new game so i know that yeah basically you know you know you know to pick up the ball you know the ball's gonna cause the pins to fall down you know that what's important to this particular bowling game is to score and you know that the score corresponds to the number of pins that fell down so it's just like if it's a model that sort of knows what it looks like knows what a bowling game looks like but doesn't actually allows you to practice over and over again and to understand that oh like what it takes to actually get a high score then it sort of doesn't actually allow you
24:26to learn what you set out to learn within the world model right and and i think this is really just one example of showing like the advantages of the approach that we're taking over most the let's call it the zeitgeist is today when people talk about clinical world models right so it sort of seems like the question to ask when there's a world model is can i not only just wander around the world and look at the beautiful graphics can i interact with the objects in the world and see the right consequences
25:02of actions and you also understand what the consequences would be if you do something right so it's not just like okay there's one thing if i pick it up something will happen but you know there's there's 50 options and i know i can expect i can infer what would happen if i do any of them right so very different when you can actually see it play around with it there's two cheeky elements of that i mean the the sort of i guess less ambitious one is um let's really establish for listeners uh why is this fundamentally different than writing unity code right like just creating a model to translate a prompt
25:37into unity code so there is an underlying physics engine yeah um in that sense there's some overlapping things to unity but the way we think about it is like physics engine or tools or code are cognitive tools like borrowing chris's term right like tools that the model can employ as means to an end so today maybe you say okay in this particular context we care about physics we care about the long-term causality consequences then yes we deploy it employ physics engine and then
26:08maybe tomorrow we say okay we're training that just say drones where we only care about really fluid dynamics and the visual aspect of the world then then yeah maybe we don't actually the model actually doesn't have to use a physics engine or maybe it employs other types of representation or physics engine to achieve the task so yes writing code for unity is sort of similar to a tool that our model can employ but our goal is for model to take a representation conditioned reasoning approach
26:41or process yeah internally yeah using these things as just like general tool calls right which i think is very interesting the other more ambitious one is some kind of recursive element where it becomes multiplayer right like here there's a single player element you're not modeling any other people involved and that is a whole other thing but in fact we can already do multiplayers oh yeah yeah okay i haven't seen any demos actually just like prompt our our model to say hey like configure it to multiplayer then it'll do like this you'll be able to configure multiplayer great
27:11persistency your database for you easy yeah so what are like some of the current limitations and where we're at so there's one approach of like okay scale up video predictors obviously there's data issues uh you know with approaches like this uh is it data constraints what are like the next steps is it real time like so there's one side of you know write an agent to write unity code but okay i want to be streaming a game real time i want to have characters being also like agentic but where where do we kind of see this scaling up right yeah there's definitely a data constraint like the more data the better
27:45this reasoning model can almost basically act as humans to like operate a variety of tools and softwares to build whatever is necessary and then there's a sort of fidelity constraint which we're actually solving with another model reverie which we can talk about later but it's like well it's not as easy to get to photorealism with the approach that we're taking um but we think there are better solutions to that which is we can dive into later later the one one thing you know here is it's a diffusion model right so there's there's a few approaches uh diffusion caution splatting
28:20um yeah so reverie diffusion model you guys want to introduce yeah totally so within our world modeling framework we think there are two models that we train right like there's the multimodal reasoning model that we just talked about that essentially handles mainly the the causality the persistency and logic determinism determinism of the world and then reverie is our bet on saying okay like while all those model um can take care of all these things that we just talked about
28:51its limitations compared to existing say video models is that it doesn't have as high of a pixel fidelity right off the gate right and reverie is to say hey we can actually take whatever persistent representation that we generate with our multimodal reasoning model and learn to restyle it into photorealistic styles or arbitrary styles you want so this model is almost to say hey i'm going to respect the persistency and interactivity of the world that you created but my only job is to make
29:24sure that its pixel distribution is close to what we want yeah yeah for example you kept the kl divergence oh where no no no i mean this is a classic like um how you don't stray too far from the source material as you you kept the kl which is oh yeah yeah i mean and the difference is and i mean sun was pointing at this where it's sort of saying it's in one way a more difficult path but a better path that you know
29:54typically the diffusion models are producing the whole scene and it looks lovely but there isn't spatial understanding behind it which is allowing for the real-time graphics gameplay the spatial intelligence understanding the consequences of worlds where this is taking a path where it is assuming an abstracted semantic model of the world the world state and then the diffusion model is then being
30:25used on top of that to produce the high quality graphics is there an intended practical or business use for this or is it like a like a demonstration of capabilities we actually believe that this is going to be the next paradigm of rendering so it's going to replace how rasterizer is it's going to replace dlss today because it not only has these pixel prior that's learned from the world such that you can literally play any game in photorealistic styles which is a lot of people's desire when they do gta right like um all the mods all the people adding perfect lighting and all this so skins for
30:57worlds let's call it skins that's called you can call it skin you can call it customization you can play it how you want right yeah exactly and i think another thing that we really pointed out specifically in this blog is the programmability of it right so what this means is that this renderer well historically renderer is always a derivative of the game state right you're saying okay here's the game say i'm rendering out of frame but here i'm saying actually this renderer can be part of the gameplay loop i can say something along the lines of if upon getting 10
31:29apples i'm gonna my weapon of choice my bullet's gonna turn into apples and that's that's possible because we can say we can basically dynamically have certain game state trigger the the preconditions to the renderer such that the rendering is now part of the game loop too one thing is to just say okay it's it's it's the appearance but the second thing is also to say there's these novel interactions that are possible because this renderer now has actually priors of the world it is up to the artist
32:00to figure out what to do with it it is up to the creators yeah yeah and i also think that's actually another big argument that we're making and the reason that we're picking back taking the bet we're baking is that a lot of the times whether it's for embodying every gaming like you want a layer where human can inject their intentions right so for example that just say in the context of gaming it's obviously like my creative intent but maybe in the context of embody ai it's like oh like i take this foundational policy and i want to actually fine tune it to deploy in my house so you want to almost say inject have a layer where human can say oh here's the distribution of things i want to create
32:36to achieve my goal and i think 3d graphics as it as it is today is basically the layer for people to say hey what do i care about in this world and it allows um basically human intent to be expressed in these worlds much more explicitly and distributionally as opposed to just saying hey i'm going to generate like arbitrary and it's like just prompts you know it's one of those things where like i think you're going to build up a series of models right this is just one of this is probably like the highest utility
33:07or heaviest frequency one i don't know what to call this where like yeah you can immediately drop this in on any game and you don't need anything else that that you guys do but um i could see i could see that i think the the human intent is something that people are not even used to because we're so used to static worlds or um you know worlds that just don't react or i don't know it's it you're kind of blowing my mind right now with like well i wonder if you've talked to people at gdc and what are they going to do with it yeah now the stance that we take on this front is like we're not going to be more
33:41creative than our users ship it out yeah um but we want to make sure that we're building things in a way that really allows them to express their intent the thing that you said about here's the distribution that i want i think text may be too low of a bandwidth to to really demonstrate because i you know that i'm probably just going to want to drop in a bunch of reference assets and then you can figure it out from there you probably want to do a mixture of both right like you throw in a few images i wanted this style i wanted to look like this it's a mixture right i think it's a mixture i mean
34:15yeah i mean there's clearly a visual component of this and it's not that you know everything can be text because of course you want to give a visual look but there's also a massive amount of giving the overall picture of the look of the world and the behavior of things that you can express in a few words of text and be very time consuming and difficult to do by a visual means so i think
34:47yeah you want a combination of both so one question i kind of have is how do we go about evaluating
Evaluating World Models
34:52world models so like there's many axes right one is like okay i have preferences how well do we adhere to prompts one is the simulation one is like do things is there core logic that's broken so coming from we know how to evaluate diffusion there's fidelity there's stuff like that but what are some of the challenges that most people probably aren't thinking about yeah i think this is like a great question and probably one of the hardest questions in world models because like i think it always comes back to what are you building this world model for and depending on your end goal and purpose the
35:24evaluation should differ so in the context of games then the most direct way of measuring is how much time are people actually spending in this worlds that you create and if your goal is to say for example in the context that we just talked about like deploying deploying action in body agent then your your end metric is then okay after training in these worlds that you generate how robust it is to when you actually deploy to the target environment but then you know it's it's hard to measure these end metrics so today people have like these proxy metrics that i call that basically try to measure
36:00what we really care about which to end metrics but then frankly it's different for every use case um yeah which seems like quite a challenge right like in in language models or video models image models your benchmarks are proxies right people aren't actually asking instruction following tool use questions they're proxies of how well it will do downstream but for this so like you know should should teams should companies have their own individual benchmarks outside of games if you think of stuff like okay video production movies stuff like that that also want to use world models should should
36:33they sort of internalize like their own proxy is this something you guys do where does that kind of yeah i think this whole space is extremely difficult as things are emerging now and i mean it's not only for world models i think it's for everything including text-based models right because you know in the early days it seemed very easy to have good benchmarks because we could do things like question answering benchmarks and could you answer the question based on these documents and the various other kinds of
37:06you know do pieces of logical reasoning or math but again these are sort of and there are sort of visual equivalents of things like object recognition right for these small component tasks but you know these days so much of what people are wanting to do also with language models is nothing like that right you're wanting to have an interaction with the language model and get some recommendations about which backpack would be best for you for your trip in europe next month and it's not the same kind of thing right
37:41and it's not so easy to come up with a benchmark as to does this large language model give you and effective interaction for guiding you in a good way for shopping right so and it's the same problem with these world models so if we take the game design case well success is that a game designer can produce what they are imagining in a reasonable amount of time and that's really the kind of
38:19macro task but you know that's a very hard thing to turn into a benchmark and i think a lot of this is actually going to turn into people working walking with their feet right i mean i guess that's what's happening you know at the large language model level right when people are choosing to use you know gpt5 or gemini or claude you know individuals are trying out these different models and deciding
38:49oh i like the kind of answers that gpt5 gives me or no i feel like i get more accurate detail from claude right it's a lot of it's sort of like checking i realize that but it's actually whether people feel it's giving them utility and what they want right and the the interesting thing there is like a lot of people prefer the visual right this looks pretty which is not the objective of what this is for right it's if a game designer is working on something they care about the game engine state it's it can look
39:22whatever you can fix that up later or you can have a really good game state and you can quickly edit it to 20 20 different versions like keep state right so that's a really important distinction um for and for speaking to moon lake strength right so yeah i mean you know great visuals are lovely to look at for a few seconds but games are really all about the concept the gameplay and you know a lot of the time
39:54that doesn't actually even require great visuals i mean there are just lots of very successful games which have relatively primitive visuals and there are other games where people have spent millions producing photorealistic visuals and the game sucks right so keeping those two axes apart is really important and thinking about what's important in a world model for different uses this conversation is reminding me of some game review and fiction discussions i've um had in my sort of non-ai related
40:30life uh some for some people might know brandon sanderson who's a very famous uh fiction author uh is is is a big big game reviewer and he he's a big fan of video games where you change one thing about a normal what you what you might assume about the world for example baba is you i don't know if you might have come across that where like the rules change as you play the game and also like where you know you can do things like reverse time selectively or like change gravity selectively and i think this is also reminded reminds me of other kinds of world models that are created by
41:02authors where ted chiang is is my typical example where he will take the world that you know today but change one thing about it and but then create a consistent world based on that uh which is long window answer of me for me to say is it's easy to create alternative worlds that don't exist but you change one thing and then let's let's run a whole bunch of people through it to see if it works my first answer will be that seems a lot easier and more conceivable to do using techno technology like moon lakes than with some of the other world models out there um where the sun can actually
41:37make it happen i'll let him give the second answer if i guess for you you're constrained by the game engine tool right like at the end of the day that's the that's the thought um partner that you have if i ask for something where like if it never is allowed to reverse time or if gravity only ever works one way then well that's it but sometimes gravity might change but it's a lot easier to change with code as opposed to a model that is learned primarily on data of real world and virtual
42:09worlds that are i guess like for example junior like there's actually training a lot of real world data and a lot of virtual gaming data and it's hard to say well maybe it's easy to say okay i want to change the visuals and like the time period of of the world but you can't change gravity for example i feel like you can to light bounds right everything comes down to like code is a better way to execute it but the models aren't that diverse and creative right you can say okay make gravity slower it can do that but it's limited to your representation of how you text it out right like they're only going
42:42to do a few iterations whereas programmatically you know if there's a game engine under the hood you can you can kind of go wild right so one of the i don't know one of the limitations of most models is that they're very over trained to one style right and extracting diversity is pretty difficult at least that's something we've seen i mean are there examples you have in mind where existing models you know like it would be easier to do that's not using code like certain types of creative intent or
43:12like uh state transitions clipping uh other models other world models are very good at clipping through things clipping my legs clipping through a rock because because it's you know it's just it's just bad
43:27like yeah you would have to struggle very hard with your your stuff to actually make that happen uh which i think is it may be a topic that you actually prepared on uh uh costume splatting versus uh the other stuff yeah yeah it's just for those not super familiar right there's a there's gush and splatting there is diffusion like what works what scales up i feel like in february when sora one came out the the blog post was literally titled like we bring it up every day you know world world video generation models are world simulators uh it's super bitter lesson pilled yeah a lot of it is emergence
44:02right so uh not to go through their blog post basically their whole thing was as you scale up all this consistency all this stuff just kind of solves it's a very simple premise right they just scaled up diffusion and from there you know this is this is feb 2024 how much can we it's already been two years which is basically five years you know how much more in ai time do we need to just scale up or or do we hit a data cap but i think we already talked about this a lot right like this is back to the beginning discussion of what's appropriate for the time
44:33and that seems like your approach right yeah the point i'm trying to make is that there are very many many different types of world simulators and like having a world simulator that can produce pixel coherency is very very useful for games and you know marketing and all these things but it's not as useful as people think when it comes to causal reasoning when it comes to embodied ai and yeah like it this this title is true like we're not saying that it's it's like you know not a great world simulator but
45:06actually in the blog that we we we wrote the bet is more so that they're going to be disproportionately large share of value of real world tasks or in virtual tasks where high resolution pixel fidelity is not needed and yes video models have their values yeah this is at the absolute limit of my physics understanding but one example that comes to mind is basically having to solve like base the equivalent of a three-body problem in a deterministic world whereas the video models were just approximated good enough
45:39yeah right like there's there's some point at which your approach kind of runs into like the well you now have to simulate the world please thank you very much and like you're trying to do that but only to the extent that the game engine lets you and like game engines cannot do some things yeah no i mean i i think the the interesting or more technical question here actually is where do you draw the boundary between what's handled with let's say diffusion prior and where when what's
46:11handled with symbolic priors yes okay right because like this boundary can actually be fluid like i think maybe what you're trying to get at is like okay people are saying pixel prior everything but what we're saying is okay there's a boundary that we draw where this is where we think provides the most economical value for the domains and things that we care about today and i actually do think and it's something that we do internally all the time which is like okay given new equations that we learn or new
46:45elements of the world and that we we learn or maybe some other knowledge that we acquire in the process developing the models should we still be maintaining this line exactly as it is today or should we move it a little bit left or a little bit right right like sometimes that we realize that oh like maybe customers or folks like want certain things that are better handled with pixel prior as opposed to um symbolic prior yeah your skin thing is it is an example moving it right yeah um or left yeah i don't know what the left right is yeah yeah no the the the the reverie model yes actually we have a few
47:20iterations of them they're actually at slightly different i know uh you should you should do that that's a cool dimension to show yeah is quantum mechanics the diffusion prior of our world
47:33uh right it's like that's the boundary of classical mechanics versus quantum right like that's it at one point god plays dice and the other point doesn't i don't know i don't know of course you want to say but i think i think generally i feel like physics is better with symbolic priors um even quantum physics even quantum physics yeah this is starts against the um mlst territory is what i call it where uh he likes to get philosophical uh we're quite friendly i mean we need to get we need to get singularity i heard some of that no no i think that is actually really helpful and uh man i just want you
48:09to productize this like as a product guy i'm just like well as a gamer you know a researcher you know like it's cool like this this is a theory theoretical like you have a very good i don't know like the way of thinking about these things but i just want to see you like you know express it i do think like you're fundamentally things when you leave open new tools like okay use use human intent to incorporate it into how you render well artists are going to have to take like two to three years to figure out what to do with this and you just don't know but i think you know this is um gives a much more
48:45approachable and controllable world which is the beauty of uh nlp that will enable it to be adopted and used and we're very hopeful about that yeah yeah i mean we are we are very focused actually on commercialization in the sense that like we do we do really believe in the data flywheel app approach yeah where um we put this in the hands of the creators and the users and then they will teach us when what capability our model should improve and that's why we are we are actually you know like products in
49:15beta yeah focusing on gaming what's like the adjacent thing to gaming embody ideas basically so we can we can i'll maybe start with where we see the platform in three years which is like okay the users would tell us what they want to achieve the end goal could be hey just i want to make something to teach my kids the value of humility um or it could be hey i want to fine-tune my drones to be really good at rescue situations i could be vacuum robots i want to like train my manipulation or like
49:47vacuum robot to be very robust to my office right but it's like whatever it is in my office like navigate very robustly with in my office but then it's like whatever end goal that you want our world model will say okay given what you want to achieve let me generate a distribution of environments such that i can train and evaluate whatever it is you want right maybe for the purpose of games it's just the end simulation and that's the end product for certain policies it's like i can
50:17train it within these environments and then help you see where your policy is failing or not and then you know so i think so in that case much more of a training tool than in other evaluation both right sure same thing yeah i think it's just this role model that allows people to train any policy that can act in any multiple environments would it be harder to reward hack is there an angle here where it is harder to reward hack like it's just i'll just put it generally because i think that's a that's obviously a key problem that a lot of people face when in training agents in these environments and
50:52i don't know can you solve it i think not necessarily i mean to the extent that there's a misspecified reward that it seems like it could be hacked in a more symbolic world or in a more pixel based world um i don't know if sun's got any thoughts but i don't think that's really being solved the other thing that comes to mind is just you could just build a better sora as a video generated model right because then you you would move the diffusion uh side a bit more further to
51:24the right i think if i got the directionality correct um and that's it it's better on domains right like on consistency over an hour for sure it exists versus something doesn't right so yeah
51:37there's a question more like like i'm just riffing on like how do you what can you build you know with the stuff that you have i do think that the mind of the academic does go immediately to training and in evaluation but like art tends to take unusual directions like you might end up okay yeah but the question is can you use this piece of software to develop compelling gameplay and i don't think you can take soar and produce compelling gameplay right if you want to have a world that you can wander around in a
52:11a bit you're good but what are your abilities to have gameplay mechanics implemented the way you'd like them to be and to have things stay you know with the long-term history of your gameplay that influences future actions i think there's just nothing there for that yeah i do tend to agree i'm just trying to sort of test the boundaries i would also make the observation that as triple a games industry has developed the line between what is a movie and what is a game has blurred um and you you you do end
52:42up basically producing a two-hour movie as part of your game no honestly there's so many actually applications in adjacent markets that our world model can go into yeah but yeah it's sort of fun to riff on although on the execution side we sort of we need to stay focused with like okay what are the capabilities we want to unlock over time and there's a roadmap for that but yeah if we're just riffing on sort of like the possibilities i feel like whether it's endless yeah classic and then the embedding for a possibility and endless in my mind is very close yeah i do want to uh focus on one like weird
53:18choice i don't know if it's weird maybe i'm i got something here uh audio right you could have just said no audio and audio in my mind has a lot of recursion whereas in in video you can just do ray casting and that's much computationally much simpler audio just seems way harder i don't know if you want to just comment on just the spatial 3d audio problem did you really have to do it i guess you do to be immersive but like a lot of people do treat it as like well you just stick a uh tts model on top of well there's a lot more to game audio than just speech right it's not
53:53just yeah tts sfx bgm spatial in my mind echoes yeah and reflections and i don't even know what's what else i don't know i don't know what other problems in this space yeah i think this point like the is sort of a more more pointing to the benefits of using an game engine as a tool that's available to the model right because like part of the spatial audio is from the code that is underlying the
54:25simulation and while we do give our model access to other types of audio models as tools none of them would be spatial i think right but that's exactly sort of more point two we're giving our model an abstraction or a suite of tools such that it's able to achieve that and you can argue that sort of spatial is like a like emergence out of the the tools that we and abstraction that we provide to the agents and i think that's the beauty of this this this approach is like there's a lot of things
54:57kind of like how humanity's built technology and they're like lego blocks that build on top of each other and it's the same thing here like there's going to be things that so just sort of emerges from being able to put these things together in like combinatorially interesting ways right this integrated audio model exploits the understanding and semantics of the moon lake world right and whereas in general for the gen ai video models there's no actual integration across to audio at all right that someone
55:32might stick some music or stick a soundscape or whatever else on top of their video so it's not a silent video but they're in no way connected into a consistent world model and there's nothing that's okay an action is happening in the video therefore there should be a sound that's coming from this part of the visual field yeah is that different than sora 2 does it not have audio not to say it's not like there's no spatial audio it doesn't no i've played around with it enough it just sounds like
56:09someone put an 11 labs voice on top of it and just tried to do the lip sync yeah i've seen okay generate a dog at the beach and reactions to big wave and move around yeah it's definitely like so have the dog move away from camera and see if the the sound goes down or it doesn't right because they don't have spatial audio so we do want to basically like we our model model like the one we're training is basically towards the goal of having a combined latent representation across all these different modalities right such that you can like reason across these different modalities um so for example
56:42if i close my eyes and like you play a video you play a sound of like cars skidding away from me i almost can like visually extrapolate that trajectory in my mind and i think that that type of capability we want our model to be able to reason right and that's the reason that we're sort of taking this multi-modal reasoning approach it's like we want this combined latent space that can yeah oh you said latent space we like that here we have to play the the bell every time that someone says they in space uh no you got to train daredevil one where you you it's only audio but you have to work out where everything is cool i think that was uh that's about it for our moon lake coverage uh i do think that
57:19we have like a couple of uh chris madden questions on on ir and uh just any any other sort of attention topics or nlp topics okay i don't know i mean yeah it's just fun uh you know we talked a bit about how you guys met but you basically you you are like the godfather of nlp per se right you spent the whole career from early embeddings early early attention you did 2015 attention for machine translation everything uh you you had information retrieval so rag before rag you know we just want to shout
57:50that out and admire a lot of that right so what prompted the switch over to world models how how'd all that come about to some answer it is um the enthusiasms and creativity of students but there's a bit of a history there right so yeah so clearly most of my career has been doing stuff with language and you know how i got into research was thinking ah this is just so amazing how humans can produce speech and understand each other in real time and somehow they managed to learn languages when
58:23they're kids how could this possibly happen and so yeah starting off i was very focused on language but you know as it sort of got into the 2010s i started you know going i'd been working on question answering and then i started to get um interest in visual question answering and that was an area where it was very noticeable that the visual understanding was bad right you know these were the days when
58:58like it sort of seemed like there's almost no visual understanding you're just getting answers that came from priors so you know if you've asked how many people are sitting at the table i'd always answer two regardless of how many how many people you could see in the picture and you know so it seemed like ah these models actually aren't able to get semantic information out of images and so i was interested in that problem and tried to work more on that and so then that required knowing more about
59:32what's happening in vision and how you can represent visual information and then things start you know there started to be this revolution of um doing generative ai images and then i had students that started looking at that before the era of moon lake i was also working with demi guar who founded pika um and so and ian obviously with gans yeah though ian was never my student but yeah i i was very aware for the the whole decade there of ian with gans yeah and i mean ian was a stanford undergrad but yeah richard
1:00:08doosu.com i believe he was your student um yeah and you know there were there were links across at that stage as well so i mean you know there were several papers in that era of doing i mean so andre kapathi was a phd student at the same time as richard and so there was some joint language vision work in that era as well you know it seems kind of ancient by modern standards but yeah we're trying to go from sort of textual dependency graphs to visual scenes at a time the glove embeddings really took over
1:00:42a lot of tf-idf like one hot encoding all that the early vision language models we saw were like lava style adapters right it's it's technically still just embedding latent space let's add image it's like mixed modality so and that's that's one of the things you super put out there too right yeah yeah yeah well thank you for all of that thank you for advancing the world on uh world modeling uh i honestly uh do think that if people deeply understand everything we just covered they will see what's coming and i think you guys have you know made some it's a really significant
1:01:13contribution here what are you hiring for you know what is the what do people find you know we agreed that the cta was a hiring call yeah i mean don't we have agi you don't need you don't need engineers anymore right yeah on the model side we we are actually striving towards basically a self-improving system but what that means is that we need people to set up the self-improving system um more specifically people who have the intersection of knowledge within cogeneration and computer vision and graphics right yeah that's that's sort of the core research background that
1:01:45we look for within our team and the majority of the team today do have like both backgrounds when you say computer vision and graphics are they the same thing or is it computer vision one thing graphics another thing and how intertwined are they they're intertwined but different yeah and i think you know this relates to some of the themes that we've been talking about that the more explicit underlying world models that are being constructed inside moon lake really draw on
1:02:21the computer graphics tradition and so it's then combining that with the visual understanding of vision got it yeah all right so if you've written a game engine come talk to us right oh yeah definitely but i do think that the line is blurred like increasingly blurred these days where it's like if you have a general understanding of vision and graphics i think for your standards it is uh for me it feels like vision is is you know i'll leave that to the big labs graphics i i can get that you know
1:02:51you would want to do that for more first principles but vision there's so many vision models off the shelf that i can take but probably not good enough for you i see i see if you're sort of like making that distinction then then maybe we we care a little bit more about having graphics knowledge um it could be like you know sometimes a hiring call can be as simple as like if you know the answer to blah you should talk to me you know like the the the sort of core known hard problem in uh in your world uh i see yeah in that case if you yeah definitely if you've written a game engine before if you've
1:03:24rl'd a variety of coding models on different objectives like easy um many of those yeah if you've done multi-model in space alignment i intentionally included a poor editor has to edit thing every time uh yeah lean space alignment honestly is it that hard well i there's some scripts out there that i've saved for the day i someday someday have to do it but i don't have to do it but it's done i think yeah there's there's versions of that that are done but i i think we are aligning audio text language
1:03:58and video right like and basically we have these world models that are able to act as agents to like act in these worlds and extract long horizon videos and encoding that back to the model to sort of self-improve so it's an insanely exciting but also technically challenged problem so people who want to do their lives best work you know that makes a place how big are you guys where are you guys based we're currently based in san mateo although we're moving up to sf um we're about 18 folks right now my ending question was going to be why what is the name what's my name oh um very cool graphics and design
1:04:34by the way actually at the at the time when the when the when we started the company we were thinking a lot about how do we make a company name that gives people the vibe of like open ai but for like almost like industrial light and magic vibes because it's like we care about creativity and using that as a funnel to solve agi so then we were we brainstormed a lot around like dreamworks right like industrial light of magic and um so there's a few few basically uh space of things that we feel like are very very
1:05:04semantically close to the company's identity yeah and then it ended up being moon lake partly because of the dreamworks vibe you know the dreamworks uh moon lake exactly yeah um so that was a little bit of that inspiration and then the moon was sort of like a it basically was like about the reflection the reflection part also implies the self-improvement loop wow we sort of like really believed in and that's the path towards multimodal general intelligence so that's that's that's that i'll
1:05:35leave it i love a good name i love a good name this is great it's a very good name it's very good lore i'm glad i asked the question i will also say you know one of my favorite story uh books or biographies ever is uh creativity inc with ed cadmel's story about pixar and how he you know was rejected as a disney animation artist so then he went into computing and brute forced his way into back into disney yeah and walt disney is also like one of my favorite founders he's like his story like at the time you're like okay i'm gonna create this like immersive park like people can't don't even have
1:06:09that technology to create it virtually but like you know what let's just build it very physically such that people can so he's the first world modeler no i i tell people that like theme parks are world models too yeah yeah yeah i mean you know it's a small world or it's uh like the epcot center with all the little um replicas of the countries yeah those are very interesting um okay well thank you uh we've covered uh you know a huge amount thank you for your time and thank you for inspiring us thank you for having us fun chatting yeah it's been a good time
More from Latent Space

AI-Native Healthcare: 100M Doctor Visits, 10–20 Hours Saved, Prior Auth in Minutes — Janie Lee & Chai Asawa, Abridge
May 14, 20261h 5m

🔬Doing Vibe Physics — Alex Lupsasca, OpenAI
May 5, 20261h 31m

Physical AI that Moves the World — Qasar Younis & Peter Ludwig, Applied Intuition
Apr 27, 20261h 12m

AIE Europe Debrief + Agent Labs Thesis: Unsupervised Learning x Latent Space Crossover Special (2026)
Apr 23, 202654 min

Shopify’s AI Phase Transition: 2026 Usage Explosion, Unlimited Opus-4.6 Token Budget, Tangle, Tangent, SimGym — with Mikhail Parakhin, Shopify CTO
Apr 22, 20261h 12m