Steadcast
Latent Space cover art
Latent Space

🔬 Training Transformers to solve 95% failure rate of Cancer Trials — Ron Alfa & Daniel Bear, Noetik

April 20, 20261h 25m · 15,009 words

Show notes

Today, we explain this piece of “clickbait” from our guest! TL;DR: 95% of cancer treatments fail to pass clinical trials , but it may be a matching problem — if we better understood what patients have which tumors which will respond to which treatments, success rates improve dramatically and millions of lives can be saved — with the treatments we ALREADY have. See our full episode dropping today: Why Big Pharma is licensing AI Models Tolstoy famously wrote, ‘All healthy cells are alike; each cancer cell is unhappy in its own way.’ Or something like that. Cancer might be the most misunderstood disease out there. It’s not one disease, it’s a family of diseases. Hundreds, maybe thousands, of unique diseases each with its own underlying biology. With this lens, saying you’ll “cure cancer” is like saying you’ll solve legos. We keep hearing AI will cure cancer, but sadly it may not be so easy. Today’s guests — Ron Alfa and Daniel Bear from Noetik — thinks they can use AI to break through a core bottleneck in the treatment development process. GSK recently signed a $50M deal for their technology that also includes an (undisclosed) long-term licensing deals for Noetik’s models like the recently announced TARIO-2 , an autoregressive transformer trained on one of the largest sets of tumor spatial transcriptomics datasets in the world. Whole-plex spatial transcriptomics is the richest way to read a tumor, and approximately ~0% of cancer patients going through standard care ever get one — and TARIO-2 can now predict an ~19,000-gene spatial map from the H&E assay every patient already has. Most big AI plays in BioTech have focused on discovery, and usually result in an in-house development effort (meaning tools companies usually become drug companies). This deal stands out in that it is a software licensing deal, and represents a commitment to a platform rather than a drug . With attention on other software tools for drug development (see the Boltz episode and Isomorphic for example), it is starting to look like the appetite of Pharma for biotech tools has finally started to grow. Why the sudden interest? Cancer is hard Biology is hard, cancer is harder. But despite this, we’ve made incredible progress. So many cancers that would have been death sentences twenty years ago are routinely survivable. It used to be our main strategy was just chemotherapy — poison you and hope the tumor dies before you do. Now, there are many treatments that actually kill a tumor and leave the rest of you intact! Immune checkpoint inhibitors like Keytruda and Opdivo target the defenses of dozens of tumor types. CAR-T therapy adds modified T-cells to your blood that can target B-cell malignancies very accurately. Antibody Drug Conjugates such as Trastuzumab combine a drug with an antibody, allowing it to target very specific (cancer) cells. We truly live in marvelous times. With that said, we still have a long way to go. For every type of cancer with a miracle treatment, we have many more that are still death sentences. The world spends $20-30 billion a year trying to cure cancers, with hundreds of clinical trials yearly.Yet, progress is slow with a 95% failure rate in clinical trials . The lab doesn’t translate to the clinic Are we leaving something on the table? Enter Noetik and Ron Alfa. Ron’s core thesis is that many of these “failed” treatments actually work! But we’re not looking at the right patients with the right tumors. If only we had a way to really understand the unique types of cancer biologies and which patients will respond to which treatments, we might be able to show a much higher success rate. Millions of lives (and billions of dollars) may ride on this. The Hard part: Blind Faith in Data Collection Ron and Noetik had the conviction to spend almost two years just collecting data. Lots, and lots, and lots, of data. Noetik has acquired thousands of actual human tumors, and collects a large multimodal dataset of hundreds of millions of images that allows them to create a detailed map of the cell makeup in the local environment. These are real human tumors, not frankenstein mouse models or immortal cell lines. This data is then fed into a massive self-supervised model, creating a “ virtual cell ”. This model has a deep understanding of cancer biology — Noetik has worked carefully to show it can distinguish different types of tumors. Maybe even tumors we didn’t identify as distinct previously! More recently they figured out how to scale up their model and data, and see no limit in their scaling laws! Noetik’s models can simulate how a patient will respond to experimental treatments. They are working with partners to test promising drugs that were demonstrated to be safe, but not effective. If these models work as hoped, Noetik will bring new cancer treatments to patients without developing a new drug! Their models will also guide the discovery process towards drugs that are more likely to make it through clinical trials. You can imagine why this is so attractive to GSK. We’ll see… Ron and Dan make pretty persuasive arguments that their models will truly assist in cohort selection in useful ways and this seems valuable. And we think it’s pretty clear that * Translation from lab to clinic is the biggest bottleneck for drug development. * Better cohort selection using biomarkers is likely to improve translation from lab to clinic. Noetik has already had some success here. We’ll see if they’re able to translate that into a reliable advantage. Stepping back a bit from the technology, curing cancer is a pretty unambiguously positive application of AI. It is also a very hard problem to solve. Our guess is that most people have been impacted by cancer or will be at some point soon. And we hope that learning about the amazing work that companies like Noetik are doing will inspire a generation of AI engineers to work on the hardest and most exciting problems that society faces. Full Video Pod: This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe

Highlighted moments

So you can think of one of those data points as an image, except instead of being an RGB image that has three color channels. Now, all of a sudden it has like 20,000 colors.
Jump to 26:07 in the transcript
you only really see the benefits of using larger models when you're looking at longer context lengths. And here, longer context really means, again, like you're seeing more tissue at once, more area at once.
Jump to 58:29 in the transcript
it doesn't follow that, like, any data set is a machine learning data set. It doesn't follow that that data set is going to solve the problem you're trying to solve.
Jump to 1:11:51 in the transcript
my bet is that the same is going to be true for these models too is that, like, by modeling sort of at the level of functional tissue where you have a bunch of cells interacting in, like, a disease context that that's going to get you to the problem of predicting kind of the patient-level behavior much faster than trying to first model a cell and then stitch a bunch of those cells together.
Jump to 1:21:57 in the transcript

Transcript

Introduction to Noetic

0:00So we basically opened the lab, we hired a team, we got all the instruments, we started sourcing tumor samples. There was no prior here that any of this would work, like zero. We just started generating data and like sourcing human tumors, processing. We built this whole processing pipeline to get the tumors into like these arrays and the formats. So you've got like these two-week runs where you're processing two slides and we're just churning data for months and we couldn't even train a model. So we sort of just built all this and then like let's say 18 months later,

0:32hey, I wonder, can we train a model off? And then it was not, you know, like it wasn't obvious. Yeah, there wasn't really like anything major to go off of. I mean, there were like transformers developed for single cell data. There just like weren't really data sets out there that people had been able to develop on. We do a lot of like custom model building.

Host Introduction

0:53Hi there, I'm RJ Haneke and this is Brandon Anderson. We're the co-hosts of the Latent Space Science Podcast. And today we're really happy to be in the studio with some of the people from Noetic. I'm Ron Alpa, co-founder and CEO of Noetic, physician scientist by training. My hobbies are making hot takes about AI curing cancer. Hi, I'm Dan Baer. I'm VP of AI at Noetic. I'm a biologist by training. I did PhD work in neuroscience and then moved into comp neuro, computer vision, self-supervised learning and

1:29have been doing AI research at Noetic for the past few years.

Noetic Founding

1:32Maybe we should start with what is Noetic? Why did you found it? What is the difference between Noetic and the other virtual cell companies? Maybe just start with a little bit of a centrarian thesis, which is really the reason for founding Noetic. We all know the numbers that 90%, 95% of cancer drugs fail in the clinic. Why do they fail? So our thesis is they fail not because we're bad at pharmacology, not because we're bad at target selection, you're making the drug. We're actually better at that process than we have ever been in

2:05the history of drug development. Most of those drugs fail, we'd argue, because we're bad at selecting which patients those drugs are in or worked in. And oftentimes you see trials where there is no placebo effect in cancer. Some patients respond to these drugs. And if you have a patient that responds, that tells you something that there's some biology that's active there, but you have a problem in patient selection. And so really that's the thesis behind Noetic is can we build models that can fundamentally understand patient biology from the very beginning and help you

2:36position molecules in the right patient population?

Patient Selection

2:39So you're actually using the models partly at least to select the patient cohort, not just so you can imagine working either way. You could design, oh, I think that this molecule will do well because I know something about the patient population. But you could also say, I think that this patient population is the match for this molecule. And that's sort of the power of the models is like once you've trained these models on patient data, you can use them on both sides of the equation. So you can use them for discovering new targets directly from the patient data, which people often refer to as

3:13reverse translation. So starting from humans and then trying to understand which targets to go after, and then you can use that to develop molecules. But you can also use them directly on patient data. If you have, you know, let's say a phase two or phase three trial, you can use these models to understand which patients or what underlying biology of the patients in the trial is a predictor of response. And we've been doing a ton of that recently.

Rescuing Trials

3:43Are you doing a lot of like rescuing trials that had a bad effect? We are doing a lot of looking at like data from phase two, phase three trials, and then using the models essentially to run inference on patient biopsies and understand whether there's underlying biology that would help us design the next trial. We haven't shared any of that yet, but you'll see this too. So cancer is kind of like infamous in that like, there are many, many different types of cancers. Whenever it says like cure cancer, that is almost a meaningless vacuous statement. So your point is

4:16even amongst cancer, or you pick a specific type of cancer, and then a subtype and a subtype, there's a bunch of different patient populations that each one of them will respond differently to drugs. And your point is you can figure this out right now, that like some subpopulation will do well and respond to this drug when you think generally speaking, the rest of the population would not, even though we have historically classified this as like all what type of cancer, what indication, or so on. Yeah, that's exactly right. And I would maybe even go further and say like, nobody actually

4:49knows what the subtypes are. There are cancers that originate in a certain tissue like the lung that, you know, have been classified into subtypes based on pathologists looking at them for, you know, more than a century. And, you know, those subtypes certainly have some connection to the real like carving nature at its joints, like what are the actual functional subtypes of disease there. But our thesis is kind of that if you look at the data, a much richer kind of data, so the multimodal

5:22data that we're generating in our lab, we're going to see that actually, you know, what people thought was one subtype of lung cancer is really three distinct subtypes of cancer. And that is going to be critical for figuring out which patients should get which drugs.

Cell Lines Limitations

5:38Yeah, maybe I'll just go back to you, like one of your first questions. And, you know, I was saying like drugs don't, you know, many drugs fail in patients because we don't understand which patients they will work in in oncology. Why do we end up in that situation? So whenever you make a new drug, you do a set of experiments in cell culture, cells in the dish, those cells are often cell lines. These cell lines have existed for 40, 50 years and they're immortalized. So they have

6:10genomes that allow them to persist that have abnormal numbers of chromosomes. They have gene expression patterns that don't represent any known cell in like the human body, really. These are sort of Frankensteinian cells. It's a cancer drive that are ruinously cancer. Yeah, right. They're mostly cancer. And then, and so you can do your experiments in these cell lines in a dish, or then you can move these into animal models. And in oncology, you often have, you know, sort of a panel of different animal models with, you know, different cancer types that you'll test

6:43these in. And we, in doing these experiments, we sort of convince ourselves that, that some of these cell lines are, let's say, lung cancer cell lines or colon cancer cell lines. And then even that some of them in, in the mouse context are colon cancer cell lines and lung cancer. And then we, in the mouse, we implant them under the skin and like weird places and we treat the mice with drugs and we, we see how they respond. But ultimately there's a big gap because they don't translate to,

7:14to patient biology most of the time. So these cancer cell lines, most of them don't even, you know, even if they are derived from a colon cancer, they don't even have the mutations that human colon cancers have in many cases. Uh, and so, and pharma has done this for, you know, 20, 30 years where you, you develop a drug, you test it against, you know, hundreds of these. It's not an art experiment. We can, you can send this out to any CRO. They'll test your drug against hundreds of, of different cancer cell lines. And then you can first sit back and say, okay, well, which of the 50 colon lines responded

7:49to my drug and which of the 50 covarian cancer lines? And you could try and map that to human biology. But the problem is these cell lines as an abstraction do, do not relate in any way to, to human patients. And so what happens is ultimately, no matter what you do preclinical, that the molecule gets in the clinic and the clinical team says, look, we don't really know how to design this trial because none of the data that you've produced gives us any insight on, on which patients to run. So we're going to, we're going to basically enroll an Oakland label study. So we're going to enroll all tumors, all patients that are, uh, you know, enrollable in

8:23the, in this trial. And we're going to see where we get signal. Imagine doing that in an early phase trial where let's say you have 50 patients and you're, you're trying to do, you know, test different doses and you don't really know the dose of the drug and you don't know what the safety margins are. And you're also trying to figure out where is my signal. Um, and then what if I told you that let's say in, in just lung cancer, hypothetically, let's say there's only 10 different subtypes of lung cancer and you don't even know if it's lung. It could be any. So, you

8:53know, this is what happens. And oftentimes you get to the end of these early stage trials and you don't see very many responders as you would expect, um, you know, statistically, and then these molecules get canceled.

Noetic System

9:04So you're imagining that your noetic system, you help the pharmaceutical company to characterize, we expect that people with a certain genetic profile or even transcriptomic profile will, will respond to this drug. And then you go and you actually sequence from the patient and you say, yes, this is a match or no, is that the sort of grand vision? Yeah. I mean, I would say we are even less biased than that. We are saying, okay, well, we want the model to learn, let's say from

9:39lung cancers, we want the model to learn like how many different therapeutically relevant subtypes of lung cancers are just from self-supervised learning from the data. And those subtypes could be driven by large genetic changes. They could be driven by, you know, immune changes. It could be really driven by any biology that the model, uh, is learning in the process of training, you know, and we do see, you know, different types. I mean, feel free to contradict this, like as the actual doctor here, but like, you know, the,

10:10the biomarkers that, you know, people have been using are, you know, biased towards simplicity. You know, does the patient have this particular mutation? Sometimes like staying for this single protein or, you know, do transfertomics like to, to look for a particular gene signature, but like, there's no reason to think that biology or like biology of cancer is that simple that you're going to capture, you know, most of the meaningful variation with such simple biomarkers. And,

10:45you know, most of them, they have like weak correlations with, you know, clinical success, but the hypothesis really is here. Like, again, if you were to carve nature at its joints and figure out what's really going on, is there, you know, these five subtypes that the correlation there between which patients you give a particular drug and whether you have success is much, much stronger than if you're forcing yourself to go with these like very simple biomarkers.

Data Generation

11:15You mentioned the lab, you do a lot of data generation in the lab. So why do you think that that versus using existing public repositories or whatever is appropriate? Yeah, we generate all our data in, in the lab, everything from sourcing tumor samples themselves to processing them and generating the data. Maybe another, another hot take I have just in AI and bio is you're sort of not at the order of magnitude of data that you are in other spaces of building

11:47training models. And so it becomes really hard to brute force these problems just by collecting data. We have a couple pretty good examples of where someone has designed a data set. So PDB was designed and has been built over the past 50 years or so. And so it's not an accident that that data set exists. Someone decided that we are going to design this data set. We're going to collect this data over decades and decades. And then with the intuition that potentially this would help solve protein

12:19folding down the road. And, and, and, and it did. So it's not just that PDB is a bunch of random data that, you know, has been, that people have organized from, from the web. I think that in bio, you really need to be intentional about the data that you generate and how you generate it, um, and have some foresight around, well, what are the models we're, we're going to want to train and what are the models I need to learn from, um, from the very beginning. So that's why we've taken, taken this approach. Yeah. And I mean, like a good comparison is to the ImageNet data set, which kicked off the deep

12:51learning revolution in computer vision with convolutional neural networks, like actually demonstrating that, you know, neural networks can do better than other methods on object categorization. ImageNet is at least the, the part of it that people were developing models on is 1.2 million images, very carefully curated. These are high quality images, not like random images from the internet or like multiple data sets cobbled together. And labeled. Yeah. And labeled. And I

13:24think with the data that we're generating, we're around that scale right now. But, you know, of course people have gone much, much larger in image data sets and language data sets, text data sets, obviously for LLM. So we think that we need to get the data up to that scale before we can really see the meaningful progress on the algorithm side. The scale of language data. Yeah. Language is really the only modality where people are seeing these very impressive scaling results. Um, and part of that

14:00has to be just the scale of data that's there and that the models are trained on. That can't be the only thing because, you know, there's a lot of like video data as well. People are training on like thousands of hours of video data and, you know, haven't seen kind of the scaling results that you have in language modeling, but having the right scale of data is necessary, if not sufficient to like really make progress here. Can I refer a contrary to that? Sure. So, I mean, there's this whole

14:30concept about the jagged frontier of LLMs and dirt of AI and how like certain regions that can be really good at solving some problems and then remarkably stupid at solving your right problems. And maybe the arguments with happening is that a lot of these frontier models are just becoming massively, like everything is becoming in distribution. Like if everything starts out OD, if you just get more data, it now becomes in distribution. Is it possible that for biological systems, because these are, they're underlying physical processes here, that you can basically

15:03make things more in distribution earlier and that you can't actually cover the space? I kind of have some follow-ups with PDB, but I'm just curious at this point. Yeah, I mean, I think it's a good question is like sort of how much data and what kind of diversity do you need like in biology to solve, you know, say like the drug translation problem, like figuring out which drugs are going to work in which patients. My intuition from working in biology like for a while is that we're still pretty far from

15:37that, like because, you know, we're building data sets that are focused on right now cancer and, you know, have generated data from thousands of patients in a few major cancer subtypes. But there's like every other disease, there's healthy tissue, there's even other species, you know, there's a lot of biology to learn, especially if you think about it as we have to learn kind of the spatial and functional patterns of tens of thousands of genes, tens of thousands of proteins, how their spatial arrangement contributes

16:13to the function of organs and so forth. You know, my hunch is that biology is like pretty complex and that we still need to generate a lot more data. But yeah, I don't know. But as a cancer company, do you think you could actually do this hypothetically for cancer? I mean, for at least some, you know, subclasses? Definitely. Yeah. I think that we've done experiments that suggest that, you know, if we can generate data from several hundred patients in all of the major cancer indications and some of the less major indications,

16:46that that will result in a model that can generalize pretty well to kind of any type of cancer we would throw at it. Backing up, what is the data you're collecting? Because my understanding is you use some pretty specialized instruments and gathering very specific data sets. So how did you come to that decision about how much data, how much data, how much to spend on it, and what types of data? I'll give a hat tip to my previous employer, Recursion. So we spent six years at Recursion from the very beginning.

17:18And a lot of what we were doing in the early days was figuring out like the things we didn't understand about the data sets and figuring out what the problems would be in the data sets. So batch effects, controls, how to orient samples on plates, things like that. Flash forward to founding of Noetic, started the company, you know, already with some principles around how we should think about building the data set. What are some things that we know matter? So for example, over many years, we learned that images are actually a really powerful data set for routine learning for many reasons.

17:51One, they're scale. So we can put patient samples on slides and on a single slide, we can capture many patients worth of biology. Well, the images themselves are very rich sources of biological information beyond that. Now we have a very information dense modality, and we can decrease the cost of data generation. So then we can increase the amount of data generation over the whole data set. And that's always been a really big benefit to image based modalities over, let's say, sequencing, where every time you run a sequencing run, you're basically your end is, you know, a patient.

18:27That was one, one way to think about it. The other was, how do we design these data sets so we can control for things that we know are really important, such as batch effects. So for example, if I have a slide, we do a, let's say, a spatial transcript down this run on that slide. You stay in the slide, do a bunch of, you know, wet lab processing. So if you put it into a machine, you get data out. If you do that on two different days, there are going to be different variables that impact the data.

18:59That's going to be a large source of variation in data sets. So you want to be able to control for things like batch effects. So really you want, you have more patients represented on multiple different slides so you can process them different in different batches. So you want to be able to control for things like this so you can go downstream and look at the data and say, okay, well, once we have, let's say, patient level embeddings, we can ask, well, do the patient level embeddings represent, let's say, patient response to immune therapy or do they represent staining batches?

19:30So you're, you're actually taking different patient, one patient and you're spreading across multiple slides so that you can get a, like a, it's sort of a calibration across the slides. So yes, our data looks very different than anyone in the space of generating data on histology or digital pathology types of specimens.

Spatial Transcriptomics

19:48Um, so we, we receive a sample, we sample those samples dozens of times to build these arrays. Um, and each array has, um, you know, hundreds of different patient samples randomized. Um, and every patient is represented on multiple different arrays. And so we're getting a lot of different representations of each patient that we're sending through the data processing pipeline. And then that lets you downstream be able to answer some of these questions and control for some of these barriers. You mentioned some terms. I just want to define for people at spatial transcriptomic.

20:18Yeah. What is that? Yeah. So what be, I mean, this was your first question. So what are the data types? Yeah. Um, so you just sit back and this is not my background in terms of spatial. Again, everything we did on your previously was cell biology and a dish. If you just sat back and you said, okay, I want to train a foundation model that understands human biology. What does that mean? What will be, how would you go after that problem? And that was really the starting point for the company is okay. But from first principles, how would we do this? So you probably want tissue level biology.

20:48You want to understand tissue cells are organized into tissues. You probably want some modality that is relevant in clinical use. So you can relate clinical data to, to what your models are learning. That's why we generate pathology H and E. So that's, you know, what every patient gets a tumor removed and then they get this stain on H and E. And that's what the pathologist. I can't explain where H and E is. Um, it basically two, two different dyes, hematocyanin and eosin. And it, you know, really just creates a contrast over the tissue.

21:20So you've probably seen these like purplish pathology specimens. So pathologists can look at those and they can identify different cellular structures. And they've used those to classify tumors based on, you know, the classical classifications of, you know, had no carcinomas, small cell carcinomas, things like that, but basically cellular structures. Okay. So there's like a specific patterns would show up when you add these two sayings and it is well established that like you classify tumors based on.

21:52Based on, yeah. Pathology on your classifications. And this is what every, basically every tumor, um, you know, that gets processed in the hospital will get this H E thing. And it's how the pathologist typically classifies a tumor. Um, from, from the first level. So, okay. So you want that. You probably also want to understand cell types. It's really hard to understand cell types from just that state because it doesn't reveal that much that a human can use to classify cell types at least. So you could say, well, I, I want to know whether there are immune cells and different

22:25subtypes of immune cells. Um, we want to have some layer of cell biology. Okay. So, and you want to know about immune cells because like you have these cancer cells and oftentimes the immune response dictates whether or not like it will be, you have an effective treatment or. It's like the immune environment of the tumor will be a core. We know is a core, uh, constituent of, of, of whether a patient's going to respond or not. Um, so you want to know, okay, you want to give them all this. So the models are going to get this tissue level information. There's not enough cell level information in there for the model to learn enough cell

22:56biology of all different subtypes. So we also want to present it with some cell level information. So we use, um, protein stains, so standard, you know, fluorescence. So you basically use antibodies against small set of, uh, cell markers to label, you know, different T cells, B cells, you know, standard subtypes of cells in the tumor and microbiome. So in this stain, just to, for the, those who are familiar with the stain on the antibody, antibody has a fluorescing protein when you hit it with a certain frequency of light, then

23:27it fluoresces so you can tell the antibody bound to a certain protein. And now it has a fluorescing guillotine attached to it. Yep. And in terms of the data, so from, from the, from the tissue layer, you have an RGB image from the next layer, you have, uh, a multi-channel image with each channel representing, you know, let's say one color. And so for example, certain immune cells are each in, in a different channel. So you have this multi-channel image now. Okay. So that's great. In tissue, we've got cells, but if we actually want to make drugs, um, we need some, some type

24:01of molecular information. We need to tie all of this down to what's happening in the genome. What is the cell doing? What are the mechanistic principles of, of the biology? So then we get spatial transcriptome. So that, that spatially resolvable RNA. So DNA transcribed into RNA, which is, uh, translated into proteins. So we get basically the RNA, um, in a spatially resolved pattern for the same cells that we're seeing all of these other layers. So now you have between a thousand or 19,000 different genes.

24:34And again, these are all image layers that are spots of where those RNA are and in which cells. And this, this one works a little bit similar to the, how we talk about protein, where you have a segment of RNA and then you have a fluorescing protein. And usually there's some sort of combinatorial thing. So you have, if you see these four colors in this amplitude, then that means this gene because they're, they're right to each other or something like that. So for the detection method, you're basically binding a probe at each one of those RNAs and then you're cycling it.

25:04And it takes weeks to run one of those assays. So you're cycling the machine, it'll cycle across each species and it'll amplify and you'll get a signal for each RNA species. Now, at this point, you, you now have basically this very rich data layer where you have the tissue, you have the cells, and you have the molecular information. And you can use all of that to train the model. And so we, you think of it as, you know, if it's essentially the central dogma, if you will. And we also have DNA. We, we genotype just so we understand the genomic alterations in these tubers.

25:36All right. So you get the stack of images basically that you can train models on with understanding the expression of genes and the proteins that are being expressed at the time that the sample is taken all in the image information. And then you can train your models with that. Yeah. I mean, the spatial transportomics is like particularly dense because if you think, let's say there are 20,000 genes in the genome. Now, you know, we're running assays that are detecting nearly all of them in a single sample.

26:07So you can think of one of those data points as an image, except instead of being an RGB image that has three color channels. Now, all of a sudden it has like 20,000 colors. So it's like a very meaty computer vision problem to try to look at those data and figure out what makes patient A different from patient B and then go from that to which drug is going to work in which one. And so you, you have a hot take about virtual cell.

26:37Like I want to understand how, okay, so you, you know, you have this big pile of data that every single sample has a massive data set with it. And then you have many, many samples. So how do you turn that into useful knowledge? Maybe just what is a, what is a virtual cell? Everyone's always, you know, asking that question. I think there are, there are really two ways to think about it. You know, one is we want to be able to simulate all the biochemical processes in a cell. So we want to have this sort of comprehensive foundation model where we understand, you know,

27:09if some signal from outside the cell interacts with the cell, then here are the millions of intracellular chemical reactions that are going to happen. And you could sort of predict them, you know, from the model. So that's one view. I think that's interesting. It's sort of an interesting intellectual pursuit. I don't think we have all of the modalities of data that you would need to solve that problem. I tend to see the virtual cell problem as something more practical. We're trying to make drugs that work in patients.

27:41So from a virtual cell perspective, really what we want to do is understand cell biology in, in some heuristic that's useful for, for making drugs. And the heuristic could be, you know, a way to understand drug targets or a way to, you know, map your cell level biology up to patient level biology. And so the way we've designed these first virtual cell models is really just to simulate the biology of a cell in some context. And the biology of that cell being, you know, let's say the, the cell being in some context

28:13and the output being, you know, the, the transcriptome in that context, or, you know, the protein in that, in that context. And these types of, of, you know, input output relationships allow us to, to essentially design experiments. And so really the, the very simplistic thing that we're doing is it is really just the model can simulate the biology of cell or, or, you know, many cells in different contexts and give you, and allow you to run some simulations in that, in that regime. Yeah. I mean, I think what most of the things that people are calling like virtual cell models

28:45right now are focused on single cell gene expression. So transportomics data, RNA data, and they're largely geared toward the problem of predicting what's going to happen to the transcriptome. So the set of genes expressed when you hit cells with either a small molecule, a drug, or, um, a genetic perturbation. And typically this is cells grown in vitro, like either cell culture or primary cells, something like that.

29:15I think that- Genetic perturbation being where I like knock out a gene or add it generally and see how that impacts the expression of the debarious, uh, RNA. Exactly. So, and I think my view, and I think Ron shares it too, is that like may be of interest in some cases, but the problem we're really trying to solve is predicting what's going to happen in a patient. And you're just modeling data that comes from a patient is, in my mind, much more likely to

29:47translate to what happens when you give a patient a drug than something that's happening in cell culture. Is there other clinical data that you're pulling into the model besides the actual, so you're calling the context of the cell, just the surrounding cells, but it, is there other, this drug caused a bad reaction kind of stuff? Yeah. I mean, we're pulling in data from the entire patient. Um, so not just, you know, the very local neighborhood of the patient. So far we haven't done much integration of, you know, like electronic health records or,

30:23you know, other information that one could get about the patient. And that's pretty intentional. Like we really want these models to learn basic biology. Again, like the central dogma, not just the central dogma, but you know, the basic biology of genes, proteins, cells, tissue in a self-supervised way. So purely from the data that we're generating and not be biased by, you know, what the doctor wrote about that patient.

30:54Because, you know, our thesis is kind of like most of the therapeutically predictive and important information is not contained in those very small number of, you know, patients who have been treated with a given drug and whatever the doctors thought was important to write down given the state of knowledge at that time. So it's much more about trying to discover what's really there in patient biology than go based on the text that people have written about it.

31:25So you have this self-supervised model. You eat a lot of data. You have essentially some clusters of patients now. How do you translate those clusters of patients to making decisions? Like you go to a pharma company and you say, we can repurpose or we can suggest this subtype should be the focus of your phase two trials. Like what is the process for that? What data do they need to provide you and how do you translate your models? So it depends on what the problem is.

31:55I think it's important. So one of the more interesting aspects of these models is they are useful for a broad array of use cases as you know, as we were talking about from the very beginning. So you as the pharma company could say, okay, well, I have this molecule and the target of the molecule is X. And I want to design my clinical trial. The molecule has seen zero patients so far. All I know is the target and some biology around the target. So we can run simulations using the models and our cohorts of patients.

32:28And let's say, if we were to look at, you know, in lung cancer, we can run simulations around the target and ask, okay, which sets of patients here would this target be important and across a cohort of lung cancers and colon cancers and across all of oncology. And you might see, and we see this sometimes, you might see that your target probably don't want to put it in lung cancer. Maybe you want to put it in ovarian cancer because it's not really important in lung cancer. Yeah. What are you simulating here? So like, are you, you say that this drug is expected to knock down this gene and therefore

33:02it will result that you want to look for clusters where knocking down this gene inhibits tumor growth rather than enhancing tumor growth? I mean, that's certainly one way we could do it. There are other types of stimulation where you might just want to ask, like, if there were immune cell here, like a T cell, which is responsible for actually killing tumor cells, what would happen to it or what genes would it express or what proteins would express in this particular patient's tumor micro environment?

33:34And, you know, that's what we've called like these virtual cell simulations. Like we have a model called octo virtual cell that does this. And that can give quite powerful answers to the question of, are these drugs going to work in these patients? Because you might find like, actually, as Ron was saying, the thing that this drug targets is just not important in this particular patient's tumor in that there's not like it's not going to have any effect on the T cells or the macrophages or some other cell type there.

34:06Then, you know, there's the type of simulation you alluded to where you can ask the model, what would happen to this patient's tumor if you were to knock down this particular target gene or its protein product? And you might be looking for cases where the model predicts that removing that gene or that protein is going to have a large effect, like either increase the immune system function, its ability to fight that tumor or, you know, decrease the tumor's ability to grow or some other readout

34:40that you think is correlated with clinical success. I just want to call out maybe like the simplest use case is the one where there's like a company that has a drug and they've given it to some patients and we know some of those patients responded. And then it just becomes like a question of like, has the space of patients that the model has learned via self supervision tell us that all of the responsive patients are in one of one of these clusters and not the other nine clusters or something.

35:13So if we know that, then there's a pretty straightforward hypothesis that this is the right cluster. So that's the scenario where you would sequence something. What would you collect about those? So you have a cohort that responded and one that didn't. Yeah. So this is getting back to something Ron mentioned earlier, which is this type of data called HNE. It's a stain, the standard pathology stain that makes these, you know, pinkish and purplish looking images. Right now, what we do is we've built models that are trained on kind of all of the multimodal data we generate.

35:50But then once they're trained at inference time, all they need is an image of HNE. And that could be something that we generate in our lab or it could just be, you know, a digital image that they have from a trial that was run years ago. And the reason that that is so powerful and flexible is, again, because HNE is kind of like the lingua franca of pathology and especially oncology. So almost every patient who's been given a clinical stage drug is going to have that.

36:23You can look at the two cohorts, the responders and the not responders, and say these HNEs live in this part of the latent space and these HNEs do not. Yeah, exactly. And I think, you know, one way we've gone further than that even is given the HNE, they can say, I predict that these genes are expressed at this location in this patient. So not only do we have these clusters, these embeddings that say, you know, all of the responders to this drug are over here, all of the non-responders are over there, but we can actually see, okay, for the responders, these are the genes that are expressed much more highly or predicted to be expressed much more highly in the responder cluster versus the non-responder cluster.

37:12And so that adds a major, like, level of interpretability there because, you know, we can see things like, okay, like, good, the responders are actually expressing the protein target of this drug. So we would be worried if that weren't the case, but, you know, we can see it is. On the other hand, we also see that, you know, the biology is very, very complicated. So kind of explaining why these simple biomarkers, like looking at a single gene or a single protein, just really don't capture, you know, what is predictive of therapeutic response.

37:47Yeah, so I have like a million directions I want to go here. HNE, that actually gives you a pathway to a diagnostic then as well. Exactly, yeah. Right, yeah. And so that you can imagine after the drug hopefully makes it to the market, then a doctor says, oh, you have cancer, I'm very sorry, we're going to do a H&E stain of your tumor, and then we're going to put in the model and it says, oh, you know, this one won't, we're free, but this one won't. That's right. And you can, so we're using the same approach for actually today, we're looking at many different mechanisms from different collaborations that we have in place.

38:28You know, one of them we've announced with a company called Agenus. These are all different mechanisms. The input is still H&E using, you know, and some of the same indications. So using H&E, we're asking whether DR-A works in some sets of patients, whether DR-B works in other sets of patients. And so you can take that, you know, to its natural progression and say, well, okay, if you can use that same input, just H&E, for, you know, experimental drugs, why not use it also for drugs that are on the market already? In a sense, the same assay, they can be very predictive across many different cancers and many different potential therapeutics.

39:04There are lots of models that take H&Es and go to gene expression out there, open source, whatever. Or they do, you know, so-so. I've read in Twitter, your Twitter feed and whatever, that you feel that you have a data mode, right? And so why is NoetX model better? Sure. I mean, I think, you know, the scale of data that we've trained these models on is like, you know, pretty different from a lot of what's out there. Like, the reality is there's just not that much of this kind of paired H&E plus other data modalities.

39:37Typically, you know, there are some data sets generated by academic labs, others where, you know, they might have maybe like a hundred or a few hundred patients worth of data with paired spatial transcriptomics. That might even be an overestimate. In comparison, we're generating these data that are, you know, multiple patients per slide, individual patients distributed across multiple slides. We've generated now, you know, more than a hundred million cells spatially resolved, spatial transcriptomics.

40:10That's all paired with H&E and protein as well, at least an order of magnitude larger than any of the other data sets that we've seen out there. And I think that makes like a pretty enormous difference. I mean, we've seen with our own models that if you drop down to 40% or 10% of that data used in training, the models get a lot worse. And they especially get worse at kind of generalizing to other types of cancer from the ones that they've been trained on.

40:42So I think that's a big piece of it. I also think that, you know, the algorithmic side of it is important. You know, we've developed custom architectures specifically for training on this multimodal data. And again, my background is in computer vision and specifically in self-supervised learning there. And so we've tried to develop, you know, self-supervised learning approaches for these data that are really adapted for solving this problem of, you know, figuring out what is different in one patient versus another.

41:13And then simulating what would happen if you were to, like, knock down a particular gene or protein or something. So this is why we call these world models where we're trying to build models that can simulate what's going to happen if you take a particular action. I think that's another big differentiator for these models. And then, again, the interpretability as well is probably a third one. It's funny because you were just talking about how one of the other strategies people take for this is to do perturbations on cells and then watch the response.

41:49And now your experience plus, like, your strategy is you can simulate this sort of counterfactual perturbation idea without even having to collect the data to do that. And you can see this. Well, there's, yeah, there's a big piece that we haven't talked about yet, which is actually we are running perturbation experiments, except they're in vivo perturbations using a platform based in mouse. We have another platform where we are, it's called PerturbMap, Ron, if you want to describe any of it.

42:25But basically, this is a platform for generating highly multiplexed knockouts of individual genes. So the same kind of, like, CRISPR knockouts that people are doing for individual cells in vitro, except when we knock out a gene in a cancer cell, that cancer cell gets injected into a mouse. It's barcoded so we know which gene was knocked out, and it's being injected alongside, like, roughly 100 other cell types with different genes knocked out.

42:55So you end up with mice that have tumors that are barcoded that have 100 different genetic perturbations in them. We can actually use that to validate our models and ask our, you know, what the models are predicting in humans via simulation actually borne out when you do these perturbations in a mouse system. Sorry, there's a lot to validate there. Barcode. Yeah, so, sorry, barcoding. This is a technology in which an individual gene is knocked out with CRISPR, but also this introduces a set of protein tags in that cell that get expressed.

43:37It's a combinatorial code, so gene X might have, you know, proteins A, B, and C. Gene Y, when it's knocked out, has proteins D, E, and F, and we can tag those proteins or label them with antibodies so that when we go and look in the mouse, we know exactly which gene was knocked out based on which of those protein tags were expressed. So you knock out a gene, but you also added a gene that has the barcode proteins encoded on them.

44:08Yeah, exactly. And I mean, the system's designed, so everything that we're doing here is tissue level. You could be in vivo, you know, tumors that came for human that are in the form of the tumor that are, you know, the whole tissue. And then here and then in this mouse system, you have hundreds of tumors in the lungs of a mouse, and if you look at these images, it's a mouse lung with, like, literally hundreds of tumors in it. And each tumor has a distinct biology that's driven by the biology of the knockout, of the gene that's being perturbed, and we can capture, basically, the biology of each tumor in a spatially resolved way.

44:46So what you can see is, okay, well, we have a bunch of tumors in human that we have, you know, certain tumors in humans, let's say, don't have immune cells in them. And so those tumors are very aggressive, and they don't respond to immune therapies. You can generate those same tumors in this mouse system, and again, they don't have immune cells in them. And you can do it genetically, so you can start to map kind of the gene, the causes of gene relationships between these different immune, or just broadly, tumor genotypes.

45:16Or biological profiles, if you will, to what you see in the human. And then you can treat those mice with drugs, and you see how hundreds of tumors in a single mouse responds to treatment with one drug. Or you can treat many different, you know, let's say 50 different knockouts across a panel of mice with 50 different drugs. And you can start to build this intersectional pharmacology and, you know, genetic experiment. On Twitter and in various places, I've heard you say, noetic is, no cell lines, no warp bottles.

45:48Maybe you even said that, you know, a few months ago. And then we just said we have a mouse ball. Yes. And injecting cells, like, to... In the noel, not under the sky. So, yes. So, you know, fundamentally, we think it's really important to build models that are trained on human data. And we are sourcing all these tumor tumors to build, you know, human-centric models. So, that is also true. From the very beginning, we have asked this question of, you know, let's say we want to develop a drug from the very beginning.

46:21And let's say the FDA, and I know things have changed a little bit with the FDA, but let's say the FDA wants you to have some data in an animal that says your new mechanism works in some animal system. What do you do? You're kind of stuck because you've now generated, you know, arguably the best data that you can in the human system. And then the FDA says, well, cool, but does it work in the mouse? How does it work in the mouse? And then so you have to back into this system that it doesn't translate.

46:52And so, from the very beginning of the company, this has been, you know, sort of a question. And so, we started, you know, probably at the same time we started generating the mouse to the human day, we started building this mouse platform with the aim of drawing connectivity between these two systems. And so, we focused on a platform, we wanted a platform that, one, allows you to map a diversity of human tumors, because we know that if we just run a mouse model with one tumor, that tumor has no climate activity. So, in the mouse system, we want to have diversity of tumors, and we want to see a mapping of diverse tumor biology to the tumor biology that we're seeing in the human across many different locations.

47:28So, we licensed this system, and we've been building it, so you can see many different perturbations that produce a lot of the tumor biologies, plural, that you see in the human. And then we also want to be able to get from this mouse system to biologically relevant, let's say, targets or genes in the human as well. So, one of the fundamental problems in mouse systems is we share many genes with mice, but there are a lot of genes in the biological process we don't share with mice, as is obvious.

48:01And so, oftentimes you run into these when you're developing drugs. It's okay, you have a target, you have some biology that works really well in mice, maybe that doesn't even exist in humans, or like maybe that pathway is like useless in humans. And so, one of the things we've started to develop that we'll share more about soon is a way to use one of these models to essentially infer human biology from the mouse directly. And so, we're in silico humanizing the mouse. So, all the outputs in terms of the transcriptome from the mouse are in the form of the human genes.

48:34And so, when we read out this mouse system, we're reading out in the form of a human or alcohol. How do you validate that? I mean, that's a pretty impressive claim if you can do it, but man, it seems like a tricky validation task. In my experience, both here at Noetic and my previous employer, I could say recursion. Recursion, like a lot of the approaches you're looking for when you're building these types of models is you're trying to ask whether the models are recognizing biology that you know to be true.

49:07So, for example, in the human context, we know that 12% of patients with lung cancer respond to immune checkpoint inhibitors. Do the models recognize those patients? Can they recover those patients without training? Yeah. And we see that. And then when you go look at those patients, we see the underlying features of those patients maps to what we know about those patients and, you know, the client. In the mouse system, we have control genes.

49:37So, we ask, if you look at the mouse tumor embedding space, do the tumors that should be really cold look really cold from the human inference? Cold in the sense of, like, they don't have immune cells. No mice. Yeah. Yeah. And then hot in the sense of, like, lots of immune cells. So, we try to build systems where you have these hand ults. And then, you know, the more of these examples that you know to be true that work, that you see, the more confidence you have. Obviously, when you're into the regime of something very new, it's still uncertain to some of these.

50:11So, the bridge is sort of the bridge between the mouse and the human is you build a world model on a human, you build a world model on a mouse, and then you say, what are the parallel structures in the two latent spaces? Is that kind of the intuition here? That's one thing that we're doing. But actually, this is, like, even simpler, which is that we've trained models on human H&E, spatial trained tryptomics, etc., and then are just inferencing them on mouse H&E, which is easy to generate.

50:42And apparently, mouse H&E looks enough like human H&E that the models think is perfectly valid H&E makes predictions about, is this, like, immune hot, like, immune infiltrated versus cold versus fibrotic versus some other tumor phenotype? And those predictions are accurate. So, you know, these are, like, some of the controls that Ron mentioned. So, you know, we know that in mice and humans and everything, if you knock down tumor cells' ability to present antigens to immune cells, you know, those are very cold, like, immune cells are nowhere near those tumors.

51:20And, you know, that's exactly what we see in the mouse, and that's exactly what the models, the in silico humanized models predict. And, you know, then there are other examples where, again, we're recovering the biology that we expect to see there. And then there are findings that are novel, but also make total biological sense. For instance, we have done knockouts in the mouse of, let's say, half a dozen genes that are all in the same pathway.

51:52So you might predict that knocking down those genes are going to produce the same phenotype because they're on the same pathway. And that was a pathway. Yeah, so a pathway is, like, protein A signals to protein B signals to protein C. And, you know, there's, like, a chain of events that leads to the cell having some behavior, you know, changes in its metabolism, its growth, etc. So these are, I don't know if you've ever seen these crazy-looking protein signaling diagrams that, you know, make you want to stay away from biology.

52:24But, you know, like, you know, people have, you know, worked down a lot and they know that these two proteins interact physically and signal to each other and so forth. And so, you know, one of... There's some chain of those interactions that this protein binds to this protein and that causes it to upregulate a gene that causes this other protein to be formed, blah, blah, blah. Until you get to some phenotype, meaning the cell changed the way it looks or the... Exactly. And so, you know, based on decades of biological literature doing experiments on these, there's a very strong biological prior that if you hit gene A, gene B, gene C, and they're all in the same pathway, you should get similar phenotypes.

53:08I mean, this is kind of how, like, old-school genetics was done. And we see that with these in silico humanized mouse models, which is amazing to me as a biologist, that you have a model that's trained on human data, then you show it some mouse histology, and it's able to say these five different tumor genotypes all look like they have the same phenotype. And lo and behold, there are, you know, five genes that are in the same pathway. So you guys, switching gears a little bit, because we want to talk about models on Latent Space Podcast, you guys recently, there was an interesting blog post, Tario model.

53:48It's some transformer-based model. Do you want to talk about that? Sure. Yeah. So this is, like, new model architecture that we developed post sort of the first virtual cell model, OctoVC, that we developed. So Tario, this model is, you know, just a different transformer architecture. One major difference between it and, you know, our prior models, I guess, if this is a model podcast, this is getting into, like, the self-supervised learning objective.

54:19So, you know, for a while, including with OctoVC, we were training models on what's called the masked autoencoding loss function or objective, where you have a piece of data, you chunk it up into small chunks, you mask out some of those chunks. And the training task is the model has to predict the masked out chunks from the revealed chunks, like Burt. Yeah, exactly, like Burt. What are the chunks? Because this is multimodal. And, like, I would imagine the different channels contain wildly different levels of information.

54:53And I remember seeing something like 99% masking in OctoVC if I'm... Yeah, yeah, so... And I was like, that was kind of surprising because when you have, you know, 19,000 channels and maybe some of the channels are fairly, like, most of the signal is fairly sparse. Yeah. Then it seems like it'd be either there's a huge redundancy here in your data, or you really risk, like, just throwing maybe out what the path. Yeah. What are the chunks? That totally depends on which modalities we're talking about.

55:25So spatial transcriptomics, one chunk or one token might be the level of expression for a particular gene at a particular spatial location. For protein images, multiplex protein images, again, it might be, you know, the image patch for that particular protein at a particular location and so on. And, you know, for, like, histology images, again, those are usually just patches of the image. So pretty standard, like, vision transformer style.

55:57Well, the masking and the maybe surprising result that, like, you can and actually need to mask out large amounts of the data to get the model to learn anything interesting. If you ran the hypothetical where you only mask out, like, 10% of the image, you know, maybe more like BERT, for instance, in language modeling, what do the models learn? And, you know, they learn these kind of, like, boring behaviors, like how to, like, continue an edge a little bit, you know, between two, like, regions of an object or something.

56:35So they can learn that task very well, but they don't end up learning anything about sort of the holistic structure of the image data. And we found pretty early on at Noetic that the same thing was true with these multimodal, like, transformers, where if you mask out a lot of it, there are actually pretty strong correlations between where protein A is expressed and where protein B is expressed. And forcing the models to learn them is really what gives it this predictive power.

57:05And so Cario, though, yeah, is an auto-aggressive model. Yeah, exactly. So, yeah, that was going to be the buy-in. And so, you know, prior models, including AutoVC, were of this masked auto-encoding style training objective. Cario is an auto-regressive model, which, if you think about it, is kind of a particular choice of masked auto-encoding, except, you know, instead of randomly masking in front of the data, you're always asking the model to predict the next token in a sequence.

57:37We know that this is something that scales very well with LLMs, like training on the next token prediction task. And with still an open question, how do you get models of other data modalities to scale the way that LLMs have scaled? And Tario was not actually our first attempt, but one of our subsequent attempts to bring that auto-regressive, like next token prediction task into modeling spatial transcriptomics data. We found that when we used this architecture and this task, we started to see, you know, much better scaling behavior where bigger models, and especially at longer context lengths, were really outperforming, you know, the smaller models at shorter context lengths.

58:23Because they can see further in the image? Yeah, that's probably a big part of it. I think, like, you know, there's actually a pretty subtle but very interesting result in that blog post with Tario, which is that you only really see the benefits of using larger models when you're looking at longer context lengths. And here, longer context really means, again, like you're seeing more tissue at once, more area at once. And I'm not, like, super deep into the language modeling literature, but I don't know if there's an analogous thing with, like, language models where, like, you only see these scaling behaviors at longer context.

59:05So it could be that we're finding here is that, like, with patient data, you really do need to incorporate sort of more of the patient spatial context to really get the models to learn these more complicated nonlinear patterns in, you know, the spatial transcriptomics and take advantage of it. Is it possible part of this is because you have some number of low expression genes and that the behavior is driven entirely by some better intermodeling of low expression genes?

59:36Yeah, definitely possible that, like, the more context you have, like, the more likely you are to catch kind of these low expression but highly predictive genes, etc. I would guess it's a combination of that and larger area. Like, we've done some experiments just, like, comparing model of the same amount of context but in smaller or larger areas. And there definitely seems to be an advantage to looking at larger regions of tissue as well. I want to hear about, you did a big deal recently, you got a lot of press, and I think have the distinction of being one of the only AI for bio tooling companies that is making money, so.

1:00:20Accidental. No. So could you tell whatever you can disclose about that? We'd love to hear. Yeah, so we were really excited to announce a deal with GSK, where we licensed them OctoVC, which is for virtual self-foundation model. So we announced that back in January. It's a $50 million deal, includes an upfront payment, milestones, and then separate than that, also includes an annual license fee, model licensing fee. You know, I think this was an attractive deal for both parties, for us and for GSK, because, you know, really the deal focuses on models that we've trained already on lung cancer, colon cancer, allows us to, you know, provide them with access to the models.

1:01:05You know, GSK is one of, you know, the top AI teams in biopharma. So, you know, they know how to use these types of capabilities, they can use them for their internal use, they can also use them to fine-tune on their data. So that was a really big sell for GSK as well, because, you know, GSK and every pharma is sitting on mountains and mountains of so-called translational data. So the types of data that we're training the models on come from clinical trials, you know, pathology specimens across many different therapeutics.

1:01:39You know, everyone's sitting on a lot of this data, and it's been very hard to unlock. And so all of a sudden, you know, GSK can use our models both to do simulations and to do therapeutic discovery, but they can also fine-tune the models on their data. And in a way, the model then becomes, you know, sort of GSK's version of the model. This was super exciting. You know, it was the first, you know, at least the first announced foundation model licensing deal in the space. And, you know, frankly, it was one, you know, we've been trying to do for a long time, even before Noetic.

1:02:10You know, I think a lot of companies have been trying to do these types of deals. And it's been, I think it should have been historically slow for adoption on the pharma side. And it's been slow to demonstrate, like, a very clear value proposition for different types of capabilities. And so what's unique about this deal is it looks, you know, it doesn't look exactly like a software, you know, licensing framework for, let's say, a small amount of money with number of seats where you're licensed. Well, it looks like a real business development deal in the industry where there's a very significant multi-million dollar cash up front in your term payment.

1:02:46But then the substrate of the deal is not a molecule. It's not doing therapeutic discovery work together. The substrate is actually a model, which is what really made this pretty eek. Why do you think there's appetite for this suddenly? And it seems like almost whiplash that, you know, it seems like only maybe a year or two ago that Bio was dying and whatever. And now suddenly there's this deal, Bolts is getting a ton of attention. There's so much attention on isomorphic.

1:03:17And people are AI-pilled in some extent. We increase it in more. I mean, maybe not totally, but increasingly more. People are, you know, in pharma, you know, across the industry are seeing the value of different capabilities. They're able to use some of the open source capabilities and they're able to demonstrate the value to themselves internally. And if you look at a pharma company, you know, these companies are working on dozens and dozens of programs. And so, you know, my opinions, just frankly, my opinion is that I think pharma increasingly want to be able to access models,

1:03:48not just for one collaboration where you and I are working together on this one program. They want to be able to access the technology across the whole pipeline. And so I think that's going to create sort of a driving force for not just, you know, bespoke project-driven licensing, but actual broad licensing where a pharma can access the technology in many different therapeutic programs. Yeah. And I think also, you know, with the structure of prediction models, protein structure prediction, binding prediction models,

1:04:18there is like this massive public data set. There are increasing amounts of data. People can generate data to augment that. So, you know, there's enough data to the point where people can train very good models, but maybe not just on the data that any one biopharma company has. And I think that the same is true, but even more so for the types of models that we are building, which are, you know, foundation models at the patient biology level where, like, you know, no one company,

1:04:49I mean, these companies may have a lot of data, but it's, you know, scattered, it's siloed, and pulling everything together to, like, train an actual foundation model may not be as easy as it sounds, like within a single company. Whereas we have just said, you know what, we're going to generate enough data ourselves to actually train a real foundation model. And that's the nice thing about being a startup here is, like, we can make that bet that, like, you actually do benefit from generating all of this data in a, you know, uniformized way,

1:05:24like very high quality, et cetera, and then use that to develop and train the models. And my opinion is that you need to have data at that scale before you can even think about developing models that actually work. It's like you can't do the AI R&D, like, or build the algorithms until you have good enough data set to tell you whether your favorite algorithmic idea is actually working or not. That's a major advantage for us is, like, we have enough data to see, like, is my idea or someone else's idea

1:05:59about how to build a model, like, actually leading to improvements there. Yeah, I mean, this is a good point. I mean, so, like, sometimes people ask me, well, why doesn't GSD just generate your data? So we just started generating data for years. There was no month. It was like, how many years? Like, how, like, two years, maybe, a year and a half, at least, before we had the first trained models working? Like, maybe a year and a half we had the first. So, I mean, certainly, yeah, like, the OctoVC model, like, we trained in 2024.

1:06:29So, yeah, that's like two years after, yeah. So we, how, zero, four years of SIL. So this is year four, and so we basically opened the lab. We hired a team. We got all the instruments. We started sourcing tumor samples. There was no prior here that any of this would work. Like, zero. Big, crazy, like, I was just going for it. And, like, we just started generating data and, like, sourcing human tumors, processing. We built this whole processing pipeline to get the tumors into, like, these arrays and the formats. And it takes weeks to, you know, it takes literally two weeks for a machine to run a couple slides on the spatial transcriptomics.

1:07:06So you've got, like, these two-week runs where you're processing two slides. And we're just churning data for months. And we couldn't even train a, we didn't even have enough data to train a model for, like, at least a year and a half. And then you're building, like, processing pipelines. You have to align all the data. You've got to, like, post-process it off the machine. So we sort of just built all this. And then, like, let's say 18 months later, hey, I wonder if this stuff. And then it was not, like, it wasn't obvious. There wasn't, like, oh, we're going to, like, off the shelf, you know, train this on some, like, open source architecture.

1:07:39You know, we've had, you know, Dan and the team have done a ton of work. Yeah, there wasn't really, like, anything major to go off of. I mean, there were, like, transformers developed for single-cell data. But, like, incorporating spatial data into that was, you know, again, there just, like, weren't really data sets out there that people had been able to develop on. So we do a lot of, like, custom model building. And I enjoy that. I think people enjoy that. Because I have a lot of her joining. A lot to build, custom model.

1:08:10I'm like, I'm, yeah, really unique, innovative, involved. Sorry, who are you looking for? Like, what kind of people? Anybody excited about doing ML research on, again, this kind of alien landscape of data where you really have to figure out what's working from first principles. And obviously, the work we do should have very, very large impact. So definitely not restricted to people who have a biology background. You know, people who just like tackling very challenging machine learning problems and are, you know, open to learning the minimum amount of biology necessary to, like, make progress.

1:08:49I think, you know, would be great candidates. Talking to you guys reminds me a lot of the Leash Bio labs, which I know that both of you are part of the Recursion Mafia. I'm not, yeah. Well, yeah, yeah. I'll bring you this, yeah. Yeah, yeah, yeah. Yeah, we're going to be on the show in the future, too. So, yeah, yeah. We're looking forward to that. But, like, it's interesting because both of you seem to have really similar philosophies and that, like, you have deep convictions that, like, you're just going to start collecting data before you know this is going to work.

1:09:20And you are going to just brute force it, go, go, go. And eventually it will work. And, you know, you have signs. I don't know. I think that's really impressive. I wonder, is there something about Recursion, which is in the water, which has led to this sort of thinking of just, like, we're going to commit to doing things at scale and it may not work at first. You have to hit a certain point before it will. I mean, we failed a lot at the beginning. Yeah. You mean at Recursion. At Recursion, yeah. And so you, and we had, I said we had to build it from first principles, and we really did.

1:09:50And so we spent many years trying to figure out, like, what should the data look like? Ian, myself, we're all involved in kind of platform development, how to design, you know, these data sets, how to design the experiments, iterative cycles over the years, seeing, you know, things that did work, things that didn't work. And so at the end of, you know, coming out of Recursion, I think what a lot of folks there had was, like, an understanding of what are the things we need to think about so that even if I want to design a different data set, you know, today, that's, like, totally different.

1:10:20What are the things that we learned and we had to learn, like, over mistakes, over, like, not mistakes, but, like, trial and error, basically, over that many months, that we would try to insert in our new approach? And so I don't know that every, everything that I've predicted at Noetic, in terms of, like, how to generate the data set, has been important, necessarily. I know that we could start at the very beginning and say, okay, well, let's make sure we do these 10 things. I know every one of these 10 things was important before. Let's at least make sure we do these 10 things.

1:10:51I don't know that all 10 things are important for us today, but I would presume that, you know, many of them are. And it lets you sort of leapfrog that process of trial and error a little bit. Certainly, we do have trial and error still, but hopefully we're not having to, you know, solve, like, you know, 15 problems. Maybe we're only solving three problems, four problems over time. So for small biotech startups, which are probably in the A space who are collecting their own data, their own data mode, like, do you have any advice or any suggestions about how to be more successful there?

1:11:25I think you sort of need to, I mean, you think ahead to, okay, what am I trying to do on the machine learning side? And, like, what is the right data for solving this problem? I think oftentimes I see, like, a lot of companies are like, okay, well, I want to generate X data set. I'm just going to generate X data set, and I'm going to do machine learning on that. Like, that might not be the right data set. You might not have designed it the right way. You know, it doesn't follow that, like, any data set is a machine learning data set.

1:11:55It doesn't follow that that data set is going to solve the problem you're trying to solve. So, I don't know, for me, it was really, and even following your mic, it was, okay, what problem are we trying to solve? And then what are the data that are going to help solve that problem, rather than, like, you know, going from the data directly to try to solve? I also, sorry, I also have a quick piece of advice, which is, like, you know, pay attention to where the technology is and, you know, where it's changing rapidly. So, you know, I finished my PhD in 2016.

1:12:27I did a lot of looking at spatial RNA, like, via this technique called in-situ hybridization, same technique that is, like, at the base of what we're doing. I could look at maybe two genes at a time on a single sample, and that took me a full week of manual work. And, you know, I came to Noetic, like, five years later, six years later, and all of a sudden, you know, there are platforms where you can look at 1,000 genes or 20,000 genes at once.

1:13:02You know, it's a single machine that can run this assay. It's expensive, but it's just, like, data beyond the wildest dreams of Dan Baer in 2016. And that is only improving, like, rapidly. So, I think it's important to see what the technology of today, you know, allows and also where it's going in terms of what data to generate. And what does that pitch look like? So, I'm going to generate data for a year and a half, and then I spend $50 million, and then...

1:13:34I mean, it wasn't 50. It was maybe closer to 10. So, yeah, I mean, it is, and just, so, yeah, so you have to do that if you, if, I mean, if you're going into a regime where there's no data, yeah, and you want to do something different, then, I mean, there's no shortcut to it, right? You're going to have to generate the data set. And so, you're not going to know the answer until it's there. And, I mean, and that's why a lot of companies are not going into that space where there are no data sets, because, you know, I think it can be challenging to do that.

1:14:07Yeah. I mean, I think a lot of smaller biotech AI startups will try this pattern where they first will either start with a public open source data set, or they will try a pilot will internally collect a small amount of data and see if something works or something it doesn't. And oftentimes, there's almost, like, a critical point where below this, you're just not going to get a new signal, and you have to have conviction that you need to collect up to a certain point before you start, like, really driving something, like, fundamentally valuable.

1:14:39Yeah. Yeah, I mean, imagine trying to train a foundation model on, like, hot enough data. Yeah, yeah. And then that's, it's sort of your clinical trial call, right? GPT-2, GPT-3, GPT-3, you know, GPT-1, 2, and 3, like, there was a clear progression there. As each one of them, you could see there was something which worked with scale, and there was this insight to, oh, we're going to scale this off. Yeah. You know, some kinds of biological data, like, the process of collecting lots of data is just very expensive to begin with. You can't just take something off the shelf and expect that you're going to hit the threshold of, you know, GPT-3, like, usefulness.

1:15:15Yeah. Yeah, so, yeah, it takes some conviction. It definitely takes conviction. I think, you know, it also takes sort of, like, a scientific belief. Then there's a lot out there, like, that we just don't know yet, and that you're not going to capture the biology you need to by having, right now, like, an agent that reads all of the biological literature. Because, again, that's just, like, a tiny slice of what's out there. Like, this is, I don't know if it's a great analogy or if I'm going to botch the history here.

1:15:46But, like, in astronomy, it was required, like, Tycho Brahe, like, collecting this enormous amount of astronomical data at his observatory that then was the substrate for Kepler, you know, figuring out the first laws of motion of the planets. And then, you know, that was superseded by, like, Newton's laws and so forth. But, like, I don't, I sometimes don't know how you even get started without, like, this large repository of really high-quality data being with, and, you know, maybe there's, like, a tragedy of the commons problem here of, like, who's going to generate that data and who's going to capture the value of it.

1:16:25But I'm very glad that we're taking that bet and, you know, we're seeing it pay off. Yeah, I mean, this is not my expertise, but if, you know, hypothetically speaking, yeah, how much of PDB do you need to train? I mean, there was some people I argued that, yeah, and then you can get some pretty good models with, I think, 1%. 1%? Yeah. And there are people going back in the 1990s argued that there was, the PDB was already complete in the sense of, like, if you had a sufficiently smart algorithm, you could have done a pretty reasonable job of protein folding even back then.

1:16:58Interesting. So, you don't need a lot to get a pretty big boost, but the community was sort of independently collecting PDB data for quite some time. Yeah. Without necessarily being convicted that this was going to lead to solving protein folding. Yeah. But then it was also usually quite, most of those structures were quite useful in and of themselves, so maybe that's their charter point is oftentimes just knowing a protein was very helpful for some useful data set. And we did see, we did see a transition from, like, early data, but how many samples did we need?

1:17:31I'm guessing probably on the order of a few hundred before there was, like. Yeah, there was definitely a moment, like, very soon after I joined, where, like, the data set just kind of doubled in size overnight because there was, like, a huge bolus. And, like, the models immediately got a lot better at that point, and, you know, now we'd run these more controlled experiments of seeing, you know, what happens if you train on 10% of the data versus 40% versus 100%. And what happens if you hold out all of the pancreatic cancer or all of the breast cancer?

1:18:03So, you know, we have a much better idea of what kind of diversity and scale we need now. I guess I would say if we were sticking to cancer, maybe we're not, like, that far off. I think, you know, again, if we end up generating a few hundred patients in a bunch of major and, you know, some minor indications, which we're, you know, going to do this year, like, maybe that's enough to generalize to kind of all cancer. Because there is a lot of shared biology in, you know, cancer and immune cells across different tissues and different, you know, mutations and so forth.

1:18:41But if you think about all of the disease biology that there is for a model to learn, you know, maybe that's, like, another order of magnitude. But even being able to solve all cancer biology would be pretty impressive. Yeah, to cure cancer would be great. Well, if it's all cancer biology, I did not say cure cancer, you know, so it's a different place. But at least if you go, Madeline, just sort of, like, just take one drug. If you could look at one drug mechanism across the whole of oncology, that's incredibly powerful.

1:19:11I mean, imagine what Merck has done with K-Truda. Like, Merck has run hundreds of trials with K-Truda. Like, it might even be over a thousand trials of K-Truda in different populations to find, you know, all these different indications. Okay, the subset of ovarian cancers, the subset of lung cancers, the subset of colon cancers. That's all been done, you know, by enrolling trials. If you can look at that biology from model embeddings and at least have a very well-defined starting point for, okay, if I'm going to run a trial,

1:19:46well, it doesn't have to be as broad as it would need to be if I didn't have any answer, then that can be a really powerful tool for, you know, a diversity of mechanisms. Yeah. Maybe it's just, like, last point, like, going back to the virtual cell hot takes. Like, you know, if your goal is to build, like, an actual mechanistic model of an individual cell and then build up from one cell to an entire tissue and then, you know, tissue to patient and so forth, like, you might need a lot more data and a lot more data modalities than, you know, just, like, gene expression or something like that.

1:20:22But, you know, we're taking much more of, like, a top-down approach of we're trying to first solve the problem of what is determining heterogeneity among actual patients and which of that variability is predictive of drug response. And my intuition is that you don't need to model the mechanism at the subcellular level necessarily to solve that problem of which patient should get which drug or, you know, which targets are important in which patients.

1:20:54And I saw a similar debate play out in neuroscience and computational neuroscience where, for a long time, people were really trying to build these biophysical models of individual neurons and then they were going to stitch them together into models of, you know, the brain and so forth. And what actually ended up working in, you know, in terms of building computational models of the brain and behavior is this abstraction, you know, we're just going to treat individual neurons as, you know, linear, nonlinear units

1:21:27and, you know, put them together in neural networks that are connected by, you know, linear weight matrices and, you know, stack a bunch of layers together and then build neural network models of the brain that abstract away kind of all of the details of biophysically what a neuron is doing. And, you know, those are now by far the most predictive models of how a given neuron is going to respond to real-world stimuli in a real brain. And I think that my bet is that the same is going to be true for these models too

1:22:02is that, like, by modeling sort of at the level of functional tissue where you have a bunch of cells interacting in, like, a disease context that that's going to get you to the problem of predicting kind of the patient-level behavior much faster than trying to first model a cell and then stitch a bunch of those cells together. Yeah, that makes sense to me. It's a good analogy of the good analogies. Do you have any call to action for the listeners? Yeah. I mean, I would say, one, everyone should be excited about biology.

1:22:35You know, sometimes a lot of my hot takes on X recently are just that I feel like there's a huge amount of enthusiasm in sort of, like, the mainstream tech ecosystem and, like, people aren't really following a lot of, like, what's happening in the biology space. But at the same time, like, you're hearing, you know, French Reel Labs saying, we're going to cure cancer. Yeah, people should actually look at the folks working on curing cancer or working on aging or working on areas of biology. These are really exciting, you know, problems. There are real, like, significant NL problems in the space.

1:23:07One call to action is with love for people to just, like, be more stoked about learning about applications of machine learning in, like, biological sciences and, like, solving some of these hard problems because I think these are the problems that are going to, like, massively impact humanity in, like, the next 10 years. And we're just, like, really at the very beginning. Like, you know, maybe we're in the, like, first inkling of the chat GPT moment for bio, but it's, like, very much just the very beginning. So we'd like... Catch it while you can make... Yeah. Yeah.

1:23:38In line with that to, like, really dig in and learn more about the details. I think, you know, a lot of the times it's presented as we have these protein folding models, we have these binding models, you know, we have AI for science agents that are, you know, like, reading all of the literature and automating these computational biology workflows. And I think it's important to realize that there are a lot of problems in AI for biology, AI for biochemistry, etc.

1:24:08And some of them, and they're very important, but, like, solving any one of those is not going to, like, solve the problem of how do we develop better therapeutics. And, you know, we're focused on, you know, a pretty particular slice of that process, which is, again, translating things that we know work well in some patients into actual, like, successful drug trials where we know exactly which patients to give them to. And that requires building foundation models at a particular level, you know, the patient

1:24:42level. But people should not be under the impression that, like, this is all going to be solved immediately because, you know, AI agents like LLMs are going to just read the literature and figure out what the right drug is. Like, there are a lot more data to generate. There's a lot more ML problems to solve. And there's the need to translate those methods into actual successful drugs. And there's a lot of different places to contribute. There's a lot to do. Yeah, there.

1:25:12Great. Thank you very much. Here we are.

More from Latent Space

AI-Native Healthcare: 100M Doctor Visits, 10–20 Hours Saved, Prior Auth in Minutes — Janie Lee & Chai Asawa, Abridge

May 14, 20261h 5m

🔬Doing Vibe Physics — Alex Lupsasca, OpenAI

May 5, 20261h 31m

Physical AI that Moves the World — Qasar Younis & Peter Ludwig, Applied Intuition

Apr 27, 20261h 12m

AIE Europe Debrief + Agent Labs Thesis: Unsupervised Learning x Latent Space Crossover Special (2026)

Apr 23, 202654 min

Shopify’s AI Phase Transition: 2026 Usage Explosion, Unlimited Opus-4.6 Token Budget, Tangle, Tangent, SimGym — with Mikhail Parakhin, Shopify CTO

Apr 22, 20261h 12m