Steadcast
Latent Space cover art
Latent Space

🔬Why There Is No "AlphaFold for Materials" — AI for Materials Discovery with Heather Kulik

March 24, 202635 min · 6,026 words

Show notes

Materials science is the unsung hero of the science world. Behind every physical product you interact was decades of research into getting the properties of materials just right. Your gym clothes contain synthetic fibers developed over decades. The glass screen, diodes, and chip substrate technology needed to read this blog post were only viable due to many teams of material scientists. Our guest Prof. Heather Kulik was one of the first material scientists to realize that there was alpha in combining computational tools with data driven modeling — she did AI for science before it was cool. She has a hard-fought perspective for how to succeed in this field. Yes, she believes the wins are real. To get there you must work hard to deeply integrate domain expertise with AI techniques, and also maintain a discriminating mind. Ultimately what matters is you succeed in the lab, and nature doesn’t care about how hyped a model is. These lessons personally resonated with the Latent.Space Science team and our own experience. This episode is a must watch for all aspiring AI for science practitioners. A few highlights: Designing new polymers with AI: Heather’s group recently used AI to design new polymers that are significantly stronger. These materials were created and tested in the lab, and the scientists who built them were surprised by the designs. The AI had figured out certain building blocks could break in a novel way. The AI discovered a purely quantum mechanical effect, and after convincing their lab collaborators to actually synthesize it, the material turned out to be four times tougher! The twenty-two-atom ligand challenge : When asked about the role and need of human scientists, Heather points out that AI has a strong understanding of academic chemistry, but is still lacking intuition. Every time an LLM is updated, Heather asks it to design a ligand that contains exactly twenty-two heavy atoms. She has yet to find one that can succeed at this seemingly simple task that any expert could do in a second! Is this the chemistry counterpart to counting ‘r’s in strawberry? Side note: Heather joked that this comment would date itself immediately, so we decided to see if this was still true three months after recording. We found some interesting results! We asked both Claude and ChatGPT to design a 22 atom ligand for both a metal-organic framework (MOF) and a Kinase protein. * For the Kinase, both models got it right: Claude pulled out RDKit in a python script and iterated on several designs, whereas ChatGPT just one-shotted it. * For MOFs, both models got it wrong, generating ligands with 21, 23, or 24 atoms, yet stubbornly not getting 22 atoms. Is there something different about how LLMs reason in the materials and bio domains? Materials vs biology: The two biggest domains of AI in science have been biology and materials. We asked Heather if there could be an AlphaFold moment for materials. Her answer reframes how we should think about the field: * First, the datasets in material science are woefully lacking in comparison to the bio world. The closest to ground truth in most cases are noisy DFT datasets. These are just approximations to the real world! The datasets that are accurate are all boring, as Heather quipped “We have really good datasets for really boring chemistry.” Furthermore, good experimental structures are hard to come by and require interpretation. So generating generating high-quality, novel datasets at scale would really drive the field forward. * More philosophically, AlphaFold is making predictions in a fairly limited space: there are just twenty amino acids. Sure, even here AlphaFold doesn’t get everything right, but it seems plausible that one could learn the entire design space. For materials, each element is a new set of interactions and chemistry, with little to no transferability. This is a massive open problem in material science that we hope some of the smartest AI scientists will want to work on! The difficulties of trusting the literature : Heather’s team has spent the last few years using NLP and later LLMs to extract data from literature. Even a few thousand data points from these papers can be valuable for guiding her group’s work. One surprising result: sometimes the reported values for a property (say temperature) do not match up with the graphs in the papers! So there’s lots of potential in using LLMs to mine data from the literature, just do it with care. The role of academia in an ever-changing world: One theme that has been running through many of our conversations has been the changing role of the academic — and the scientist — in science. When startups are raising $100s of millions and hyperscalers and Big Pharma are all ramping up AI-for-science efforts, the academic researcher needs both resources and judgement about problems to chase more than ever. Resources include data that is organized for machine learning, access to high throughput experimentation labs, and compute resources. These are all things that academics can build together. More importantly, Heather emphasizes curiosity about problems that haven’t hit the radar of the heavily capitalized AI companies. After so many years on the forefront of AI for Science, Heather’s judgement that Chemical Engineering and Material Science still need curious people asking questions with no clear path to money is a welcome beacon in the AI fog. Full Video podcast Is on Youtube ! This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe

Highlighted moments

I just ask it, please design me a ligand that has 22 atoms. I can never get an answer that has 22 atoms.
Jump to 0:31 in the transcript
We have really good data sets out there for really boring chemistry.
Jump to 17:08 in the transcript
every time someone comes out with kind of a new data set trained on a, and they call it a foundation potential foundation model, it looks really good. And then you get it into your lab and you say, okay, I want to use it for this problem I'm really excited about. And it starts doing kind of wacky things like molecules fall apart.
Jump to 20:00 in the transcript
you can get the temperature at which a material will break down two ways. One, you can get it from the graph. And two, you can get it from what the authors say about how they interpret the graph. And those two things do not line up.
Jump to 27:39 in the transcript

Transcript

Introduction to Chemistry

0:00There's a school of thought that why should I bother to learn chemistry or physics or whatever when ChatGPT, you know, as PhD level understanding of that anyway. ChatGPT is super good at Wikipedia level chemistry knowledge. I'm really interested in molecular design. Like how do you find a new ligand that can go into a transition male complex? And what that means is that some combination of atoms and it's going to bind to the metal and it's going to change its properties.

0:31The thing I constantly do every time an LLM is updated is I just ask it, please design me a ligand that has 22 atoms. I can never get an answer that has 22 atoms.

Guest Introduction

0:45Hi, we're really excited to have Heather Kulig here. She's a professor of chemical engineering at MIT. Heather has done some like amazing work in material science and computational chemistry. But we're particularly excited to have her today because she has for almost her entire career been working on the intersection of using data driven methods, AI, and using applying them to improve materials and understanding materials. And she has a lot of like really interesting opinions about what works and how do you approach these problems to get the most out of them.

1:24So, yeah, we're really excited to have you here. And yeah, maybe to get started, can you just tell us about like one of the coolest things you've done in your opinion for an AI engineering audience?

Accelerated Discovery

1:36Yeah.

Accelerated Discovery

1:36So, my group, we work a lot in accelerated discovery of new materials. When I first started out, we were just really using AI to make predictions we'd normally make with computational models, just make them faster. But the question I would often get when we were doing that was, okay, but what's surprising? What's sort of something from AI that like I wouldn't have already known if I were a really smart chemist or a really smart material scientist? And, you know, you make all these computational predictions.

2:08Has anyone actually made in the lab something that you predicted? Recently, I was able to do a really nice demonstration where the answer to both of those questions, you know, was very clear from the work. So, we were able to screen with artificial intelligence a set of thousands, tens of thousands of materials where each individual experiment, if it were done in the lab, would have taken months to years. And through AI, we uncovered this sort of unexpected chemical phenomenon that led to an emergent property in what's known as a polymer network, so plastics, that would make the polymer about four times tougher.

2:50And when we showed the design that AI had come up with to the experimentalists, they were really surprised. They would have never come on this on their own. And then we were able to convince them to make it in the lab. And, in fact, it was this tougher material. And where this has applications is if we can make plastics tougher, then we, you know, can get more use out of them. And it will ultimately address some of the problems we have with overall durability and use of plastics. So, I think that's an example of some of the promise of AI in materials discovery.

Machine Learning Applications

3:23Cool.

Machine Learning Applications

3:23So, can you dig into that a little bit? What was the surprising chemical discovery there? So, it's sort of hard for me to think about how to explain it without getting too deep into the chemistry. But basically, these are molecules that have to break apart. And when they break apart, they make the overall structure that they're in tougher. So, a little part of the material breaks and that helps to dissipate the force. Normally, the way you would think about making it easier to break apart these small molecular components might be to create a hinge.

3:58So, they can kind of peel open instead of sliding apart. But what we discovered was that there was a fully quantum mechanical phenomenon. There was really no way for us to predict this, you know, based on anything else where the electrons just move around in a different way so that at this moment where the molecule is going to break apart, it's a lot more stabilized. These types of concepts, they're sort of similar to what's kind of known about how catalysts and enzymes work. But it had never before been shown in these polymer materials.

4:30So, this is sort of like the fuse in the Bay Bridge that sort of like allows the bridge to keep its structural integrity during an earthquake by having a controlled break. Is that kind of... Yeah, yeah. So, we weren't the first ones to discover that phenomenon on its own. The general phenomenon that putting little places that could break to make the network stronger. That was published in Science Magazine a couple years ago. But the specific way we came up with to design the material to do this, that was our new contribution.

5:02How did you... You mentioned that, you know, you started off in accelerating kind of existing methods using, you know, sort of enhanced computation. What caused you to take that leap to more machine learning-based methods? So, you know, I was drawn to data-driven discovery pretty early on, sort of before I even knew the phrase machine learning. And I guess I was just really excited by what you could learn from patterns in data. Back then, we were trying to call it cheminformatics.

5:33And just sort of trying to think about, you know, in what ways could you unearth trends in data? Because I started my career actually working kind of one molecule at a time or one material at a time. And I was just impatient. I wanted to be able to sort of understand not just one molecule at a time and write one paper about it, which is something people would have been happy to do back when I was starting my career in the mid-2000s, but to actually kind of unearth broader trends in how you understand how material is going to behave.

6:10Somewhere around 2015, 2016, I realized it was a bad idea to call things cheminformatics. And it was a good idea to start calling things machine learning. And I had a brilliant student, Jean-Paul Genet, who's now, I think, an assistant director at AstraZeneca in Sweden, running their inverse design program. He and I originally talked about all sorts of ways of thinking about materials design, and he very quickly adapted that into training neural networks.

6:41And that's sort of when, you know, I thought we were in the first sort of hype cycle, the first wave, but I think compared to what's going on right now, it was a tiny baby wave. I read in your paper that that was actually a class project or something. Yeah, yeah, that's right. You know, he just said, I have to do something for my homework, and that's how we got into it. I've also read in your paper that you've done a lot of work, like, slightly more recently on active learning and using.

7:12Can you talk a little bit about that? Yeah, yeah. So even that polymer example I was giving, that would have been active learning in principle, but we sort of stopped after one generation because we had exhausted the space. But I think one of the areas where machine learning kind of just with what's out there right now has the most promise in chemical sciences is in solving multidimensional challenges. So right now we're working on a project in metal-organic frameworks where we're trying to solve trade-offs relevant for direct capture of CO2 from the air.

7:51And so in order to find a material that's good for that, we would worry about its cost, its stability in, say, aqueous humid environments, its ability to take in CO2 over other molecules, its mechanical stability, is it going to hold up under force, is it thermally stable, can you heat it up, and will it be okay? I'm just naming a few. But in total, right now in an active learning campaign, we're working on seven different objectives. And usually, just even for a not-so-accurate machine learning model,

8:26you get, you know, at least 100 to 1,000-fold speed-up for every dimension you're optimizing over. So the real promise is going to be in searching for that needle in a haystack with, say, seven objectives and doing something where you're not waiting for the models to be accurate before you start doing that optimization. That's really the promise of active learning. Yeah, that has an interesting parallel in my mind to the pharma world where you have a lot of computation work

8:56in the discovery process, but that actually getting the drug out in people's hands is often the bottleneck for a drug. And also, you know, what happens to the drug when it sits on the shelf for three months? That kind of thing, yeah. Are these medical organic frameworks, what are the kind of things that they're used for? They're used most in gas storage, sensing, and separations. They're used in combination with polymer composites.

9:26They have really strong promise for CO2 capture especially, but people have looked at them for catalysis. The limitation on catalysis has been, you know, how stable are they? So, one of the things we've spent a lot of time on is trying to be able to predict their stability. But they're used for all sorts of things, even drug delivery. What they have the opportunity to do is really place precise chemical groups in specific orientations that can ultimately allow for what's known as host-guest interaction.

10:02So, basically, kind of create a glove to have a targeted interaction with a guest molecule in the metal-organic framework. I see. And just for the non-chemist, metal-organic framework, the Legos for chemistry, is that... Yeah, yeah. Metal-organic frameworks, I think, are going to be a little bit more of a household name among some engineers because the discoverers of those materials just won the Nobel Prize in chemistry this year. So, as much as that can make something in chemistry a household name, but they're basically like Tinker Toys or Legos.

10:40And they have different building blocks that can be combined in basically infinite ways to create very precise chemistry. I see.

Pre Machine Learning Techniques

10:48Maybe for context, could we step back in, like, what are the techniques you were using before you started or maybe in parallel with machine learning? And how does machine learning help you advance those? Like, what are the roles of the two? So, I started my career studying what's known as transition metal catalysis. If you look at the periodic table, the middle of it contains a bunch of metals. A good example would be iron. And all of those things sitting in the middle of the periodic table, they have what's referred to as an open shell.

11:22So, the electrons in those materials are not paired and they're not... They're, as a result, more reactive. Normally, like, the way that you understand how they're going to behave... So, for instance, they give rise to, you know, different combinations of these metals give rise to the catalysts that are used in a large number of transformations, including the things that, say, feed and sustain most of the world's population, such as the Haber-Bosch process for ammonia synthesis. And the way, going back 20, 30, 50 years, that people understood these materials and could enable their rational design is through quantum mechanical modeling.

12:02Quantum mechanical modeling, by using approximations to the Schroinger equation, it can be very accurate, but it's very computationally costly. And so, a single quantum mechanical prediction, depending on the level of fidelity used, could take hours to days to weeks. And that's what I would have normally been doing before I got started in AI. Some of what we do these days is accelerating those quantum mechanical predictions, as well as looking at, you know, an area that I'm particularly excited about is that not all quantum mechanical approximations are equal.

12:41And you can actually use ML models to kind of predict what the best approximation to use is, depending on the material studied. Is that, like, closer is better, or is it not really distance-related? In terms of which method is the right method to use? Yeah. So, we, it actually turns out to be quite complex. You can't just determine it from heuristics. So, we actually, in one area, use the quantum mechanical wave function as inputs to neural networks to actually predict what is the right method to use and learn that mapping.

13:15I see. That's probably going to be in the best part. That sounds like a challenge. Yeah, it's going to be in the best part in the next, uh... The cool, like, 22 atom-ligand challenge. Yeah. Go. I have a spicy question I want to ask. So, there's a school of thought that why should I bother to learn chemistry or physics or whatever when to LGBT, you know, as PhD level understanding of that anyway? And shouldn't I just focus on being really good at using AI for stuff?

13:45So, I want to hear your thoughts. My personal experience is that, um, and this will date itself immediately, is, is that, is that, uh, chat GPT is super good at Wikipedia level chemistry knowledge. But one of my favorite things, um, but one of my favorite things to actually throw at GPT as, as, as an anecdote is, is I'm really interested in molecular design. Like, how do you find a new ligand, um, that can go into a transition male complex? And what that means is that some combination of atoms and it's going to bind to the metal and it's going to change its properties.

14:19And so, the thing I constantly do every time an LLM is updated is I just ask it, please design me a ligand that has, uh, 22 atoms. So, the first time I've, I've done that, there are many ligands out there that have 22 atoms. And then I say, I wanted to bind to the metal with two nitrogen atoms. I can never get an answer that has 22 atoms. So, then you can try a range and see how many times you can get a range. And so, that's, that's maybe a trivial thing, but that's something that a expert chemist, you know, could do in a, do in a second.

14:52Um, so there are really good introductions to chemistry that I think you can get through conversations with an LLM. Um, you can get a lot of insight into an area you're unfamiliar with. And, for sure, things have improved a lot. Like, when I first tried typing in, you know, which exchange correlation functional should I use for this type of chemistry, the answers were completely wrong. They looked right, but they were completely wrong. I think things have gotten better because that knowledge is out there on the internet and it's in the training data. But I think there's a lot of things that, um, probably backing up a moment.

15:27Um, you should learn chemistry well enough to know when, when these models are right or wrong. Um, and if you don't know any chemistry at all, it's hard to know if you're, if you're assessing correctly. But I think that there are a lot of things that you don't have time to do a deep dive into that you can now get from, say, an LLM that can augment knowledge. But I think you have to start from somewhere and then use it as a tool rather than starting from zero and relying blindly on what an LLM will say. But one of my favorite things, if someone can get in one shot an LLM to generate me a 22 atom ligand, I would, I would love to see it.

16:07What do you think the biggest gaps that machine learning has from your experience? That, like, if you are an aspiring, um, ML engineer with looking to take on a new problem from the machine learning side, what do you think someone could work on which would really help the chemistry side? There are a lot of challenges out there where the data sets aren't large enough or diverse enough. And so I think they've attracted less interest. Um, so the ones closest to my heart are, uh, reactivity predictions. So predicting which reactions will occur and, and why, especially in complex, uh, phenomena, uh, like in, um, you know, multiple elements and, and sort of predicting those transformations.

16:50Uh, another thing that I think, um, there's not enough data on is just more diverse chemical bonding and more diverse, uh, chemistry. Um, for me, that's transition metals, but there's also questions of warm, dense materials, sort of exotic phenomenon. We have really good data sets out there for really boring chemistry. Um, so we have, you know, probably even if you're not a chemist, you're familiar with, um, organic molecule data sets and organic molecules binding to proteins.

17:22Those are the common data sets out there. There's lots of challenges out there where the physics is much more complex and the things like how does matter behave when, uh, you shine light on it and you excited into excited states. All, all sorts of things like that receive relatively little attention because, you know, there may not be a benchmark or a leaderboard yet for that. And so maybe it's on us chemists to generate more data sets. So those leaderboards are out there, but there's definitely, you know, a lot of interest in chemistry for which there has been less attention.

17:57So in the protein world, there's CASP, right? And people have been working on that for a while and this led to AlphaFold, like kind of without CASP, AlphaFold probably wouldn't exist. Is there like an equivalent to CASP in the material science world? So there are all sorts of repositories of fairly low fidelity DFT data on crystalline materials. So materials project, um, open catalyst project, these do provide good leaderboards, but some of the limitations of that are the data comes from not very high fidelity density functional theory.

18:30So I'd say that's a second, um, challenge is that we're all, all the smartest ML engineers right now are learning on data that is not going to be reflective of experiment. There aren't big experimental data sets. For example, um, one of the advantages of things like CASP is that it comes from an experimental ground truth. Um, whereas, uh, that aspect just isn't available in materials as much. We were talking about CASP and, you know, the role of CASP and AlphaFold.

19:02Do you think that there is, um, like a problem, a way of phrasing this, that we could start collecting data at scale, that we could, um, you know, really have a community challenge, which breaks open some open problem in your mind. And maybe like, uh, maybe actually even stepping back beyond that, what, what would you want to have if there was like an AlphaFold, um, for materials, what would you want it to do? One kind of murky area. So maybe I'm not going to directly answer, answer this question.

19:34And one murky area for us is electronic structure calculations are expensive and, uh, they should, in principle, give you the right answer. They should, from first principles, give you the right answer of how a material is going to behave. And a lot of people are scaling these up right now with, uh, machine learned interatomic potentials on training data. And every time someone comes out with kind of a new data set trained on a, and they call it a foundation potential foundation model, it looks really good.

20:09And then you get it into your lab and you say, okay, I want to use it for this problem I'm really excited about. And it starts doing kind of wacky things like molecules fall apart. I won't name names, but, um, there was one that made a huge splash this summer and people started declaring, oh, this method is dead. This method is dead. We're all going to just use these neural network models now. Um, it's only in my hands, uh, the one I'm still not naming is only about five times faster than my fastest DFT calculation on a GPU.

20:41And it also doesn't work all the time. Um, so I would say we need a more transparent way of, of trying to figure out, um, if these models can really replace conventional physics-based modeling. Um, if they could, if I could just give up ever doing a DFT calculation again and just rely on machine learned potentials. And if they were, you know, two orders of magnitude faster than the traditional approach, that would change, that would change how we're doing science. But there needs to be a little more rigor on what we consider, you know, just fitting data when that data maybe lacks quality.

21:20Or there needs to be a little bit tougher requirement for, for how we say this, this model can really replace the physics-based modeling. Yeah. So I, one of our pieces is that the, the interface between bits and atoms is really the bottleneck, right? Where you have to, the, actually activity of trying things in the lab is the bottleneck. And you've addressed that to some extent in active learning. Um, but I think that there's also an extent to which that just pure, uh, process and automation, good operational practice.

22:01Those are important things so that if you, you can push to automation on the one side, but on the other side, that creates brittleness. So how do you think about kind of bridging that gap to experimental chemistry and, and, um, using that sort of as a, as a nature's computer, um, to figure out things for your design process?

Bridging Gap to Experimental Chemistry

22:23Yeah.

Bridging Gap to Experimental Chemistry

22:23So, so there are a lot of really smart people working in, um, high throughput synthesis and experimentation and autonomous labs. I think the thing that, uh, that's interesting to me in that space at least is that there are some types of experiments that at least as of the last conference I went to on this are really hard for autonomous, uh, high throughput experimentation, but are really easy for a human and vice versa. And then there's, you know, the serendipity that a human might experience in the lab that a couple of people have tried to think about like, well, how do you, how do you introduce that noise into high throughput experimentation?

23:04So I think that's a challenge. Uh, your question also brought to mind another point that I am by no means an expert on, but most people who actually work on getting materials to the device scale, say something that would, um, be in your television or something like that, is they will tell you that it's not just the material, it's the process. And I think we're at, we're at ground zero where we're nowhere when it comes to like, well, how do we machine learn not just the structure and the properties, but also the, the role that processing plays.

23:39Um, I don't think we know anything about how to do that. Maybe for non-experts, like with protein structure, it's really easy to imagine like, oh, you can see these proteins like, and we can run some simulations and see them wiggling around. And, uh, the structures look really pretty. What does the data look like for material science? Is there's the computations like DFT, I think gives you something which looks like a crystal structure you can imagine. But then there's also like, is there experimental data where you can observe that crystal structure? Or is this mostly sort of like kind of probes where you're measuring individual properties, which are kind of collective and with not fine grained?

24:14So experimental structures are available and the example I was giving is something we know is stable and we've seen a structure of it before and it will fall apart with some of these models. The challenge here is that, um, what alpha-fold has done really well is, is predict structures of globular proteins primarily with 20, um, natural amino acids. I could actually point to lots of cases where alpha-fold fails too for more interest in chemistry. Um, the challenge is that you have a lot more than 20 building blocks when it comes to materials.

24:47Um, and so there's lots of different ways to think about chemical bonding. And right now, no, um, potentials are really robustly encoding all of that bonding, especially with respect to, um, metal-organic bonding. Yeah, maybe a different way of saying it is, like, with alpha-fold, I mean, alpha-fold is solving ground state structures. Like, it's not looking at dynamics, um, which is, I think, consistent with, like, some of your statements about needing quantum mechanics for catalytic enzymes. Um, so, um, but even, you're saying, even at, like, just kind of ground state properties, you're saying that just there are too many, uh, parameters and there's not, like, a clear set of interactions, which is limited to a small number of building blocks.

25:27The bonding is, is highly variable across all of material space. Now, there's simple regions of material space. You can pick aluminum. Aluminum is very boring, and you can write down, people in the 60s could write down on, on paper, you know, how you need to model aluminum. That's something that is pretty easy to fit a neural network potential to. Um, but then, if you want to get over to iron oxide, and then, if you want to get over to, um, high entropy alloys, there are definitely cases where people are using these methods.

25:58Um, but I'd say a big challenge is, is that there's no real way to know if, when you go to bigger land scales and time scales, there's no real way to know if you're right or wrong. The experimental data is not there, um, experiment, even interpreting, say, looking at an image of an experiment, uh, uh, surface, which you would want to do, um, it requires some degree of interpretation of that image. Um, so it's just, it's just hard to know from experiment or from other computations if these types of models are, are correct.

26:31And they're certainly not correct across all of chemical space. Um, and I'd say they could fail more catastrophically than AlphaFold obviously fails, though there are definitely failures of AlphaFold, too. Switching gears a little bit, I, I read in your paper also that you had done some work with integrating textual information from papers and into your, um, so it's kind of the AI that we use.

Integrating Textual Information

26:52We all know and love right now, um, can you talk about what kind of lift that, that gives, uh, the models and how did, how did you actually do that integration? Yeah, so we started, um, I guess about five years ago. So when we first started doing it, we were just doing sort of standard natural language processing and graph digitization. Um, these days we use LLMs, um, but just to try to extract from the literature data sets of properties, uh, wherever, wherever people are.

27:22People are widely reporting properties. And what we noticed is that there's a lot you can learn from these models. So you can, you can, even on the scale of a few thousand data points, you can then do things like predict the temperature at which a moth will break apart based on experimental reports. But one of the funniest things I think we noticed is that you can get the temperature at which a material will break down two ways. One, you can get it from the graph. And two, you can get it from what the authors say about how they interpret the graph. And those two things do not line up.

27:54So people, you know, one of the challenges I think with literature extraction from papers is one would be the obvious mistakes people make. You know, no one's perfect. But the other would be just, you know, people interpret their results in different ways. And so if we're building models based on those interpretations, that's a challenge. In terms of LLMs, um, they've come a long way in terms of literature extraction, but they're still definitely sensitive to false positives. And I think the amount of time we spend checking on LLMs to make sure that the data we're ingesting is, is, um, accurate definitely is an overhead on those types of workflows.

28:32I see. And, and what about, um, the way that it might bias the discovery process, right? Because you have this known literature, your job as a chemist kind of sorta is to find new stuff. But if, so if you're emphasized, if your computational method is pulling in literature, then maybe it's biasing you towards the previously reported results instead of something, you know. You know, one of the ways we try to address that is we try to train a model on that literature, but then apply it to new structures that have never been seen before and try to really look at how far we can extend the model.

29:05Um, but we are trying to answer this in general. Uh, there are repositories out there of experimental data, um, where you can have a sense of when it was published, what the structure is, um, what it was used for. And we're really trying to build generative models on top of that now to try to be able to say, well, well, if I know about the first 30 years of a field, can a model trained on that predict the next 20? I think that's an, that's an open question. And what model is best at, you know, and maybe it won't get all of them, but maybe some of those discoveries that we think are new in the most recent 20 years, maybe, maybe some of them are trivial for a model to generalize to, whereas others are not.

29:47I think in an ideal case where, where we have the available literature data and we don't know, we could use uncertainty quantification to then identify, okay, these would be the most interesting materials to get into our data set. I see. And those data sets, just for people who are interested in getting involved, what, what are some of the, what's the best ones to get, to get started with? I don't know about the best. We, we've, we've curated a few thousand data points of metal organic framework, thermal stability, as well as metal organic framework, activation stability, water stability.

30:18Uh, other groups have curated other measures of stability. They're all out there. They're on our website. Um, that kind of thing. Awesome. Do you imagine there being, uh, useful to like create an initiative or like a multi-institutional, like funding, a great source or something, which really is trying to get data in a high throughput automated way? Um, what would sort of like your dream be if you could organize something which, like, which really will drive the field forward in your mind?

30:49I think the National Science Foundation has one initiative. I've also heard about things with, with, um, uh, foundations before, sort of being interested in, in putting together cloud labs. So things that users can on demand make use of high throughput automation. I definitely think, I think having user facilities where a computational researcher like me could design an experiment and have it executed would be awesome.

31:19Having all that data collected in sort of a public way would be great. You know, the way that research right now gets published into papers, it's very hard to then extract back out. But we spend a lot of energy trying to get it back out. And so some of this is a need also for, you know, maybe systematization of how results get reported so that, uh, they can be machine learning ready from, from day one when they're published. Some research subfields are trying to do that, but it's, it's, it's not really developed across material science.

31:53But for sure, you know, I think there will be more sort of shared facilities where people can, uh, make use of data from high throughput experimentation. And that would be really, really awesome. I don't know if it'll come from companies donating equipment from National Science Foundation or from, uh, you know, private foundations. Yeah, there is a large, like philanthropic push in the biotech space. Um, it seems like people haven't quite picked up on this as such an important, uh, field, like, especially with things like materials for climate change.

32:23You can imagine a particular, a very important problem that we could use a lot of push on. Um, yeah, that, that, that kind of brings up the question. There's been a ton of re very recent materials investment for private companies, uh, startups. Um, where does that leave in your mind, the role of the academic in chemistry? I ask myself that all the time or more recently in the past year. So in particular, there's a lot of, uh, compute that companies have access to that academics don't.

32:58So I ask myself, you know, what can we do that's, uh, more creative that doesn't require just brute force compute? Um, and I think there is, there is like an, a lot of stuff that we can still do. Um, but we have to ask those questions. Um, for sure, Microsoft, Meta, those, those ones are, are kind of like the companies that have basically infinite resources. And as an academic, I don't have infinite resources, you know, but we have an interest in problems that, you know, haven't crossed the radar of those companies yet.

33:34And I think as long as we, you know, whenever someone poses a problem to me now versus a few years ago, I try to make sure that we're not just in the process of trying to do something that throwing a lot of compute at it would, would, would solve it. Yeah. I think we're kind of running out of time, but would like to give you an opportunity call to action. What, what would you, uh, like our listeners to know about do, um, what, what should they do to get involved or something that you're really passionate about?

34:05I think I will stick to something kind of niche. Um, so I think there is still a place for chemistry. I will say that. Um, but, uh, my group develops a code for, uh, transition mal complex structure generation, metal organic framework screening. It's called mole simplifier. When we're working on moffs, we call it moff simplify. There's website versions of it that you can look up and not install anything, but it's also on Conda and GitHub. And if you do have an interest in transition mal complexes, you know, just try it out.

34:37Um, it includes machine learning predictions, but it also make novel structures. And I'm just really interested to hear ever if people are using it. I know a lot of companies are using it. Um, but we sort of find out sort of after the fact. So, um, if you're interested more in this material space, I'm, I'm definitely interested and open to feedback. Grateful. Awesome. Getting involved. Thank you very much. Take care, doctor. Thank you very much.

More from Latent Space

AI-Native Healthcare: 100M Doctor Visits, 10–20 Hours Saved, Prior Auth in Minutes — Janie Lee & Chai Asawa, Abridge

May 14, 20261h 5m

🔬Doing Vibe Physics — Alex Lupsasca, OpenAI

May 5, 20261h 31m

Physical AI that Moves the World — Qasar Younis & Peter Ludwig, Applied Intuition

Apr 27, 20261h 12m

AIE Europe Debrief + Agent Labs Thesis: Unsupervised Learning x Latent Space Crossover Special (2026)

Apr 23, 202654 min

Shopify’s AI Phase Transition: 2026 Usage Explosion, Unlimited Opus-4.6 Token Budget, Tangle, Tangent, SimGym — with Mikhail Parakhin, Shopify CTO

Apr 22, 20261h 12m