Many attempts have been made to harness the power of new artificial intelligence and large language models (LLMs) to try to predict the outcomes of new chemical reactions. These have had limited success, in part because until now they have not been grounded in an understanding of fundamental physical principles, such as the laws of conservation of mass. Now, a team of researchers at MIT has come up with a way of incorporating these physical constraints on a reaction prediction model, and thus greatly improving the accuracy and reliability of its outputs.
The new work was reported Aug. 20 in the journal Nature , in a paper by recent postdoc Joonyoung Joung (now an assistant professor at Kookmin University, South Korea); former software engineer Mun Hong Fong (now at Duke University); chemical engineering graduate student Nicholas Casetti; postdoc Jordan Liles; physics undergraduate student Ne Dassanayake; and senior author Connor Coley, who is the Class of 1957 Career Development Professor in the MIT departments of Chemical Engineering and Electrical Engineering and Computer Science.
"The prediction of reaction outcomes is a very important task," Joung explains. For example, if you want to make a new drug, "you need to know how to make it. So, this requires us to know what product is likely" to result from a given set of chemical inputs to a reaction. But most previous efforts to carry out such predictions look only at a set of inputs and a set of outputs, without looking at the intermediate steps or considering the constraints of ensuring that no mass is gained or lost in the process, which is not possible in actual reactions.
Joung points out that while large language models such as ChatGPT have been very successful in many areas of research, these models do not provide a way to limit their outputs to physically realistic possibilities, such as by requiring them to adhere to conservation of mass. These models use computational "tokens," which in this case represent individual atoms, but "if you don't conserve the tokens, the LLM model starts to make new atoms, or deletes atoms in the reaction." Instead of being grounded in real scientific understanding, "this is kind of like alchemy," he says. While many attempts at reaction prediction only look at the final products, "we want to track all the chemicals, and how the chemicals are transformed" throughout the reaction process from start to end, he says.
In order to address the problem, the team made use of a method developed back in the 1970s by chemist Ivar Ugi, which uses a bond-electron matrix to represent the electrons in a reaction. They used this system as the basis for their new program, called FlowER (Flow matching for Electron Redistribution), which allows them to explicitly keep track of all the electrons in the reaction to ensure that none are spuriously added or deleted in the process.
The system uses a matrix to represent the electrons in a reaction, and uses nonzero values to represent bonds or lone electron pairs and zeros to represent a lack thereof. "That helps us to conserve both atoms and electrons at the same time," says Fong. This representation, he says, was one of the key elements to including mass conservation in their prediction system.
The system they developed is still at an early stage, Coley says. "The system as it stands is a demonstration - a proof of concept that this generative approach of flow matching is very well suited to the task of chemical reaction prediction." While the team is excited about this promising approach, he says, "we're aware that it does have specific limitations as far as the breadth of different chemistries that it's seen." Although the model was trained using data on more than a million chemical reactions, obtained from a U.S. Patent Office database, those data do not include certain metals and some kinds of catalytic reactions, he says.
"We're incredibly excited about the fact that we can get such reliable predictions of chemical mechanisms" from the existing system, he says. "It conserves mass, it conserves electrons, but we certainly acknowledge that there's a lot more expansion and robustness to work on in the coming years as well."
But even in its present form, which is being made freely available through the online platform GitHub, "we think it will make accurate predictions and be helpful as a tool for assessing reactivity and mapping out reaction pathways," Coley says. "If we're looking toward the future of really advancing the state of the art of mechanistic understanding and helping to invent new reactions, we're not quite there. But we hope this will be a steppingstone toward that."
"It's all open source," says Fong. "The models, the data, all of them are up there," including a previous dataset developed by Joung that exhaustively lists the mechanistic steps of known reactions. "I think we are one of the pioneering groups making this dataset, and making it available open-source, and making this usable for everyone," he says.
The FlowER model matches or outperforms existing approaches in finding standard mechanistic pathways, the team says, and makes it possible to generalize to previously unseen reaction types. They say the model could potentially be relevant for predicting reactions for medicinal chemistry, materials discovery, combustion, atmospheric chemistry, and electrochemical systems.
In their comparisons with existing reaction prediction systems, Coley says, "using the architecture choices that we've made, we get this massive increase in validity and conservation, and we get a matching or a little bit better accuracy in terms of performance."
He adds that "what's unique about our approach is that while we are using these textbook understandings of mechanisms to generate this dataset, we're anchoring the reactants and products of the overall reaction in experimentally validated data from the patent literature." They are inferring the underlying mechanisms, he says, rather than just making them up. "We're imputing them from experimental data, and that's not something that has been done and shared at this kind of scale before."
The next step, he says, is "we are quite interested in expanding the model's understanding of metals and catalytic cycles. We've just scratched the surface in this first paper," and most of the reactions included so far don't include metals or catalysts, "so that's a direction we're quite interested in."
In the long term, he says, "a lot of the excitement is in using this kind of system to help discover new complex reactions and help elucidate new mechanisms. I think that the long-term potential impact is big, but this is of course just a first step."
The work was supported by the Machine Learning for Pharmaceutical Discovery and Synthesis consortium and the National Science Foundation.