AI Model Enhances Drug Design With Physics Insights

When machine learning is used to suggest new potential scientific insights or directions, algorithms sometimes offer solutions that are not physically sound. Take for example AlphaFold, the AI system that predicts the complex ways in which amino acid chains will fold into 3D protein structures. The system sometimes suggests "unphysical" folds-configurations that are implausible based on the laws of physics-especially when asked to predict the folds for chains that are significantly different from its training data. To limit this type of unphysical result in the realm of drug design, Anima Anandkumar , Bren Professor of Computing and Mathematical Sciences at Caltech, and her colleagues have introduced a new machine learning model called NucleusDiff, which incorporates a simple physical idea into its training, greatly improving the algorithm's performance.

Anandkumar and her colleagues describe NucleusDiff in a paper that appears as part of a "Machine Learning in Chemistry" special feature published by the Proceedings of the National Academy of Sciences (PNAS).

The goal in structure-based drug design is to come up with small molecules, called ligands, that will bind well to a biological target, typically a protein, causing some kind of desired change in activity. Drug-design AI models are trained on datasets containing tens of thousands of examples of such protein-ligand pairings as well as information about how well they latch on to each other, an important measurement called binding affinity. But importantly, NucleusDiff goes a step further.

"With machine learning, the model is already learning many of the aspects of what makes for good binding, and now we throw in some simple physics to make sure we rule out all the unphysical things," Anandkumar explains. In the case of NucleusDiff, the model ensures that atoms stay at an appropriate distance from one another, accounting for physical concepts such as repellant forces that prevent atoms from overlapping or colliding.

"We have some nice physical theory behind the algorithm, but it's also intuitive," Anandkumar says. "Surprisingly, without these constraints, all these AI models tend to predict that there is collision, that the atoms come too close. By adding simple physics, we increased the model's accuracy."

Rather than accounting for the distance between every single pair of atoms in a molecule (a task that would be prohibitively computationally expensive), NucleusDiff estimates a manifold, or envelope-a rough estimation of the distribution of atoms and the probable locations of electrons in the molecule. On that manifold, it then establishes main anchoring points to watch, making sure that the atoms never get too close to one another.

The team trained NucleusDiff on a training dataset called CrossDocked2020, which includes about 100,000 protein-ligand binding complexes. They tested it on 100 of those complexes and found that it significantly outperformed state-of-the-art models in terms of binding affinity while also reducing the number of atomic collisions to almost zero. Next, the researchers used the new model to predict binding affinities of a newer molecule that was not included in the training dataset: the COVID-19 therapeutic target 3CL protease. Again, NucleusDiff showed increased accuracy and a reduction of atomic collisions by up to two-thirds as compared to other leading models.

The work fits within a larger push on campus by Anandkumar and others, through an initiative called AI4Science , to integrate more physics into data-driven AI models built for a variety of topics-from climate prediction to robotics and from seismology to astrophysical modeling.

"If we rely purely on training data, we do not expect machine learning to work well on examples that are significantly different from the training data," Anandkumar says. In fact, she says, it is a standard principle of machine learning that the outputs typically fall within the realm of the examples provided in the training data. But in many scientific domains like drug design, researchers are looking for novel results (e.g., new molecules).

"We see a lot of machine learning fail in coming up with accurate results on new examples that are different from training data, but by incorporating physics, we can make machine learning more trustworthy and also work much better," says Anandkumar.

The paper is titled "Manifold-Constrained Nucleus-Level Denoising Diffusion Model for Structure-Based Drug Design." Additional authors are Liang Yan of Fudan University, who completed the work as a research intern and visiting student at Caltech; Shengchao Liu, Christian Borgs, and Jennifer Chayes of UC Berkeley; Weitao Du of the Alibaba DAMO Academy in Bellevue, Washington; Weiyang Liu of the Max Planck Institute for Intelligent Systems in Germany; Zhuoxinran Li of the University of Toronto; and Hongyu Guo of the National Research Council Canada. The work was supported by the Bren endowed chair and by the AI2050 Senior Fellowship program at Schmidt Sciences.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.