AI Taps Tree of Life for Rare Disease Diagnosis

Center for Genomic Regulation

Researchers have created an artificial intelligence model that can identify which mutations in human proteins are most likely to cause disease, even when those mutations have never been seen before in any person.

The model, called popEVE, was created using data from hundreds of thousands of different species and of genetic variation across the human population. The vast evolutionary record allows the tool to see which parts of every one of the roughly 20,000 human proteins are essential for life and which can tolerate change.

That allows popEVE to not only identify disease-causing mutations but also rank how severe they are across the body. The findings, published today in Nature Genetics by researchers at Harvard Medical School and the Centre for Genomic Regulation (CRG) in Barcelona, could transform how doctors diagnose genetic disease.

One in two people with a rare disease never receive a clear diagnosis. popEVE could change that by helping doctors focus on the most damaging variants first. Another benefit is that it can work with the patient's genetic information alone. That has important implications for rare disease medicine in healthcare systems with limited resources, making diagnoses faster, simpler and cheaper than before.

"Clinics don't always have access to parental DNA and many patients come alone. popEVE can help these doctors identify disease-causing mutations, and we're already seeing this from collaborations with clinics," says Dr. Mafalda Dias, co-corresponding author of the study and researcher at the Centre for Genomic Regulation.

Every individual's genome contains many small differences which make them unique. This includes missense mutations, changes that alter one amino acid in a protein. Many are harmless, but some cause severe conditions or disorders. The challenge is working out which are benign and which are harmful.

However, not all harmful mutations are equally harmful. Some cause mild symptoms, others severe disability and some are fatal in childhood. Many AI tools exist to predict whether a mutation is dangerous or not but don't offer a sliding scale of this behaviour.

For conditions "as rare as one", there are no case histories to consult. Even if the world's entire population were sequenced, these patients' mutations would be completely new. Traditional methods that depend on spotting patterns across groups of patients or in large cohorts cannot help in these one-off cases.

That's why a team led by Debora Marks at Harvard Medical School and Jonathan Frazer and Mafalda Dias at the Centre for Genomic Regulation (CRG) turned to evolution instead.

Over billions of years, evolution on Earth has already run countless experiments, testing which changes a protein can tolerate and which are too damaging to survive. Computational models can learn which amino acid positions are critical for life by comparing protein sequences across many different species.

This was the idea behind EVE (Evolutionary model of Variant Effect), an algorithm released by the researchers back in 2021. It used evolutionary patterns to classify mutations in human disease genes as benign or harmful. EVE performed as well as, or better than, many lab-based experiments, and has since been used in clinical genetics to help interpret uncertain variants.

But while EVE could judge the impact of mutations within a gene, its scores weren't directly comparable between genes. A variant that looked severe in one protein couldn't be fairly compared with a variant in another. That's a problem because doctors need to know which mutation in a patient's genome is the most damaging.

The latest model in the EVE family, popEVE, solves that problem by combining evolutionary data with information from the UK Biobank and gnomAD, two vast repositories. These datasets show which variants are present in healthy people, helping the model calibrate its predictions for humans.

The result is the first model that can meaningfully rank mutations across the entire human proteome, the complete set of roughly 20,000 proteins encoded within the human genome. A mutation in gene A can now be compared directly with one in gene B on the same severity scale. That allows doctors, for the first time, to focus on the potentially most damaging variants first.

To validate popEVE, the researchers analysed genetic data from more than 31,000 families with children affected by severe developmental disorders. In 98% of cases where a causal mutation had already been identified, popEVE correctly ranked that variant as the most damaging in the child's genome. It outperformed state-of-the art competitors like DeepMind's AlphaMissense.

When the researchers looked for new candidate disease genes, popEVE uncovered 123 that had never been linked to developmental disorders before. Many are active in the developing brain and interact physically with known disease proteins. 104 of these were observed in just one or two patients.

One of popEVE's strengths is that it avoids penalising people whose ancestry is underrepresented in genetic databases, which are predominantly biased towards people of European ancestry. This is a problem in other tools which flag possible disease-causing mutations simply because those variants hadn't been seen before.

popEVE avoids this by treating all human variants equally. By asking whether a mutation has been seen before in humans, regardless of whether it's once in a specific population or a thousand times in European populations, it predicted fewer false positives.

"No one should get a scary result just because their community isn't well represented in global databases. popEVE helps fix that imbalance, something the field has been missing for a long time," says Dr. Jonathan Frazer, co-corresponding author of the study and researcher at the Centre for Genomic Regulation.

The authors of the study stress that popEVE only interprets DNA changes that alter proteins. Many other types of mutations exist, so it doesn't over all types of genetic variation. It also doesn't replace clinical judgement. Doctors must use medical histories and symptom analysis to aid diagnosis.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.