Northwestern University biophysicists have developed a new computational tool for identifying the gene combinations underlying complex illnesses like diabetes, cancer and asthma.
Unlike single-gene disorders, these conditions are influenced by a network of multiple genes working together. But the sheer number of possible gene combinations is huge, making it incredibly difficult for researchers to pinpoint the specific ones that cause disease.
Using a generative artificial intelligence (AI) model, the new method amplifies limited gene expression data, enabling researchers to resolve patterns of gene activity that cause complex traits. This information could lead to new and more effective disease treatments involving molecular targets associated with multiple genes.
The study will be published during the week of June 9 in the Proceedings of the National Academy of Sciences.
"Many diseases are determined by a combination of genes — not just one," said Northwestern's Adilson Motter , the study's senior author. "You can compare a disease like cancer to an airplane crash. In most cases, multiple failures need to occur for a plane to crash, and different combinations of failures can lead to similar outcomes. This complicates the task of pinpointing the causes. Our model helps simplify things by identifying the key players and their collective influence."
An expert on complex systems, Motter is the Charles E. and Emma H. Morrison Professor of Physics at Northwestern's Weinberg College of Arts and Sciences and the director of the Center for Network Dynamics . The other authors of the study — all associated with Motter's Lab — are postdoctoral researcher Benjamin Kuznets-Speck, graduate student Buduka Ogonor and research associate Thomas Wytock.
Current methods fall short
For decades, researchers have struggled to unravel the genetic underpinnings of complex human traits and diseases. Even non-disease traits like height, intelligence and hair color depend on collections of genes. Existing methods, such as genome-wide association studies, try to find individual genes linked to a trait. But they lack the statistical power to detect the collective effects of groups of genes.
"The Human Genome Project showed us that we only have six times as many genes as a single-cell bacterium," Motter said. "But humans are much more sophisticated than bacteria, and the number of genes alone does not explain that. This highlights the prevalence of multigenic relationships, and that it must be the interactions among genes that give rise to complex life."
"Identifying single genes is still valuable," Wytock added. "But there is only a very small fraction of observable traits, or phenotypes, that can be explained by changes in single genes. Instead, we know that phenotypes are the result of many genes working together. Thus, it makes sense that multiple genes typically contribute to the variation of a trait."
Not genes but gene expression
To help bridge the long-standing knowledge gap between genetic makeup (genotype) and observable traits (phenotype), the research team developed a sophisticated approach that combines machine learning with optimization.
Called the Transcriptome-Wide conditional Variational auto-Encoder (TWAVE), the model leverages generative AI to identify patterns from limited gene expression data in humans. Accordingly, it can emulate diseased and healthy states so that changes in gene expression can be matched with changes in phenotype. Instead of examining the effects of individual genes in isolation, the model identifies groups of genes that collectively cause a complex trait to emerge. The method then uses an optimization framework to pinpoint specific gene changes that are most likely to shift a cell's state from healthy to diseased or vice versa.
"We're not looking at gene sequence but gene expression," Wytock said. "We trained our model on data from clinical trials, so we know which expression profiles are healthy or diseased. For a smaller number of genes, we also have experimental data that tells how the network responds when the gene is turned on or off, which we can match with the expression data to find the genes implicated in the disease."
Focusing on gene expression has multiple benefits. First, it bypasses patient privacy issues. Genetic data — a person's actual DNA sequence — is inherently unique to an individual, providing a highly personal blueprint of health, genetic predispositions and family relationships. Expression data, on the other hand, is more like a dynamic snapshot of cellular activity. Second, gene expression data implicitly accounts for environmental factors, which can turn genes "up" or "down" to perform various functions.
"Environmental factors might not affect DNA, but they definitely affect gene expression," Motter said. "So, our model has the benefit of indirectly accounting for environmental factors."
A path to personalized treatment
To demonstrate TWAVE's effectiveness, the team tested it across several complex diseases. The method successfully identified the genes — some of which were missed by existing methods — that caused those diseases. TWAVE also revealed that different sets of genes can cause the same complex disease in different people. That finding suggests personalized treatments could be tailored to a patient's specific genetic drivers of disease.
"A disease can manifest similarly in two different individuals," Motter said. "But, in principle, there could be a different set of genes involved for each person owing to genetic, environmental and lifestyle differences. This information could orient personalized treatment."
The study, "Generative prediction of causal gene sets responsible for complex traits," was supported by the National Cancer Institute (grant number P50-CA221747) through the Malnati Brain Tumor Institute, the NSF-Simons National Institute for Theory and Mathematics in Biology (National Science Foundation grant number DMS-2235451 and Simons Foundation grant number MP-TMPS-00005320) and the National Science Foundation (grant number MCB-2206974).