AI Model Decodes DNA to Trace Ancestry

Researchers at the University of Oregon have developed an artificial intelligence tool that can read genetic code the way large language models like ChatGPT read text. Scanning the genome for biological mutation patterns, the computer model traces pairs of genes back in time to their last common ancestor.

It's the first language model designed for population genetics, said Andrew Kern, a computational biologist in the UO College of Arts and Sciences. As described in a paper published April 10 in the Proceedings of the National Academy of Sciences, the AI tool offers scientists a fast and flexible alternative to classical methods for reconstructing evolutionary history.

In practice, it can help researchers like Kern understand when disease-resistance genes emerged in a population, for example, or when species evolved key traits.

"Advances in generative AI and the architectures behind them are potentially useful to a number of fields outside a chatbot," said Kern, an Evergreen professor of biology. "We're borrowing strengths from the world of AI and applying them in this different context that's largely been untapped."

Andrew Kern
Andrew Kern, an academic expert in population genetics and machine learning, develops new tools for studying evolutionary biology. (Photo by Charlie Litchfield)

Training AI on the language of DNA

Genomes are often compared to a written language, with combinations of DNA's four-letter alphabet - A, T, C and G - forming the basis for genes and chromosomes. Kern and his lab are most interested in what's misspelled, which scientists call mutations: changes in DNA sequences, like swapped or missing letters, that accumulate over time as part of evolution.

Often harmless, mutations can be passed down from generation to generation, leaving a trail of breadcrumbs for tracing ancestral relationships.

Traditional methods based on math and statistics are the gold standard for translating mutations into ancestry. They're difficult to beat in most cases, said Kevin Korfmann, lead author of the study and former postdoctoral researcher at the UO. But those classical probabilistic approaches can be slow and struggle with large or incomplete genomic datasets, he added.

So, the researchers looked to AI to efficiently interpret the language of life by modifying a GPT-2 model, the older machine learning architecture behind ChatGPT. But instead of being trained on large volumes of English text, the language model was trained on simulations of genetic evolution across different species - including bacteria, rodents, mosquitoes and primates - to learn and recognize mutation patterns.

"We can't repeat evolution, so one of the key workflows we have is developing simulations," Korfmann said. "The simulations mimic evolutionary processes, and then we use the outcomes as training data for our deep learning models."

In general, stretches of DNA with many mutations likely trace back to a distant common ancestor, whereas those with few mutations are likely to share a more recent ancestor. This helps explain why chimpanzees are considered humans' closest living relatives, with similar DNA, while sea sponges are the most distant, having diverged genetically more than 700 million years ago.

Based on those mutation patterns and other biological principles, the AI model can predict when gene pairs last shared a common ancestor, known as the "coalescence time."

Sidestepping data bottlenecks

In tests, the tool performed as well as state-of-the-art statistical methods, which was surprising to the research team.

"You never really know what's going to work when you're essentially borrowing techniques from a totally different world and applying them to a new problem," Kern said. "But this was a case where things worked really well."

The computer model was also dramatically faster. While traditional methods can take hours or even days to decode a single mosquito chromosome, the new approach can do it in minutes. That efficiency is especially beneficial for scientists handling large amounts of genetic sequence data.

"Compared to classical inferential approaches, the AI tool doesn't have to reason about every mutation individually," Korfmann said. "It just reads the patterns because all of the expensive statistical work was done up front, during training, which sidesteps the bottleneck."

The model's simulation-based training also enables scientists to use DNA datasets that are incomplete or missing genetic code - an issue Kern frequently faces when working with mosquito genetic databases for his research on malaria transmission.

That versatility comes at a crucial moment for malaria control, Kern said. For decades, insecticides have been a cornerstone for the prevention of malaria-spreading mosquitoes. But evolution, as Kern puts it, "did its thing."

"Insecticide resistance is being observed in all of these mosquito populations today," he said. "A major challenge in preventing the spread of malaria has been understanding the evolution of insecticide resistance. Now, we can go in with our AI model, ask how long ago these resistance genes arose in the population, and learn about the evolutionary history of this critical carrier of malaria."

Looking ahead, Kern and Korfmann aim to advance the biological model beyond tracing shared ancestry between two lineages towards reconstructing full genealogical trees across multiple lineages. Some traditional methods can already do this, but Kern said they'd like to chase that goal from a machine-learning angle.

"There's so much going on in the machine learning field that we haven't applied yet in our field," Korfmann said. "There's tons of translational work to do to get these novel algorithms working in biology."

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.