Accuracy Test For Protein Language Models Shines Light Into AI 'black Box'

Emory University

AI language models, used to generate human-like text to power chatbots and create content, are also revolutionizing biology by treating complex biological data like a language. Language models are increasingly used, for example, to find patterns in DNA and proteins to make predictions and speed research into biological complexity.

A critical gap, however, is the lack of a method to estimate the reliability of these predictions.

Computational biologists at Emory University have bridged this gap, developing a simple way to test the accuracy of a language model's understanding of proteins. Nature Methods published their system, which scores the reliability of a model's predictions by comparing how it "embeds," or numerically codifies, synthetic random proteins versus proteins found in nature.

"To the best of our knowledge, our framework is the first generalized method to quantify protein sequence embedding reliability," says Yana Bromberg, senior author of the paper and Emory professor of biology and computer science.

"Our method is a simple, elegant solution to a complex problem," adds R. Prabakaran, first author of the study and a postdoctoral fellow in the Bromberg lab. "It's a foundational method with a lot of scope for a range of language models in science."

The new method gives a clearer view into the embedding process, or how a language model codifies and "files" different types of data.

"We are shining a light into the black box of AI," Bromberg says. "Better understanding how a language model works allows you to find ways to keep improving its reliability and to develop better models."

Understanding gene and protein complexity

Proteins, long chains of amino acids folded into 3D shapes, are essential to nearly all cellular work. A protein's unique sequence of amino acids, determined by DNA, dictates its specific 3D shape and its function, including such diverse roles as catalyzing reactions, enabling muscle contraction or defending against pathogens.

Bromberg is a pioneer in applying machine learning for protein and genomic analysis, including understanding how protein language model embeddings capture biological information. The goal is not just to help unravel the complexities of the human genome, which consists of 3.2 billion base pairs of DNA. The Bromberg lab is developing computational techniques needed to study metagenomes — the collection of genetic material from all organisms, including microbes, living within a particular community. For example, Bromberg and Prabakaran built a DNA language model for metagenomic analyses recently published in Nucleic Acids Research .

The vastly complicated interactions of the metagenome play key roles in the health of both individual organisms and ecosystems.

"If you think of the genome as a tree, the metagenome is a forest," Prabakaran explains. "And just like a tree, you don't live in isolation. A whole ecosystem of microbes is living in community with your body. If you want to fully understand nature, you need to understand metagenomics."

A shortage of data

AI tools are essential to study this complexity.

The large language models underlying language content creation, such as ChatGPT, are trained on available examples of human writing to make predictions for appropriate text. Protein language models are trained on available protein sequences so they can make predictions to aid in the study of all existing protein sequences.

One problem is the relative shortage of data. While the sequences of more than 200 million known proteins are collected in databases such as UniProt , it is estimated that trillions more proteins exist.

"I'm especially interested in microbes," Bromberg says. "And most microbes, probably more than 90 percent, we've never seen or studied before. Every microbe will share some fraction of its genome common to other microbes. The question is, can we take some fraction of protein sequences from any microbial community and use it to make reliable predictions about the sequences and functions of all proteins?"

To answer that question, the researchers needed to develop a way to test the reliability of protein language models.

A language model's 'junkyard'

The key to the new testing method lies in understanding how evolution shapes proteins. Evolution leaves a distinct signature on proteins by conserving amino acid sequences that are inherently important for life. Protein language models are trained to make predictions by exposing them to actual proteins found in nature, which contain this evolutionary signature.

The researchers decided to compare how a protein language model would classify, or encode, this biologically meaningful information alongside randomly generated synthetic proteins.

Language models embed data by compressing it into an abstract, latent space, grouping similar items together.

Visualizing this latent space as a scatter plot, where protein "point" proximity indicates similarity, revealed that the language model grouped proteins found in nature by various subtypes and segregated them primarily into one area of the latent space — away from synthetic proteins, with which the model was not familiar.

The researchers dubbed this separate area the "junkyard," and hypothesized that it represented a subspace for the low-quality, less biologically meaningful embeddings.

A 'random neighbor' score

They further proposed that the degree of overlap in latent space between a protein's nearest neighbors and the embeddings of non-biological sequences is inversely correlated with the model's confidence in the embedding.

The researchers then quantified this relationship into what they call a "random neighbor score." The score reflects the number of random, synthetic sequence neighbors of a given protein. The lower the random neighbor score, the higher the confidence of the model in the embedding, while a higher score shows uncertainty.

To assess whether the random neighbor score can serve as an indicator of predictive performance they analyzed low-quality embeddings across a range of tasks for which protein language models are used. They found that these low-quality embeddings often failed to capture meaningful biology.

Sharpening the tools

Applying the new method will allow for more precise measurements for the accuracy of a scientific language model's embedding process.

"You can think of it like a surgeon choosing the sharpest knife for a future surgery," Bromberg says.

This refinement of reliability can be used in the development phase of language models to enhance the machine-learning process.

"We need better quality control at every step in this process," Prabakaran notes. "The errors will keep multiplying if you keep building onto junk data."

The new method is a biologically grounded uncertainty measure, as opposed to measures borrowed from computer science, he adds.

The work was supported by a grant from the National Science Foundation.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.