ISTA Unveils Algorithm to Boost Biobank Data Analysis

Institute of Science and Technology Austria

Extracting and analyzing relevant medical information from large-scale databases such as biobanks poses considerable challenges. To exploit such 'big data', attempts have focused on large sampling algorithms that model individual data points. However, since these algorithms sample the entire dataset millions of times, their theoretically very high level of precision comes at a prohibitive computational cost and therefore remains unattainable. To overcome this, scientists previously developed approaches that sacrifice accuracy for speed.

In a bid to optimize precision and performance, researchers from the groups of Matthew Robinson and Marco Mondelli at the Institute of Science and Technology Austria (ISTA) developed an algorithm that can extract and analyze information from the world's most extensive biobank with unprecedented accuracy and speed. Ultimately, their method, presented here using the model complex trait of human height, could advance personalized medicine in the context of diagnostics—and even further forensics.

Algorithmic innovation using human height

The team's approach draws on the recently established mathematical framework known as "approximate message passing" (AMP), to which Mondelli has made significant contributions. Their new method, dubbed "genomic Vector Approximate Message Passing" or gVAMP, enhances the framework's ability to extract complex information from the dataset at hand.

"Whereas other methods tend to analyze one snippet at a time before combining the results, gVAMP functions as a 'joint estimation' method. Therefore, it provides a detailed overview of the effects on a trait in the context of all variants across massive-scale genetic datasets," says ISTA PhD student Al Depope, the study's first author. "We can speak of an algorithmic innovation."

To develop their method, the team chose human height, an established model for the genetic analysis of complex traits.

"Examining human height allowed us to explore the limits of computational scalability with gVAMP, both in the number of genome sequences as well as the number of variants involved," says Depope.

Indeed, the trait is influenced by a whopping 17 million variants, which the team could analyze simultaneously in hundreds of thousands of whole-genome sequences from anonymized volunteers contained in the UK Biobank , the world's most comprehensive dataset of biological, health, and lifestyle information.

"What I find particularly important is the interpretability of our algorithm when applied in biology. In addition to allowing us to predict people's height from their DNA more accurately than before, it also allows us to pinpoint the specific DNA regions involved," says ISTA postdoc and co-author Jakub Bajzik.

Outperforming existing methods

When gVAMP predicts human height and the contribution of individual genetic variants, the algorithm creates this data for the first time. As a result, there is no pre-existing data on human height against which to benchmark the method.

"Essentially, the question here is 'how do we know that gVAMP picked out the true variants?'" Depope explains.

To evaluate the strength of their method, the ISTA researchers performed a data simulation—a common approach in the field. They developed an artificial trait with roughly the same number of genetic variants as human height and performed an extensive simulation study on multiple datasets, benchmarking the algorithm's performance against other methods. By doing so, they demonstrated that gVAMP largely outperforms existing methods in both accuracy and processing time.

"Our method achieves state-of-the-art accuracy while remaining efficient enough to perform a true joint analysis across massive-scale genetic datasets in mere days. This allows us to uncover the underlying biology previously hidden by limited scale," says Depope. "The algorithmic innovation is exactly what makes this scale of analysis possible, as well as the resulting biological insights."

From personalized medicine to forensics?

The interdisciplinary study combines expertise in information theory, mathematics, genomics, and software engineering. Bajzik's background in computer science complemented Depope's focus on theory and math. Robinson, who specializes in state-of-the-art statistical models for genomic data, co-supervised the project with Mondelli, who seeks to develop robust inference methods in information theory to address data-driven challenges in engineering and natural sciences.

Currently, the team is building on this work to extend it to personalized medicine and diagnostics applications. These could include predicting the time of disease onset, its severity, and when specific symptoms are likely to develop. In addition, the researchers seek to extend the method to consider protein and epigenetic data, information not conveyed by the genomic sequences alone. Ultimately, gVAMP's potential in personalized medicine applications could also help clinicians select targeted patient profiles for clinical trials.

But the method could even find other applications, according to Depope.

"I think our algorithm might also be useful in forensics to predict a suspect's height from the DNA found on a crime scene," he says.

Publication:

Al Depope, Jakub Bajzik, Marco Mondelli, and Matthew R. Robinson. 2026. Joint modelling of whole genome sequence data for human height via approximate message passing. Cell Genomics. DOI: 10.1016/j.xgen.2026.101162

Funding information

This project was supported by funding from a Lopez-Loreta Prize, an SNSF Eccellenza Grant (PCEGP3-181181), an ERC Starting Grant (INF2, project number 101161364), and by core funding from ISTA. High-performance computing was supported by the Scientific Service Units (SSU) of ISTA through resources provided by Scientific Computing (SciComp).

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.