Algorithm Breakthrough to Transform Biomedical Research

King’s College London

New approach solves one of computing's most enduring problems up to thirty times quicker and four times more efficiently than current methods.

Algorithmic Governance (2)

A new algorithm is solving one of computing's most enduring problems up to thirty times quicker and four times more efficiently than straightforward methods, in what could be a step change for virus classification and biomedical research.

Presented at IEEE ICDE (International Conference on Data Engineering) 2026, researchers from King's College London and their collaborators introduced Splitting-Counting-Merging (SCM), an algorithmic approach to the Subtree Mode Problem, a specialised version of the well-known Range Mode problem in computer science.

The only practical solution of its kind in terms of time and computational power, SCM also allows researchers to generate a similarity score between a viral sequence of unknown type and a dataset of sequences of a known virus. This could help biomedical professionals identify new strains of disease accurately and efficiently.

High quality data is at the heart of the AI revolution, and to maintain accurate models integrated within our society, we'll need more data to train them. By going beyond the 10,000 data point cap that previous approaches couldn't practically solve, we're setting up future researchers to be able to push discovery and efficiency across a realm of domains."

Dr Grigorios Loukides

The mode is the most frequent value in a collection of values and is the average used when large outliers could throw out a more traditional average, such as the mean. If you wanted to find out the mode of a sub-tree of values in a wider collection of hierarchical data, this would be a Subtree Mode problem, which is the problem variant introduced in the SCM work.

Hierarchical tree of influenza
This is a typical, biological example of a hierarchically structured dataset necessary to find the Subtree Mode. The black dots, or internal nodes, represent groups of patients with different types of influenza while the coloured outer nodes, or leaves, represent individuals and their strains of the influenza virus. The Subtree Mode aims to find for each internal node the colour of its leaves that appears most often.

This type of hierarchical data organisation is common in areas such as information retrieval for search engines and biological applications like genome sequencing or virus classification.

These datasets can contain billions of values, which existing techniques of finding the Subtree Mode based on Range Mode are unequipped to handle. These methods are based on operations which require impractically high levels of computing power and several hours to process datasets over 10,000 values.

By taking a novel approach to breaking down the tree-like structure of the dataset as opposed to its content, SCM tackles datasets of up to 7.3 billion values to find the Subtree Mode at thirty times the speed and four times the computational efficiency of the Range Mode based techniques. The algorithm is also domain agnostic, meaning it can be applied readily to scenarios as diverse as data mining and virus classification.

Author Dr Grigorios Loukides, Senior Lecturer in Computer Science at King's College London, said "The computational problems facing information retrieval and virus classification are fundamentally, very similar - what value is most seen in a particular range; be they keywords like 'football' in a Google search or a particular strand of DNA in part of a virus. This allows our approach to assist the growing interdisciplinarity between software engineers and biomedical professionals.

What excites me most is turning a process that took hours into seconds. With SCM, annotating phylogenetic trees, which are essential for tracking disease evolution and resistance, becomes nearly instantaneous. That speed means we can identify new viral strains and adapt treatments before an outbreak gets ahead of us."

Dr Eileen Guan, biomedical researcher based at King's

"But what is important here is scale. High quality data is at the heart of the AI revolution, and to maintain accurate models integrated within our society, we'll need more data to train them. By going beyond the 10,000 data point cap that previous approaches couldn't practically solve, we're setting up future researchers to be able to push discovery and efficiency across a realm of domains."

In classical classification approaches, a researcher would need to have at least two different species of virus and say what another sequence is more like. Using samples of COVID-19 as a test bed, the team found that with SCM they could provide an estimate of how likely a single sample is to be of this virus. By helping scientists in this task, this approach could help researchers develop new treatments or retool exist ones quicker.

The team also hope that the approach could accelerate biomedical research by recognising patterns in diagrams called annotated phylogenetic trees.

In medical research, the sequence of DNA for individuals or groups are often organised. These tree-like diagrams are then overlayed with information on factors which might impact disease outcomes, like age, ethnicity, location, or antibiotic resistance.

Annotated phylogenetic tree
This annotated phylogenetic tree reflects antibiotic resistant status and how that interacts with single positional changes in DNA.

These would be used to spot patterns, such as how a disease might impact a certain branch of people more or evolve over time. However, these important tools have so far only been created by hand because of an inability to solve the Range Mode in a large dataset. With SCM, phylogenetic tree annotation that would have taken hours can now be done in seconds.

Dr Eileen Guan, a biomedical researcher based at King's and co-author off the paper, said "What excites me most is turning a process that took hours into seconds. With SCM, annotating phylogenetic trees, which are essential for tracking disease evolution and resistance, becomes nearly instantaneous. That speed means we can identify new viral strains and adapt treatments before an outbreak gets ahead of us."

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.