Machine learning allows for ‘ultra-fast,’ ‘highly accurate’ classification of COVID-19 virus genomes

Kathleen Hill

Using machine learning, a team of computer scientists and biologists from Western University identified an underlying genomic signature for 29 different COVID-19 virus DNA sequences.

This new data discovery tool will allow researchers to quickly and easily classify a deadly virus like COVID-19 in just minutes – a process and pace of high importance for strategic planning and mobilizing medical needs during a pandemic.

The study also supports the scientific hypothesis that the COVID-19 virus (SARS-CoV-2) origins in bats as Sarbecovirus, a subgroup of Betacoronavirus.

The findings were published today in PLOS ONE.

The “ultra-fast, scalable, and highly accurate” classification system — driven by machine learning — uses a new graphic-based, specialized software and decision tree approach to illustrate the classification and arrive at a best choice out of all possible outcomes. The entire method uses a new graphic-based, specialized software to illustrate a best choice out of all tested possible outcomes.

Kathleen Hill, an Associate Professor in Western’s Department of Biology, co-led the study with collaborators from the Departments of Computer Science and Statistical and Actuarial Sciences at Western and University of Waterloo’s Department of Computer Science.

The machine learning method achieves 100 per cent accurate classification of the COVID-19 virus sequences and more importantly, discovers the most relevant relationships among more than 5,000 viral genomes again within minutes.

“All we needed was the COVID-19 virus DNA sequence to discover its own intrinsic sequence pattern. We used that signature pattern and a logical approach to match that pattern as close as possible to other viruses and achieved a fine level of classification in minutes – not days, not hours but minutes,” says Hill.

This classification tool has already been used to analyze more than 5,000 unique viral genomic sequences, including the 29 COVID-19 virus sequences available on January 27, 2020.

Hill believes the tool, which is able to classify any newly-discovered virus sequence COVID-19 or otherwise, will be an essential component in the toolkit for vaccine and drug developers, frontline healthcare workers, researchers and scientists during this global pandemic and beyond.

/Public Release. View in full here.