How To Rapidly Search World's Microbial DNA

By making the world's microbial DNA easier to explore, LexicMap helps researchers track outbreaks, study antibiotic resistance, and understand microbial diversity

A search bar with bacterial sequences and a globe to show the world's microbial DNA
LexicMap. Image credit: Karen Arnott/EMBL-EBI

A new sequence alignment tool, LexicMap, lets scientists search for a DNA sequence against millions of bacterial and archaeal genomes in minutes.

Open-access databases such as the European Nucleotide Archive (ENA) contain over 2.4 million bacterial genomes, and this number continues to grow rapidly. Until now, searching these vast resources has been slow and computationally demanding, limiting scientists' ability to track antibiotic resistance, study outbreaks, or explore microbial diversity.

A new paper, published in the journal Nature Biotechnology , introduces a new algorithm called LexicMap. By using an innovative method to index genetic data, LexicMap enables researchers to quickly search for DNA sequences or mutations across the world's growing DNA databases. This opens up new opportunities in epidemiology, ecology, and evolutionary biology.

"Evolution gradually changes genes through mutation, so biologists often want to scan through all the world's DNA data to look for matches and how they differ through mutations," said Zamin Iqbal, Professor of Algorithmic and Microbial Genomics at the University of Bath and visiting Group Leader at EMBL-EBI. "As the data explosion has outstripped our algorithms, we have had to live with search engines that search a fraction of our data."

Breaking the scalability barrier

Over the last decade, the team behind LexicMap have been developing high-quality data resources for the use of the research community and, in parallel, developing improved search algorithms for microbial DNA . They also work as part of a global consortium - AllTheBacteria - to assemble and annotate all 2.4 million bacterial and archaeal genomes in the ENA. LexicMap is the first alignment algorithm which can search all these data rapidly, and with a low computational burden.

"Google search is a routine part of modern life, and we cannot imagine dealing with the internet without it," said Wei Shen, Associate Professor at Chongqing Medical University and former visiting scientist at EMBL-EBI. "Alignment to a DNA database is the biology equivalent of Google search, and LexicMap now makes that scalable to the full volume of global bacterial data. If you have found a new drug resistance gene, you might want to know how prevalent it is amongst bacteria, and now you can search through the world's data for it in just a few minutes."

Tracking microbial threats

By making microbial genomes easier to search, LexicMap opens up new possibilities for research and public health.

"Having the ability to search all publicly available bacterial genomes in minutes changes what's possible," said John Lees, Group Leader at EMBL-EBI . "If you're developing a new antibiotic and discover a resistance mutation, you need to know how common it is in the real world. Now, for the first time, you can search over 2 million genomes - the entire global collection - in minutes to find out."

The LexicMap tool has already been integrated into the AllTheBacteria project , which curates and indexes high-quality assemblies of all known bacterial genomes. This gives researchers an easy way to explore one of the largest collections of microbial DNA ever assembled.

EMBL Sabbatical Visitor Fellowships

During his time at EMBL-EBI, Wei Shen, the lead author on this study, received support through the EMBL Sabbatical Visitor Fellowships . These fellowships offer researchers the opportunity to spend time at EMBL, collaborate with experts, and work on projects that benefit from EMBL's world-class facilities and resources. They are designed to foster international collaboration, drive scientific innovation, and support researchers in advancing their work.

Funding

This study was supported by grants from the National Natural Science Foundation of China (82341112), Chinese Scholarship Council scholarship (202308500105 to W.S.), EMBL Visitor/Sabbatical Programme fellowship, Remarkable Innovation-Clinical Research Project, Joint Project of Pinnacle Disciplinary Group (to W.S.), and Kuanren Talents Program of The Second Affiliated Hospital of Chongqing Medical University.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.