Computer scientists at ETH Zurich have developed a digital tool capable of searching through millions of published DNA records in a matter of seconds. This can significantly accelerate research into antibiotic resistance and unknown pathogens.
In brief
- "MetaGraph", a new ETH tool enables fast searching of DNA sequences - efficiently, accurately and at favorable costs.
- In order to achieve this, the researchers use indices enabling better structuring of large data volumes, making them easy to search.
- As an open-source tool, MetaGraph is freely accessible, offering a wide range of potential applications.
Rare hereditary diseases can be identified in patients and specific mutations in tumour cells detected - DNA sequencing revolutionised biomedical research decades ago. In recent years, new sequencing methods (next-generation sequencing) in particular have resulted in numerous scientific breakthroughs. In 2020/2021, for example, they enabled the rapid decoding and global monitoring of the SARS-CoV-2 genome.
Meanwhile, more and more researchers are making the results of sequenced DNA publicly available. This has given rise to the creation of huge data volumes, which are stored in central databases such as the American SRA (Sequence Read Archive) or the European ENA (European Nucleotide Archive). Around 100 petabytes of data are stored there - roughly the same amount as all the text on the internet, one petabyte being the equivalent of one million gigabytes.
To date, biomedical scientists have needed massive computing power and other resources to search through this amount of DNA sequences and compare them with their own sequences - making the efficient searching in such mountains of data a sheer impossibility. Computer scientists at ETH Zurich have now solved this problem.
Full-text search instead of downloading entire data sets
The scientists have developed a method that greatly shortens and facilitates this search. The "MetaGraph" digital tool searches the raw data of all DNA or RNA sequences stored in the databases - just like a conventional Internet search engine. After entering a sequence they are interested in as full text into a search mask, researchers can find out within seconds or minutes, depending on the query, where it has already appeared.
"It's a kind of Google for DNA," as Professor Gunnar Rätsch, data scientist at the Department of Computer Science at ETH Zurich summarises. Until now, researchers had to search the databases for descriptive metadata. In order to access the raw data, they had to download the respective data sets. These searches were incomplete, time-consuming and expensive.
"MetaGraph" is comparatively favorable in terms of costs, as the researchers state in their study. The representation of all public biological sequences would fit on a few computer hard drives, while larger queries should cost no more than 0.74 dollars per megabase.
As the DNA search engine the ETH researchers have developed is also both precise and efficient, it can help to accelerate genetic research - for example, in the case of little-researched pathogens or new pandemics. In this way, the tool could become a catalyst in research into antibiotic resistance: for example, by identifying resistance genes or useful viruses that can destroy bacteria - known as bacteriophages - in the databases.
Compression by a factor of 300
In the study published on 8 October in the journal Nature, the ETH researchers demonstrate how MetaGraph works: the tool indexes the data and presents it in compressed form. This is achieved by way of complex mathematical graphs that improve the structure of the data - similar to spreadsheet programmes such as Excel. "Mathematically speaking, it is a huge matrix with millions of columns and trillions of rows," as Rätsch states.
The idea of rendering large amounts of data searchable with the help of indexes is standard practice in computer science research. What is new about the work of the ETH researchers, however, is the complex linking of raw data and metadata and the compression by a factor of about 300, similar to a book summary: it no longer contains every word, but all the main storylines and connections remain intact - more compact, yet without any relevant loss of information.
"We are pushing the limits of what is possible in order to keep the data sets as compact as possible without losing necessary information," says Dr André Kahles, who, like Rätsch, is a member of the Biomedical Informatics Group at ETH Zurich. By contrast with other DNA search masks currently being researched, the ETH researchers' approach is scalable. This means that the larger the amount of data queried, the less additional computing power the tool requires.
Half of the data is already available now
The ETH researchers first presented MetaGraph in 2020 and have been continuously improving it ever since. The tool is already available for queries (link). It provides a full-text search engine for millions of sequence sets from DNA and RNA, as well as proteins from viruses, bacteria, fungi, plants, animals and humans. At present, just under half of the sequence data sets available worldwide are indexed. According to Gunnar Rätsch, the rest should follow by the end of the year. Given that MetaGraph is available as open source it could also be of interest to pharmaceutical companies that have large amounts of internal research data.
Kahles even believes it is possible that the DNA search engine will one day be used by private individuals: "In the early days, even Google didn't know exactly what a search engine was good for. If the rapid development in DNA sequencing continues, it may become commonplace to identify your balcony plants more precisely."
References
Karasikov, M., Mustafa, H., Danciu, D., Kulkov, O., Zimmermann, M., Barber, C., Rätsch, G., & Kahles, A.: Efficient and accurate search in petabase-scale sequence repositories. Nature 2025, doi: external page 10.1038/s41586-025-09603-w