An international team led by researchers at University of California San Diego and University of California, Riverside has developed a free, web-based platform designed to make public metabolomics data more accessible. By allowing users to search for chemical structures across billions of chemical spectra (the unique signatures of molecules) spanning thousands of studies, the tool has the potential to make "big-data" metabolomics as straightforward as a standard internet search. It can be used to discover new metabolites, track drug exposures and connect specific molecules to diseases or environmental sources. The study was published in Nature Biotechnology .
Metabolomics is the large-scale study of small molecules (metabolites like amino acids and lipids) that are the end products of cellular processes. It provides a holistic snapshot of what is happening inside a cell, tissue, organ or entire organism, including biochemical changes driven by genetics, diet, environmental factors or disease.
Until now, searching for specific molecules in public repositories required expert knowledge and was limited to isolated datasets. The new tool, called StructureMASST enables researchers, clinicians and even the public to type in a chemical name, a SMILES string (text that represents 2D and 3D molecular structures) or a sub-structure pattern to instantly locate where those molecules have been documented across human, animal, plant and environmental samples — from recently extinct animals and long-dead dinosaurs to microbial communities on the International Space Station.
"It will tell you what organs it's found in, which organisms can produce it, what health conditions it's associated with, and what molecules are connected to it," said senior author Pieter C. Dorrestein, PhD, professor at UC San Diego Skaggs School of Pharmacy and Pharmaceutical Sciences and the Departments of Pharmacology and Pediatrics at UC San Diego School of Medicine.
StructureMASST leverages a massive knowledge base that integrates data from all of the major public metabolomics repositories. To make the data easily searchable, the researchers used indexing technology to tag each chemical spectra from the repositories with their known associations when available, similar to how web search engines work. Tags include the organism (e.g., human, mouse, bacterium), health condition or disease (e.g., inflammatory bowel disease, diabetes, Alzheimer's), sample type (e.g., blood, saliva, soil), geography and environment (e.g., urban vs. rural, marine, soil), gender/sex, and experimental design (e.g., control vs. treatment, dose, time point, disease stage).
"Search engines allow you to input text and quickly retrieve all the information associated with it because the entire worldwide web has been indexed," said Dorrestein, who also directs the UC San Diego Collaborative Mass Spectrometry Innovation Center. "We do essentially the same thing that these web search engines have done, but for molecules."
And like a search engine, indexing enables queries that return results in seconds or a few minutes, a small fraction of the time other methods take. Indexing also makes it possible to search by disease. For example, a search for Alzheimer's disease would retrieve every spectrum linked to the condition across all repositories.
After building StructureMASST, the researchers put the molecular search engine through its paces with real world examples including well-known compounds, natural products and pharmaceuticals:
- Caffeine: A single query using the molecular structure of caffeine returned more than 6,000 spectra files, detecting the stimulant not just in samples from coffee plants but also in human blood, milk and even microbial cultures.
- Environmental exposure: The tool revealed that the environmental metabolite surfactin, produced by Bacillus subtilis bacteria, is more common in people living in remote, traditional villages compared to urban populations, highlighting how lifestyle and environment shape the human metabolome.
- Bacterial siderophores: Sub-structure searches revealed that iron-scavenging compounds produced by certain bacteria are present in human patients with chronic conditions like cystic fibrosis and rheumatoid arthritis, suggesting that these molecules may play a role in immune regulation or trigger opportunistic infections within the human body.
- Drug distribution: Using the tool to track the cardiac drug amiodarone and its metabolites across dozens of human tissues provided a detailed view of drug exposure and metabolism that could inform safety monitoring.
In addition to its search capabilities, StructureMASST includes built-in quality control features that flag erroneous data in public libraries that could otherwise lead to false conclusions. It is also being continuously updated as the scientific community contributes new information.
By transforming massive, publicly deposited molecular data into practical insights, StructureMASST could become an essential tool for advancing medicine, basic biology and environmental science. It will help generate hypotheses, uncover new information about metabolism, and speed up the discovery of molecular biomarkers of disease and therapeutic targets.
Additional co-authors on the study include: Yasin El Abiead, Jeong In Seo, Vincent Charron-Lamoureux, Wilhan Donizete Goncalves Nunes, Haoqi Nina Zhao, Kine Eide Kvitne, Simone Zuffa, Helena Mannochio-Russo, Harsha Gouda, Abubaker Patan, Shipei Xing, Jasmine Zemlin, Ipsita Mohany, Julius Agongo, Caraballo Rodriguez Andres Mauricio, Victoria Deleray, Jeremy Carver, Lindsey A. Burnett, Eoin Fahy and Shankar Subramaniam at UC San Diego; Michael Strobel, Mingxun Wang and Daniel Petras at UC Riverside; Cristina Bez at the International Center for Genetic Engineering and Biotechnology; Abzer K. Pakkir Shah at University of Tübingen; Jarmo-Charles Kalinski at Rhodes University; Nikiforos Alygizakis at Environmental Institute (Slovak Republic); and Ozgur Yurekten, Thomas Payne and Juan Antonio Vizcaíno at the European Bioinformatics Institute.
Disclosures: Dorrestein is an advisor and holds equity in Cybele, BileOmix, Sirenas and is a scientific co-founder, advisor, holds equity and/or received income from Ometa, Enveda, and Arome with prior approval by UC San Diego. He also consulted for DSM Animal Nutrition & Health in 2023.
Read the full study .
The study was funded, in part, by the Chan Zuckerberg Initiative (grant 2024-350548), the National Research Foundation of Korea (grant RS-2025-02373133), the National Institutes of Health (NIH) (grant K99ES037746, 5U24DK133658, 2R01GM107550, U24DK141185 and U2CDK119886) and the joint National Science Foundation (NSF) and United Kingdom's Biotechnology and Biological Sciences Research Council (BBSRC) (award 2152526).