The U.S. National Science Foundation (NSF) National Synthesis Center for Emergence in the Molecular and Cellular Sciences (NCEMS) at Penn State has awarded four researchers from across the world for their efforts in building machine learning-based approaches to streamlining publicly available datasets for reuse. The researchers - two individuals and one team of two - won a competition hosted by the center and designed to showcase data harmonization solutions.
"Harmonizing the Data of your Data," was launched by NCEMS as part of the center's larger aim to leverage the power of artificial intelligence (AI) and data science to harmonize publicly available datasets and gain new insights into molecular and cellular biology.
Working groups at NCEMS were tasked with a challenge to streamline public databases that stored spectrometry, proteomics data - measurements of proteins, their content and abundance - as well as the associated metadata, or specific details that provide context about the stored data.
"Modern proteomics experiments generate enormous amounts of data, but those data are much more useful when they come with clear contextual information: what was measured, under which conditions, from which samples and how the experiment was performed. Much of that information is currently hidden in scientific papers or recorded in inconsistent ways," said Wout Bittremieux, assistant research professor from the University of Antwerp who is leading the NCEMS working group tackling this problem. The working group includes Shomir Wilson, associate professor of human-centered computing and social informatics in the College of Information Sciences and Technology, and Iddo Friedberg, associate professor of veterinary microbiology and preventative medicine at Iowa State University.
The working group built a hybrid tool that could take in raw data, extract new information from text using large language models, reprocess the information and harmonize the data products within the database using AI agents, making it more accessible for researchers to reuse, requantify and reanalyze.
"All cells have proteins in them with differing levels of abundance under different conditions," said Ian Sitarik, Institute for Computational and Data Sciences (ICDS) Research Innovations with Scientists and Engineers (RISE) bioinformatician and computational biologist. "This mass spectrometry data allows researchers to get an estimation of the protein abundance level and how it changes under different conditions, for example, looking to see which proteins abundance changes when the organism has a particular disease."
With a mission of supporting community-scale research, NCEMS tasked computer science-focused researchers beyond the center with the same challenge via Kaggle, the online platform owned by Google where scientists, engineers and researchers host competitions for individuals or teams interested in tackling challenges in data science and machine learning.
"When public institutions are creating databases, they often don't have the support to maintain the databases for more than a couple of years, leading to a hodgepodge of partial and inconsistently structured resources," Sitarik said. "Proteomics mass spectrometry databases are no different, and we wanted to gain community insights into the best methods to standardize, enhance and harmonize this metadata using both advances in machine learning and reanalysis of the raw spectral data. This could lead to hypothesis generation, new research insights, finding common themes and more that would not have been possible otherwise."
The competition had over 5,000 submissions, 717 entrants, 270 participants and 250 team submissions overall. Out of more than 20 complete submissions to the competition, four researchers were awarded support. First place received $5,000, second place received $3,000 and third place received $1,000.
The awardees include:
First place: Jason Karpeles, senior principal data scientist at PMG
Second place: Juan Gil, software engineer, and Kimberly Duran, quality and continuous improvement analyst, both of Prometio Group
Third place: Rohan Vinaik, research associate at StemCellerant
"This competition was very successful," Sitarik said. "The unique approaches the contestants came up with to tackle this metadata harmonization problem allows us to build a more robust and high-quality resource for the whole scientific community to use."
NCEMS is a center supported by the NSF and Penn State's ICDS and Huck Institutes of the Life Sciences.