Community Approach to Fixing Biology's Big Data Problems

Berkeley Lab

As Sarai Finks was sorting through datasets of bacterial genome sequences last year, she became frustrated by missing information. Finks studies how dietary changes affect the communities of bacteria-infecting viruses that live in our guts, and how these changes in turn affect our health. She needed to know more about the microbiome samples represented in the datasets - in particular, more details about the human environments they came from. Where did the person live? What kind of foods did they eat? What did they drink? The lack of specifics made it difficult for Finks to connect all the dots about the organisms and how they interact with other microbes as the gut conditions shift due to diet.

On one hand, the mere existence of a wealth of biological data that Finks did not have to generate herself was a boon. On the other hand, the data was a mess - inconsistent and incomplete.

Finks' moment of frustration is familiar to any researcher who studies microbial communities, also known as microbiomes. Understanding microbiomes gives us insight into big topics like the origins of disease and carbon sequestration in the soil, as well as answering intriguing questions, like how life thrives in the dark depths of the ocean. To discover which organisms are in a given microbiome and what each of these inhabitants are doing, scientists gather samples and analyze the DNA, RNA, and proteins within; sometimes going as far as trying to identify every organic compound that is present. These studies generate giant datasets of molecular information and genetic sequences that are all different in their organization, style and language of notation, and underlying software, depending on the team who created them.

Other researchers may benefit greatly from this data, especially those without the time or resources to perform original sample analyses, but actually using it can feel like trying to read an encyclopedia written in a different language, with pages missing.

"Data standards in microbiome research are critical to enable cross-study comparison, share and reuse data, and to build upon existing knowledge of what microbes do in their environments and where they occur," said Emiley Eloe-Fadrosh, Program Lead of the National Microbiome Data Collaborative (NMDC). The NMDC was founded in 2019 by a diverse group of experts - with funding from the Department of Energy - to help address the ongoing data challenges through community engagement and the creation of new tools and standardized practices. The NMDC is led by Eloe-Fadrosh and scientists from Lawrence Berkeley National Laboratory (Berkeley Lab), Los Alamos National Laboratory, and Pacific Northwest National Laboratory.

/Public Release. This material from the originating organization/author(s) might be of the point-in-time nature, and edited for clarity, style and length. Mirage.News does not take institutional positions or sides, and all views, positions, and conclusions expressed herein are solely those of the author(s).View in full here.