MGnify Genomes offers new ways to explore microbial communities in soil, marine sediments, and the gut

If you could shrink yourself to the size of a microbe, you would discover that every handful of soil and every bucket of seawater is akin to a bustling city filled with millions of microbes. Until recently, most of these 'microbial metropolises' were impossible to map because the majority of microbes can not be cultured in the lab.
This makes it extremely challenging for researchers to reconstruct microbial communities living in specific environments such as soil, water, or the gut. The rise of metagenomic data and advances like MGnify Genomes are shedding light on microbial communities that were previously inaccessible.
What is metagenomics?
Metagenomics is the study of collective genetic material, also known as the metagenome, sampled from specific environments.
EMBL-EBI's MGnify is one of the world's largest resources for microbiome data analysis. It also contains richly annotated, microbe-derived genomes , a treasure trove for scientists working in antimicrobial resistance, agritech, or biodiversity.
Lorna Richardson (LR), Coordinator for Microbiome Informatics, and Tatiana Gurbich (TG), Microbial Genomes Project Lead, both part of the MGnify team at EMBL-EBI, share their insights on how metagenomics is helping us understand microbes as individuals and as part of their environments.
Can you tell us about the biome catalogues available in MGnify?
TG: The biome catalogues in MGnify are collections of microbial genomes compiled to represent the known microbes in specific environments, such as ocean sediment, soils, or the digestive systems of humans and animals.
The catalogues include genomes assembled from metagenomic samples, as well as genomes from microbes grown and sequenced individually, which we refer to as isolates. Catalogues also contain information about predicted proteins and genes, along with descriptions of their likely functions, for example, involvement in antimicrobial resistance (AMR).
For species represented by multiple genomes, we generate pangenomes, which can be a useful resource for understanding gene prevalence and diversity within a species.
Why are biome catalogues interesting?
LR: Many microbes found in metagenomic samples are hard to grow in the lab. As a result, they may not be represented in traditional genomic databases such as EMBL-EBI's Ensembl . As metagenomics enables us to reconstruct microbial genomes directly from environmental DNA, it means we can build biome catalogues that better represent the diversity of microbes found in metagenomic samples.
In nature, microbes live in communities. Scientists can gain new insights by analysing them in isolation as well as within their community. Metagenomics and the MGnify biome catalogues help scientists infer which microbes show up in a sample, how abundant they are, which species coexist, and what the individuals can do.
What do MGnify biome catalogues enable scientists to do?
TG: The MGnify biome catalogues can act as reference datasets. If you have a soil sample that you want to contextualise, you can compare it to relevant MGnify genome catalogues. For example, you can use catalogues to identify what taxa are contained in your metagenomic sample and which gene functional categories are represented. Or, if you're generating your own metagenome-assembled genomes (MAGs), you can use catalogues to check if you have anything novel. If the genome has already been assembled before, MGnify catalogues can show you what other environments the species has been seen in. All catalogue data can be downloaded for further exploration.
Who uses the MGnify biome catalogues?
TG: Last year, over 12,000 people from around the world accessed the Genomes section of MGnify. This is a conservative estimate. Some catalogues are of particular interest for agricultural research, where public data are quite scarce.
What biome catalogues are currently available?
LR: The resource is steadily growing. Right now, MGnify represents 18 biomes and over half a million genomes in total. We have several catalogues from human-associated biomes, the largest and most widely used being the human gut catalogue, which contains nearly 300,000 genomes.
We have genomes assembled from rhizosphere soil samples for tomato, corn and barley cultures, which were generated by the Horizon2020 FindingPheno project. The goal of the project is to understand what drives desirable crop traits and use the information to improve crops. The data generated by the project is openly available.
We also have marine water and marine sediment catalogues developed for the BlueRemediomics project, which harnesses marine microbes to develop high-value, sustainable products and services. There are also several animal-associated catalogues, such as for the pig gut and the cow rumen. We even have a honeybee gut catalogue.
How do you decide what biomes to create catalogues for?
LR: It depends entirely on what data are available. Scientists usually publish the raw genetic data from their studies in public databases such as EMBL-EBI's European Nucleotide Archive . The MGnify team can then use these raw data, as well as any MAGs shared by scientists, to generate a biome catalogue. Often this is done as part of a project EMBL is involved in - a recent example is the HoloFood project, which explored animal gut microbiomes in farmed animals.
Scientists all over the world are generating MAGs at scale. We always encourage researchers to make data publicly available so the catalogues can best represent all this knowledge, and so the community can benefit from it.
We don't create biome catalogues as a standard request-based service yet, but anyone who has an interesting MAG dataset can reach out to us by using the MGnify support form to see if we can turn it into a catalogue.
How do you develop biome catalogues?
TG: We start with sequencing data from metagenomic samples of an environment, such as the soil around corn roots, known as the maize rhizosphere, or with genomes that have already been generated from that environment. In the case of raw metagenomic data, we assemble the sequences and generate MAGs.
We process the genomes using a pipeline that performs quality control, clusters the genomes at the species level, selects the best genome as the species representative, and annotates it with functional information about the genes and proteins.
We also gather all proteins and all genes into biome-specific protein and gene catalogues. For each genome in the catalogue, we provide extensive metadata, such as the sample the sequences were taken from, the geographic location, and the quality of the assembly.
How do you update the catalogues to include the latest information?
TG: We have a workflow for adding or updating genomes as new data becomes available. Over time, the biome catalogues will continue to grow as scientists generate more complete genomes, helping us build an ever clearer map of the microbial life in environments that matter to science and society.
Through our biome catalogues, the 'microscopic cities' beneath our feet and within our bodies start to take shape. With each new catalogue, we move closer to understanding the microbes and the remarkable environments they create. This isn't just science for curiosity's sake, it's also essential for understanding and treating disease and developing sustainable products and services.