From the Amazon to Asia, groundbreaking research maps the microbial diversity of our guts, spotlighting the need for inclusive global data.
Study: Integration of 168,000 samples reveals global patterns of the human gut microbiome
In a recent study published in the journal Cell, researchers identified global and technical factors influencing human gut microbiome variation using a large-scale, uniformly processed dataset of 168,464 samples.
Background
The human microbiome plays a critical role in health and disease, with differences in composition linked to conditions such as colorectal cancer and inflammatory bowel disease. Variation in microbiome composition is influenced by factors such as host genetics, diet, antibiotic use, and geographic region.
Dietary habits, antibiotic consumption, and cultural practices vary globally, impacting gut microbiota. For example, the paper notes microbiome shifts in immigrants to the U.S. from regions like Thailand and Latin America. However, most research disproportionately focuses on high-income countries, leaving many populations underrepresented.
Technical factors like DNA extraction methods and primer selection further complicate analysis. Reference databases like SILVA (SILVA ribosomal RNA gene database project) are biased toward Western microbiomes, potentially underestimating diversity in underrepresented regions. Further research is essential to comprehensively understand microbiome variation and its implications for global health equity.
About the Study
The study retrieved publicly available sequencing data from the Sequence Read Archive (SRA) under the “human gut metagenome” category as of October 2021. Metadata associated with these samples was reviewed, and samples categorized as “genomic” or “metagenomic” with a “library strategy” of “amplicon” were included, totaling 245,627 samples. Further filtering removed BioProjects with errors, multiple sequencing platforms, or fewer than 50 samples, resulting in 234,875 samples from 811 BioProjects. Pyrosequencing data and samples processed with non-Illumina technologies were excluded to ensure consistency. Metadata inconsistencies, such as mislabeled sequencing instruments, were addressed to retain relevant samples.
Sequencing data were downloaded using the SRA Toolkit, processing paired-end and single-end reads with Divisive Amplicon Denoising Algorithm 2 (DADA2). Low-quality reads were removed, such as those shorter than 20 nucleotides or containing ambiguous bases. Taxonomic assignments were conducted using the SILVA database (v138.0), with taxonomy updates reflecting the latest nomenclature changes. Filtering steps excluded samples with insufficient reads, high proportions of unassigned taxa, or excessive chimeric reads (>25% in some BioProjects).
For most samples, country and region of origin were inferred from metadata, and geographic diversity was analyzed by consolidating data into eight global regions. Regions followed United Nations Sustainable Development Goals (SDG) classifications, such as “Eastern and South-Eastern Asia” (not “Eastern Asia”). Taxonomic richness and microbiome variation across regions were examined.
Study Results
To generate the Human Microbiome Compendium, researchers identified 245,627 publicly available 16S rRNA gene amplicon sequencing samples from the BioSample database maintained by the NCBI. The focus was on Illumina-based assays, excluding pyrosequencing and long-read sequencing data. Using DADA2, taxonomic tables were generated for each BioProject, quantifying amplicon sequence variants (ASVs) and classifying them to the genus level based on the SILVA reference. The final dataset included 168,464 samples from 68 countries, encompassing 5.57 terabases of sequencing data processed through a uniform pipeline.
Automated annotation tools and manual curation were used to infer metadata such as country of origin, DNA extraction kits, and amplicon choice. This enabled global-scale quantification of gut microbiome composition. A filtered subset of 150,721 high-quality samples was created by excluding samples with fewer than 10,000 reads or rare taxa. Bacillota (formerly Firmicutes) was identified as the most prevalent phylum, found in 99.9% of samples, followed by Pseudomonadota (formerly Proteobacteria), Actinomycetota (formerly Actinobacteria), and Bacteroidota (formerly Bacteroidetes). Alpha diversity, measured by the Shannon diversity index, showed broad variation, with a median of 2.33 and values as high as 5.07. Rarefaction analysis revealed genus-level taxa are still being discovered, particularly in underrepresented regions.
Geographic differences in microbiome composition were examined using metadata available for 92.4% of samples. Europe and Northern America accounted for the majority of samples (60.5%), with significant underrepresentation from regions like Central and Southern Asia (3.4%) and Sub-Saharan Africa (3.7%). Latin America and the Caribbean exhibited the highest alpha diversity (median Shannon diversity index = 2.69), while Central and Southern Asia had the lowest (median = 1.68). Faith’s Phylogenetic Diversity (PD) analysis showed combining taxa from underrepresented regions with Europe/Northern America increased evolutionary branch length by up to 68.6%. Principal coordinates analysis (PCoA) using the Aitchison distance revealed distinct clusters corresponding to world regions, underscoring the strong influence of geography on microbiome composition.
Technical factors, including DNA extraction methods, bead beating (mechanical lysis), amplicon choice, and sequencing depth, were found to influence microbiome variation significantly. For example, taxa such as Enterobacter (higher in V3–V4 amplicons) and Akkermansia (higher in V4 amplicons) exhibited differential abundances depending on the hypervariable region of the 16S rRNA gene used for sequencing. The interaction between region and amplicon choice had a more substantial effect (R² = 0.010) than the amplicon alone. Regions like Latin America and Sub-Saharan Africa had the highest proportions of unidentified taxa, linked to reference database biases, suggesting undersampling and the potential for unobserved microbial diversity.
Random forest classifiers were trained to predict the geographic region of origin for individual microbiome samples. They achieved high accuracy for regions like Australia and New Zealand (AUC = 0.944), while Europe and Northern America had lower predictive accuracy (AUC = 0.797), likely due to overrepresentation creating overlapping clusters.
Conclusions
Researchers integrated data from 168,464 publicly available 16S rRNA gene amplicon sequencing samples from 482 BioProjects to study global variation in the human gut microbiome. Most samples originated from Europe and Northern America, regions so extensively sampled that most microbial taxa are likely already observed, while other regions, such as Latin America and Eastern and South-Eastern Asia, exhibit remarkable diversity with many taxa still undiscovered. Each region occupies a unique niche within the ordination space, as revealed by multidimensional scaling and machine learning classification.
Significant microbiome differences were found across regions, including higher Bacteroides abundance in Europe/Northern America and elevated Prevotella in Sub-Saharan Africa and Latin America. Technical factors such as amplicon choice influenced findings, with primer bias affecting taxa like methanogenic archaea Methanobrevibacter. This compendium serves as a valuable resource for exploring microbiome diversity and advancing global microbial ecology research.