Microbial clades and genomic potential across all soils
We collated soil metagenomic sequences from a wide range of environmental conditions across global biomes (
Fig. 1;
Tables S1 and S2). Bulk density, cation exchange capacity (CEC), nitrogen (N), pH, soil organic carbon (SOC), and clay content ranged from 0.28 to 1.56 kg/dm³, 7.7 to 68.7 cmol C/kg, 0.25 to 22.4 g/kg, 4.4 to 8.3, 2.4 to 510.9 g/kg, and 2.7% to 57.1%, respectively.
To reveal the full range of taxa and functional potential encoded by global soils, we first assessed microbial attributes [i.e., gene annotations by the Kyoto Encyclopedia of Genes and Genomes (KEGG) (
81,
82), the Pfam (
83) and TIGRFAM protein databases (
84), and read-based taxonomy; see Materials and Methods] with the highest normalized abundance across all soils (
Fig. 2). Multiple annotation databases provide complementary information and are a standard operating procedure by the U.S. Department of Energy Joint Genome Institute (JGI) (
85,
86). Abundant KEGG orthologies (KO) were associated bacterial secretion systems (K12510), redox reactions (K00104, K00334, K03882, and K00380), and cellular growth processes (K02335, K02049, K00334, K03882, and K03088). Abundant Pfam were comprised of functions related to the tolerance of common soil stressors including temperature (PF11999, PF08496) and salinity (PF12838), as well as functions involved in microbial growth (PF08511, PF07357, PF13247, and PF00005). The most abundant TIGRFAM were also associated with soil stressors, for example, TIGR00229 (PAS Domain S-box protein) is involved in sensing changes in light, redox potential, oxygen, and other stressors (
87). Additional highly abundant TIGRFAM across all soils included a sigma-factor that may be associated with microbial responses to environmental fluctuation (TIGR02937) and other signaling and sensing processes (TIGR00254, TIGR00231). Finally, abundant microbial genera included diverse microorganisms including Burkholderia, Mycobacterium, Streptomyces, Acetobacter, and Pseudonocardia.
Variation in genomic potential across environmental gradients
To reveal biogeographic patterns in microbiomes across global soils, we extracted microbial attributes that were associated with soil factors and ecological biomes using random forest regression models (see Materials and Methods;
Table S1). Soil pH, bulk density, and biome type were the strongest correlates of microbial attributes (i.e., they generated well-fitting models with comparatively high explanatory power relative to other factors;
Table S1). The cross-validated random forest model for each of the variables above had a root mean square error of less than 0.8 (often <0.2) and/or at least 60% variance explained. Models for CEC, N, SOC, and clay content did not meet these thresholds. Results were conceptually consistent across different microbial attributes. A full description of all genomic fingerprints is in Extended Data S1.
pH separated soil microbiomes by nutrient- and vitamin B-related genes (
Fig. 3). More alkaline soils were characterized by vitamin B-, nutrient-, and phosphatase-related activities, with genomic fingerprints that included K01662, PF01872, K08717, TIGR00378, PF01966, and PF00481 for instance, as well as many genera of Alphaproteobacteria. More acidic soils were signified by ammonia- and complex organic polymer- (e.g., wood-) related reactions, with genomic fingerprints that included K14333, K01684, TIGR03404, and TIGR02093, and microbial clades known to use ammonia as a substrate (Rhodopseudomonas and Klebsiella).
Bulk density separated soil microbiomes by genes associated with soil structure, hydrology, and nutrient content. The genomic fingerprint of low-bulk density soils comprised genes involved in organic N and methane cycling, as well as signatures of anoxia (
Fig. 3). It included K00992, K11261, K11959, TIGR03323, TIGR01392, and microorganisms associated with anoxic and/or aquatic environments (Gardnerella, Cetia, Salmonella, Olleya, Mycoplasma, Escherichia, Enterobacter, and Chitinophaga). Conversely, high-bulk density soils were signified by genes involved in microbial transport, stress, pigmentation, and infection mechanisms. Notable attributes associated with high-bulk density soils included K00854, K11733, K03799, TIGR01263, TIGR01957, TIGR01203, PF00582, PF03631, and four Alphaproteobacteria.
Four biomes were highlighted with distinct genomic signatures by our models: (i) flooded grasslands and savannas, (ii) tundra, (iii) boreal forests, and (iv) montane grasslands/shrublands (
Fig. 4). The soil genomic fingerprint of flooded grasslands/savannas was primarily characterized by positive associations with organic N (K13497, K07395, K19702, and K02042), phosphorus (P, K02042, K00854, K02800, and PF04273), stress tolerance (heat and arsenic: TIGR04282 and TIGR02348), and motility (PF00669, PF00482, and Paeniclostridium). The genomic fingerprint for tundra had the greatest number of positive gene associations, including those associated with methanogenesis (K11261, TIGR03323, TIGR03321, TIGR03322, TIGR03306, TIGR01392, and TIGR02145), nitric oxide production (PF08768), P transfer (K03525, K00997, K00992, K13497, and PF08009), and pathogenesis (PF11203, PF05045), supported by the presence of anaerobic (Gardnerella, Escherichia, Kosakonia, Glaesserella, Mycoplasma, Treponema, and Paeniclostridium) and possible plant growth-promoting genera (Kosakonia).
Boreal forests and montane grasslands/shrublands were primarily negatively associated with microbial attributes. Boreal forests displayed negative relationships with the normalized abundance of magnesium (Mg, K03284)- and P (K06217)-related genes and carbohydrate and organic N metabolism (K02800, K11472, PF00208, and PF04685), as well as some metal-related genes (TIGR00378, TIGR04282, and TIGR00003). Montane systems were negatively characterized by genes related to Mg and P (K01520, K04765, and TIGR01203), carbohydrate metabolism (K01816), virulence (PF09299), and motility (PF00669), as well as organic N (K07395, K00600, and PF00208) and S (K03644). Both boreal forests and montane grasslands/shrublands were negatively associated with anaerobic genera (Gardnerella, Thielavia, Coniochaeta, Escherichia, Salmonella, Kosakonia, Shigella, Mycoplasma, Treponema, Glaesserella, Enteractinococcus, Paeniclostridium).
Interestingly in all analyses across pH, bulk density, and biomes, we observed phylogenetic coherence in soil microbiomes—the normalized abundances of more closely related microorganisms tended to move congruently in response to the biome type and environmental factors (
Fig. 4B; Extended Data S1).
Unsupervised machine learning to disentangle environmental interactions associated with soil genomic potential
Finally, to detect emergent patterns in microbial biogeography, we used unsupervised machine learning to reveal a collection of soils with distinct genomic fingerprints. These soils and their resident microbiomes were associated with coordinated changes in SOC content, total N, and CEC and with opposite coordinated changes in bulk density and clay content (
Fig. 5; Extended Data S2). We identified KO cluster 7, genus cluster 6, and Pfam cluster 1 as containing soil samples that exhibited the strongest coordinated changes (defined by the number of significant correlations and
R2 values; see Materials and Methods). TIGRFAMs did not exhibit these patterns.
Selected microbial KO and genera were positively correlated with SOC, N, and CEC and negatively associated with bulk density and clay, while Pfam cluster 1 exhibited opposite patterns (i.e., negative associations with SOC, N, and CEC and positive associations with bulk density and clay). Soil microbiomes in KO cluster 7 and genus cluster 6 contained high normalized abundances of genes associated with energy generation (K00334, K00339, K00324, and K08738), filamentous bacteria/biofilms (K11903, Nocardia, Streptomyces, Mycobacterium, and Leptolyngbya), chemically complex organic matter decomposers (Nocardia, Rhodococcus, and Streptomyces), and N-fixing organisms and plant symbionts (Rhizobium, Bradyrhizobium, and Nostoc). Additionally, genes associated with N-cycling processes (K07685, K10041) and microbial interactions (e.g., signaling and secretion, K12537, and K02661) changed most dramatically across environmental gradients. The most variable taxa across environmental gradients were diverse—they included N-cycling (Uliginosibacterium, Microvirgula), autotrophic and/or anaerobic (Methanosalsum, Salana, Pseudobacteroides, Sulfurirhabdus, Microvirgula, Desulfallas, Pelotomaculum, and Risungbinella), and plant-related microorganisms (Pseudobacteroides, Zeaxanthinibacter, and Parapedobacter). Samples in Pfam cluster 1 had high relative abundances of organic N- (PF01979, PF08352, and PF01842), P- (PF00406, PF01380), and signaling-related (PF01627, PF09413) genes, while many genes thought to be associated with eukaryotes changed most dramatically across environmental gradients in Pfam annotations (PF04178, PF03188, and PF01793).