INTRODUCTION
Mounting evidence implicates the gut microbiome as a critical component of human health. For example, research demonstrates that gut microbiota contribute to immunity, nutrition, and behavior (
1,
2). Additionally, gut microbiomes of diseased individuals tend to harbor different taxa and contain different genes than those of healthy individuals (
3). These observations motivate the hypothesis that human health depends, in part, upon the taxonomic composition of and biological functions executed by gut microbiota. Accordingly, researchers have sought to identify the properties of the human gut microbiome that signify health and disease. Such signatures are valuable to resolve because they provide important context for the development of disease diagnostics, clarify disease etiology, and generate insight into how microbiomes could be amended to restore health.
Prior investigations focused on defining how the gut microbiome signifies health or disease. For example, the Human Microbiome Project defined the structure and function of the gut microbiome in clinically healthy, urban North Americans (
4). Other investigations used clinical 16S rRNA gene sequence data to determine how the structure of the gut microbiome of diseased individuals differs from that of healthy individuals (
3,
5). More recently, a smaller set of investigations used shotgun metagenomes to resolve how both the structure and functional diversity of the gut microbiome associate with disease (
6 – 13). However, almost all prior investigations focused on a single disease population and a matching control. Very few studies integrate data across multiple populations, incorporate data from other studies, or compare patterns across various disease types. Consequently, it is unclear which associations are robust to population or study effects. Moreover, we possess limited insight into which associations are specific to a disease type versus those that are common to myriad diseases. These limitations hinder our ability to develop robust clinical diagnostics from microbiome data and obscure our understanding of the potential mechanisms through which the microbiome contributes to a specific disease or health in general.
Integrating data across investigations through a meta-analysis overcomes these limitations (
14 – 16). Though their application in microbiome science remains limited, meta-analyses provide important clarity in microbiome research. For example, meta-analysis of 16S rRNA gene sequence-based investigations surrounding human obesity revealed that originally reported associations between the taxonomic composition of the gut microbiome and obesity were inconsistent across studies (
17) and appear to manifest only weak statistical effects (
18). Additionally, a meta-analysis of 16S rRNA gene sequence data quantified the microbiome’s taxonomic association with disease across several populations that span a variety of diseases to reveal that some microbiome characteristics are disease specific while others are common to multiple diseases (
15). The application of meta-analyses to shotgun metagenomic data is even more restricted, in part due to the limited amount of clinical metagenomic data currently available. One study integrated metagenomes to assess the predictive capacity of the taxonomic profile of the microbiome for several diseases, finding that integrating multiple data sets improved prediction capabilities (
16). These studies highlight the importance of data integration in contributing to our understanding of the role of the microbiome in health and disease.
While these studies have proven insightful, their focus on taxonomy may limit our understanding of how the microbiome relates to health. Metagenomes afford insight into the types of genes contained, and consequent biological pathways encoded, by the microbiome. Resolving the association between microbiome functions and health may prove critical to determining the mechanisms through which the microbiome promotes health or contributes to diseases. Moreover, such analyses may reveal robust indicators of disease given observations that different microbes can elicit analogous functional effects on the host (
19,
20). For example, the application of meta-analysis to the functional diversity of the gut microbiome in a study of type 2 diabetes revealed gene families contained in the microbiome that consistently associate with disease across two continents (
21). The integration of metagenomic data sets in this study revealed the confounding contribution of antidiabetic medicine to the results, emphasizing the need to consider additional factors, such as medication, in assessments of the gut microbiome’s relationship to health and disease.
Here, we describe the first meta-analysis of microbiome gene functions that spans multiple disease types and populations. For this meta-analysis, we identified all publicly available human shotgun metagenomic microbiome data with diseased and nondiseased subjects, which consist of ∼2,000 metagenomes that span 8 studies and 7 diseases. We selected a case and control population for each disease from the available samples and applied a regression-based statistical framework to assess how the functional capacity of the microbiome varies in association with each disease and across diseases in general. Where possible, we modeled data spanning multiple studies with a study variable to control for potential study effects. Our study (i) reveals that functional diversity indicates disease, but usually with weak effect; (ii) resolves microbiome functions that associate with multiple diseases as well as functions that indicate specific diseases; (iii) documents the importance of considering study-specific parameters when deriving diagnostics based on the functional diversity of the gut microbiome; and (iv) explores the ability of the functional composition to predict disease status.
DISCUSSION
Our integrative analysis reveals the functional attributes of the gut metagenome that relate to human health and disease. We show that healthy microbiomes tend to encode higher protein family richness, significantly different functional compositions, and increased constraint on the variation in that composition compared to disease-associated microbiomes. However, effect sizes are frequently weak and not all diseases manifest these trends. Moreover, we identify specific functional modules that associate broadly with disease and, therefore, may be important to maintaining host health. Additionally, we resolve disease-specific markers that help clarify disease etiology and assess the ability of potential biomarkers to classify health status. Ultimately, the microbiome functions that we identify as being enriched in healthy individuals and disrupted in diseased individuals may illuminate how the microbiome contributes to host health.
Disease tends to associate with a reduction in the number of distinct protein families encoded in the microbiome. However, this trend is not universal, where some diseases (i.e., liver cirrhosis and rheumatoid arthritis) have no significant difference in richness and others (i.e., colorectal cancer) exhibit an increase in richness in diseased subjects. Decreased taxonomic richness commonly associates with disease, and some studies have associated decreased functional richness with disease (
35). While this holds true for several diseases (i.e., Crohn’s disease, obesity, type 2 diabetes, and ulcerative colitis), it is not a ubiquitous characteristic of the microbiome in a diseased subject.
The integration of metagenomic data enabled comparison of the differences in the gut microbiome’s functional composition across a variety of diseases. We find that while the microbiome’s functional composition associates with host health, the strength of the association substantially varies by disease and is generally relatively small. This suggests that these diseases are not defined by a substantial restructuring of the functional composition of the gut microbiome. Rather, if the microbiome contributes to diseases, it tends to do so through changes in the abundance of specific protein families, which may be different in each diseased subject. Consequently, health is not necessarily defined by the sum total of the functional capacity of the microbiome.
Among the many complexities of the gut microbiome is the variation in functional composition observed even in healthy populations that can be attributed to factors unique to a population (e.g., their geographic location) or investigation (e.g., how samples were processed). These factors may impact the apparent relationship between the microbiome and health state. These so-called study effects may thus potentially confound the discovery of microbiome signatures that robustly indicate disease, especially when data that are collected from only a single population or investigation are used to uncover these indicators. However, no investigation has yet measured how study effects impact discoveries that result from associating the microbiome’s functional diversity with health state. To date, only colorectal cancer, obesity, and type 2 diabetes have been investigated using clinical shotgun metagenomic data that were generated from multiple, distinct populations and research studies. Integrating data across these studies, we find that study accounts for approximately 18.14% and 14.92% of the variation in functional composition between cases and controls for obesity and type 2 diabetes, respectively, while disease status accounts for only 1.2% and 1.7%, respectively. This finding aligns with prior observations of study effects in analyses of the taxonomic composition of the gut microbiome (
15,
26,
36).
The phrase “study effects” is an umbrella term often used to describe any unknown source of variance. Comparison of the technical and biological replicates in this data set reveals that the variation between these replicates is less than the variance between unrelated samples, indicating that certain study effects (i.e., batch effects) are unlikely to be the source of variance between samples. The variance in functional composition is more reasonably due to factors associated with geographical location such as diet and cultural practices. Unfortunately, we do not currently possess the appropriate data set to address this question. Future studies should seek to generate metagenomic data from more diverse populations that span distinct countries. Despite the large contribution of study effects, disease status remains an important factor in explaining the variance between samples.
Analysis of the microbiome’s functional beta-dispersion reveals that most diseases have increased intersample variation in the microbiomes of the case populations relative to the microbiomes of the control populations. This pattern of increased dispersion in disease-associated microbiomes was previously observed in studies of taxonomic diversity and dubbed the Anna Karenina principle (AKP) (
28). AKP hypothesizes that certain stressors elicit stochastic effects on the taxonomic composition of the microbiome to yield increased variation in the stressed group relative to the control group. Our beta-dispersion analysis shows that the AKP also applies to the functional profiles of the gut microbiome in diseased hosts. This observation indicates that the increased dispersion observed in the taxonomic analysis of diseased microbiomes is unlikely to be the result of redundant functional compositions across communities, since if that were the case we would expect to find little to no increase in dispersion in the functional profiles. That said, our observation does not preclude the possibility that different taxa encode a small set of redundant proteins that associate with the disease state. For example, several genera within the phylum
Proteobacteria (e.g.,
Escherichia,
Pantoea, and
Sutterellaceae) appear to contribute to the abundance of lipopolysaccharide (LPS) biosynthesis and transport modules. Additionally, our finding that there tends to be lower functional dispersion among healthy individuals indicates that there may exist greater constraints on how the microbiome operates among healthy individuals.
Our robust and integrative modeling approach reveals specific associations between microbiome function and health by identifying commonly perturbed functions that impact host health. Interestingly, most of the common indicators (i.e., indicators of four or more diseases) are increased in abundance in the microbiomes of diseased subjects relative to the microbiomes of control subjects, suggesting that these shared disease associations may be due to the elevated presence of some microbiome functions rather than their loss in the microbiome. For example, subjects with colorectal cancer, liver cirrhosis, Crohn’s disease, and obesity have increased abundance of a module for lipopolysaccharide (LPS) biosynthesis (M00060). LPS is a well-known proinflammatory molecule; increased LPS biosynthesis by gut microbiota could contribute to intestinal inflammation observed in subjects with these diseases. Additionally, some common indicators may clarify collective features of the intestinal environment across disease. For example, several modules for iron transport (M00318, M00190, M00240, M00243, M00317, and M00319) are increased in the microbiomes of subjects with Crohn’s disease, liver cirrhosis, obesity, and type 2 diabetes. Iron is an important cofactor for both humans and microbes and is often the subject of conflict between host and pathogen (
37). Another common indicator is acetate production (M00377 and M00618), which is increased in the gut microbiome of subjects with rheumatoid arthritis, Crohn’s disease, obesity, and type 2 diabetes. Short-chain fatty acids (SCFAs), particularly acetate and butyrate, that are produced are thought to act as signaling molecules between the gut microbiome and host and may play a role in host metabolism (
38). Unlike butyrate which seems to play a protective role in the gut microbiome (
39), acetate is thought to interact with the host parasympathetic nervous system to modulate insulin secretion and may promote obesity (
40,
41). Our finding that acetate production modules are consistently elevated across diseases supports prior work linking microbe-produced acetate to disease.
The integration of data from distinct diseases enables differentiation of disease-specific and disease-common indicators, which can clarify the etiology of specific diseases and advance their diagnosis. For example, rheumatoid arthritis cases carry an increased abundance of a methane production module (M00618) relative to controls. Increased abundance of methane-producing microorganisms was reported in patients with multiple sclerosis, an autoimmune disease that affects the central nervous system (
42). These findings suggest that methane production by gut microbiota may associate with autoimmune conditions. Additionally, modules for degradation of glycosaminoglycans (GAGs) (M00076, M00077, M00078, and M00079) are uniquely elevated in subjects with Crohn’s disease. Increased degradation of GAGs in Crohn’s disease subjects has been reported previously (
43) and may be caused by gut microbiota.
The observed indicators of disease also clarify the potential role of the microbiome in various diseases. By focusing on what the microbiome is capable of doing, rather than which taxa are present, and how this functional capacity associates with health, we can develop testable hypotheses about how the microbiome may mediate health and disease. For example, our work reveals robust associations between the functional composition of the gut microbiome and obesity. Among the indicators for obesity are modules for acetate production (M00377, M00579, and M00618). Recent research connects acetate production by gut microbiota to metabolic syndrome via interaction with the host parasympathetic nervous system to promote insulin secretion (
40). These results are especially valuable in light of recent work that demonstrates an effect of the microbiome in metabolic diseases (
44) but inconsistent (
17) or weak (
18) associations between the taxonomic composition of the gut microbiome and obesity. Notably, the overall functional diversity of the microbiome similarly manifests weak associations with obesity, but the aforementioned protein families robustly resolve the disease. Consequently, these specific indicators may serve as important leads in future studies of how the gut microbiome contributes to obesity and metabolic syndromes.
The random forest analysis demonstrates that the functional composition of the microbiome can aid in classifying disease status and may serve in disease diagnosis. However, the relatively large margin of error observed for some diseases or for classifying health versus disease in a general sense indicates that such diagnosis may be pertinent only for diseases with stronger microbiome signatures (i.e., Crohn’s disease or liver cirrhosis). As seen with colorectal cancer, the severity of the disease may also play a role in the potential for diagnostics.
Collectively, our analysis discerns how the gut microbiome’s functional capacity relates to host health. Through integration of data spanning multiple health states, we observe broad patterns of microbiome changes in disease that clarify how the gut microbiome contributes to health. For example, the metabolic modules that are commonly perturbed during disease may reflect mechanisms through which the gut microbiome interacts with physiology to promote health. Future studies should explicitly test whether the genes encoding these microbiome functions are actively expressed and critical to maintaining health. Moreover, disease associates with a personalized alteration in the functional composition in the microbiome, as indicated by our beta-diversity and beta-dispersion analyses. This result indicates that microbiome-based therapies may need to consider patient-specific parameters to ensure efficacy. Additionally, we uncover disease-specific indicators that not only serve as diagnostic leads but also clarify potential microbiome-mediated etiologies of disease. Future studies should similarly seek to test the effects of these microbiome functions on health. Expansion of metagenomic sampling across populations and health states is critical to advancing our understanding of how the functions encoded in the gut microbiome associate with disease, but improvements to existing analysis methodologies may be necessary to ensure that results are robust to technical considerations (e.g., the compositional nature of sequence data). Additionally, efforts to expand the functional characterization of microbial genes will enhance the sensitivity and specificity of imputed characterizations of microbiome functional capacity. Ultimately, integrative data analysis can expand our understanding of the role of the microbiome in maintaining health but requires more comprehensive patient data, standardized methodologies, and extended patient populations to maximize its utility.
ACKNOWLEDGMENTS
We thank Jesse Zaneveld for insightful discussions on beta-dispersion, Svetlana Lyalina for guidance with compound-Poisson modeling, and Chris Gaulke, Andrey Morgun, and Natalia Shulzhenko for helpful feedback.
This project was supported in part by NSF grant BIO-1557192, NSF grant DMS-1563159, NIH grant R01-DK103761-01A1, a Tarter award fellowship to C.R.A., and institutional funds to T.J.S.
C.R.A., S.N., K.S.P., and T.J.S. designed experiments and analyses. S.N. downloaded, processed, and annotated the metagenomes. C.R.A. analyzed data, interpreted results, produced figures, and wrote the paper. S.N., K.S.P., and T.J.S. all contributed to manuscript writing and editing. All authors read and approved the final manuscript.