INTRODUCTION
Diarrheal diseases remain a major public health issue worldwide, especially in developing countries where poor sanitary conditions and limited access to clean water exacerbate the burden (
1). Although most diarrheal cases self-resolve relatively quickly (
2,
3) and hence the causative agent(s) is never identified, there are instances where acute diarrheal infections lead to death, making detailed investigations of the causative agents and their signatures necessary. Such cases include not only acute child diarrhea in the developing world but also foodborne outbreaks linked to contaminated food and other causes in developed countries (
2–4).
Diagnostic testing for enteric pathogens has historically relied on culture-based techniques or culture-independent molecular assays such as PCR that are based on cultured material. Accordingly, reference (cultured) isolates are the foundation for public health surveillance (
5–7). It is important to realize, however, that culture-based techniques are typically time-consuming and costly and require significant expertise, while they frequently miss moderately or distantly related relatives of the reference isolates, since many bacterial microorganisms are difficult to growth or are even nonculturable, although still viable. For example, some pathogens, including
Escherichia coli, can quickly lose the ability to be cultured upon exposure to laboratory conditions (
8). In addition, culture-based techniques may detect transient pathogens that are not related to the current diarrheal episode. Several of these limitations also apply to (culture-derived) culture-independent methods: in cases where PCR detection is applied for example, there is still a high risk of detecting traces of asymptomatic carriage or residual DNA from previous infections or missing divergent (target) sequence of close or distant relatives to the reference isolate (
9,
10). Further, and perhaps more importantly, culture does not provide quantitative estimates of a pathogen's abundance, virulence potential, and diversity, and it does not allow the characterization of the gut microbiome response, which could be important for diagnosis.
In the United States, a total of 38.4 million cases of foodborne illness per year cannot be attributed to specific causes, and the proportion caused by yet-to-be-described microbial agents is unknown (
9,
11,
12). In developing country settings, even using highly sensitive methods such as qPCR to detect a broad array of pathogens, >10% of moderate-to-severe diarrhea cases cannot be traced to an etiologic agent, and the fraction of false-positive signal for cases with detected pathogens, due to transient pathogens for instance, remains essentially speculative (
12). Culture-independent metagenomic approaches can be used to robustly assess diarrheal infections.
E. coli is a gut commensal of vertebrates, including humans (
13); it can nevertheless cause a broad range of diseases, including intestinal and extraintestinal infections, through the acquisition of accessory genes. The relevance of this species in diarrheal disease has been extensively illustrated. For example, the Global Enteric Multicenter Study (GEMS) demonstrated that
Shigella (a distinct lineage within the
E. coli phylogenetic clade) and enterotoxigenic
E. coli (ETEC) producing heat-stable toxin (ST), either alone or in combination with heat-labile toxin (LT), are among the most important pathogens associated with moderate-to-severe diarrhea in children younger than 5 years old in developing countries (
11,
12). ETEC, enteropathogenic
E. coli (EPEC), and
Shigella are responsible for an estimated 18% of diarrhea deaths in children <5 years old globally (
12). Despite the relevance of
E. coli as a diarrheagenic pathogen, little is known about the effect(s) of pathogenic
E. coli infections on the gut microbiome, such as how the structure and diversity of the community changes during active infections.
Although commensal
E. coli is typically a relatively minor component of the colonic gut microbiome in humans, where it represents <0.1% of the total bacterial cells (estimated at ∼10
8 cells/g) (
13,
14), during infection with an invading pathogenic
E. coli strain or another enteric pathogen, the overall signal of
E. coli in the gut microbiome can increase substantially (
15,
16), allowing the detection and recovery of genomes based on culture-independent metagenomic techniques. In addition, quantifiable shifts in the proportion and diversity of the entire gut microbial community in response to the enteric infection disturbance have been observed (
17–20). However, the number of samples analyzed in these previous studies were limited, and the methodological approaches employed were based almost exclusively on taxonomic marker genes such as 16S rRNA that have limited resolution. Furthermore, and perhaps more importantly, in all previous studies the etiological agent of infection was inferred by DNA sequencing and was not validated by an independent approach. Accordingly, it remains unclear whether different pathogens such as different
E. coli pathotypes produce distinct alterations in the indigenous microbiome due to their characteristic mechanism of infection and virulence factors they excrete and employ during infection. Alterations of the gut microbiome composition upon infection have been shown to take place quickly, within 1 to 2 days or a few hours in some cases (
16,
20). Thus, metagenomic sampling following the onset of diarrhea, even in a cross-sectional sampling strategy, could potentially provide insights into pathotype-specific signatures and quick results for diagnostics.
During the interaction with the intestinal epithelial cells, in the phase of attachment and colonization, distinctive machineries of pathogenicity are known to be used by each of the six known
E. coli pathotypes, depending on whether the pathogen invades the cell, produces biofilms, or secretes toxins (
21). For example, ETEC is thought to adhere to the small bowel mucosa and deliver secretory enterotoxins (
22). Enterohemorrhagic
E. coli (EHEC) adheres to the colonic mucosa and transduces a signal, resulting in secretory diarrhea. Concurrently, the organism releases
Shiga toxin, resulting in local and systemic effects (
21). Enteroaggregative
E. coli (EAEC) adheres to intestinal epithelial cells and produces a thick mucous gel (biofilm) and causes intestinal secretion and damage (
23). Diffusely adherent
E. coli (DAEC) has been shown to elicit elongation of microvilli
in vitro, although this effect has not been demonstrated
in vivo, and has been considered not as virulent as other pathotypes (
24). EPEC elicits the attaching and effacing lesion in the small bowel, resulting in intestinal secretion (
25) and enteroinvasive
E. coli (EIEC) invades the colonic mucosa, giving rise to inflammatory enteritis (
26). Among the six pathotypes, EPEC and ETEC have received special attention in developing countries because they are believed to be major pathogens causing diarrhea in children 5 years old and younger (
1,
14). It remains to be elucidated whether these biological differences in infectious mechanisms disturb the gut microbial community in different, distinguishable ways.
We devised a novel bioinformatic approach that combined traditional, isolate-based, and PCR techniques with metagenomic and epidemiological data to identify diarrheal cases where E. coli was most likely the causative agent and evaluate whether pathotype-specific signatures in the disease-state gut microbiome exist that distinguish among different E. coli infections. For this purpose, we took advantage of a large epidemiological, case-control study of diarrhea that was carried out over a period of 18 months in Northern coastal Ecuador (named EcoZUR for “E. coli en Zonas Urbanas y Rurales”). The study was uniquely suited to address these questions as it included data on diarrheal disease outcome, a large collection of pathogenic E. coli isolates from diarrheal and nondiarrheal (control) samples, and other sociodemographic and clinical data from study subjects. Most genetic studies of pathotypes have been carried out using collections of isolates unrelated in time and space. Our study circumvents these previous limitations and allowed us to also observe E. coli strain relatedness.
We previously reported on risk factors for diarrhea and pathotype distribution (
27) in the EcoZUR study. Here, we report on a subset of the EcoZUR samples that comprised cases of infected children with three major
E. coli pathotypes (DAEC, ETEC, and EPEC) and their age-matched controls (no diarrhea). Our study addressed three main objectives: (i) to describe and compare the overall gut microbiome diversity between cases of diarrhea and controls using both 16S rRNA marker genes and whole shotgun metagenomic data; (ii) to identify cases of diarrhea where
E. coli was presumably the etiological agent based on a combination of metagenomics, isolate genome sequencing, and epidemiological data; and (iii) to determine whether pathogen-specific signatures in the disease gut microbiome exist that distinguish between DAEC, EPEC, and ETEC infections.
RESULTS
Study design and demographics of the study population.
We analyzed samples collected in the EcoZUR study, a case-control study of diarrhea carried out in four sites across a rural-urban gradient in northern Ecuador between April 2014 and September 2015. Participants were recruited from Ecuadorian Ministry of Health hospitals and/or clinics in Quito, Ecuador’s capital, with approximately 1.6 million inhabitants; Esmeraldas, a coastal city in the northwest of Ecuador with ∼160,000 inhabitants; Borbón, a town in Esmeraldas Province with ∼7,000 inhabitants; and rural villages located in the Borbón region with <800 inhabitants each. Cases comprised individuals who presented with diarrhea, defined as three or more loose stools in a 24-h period, and controls comprised age- and site-matched individuals for any other complaint, with no record of diarrhea or vomiting for the previous 7 days. Both cases and controls were excluded if they reported antibiotic usage in the prior week or if they had not lived in the study location for at least 6 months. Data on demographics, medical history, water, sanitation, hygiene (WASH) practices, animal contacts, and recent travel history were collected from all participants.
Cases and controls were similar on key demographic variables (e.g., age and race), indicating that the case-control matching was robust (
27). Sociodemographic and WASH conditions varied as expected across the rural-urban gradient, with urban sites reporting higher levels of improved sanitation and improved drinking water and rural sites reporting a higher proportion of individuals receiving government assistance, and a lower proportion of individuals with fixed employment and higher education levels. Additional information and previous results from the EcoZUR study can be found elsewhere (Smith et al. [
27]). Briefly, we found that travel to urban destinations was associated with higher risk of diarrhea and diarrheagenic
E. coli infections.
Fresh stool samples were incubated in
E. coli-specific media (MacConkey agar media) and tested for different pathotypes based on the presence of specific virulence genes determined by PCR (
Table 1; see additional details in Materials and Methods). All diarrhea samples that resulted in a PCR-positive signal for the presence of any marker gene characterizing DAEC, ETEC, or EPEC pathotypes and obtained from young children between 1 and 6 years old were selected for further analysis. These samples were matched with randomly selected control samples within the same age and location categories, without any diarrheagenic
E. coli infection detected through PCR. This strategy resulted in a total of 80 samples that were taxonomically screened by amplicon sequencing of the 16S rRNA gene (38 diarrhea and 42 control samples). All diarrhea samples and a randomly selected subset of 23 control samples were subjected to whole shotgun metagenomic sequencing (see Fig. S1 and Table S1 in the supplemental material). The diarrheal samples included 16 samples that were PCR positive for
afaB, the virulence marker of DAEC; 10 samples that were positive for
bfpA, the marker gene for typical EPEC; and 12 samples that were positive for
eltA and/or
sta, marker genes for ETEC. All
E. coli isolates recovered from the selected diarrheal samples were also sequenced for genomic characterization and comparison (see Table S3 in the supplemental material).
High frequency of coeluting human DNA in diarrhea samples.
Analysis of the composition of metagenomic reads showed that specimens collected in this study differed in the proportional amount of microbial and human DNA sequenced. The average percentage of human reads detected in the diarrhea group was 17.8%, while in the control group it was only 0.07%. Samples with a large fraction of detected human contamination belonged mostly to DAEC-positive (as determined by culture and PCR test for pathotype-specific genes) and EPEC-positive groups (
Fig. 1A). Our analysis also revealed no correlation between the fraction of human reads detected in our libraries and the severity of the disease, measured as the number of days with diarrhea previous to the sampling day or the detection of blood in the specimen. However, we did observe a significant increase in the fraction of human reads with detection of mucus in stool specimens versus those with no mucus detected (Welch’s two-sample test,
P = 0.004).
After removal of the human reads, between 221 Mbp and 3.2 Gbp of reads per metagenome remained for analysis (average = 1.78 Gbp) (Table S2). One ETEC sample (Q53) had only ∼1,200 reads after quality control and was therefore removed from further analysis. To evaluate the fraction of the total extracted DNA from the stool sample that was sequenced (i.e., determine the coverage of the microbial community by sequencing), we estimated the sequencing coverage using Nonpareil 3, a read redundancy-based algorithm (
28). Although the sequencing coverage varied among samples and pathotype groups, generally ≥80% of the microbial community was covered in the majority of samples, except for one EPEC sample (R126) that had very low coverage and was therefore discarded from further analysis (
Fig. 1B; Table S2). The coverage estimates also reflected differences in the complexity of the microbial communities among pathotypes, showing that the DAEC group (defined by the recovery of a DAEC isolate) contained, in general, metagenomes with simpler microbial communities compared to the two other pathotypes (Kruskal-Wallis rank sum test, χ
2 = 9.07, df = 3,
P = 0.02), whereas control samples had more complex (diverse) gut microbial communities compared to all diarrheal samples (Wilcoxon rank sum test, W = 278,
P = 0.03). Overall, our community coverage results suggested that despite the high frequency of coeluting host DNA in some disease samples, our metagenomic sequencing effort was adequate to assess the microbial community disturbance during diarrhea and to detect and recover abundant microbial community members and the putative pathogen.
Differences in microbial community composition and diversity between diarrhea and control samples.
The overall microbial community compositions and diversities in diarrhea and control groups were assessed based on 16S rRNA gene amplicon sequencing. Comparison of the normalized relative abundance at the phylum level indicated that three major phyla dominated the gut microbial communities in both diarrheal and control individuals, namely,
Bacteroidetes,
Firmicutes, and
Proteobacteria. In general,
Bacteroidia and
Clostridia were the two most abundant classes in all samples. In contrast to the control group, the majority of samples in the diarrhea group also exhibited a large proportion of sequences belonging to
Gammaproteobacteria, a bacterial group comprising several enteric pathogens (Wilcoxon rank sum test, W = 1,231,
P = 1.733e–05), which indicated a possible underlying bacterial infection in individuals with diarrhea (Fig. S2A). Alpha-diversity analysis based on Faith’s phylogenetic diversity revealed significant differences between case and control data sets. Consistent with several other recent studies (see, for example, references
15,
16,
17,
18,
19, and
29), control stool samples showed, in general, a significantly higher microbial diversity than the cases (Kruskal-Wallis test, H = 15.9,
P = 6.06e–05) (Fig. S2B). Analysis of the overall community dissimilarity at the genus level also revealed significant differences between diarrhea and control groups (permutational multivariate analysis of variance [PERMANOVA], pseudo-F = 2.47,
P = 0.002). Diarrhea samples clustered more closely together compared to control samples, although some overlap between several samples of the two groups was also observed (Fig. S2C).
Differences in the overall relative abundance of the OTU taxonomically assigned to E. coli in diarrhea compared to control samples, by about 1 order of magnitude (mean of 6.39% of total reads in cases versus 0.81% in control, Welch’s two-sample test, P = 0.009), were also detected. Few samples had as much as 30% of their 16S rRNA gene-containing read sequences assigned to E. coli, although other diarrheal samples had very low E. coli abundances, comparable to those in control samples (Fig. S2D). Conversely, three samples from the control group had as much as a 3 to 5% relative abundance of E. coli, which might indicate an asymptomatic E. coli infection and/or higher abundance of commensal E. coli. Therefore, we next sought to identify samples where pathogenic E. coli was most likely the causative agent of diarrhea by examining the companion shotgun metagenomic and epidemiological data.
Identification of disease samples where E. coli was most likely the causative agent of diarrhea.
Our observations of the overall community composition, diversity, and estimated taxon abundance in the diarrheal samples based on 16S rRNA gene amplicon data (see, e.g., Fig. S2D) indicated that E. coli was likely not the causative agent of disease in all diarrhea samples, even though an E. coli pathotype isolate was cultured from all diarrhea samples and at least one virulence marker gene was detected by PCR. In other words, isolation and PCR detection of E. coli pathotypes in stool samples from individuals with diarrhea might have reflected the recovery of a rare isolate in situ that was not involved in infection, or a stage of infection with E. coli but not necessarily diarrhea caused by E. coli. Therefore, we next aimed to determine the diarrheal cases that were most likely attributable to pathogenic E. coli. For this purpose, we assessed the metagenomic data sets, recovered metagenome-assembled population genomes (MAGs) from the metagenomes, and the genomes of isolates for five main lines of evidence.
(i) Metagenomic abundance. We estimated the
in situ metagenomic abundance of the
E. coli pathotype isolate (or the MAG that originated from the same sample as the isolate) and a reference commensal
E. coli strain HS (
NC_009800.1), using the diarrhea metagenomic data set from which the isolate was recovered. We expected that the estimated abundance of the pathogenic isolate would be higher than the commensal strain in the diarrhea metagenome, as well as in control metagenomes in a competitive, read-based search. Competitive mapping, e.g., mapping reads against a single combined data set containing both commensal and pathogenic genomes, was preferred over regular mapping to avoid double counting reads that map to very conserved regions that could potentially lead to overestimation of the calculated abundance.
(ii) Virulence factors. We evaluated the presence or absence of a large array of E. coli virulence factors, including pathotype-specific marker genes and enterotoxins in the metagenomic data sets of diarrhea and control samples, based on the sequencing coverage of the genes shown by metagenomic reads relative to the coverage of the rest of the genome (e.g., these factors sometimes did not assemble as part of the corresponding MAG; hence, a read-based approach was used; see Materials and Methods for details). We expected that the abundance of virulence factors would be higher in the diarrhea metagenomes than in the control metagenomes.
(iii) Intrapopulation diversity. We estimated the degree of
E. coli intrapopulation diversity by calculating the average nucleotide identity of the metagenomic reads (defined as ANIr) mapping to the reference genome (pathotype isolate or reference commensal in competitive blast searches) with a percent identity between 90 and 100%, i.e., reads representing the total
E. coli population in a sample. We expected that the degree of intrapopulation diversity (or clonality) of the pathogenic
E. coli population would be lower (more clonal) compared to the
E. coli population in control samples, as is often the case for active infection-associated pathogens (
30).
(iv) Membership in disease-associated clonal complexes. We examined the phylogenetic clonal complex that the pathotype isolate was assigned to (or the MAG that originated from the same sample as the isolate). For the isolate of interest, we expected that other isolates within the same clonal complex would be more frequently associated with disease than control samples. For this purpose, we built a core-genome phylogeny based on ∼1,200 orthologous genes for a total of 263
E. coli isolates obtained from the EcoZUR project. Clonal complexes within the phylogenetic tree were identified based on their ANI values (
31) using the PAM algorithm (partitioning around the medoids) with
k medoids, where
k was determined by the local gain in the average Silhouette width for each level of clustering (see Fig. S3) (
32). Clonal complexes corresponded to sets of strains clustered together in the core genome phylogeny, typically with >99% ANI among them (versus <99% ANI between clonal complexes).
(v) Virulent E. coli MAGs. Finally, we performed binning of the assembled metagenomic contigs in order to recover high-quality E. coli MAGs (Table S2). In addition, we assessed the presence of virulence genes and enterotoxins in the recovered MAGs and built a phylogenetic tree using isolates and MAGs derived from the same sample to evaluate whether the isolates obtained in culture were good representatives of the indigenous population(s). We expected to recover complete, high-quality E. coli MAGs carrying canonical virulence genes.
Taking these five lines of evidence together, our expectation was that the putative E. coli pathogen (isolate and/or MAG) was likely the causative agent of disease in the sample that it was recovered from when (i) the E. coli population presented a higher estimated metagenomic abundance compared to the commensal E. coli strain HS; (ii) the E. coli population had more virulence genes than its counterparts from control samples, including the detection of key enterotoxins and pathotype-specific marker genes; (iii) the corresponding diarrheal metagenome harbored a more clonal population of E. coli compared to control metagenomes; (iv) the isolates were assigned to a clonal complex in the core genome phylogenetic tree that was enriched in isolates from other disease (as opposed to control) samples; and (v) we were able to recover E. coli MAGs harboring the typical pathotype-specific virulence genes. We tested these expectations for the three pathotype groups with the highest numbers of diarrhea cases in our data set: DAEC, ETEC, and EPEC.
DAEC as the causative agent of diarrhea.
Our results indicated that approximately 50% of the samples from which DAEC isolates were obtained (i.e., 8 samples of 16 total), showed metagenomic signatures consistent with the isolate being the causative agent (
Fig. 2). These samples (Q51, Q196, E158, Q56, Q65, E230, E124, and E27) exhibited the following signatures: (i) higher abundance of the pathogenic isolate compared to the reference commensal strain or the total
E. coli population in the control samples, i.e., 27.81% versus 0.6%, on average (
Fig. 2A); (ii) detection of 40 or more virulence factors that were present in metagenomic contigs or were binned into MAGs at similar or higher sequence coverage than the MAG (
Fig. 2B); (iii) reduced intrapopulation sequence diversity with ANIr values of ≥99% for the isolate and typically ANIr values of <99% for the reference commensal genome (
Fig. 2C); (iv) the recovered pathotype isolate(s) was generally grouped in 11 phylogenetic clonal complexes composed of more isolates originating from cases of diarrhea than control samples (
Fig. 2D); and finally, (v) the recovery of seven high-quality
E. coli MAGs that encoded the pathotype (
afaB) and other
E. coli virulence factors. We also observed that the isolates within these clonal complexes were evenly distributed between rural and urban settings, which suggested that distinct DAEC genotypes were each associated with small-scale diarrheal outbreaks in northern Ecuador (see Fig. S3 in the supplemental material).
Although two DAEC-positive diarrhea samples showed ANIr values of >99% and the presence of the
afaB gene (e.g., Q310 and Q49), they presented lower metagenomic abundances of the isolate compared to the control samples and/or other positive samples and relatively lower numbers or the absence of virulence genes. In addition, for these samples, no MAGs were recovered, a finding consistent with the relatively low abundance of
E. coli. Therefore, these samples were not as conclusive with respect to whether the isolate was the etiological agent. Of particular interest was the observation that two of these samples (E27 and B89) were also positive for rotavirus and two (B274 and Q310) had a substantial number of reads mapping to adenovirus, a nonenveloped, double-stranded DNA virus causing acute gastroenteritis primarily in children (
33). These results indicated that despite the PCR detection of a DAEC marker gene in these individuals, other viral pathogens, rather than
E. coli, might have been responsible for the diarrhea phenotype.
To evaluate whether or not the DAEC MAGs were representative of the DAEC isolates recovered from the same samples, as well as the diversity of the population(s) present in the disease samples, we performed a phylogenetic reconstruction with MAGs and isolates. Core-gene phylogeny revealed that MAGs and isolate genomes clustered together in the majority of samples (
Fig. 3A), with the average ANI between the pair of MAG and isolate genome originating from the same sample being 99.92%, except for sample Q196, where the estimated ANI was 97.29%. Further evaluation showed that the recovered MAG from sample Q196 was a high-quality genome, with 97.1% completeness, 1.85% contamination, and a 43.45 estimated index of strain heterogeneity (SH; scale between 0 and 100). This finding suggests that the isolate represented a minor member of the total
E. coli population in sample Q196.
Next, we evaluated whether any correlation existed between the percentage of human reads detected in the DAEC metagenomes and the estimated metagenomic abundance of the isolate for the set of eight samples that had strong evidence of DAEC-caused diarrhea. Our results showed a relatively strong positive linear correlation between the two variables (Pearson’s
R = 0.65,
P = 0.08) (
Fig. 4). This observation indicated that the fraction of human reads observed in the metagenome might be directly related to the infection by pathogenic DAEC strains.
ETEC and EPEC as causative agents of diarrhea.
A similar approach was applied to the samples that were positive for ETEC and EPEC by isolation and PCR to identify representative cases of infection (see Fig. S4 and S5 in the supplemental material). The signal of
E. coli infection was, in general, more clear in ETEC (compared to DAEC) but much less clear in EPEC. Seven of ten samples (70%) had strong evidence of infection caused by the ETEC isolate (E184, Q294, B295, B45, B62, B244, and B255). In these seven metagenomic samples, at least one of the two ETEC marker genes, i.e., heat-labile (
eltA) and/or heat-stable (
sta) enterotoxins, was detected. The remaining three samples (i.e., B109, B68, and B200) did not show strong evidence of
E. coli being the infectious agent (Fig. S4), since they presented a very low abundance of
E. coli and/or the absence of at least one ETEC marker gene and/or relatively low clonality (ANIr < 99%). The recovery of high-quality ETEC MAGs was possible for six samples. All ETEC isolates and MAGs cluster together in the core genome phylogeny in only three clonal complexes, independent of the geographic origin of the genomes (Fig. S3 and
Fig. 3B).
For EPEC-positive samples, the diagnostic marker gene (eaeA) was recovered in only two of the eight samples (E187 and E162). However, even for the latter two samples, the analysis in intrapopulation diversity revealed no clonal population. Recovery of high-quality E. coli MAGs was possible in only one sample (R135), but <50% of the total EPEC hallmark or other virulence genes were detected in the metagenomes or the MAG (Fig. S5). Given that no strong metagenomic signature of pathogenicity was observed in the majority of EPEC samples, we excluded this pathogroup from further analysis that focused on the detection of further pathotype-specific gut microbiome signatures.
Gut microbiome signatures of DAEC and ETEC infections.
The identified samples with strong evidence of DAEC or ETEC infection were analyzed further in order to elucidate differences between diarrhea and control groups on the gut microbiome such as gene content or microbial community composition alterations. In total, seven discriminative taxa between diarrhea and control samples were detected: Prevotella copri (Kruskal-Wallis test, P = 0.05), E. coli (P = 2.26e–5), Alloprevotella tannerae (P = 0.04), Campylobacter concisus (P = 0.01), Haemophilus haemolyticus (P = 0.05), Fusobacterium mortiferum (P = 0.02), and Bifidobacterium longum (P = 0.04) (Fig. S6A). P. copri was most strongly enriched in the control group, while E. coli was the phylogroup mostly differentiating the diarrhea group, revealing a negative correlation in the abundance pattern for these two species (Spearman’s rho = −0.39, P = 0.012) (Fig. S6C). In addition, a principal-component analysis (PCA) of species abundance indicated that microbial communities from DAEC versus ETEC infections clustered separately (Fig. S6B).
To explore whether pathotype-specific alterations (signatures) in the gut microbiome exist that discriminate DAEC versus ETEC infections, we evaluated whether or not differences in gene content and/or composition existed after removing reads assigned to
E. coli from the metagenomic libraries and focusing on the samples with strong metagenomic evidence for infection identified above. The initial taxonomic characterization revealed at least four species that were discriminatory of DAEC versus ETEC infections. Specifically,
Fusobacterium mortiferum (
P = 0.025) and
Campylobacter concisus (
P = 0.011) were significantly more abundant in ETEC infections (
Fig. 5A and
B), while
Bifidobacterium longum (
P = 0.040) and
Alloprevotella tannerae (
P = 0.046) were significantly more enriched in DAEC infections (
Fig. 5C and
D). A PCA based on taxonomic composition at the species level also revealed that metagenomes associated with ETEC infections tended to be taxonomically more similar among themselves, whereas DAEC samples showed more diversity. Notably, one sample positive for DAEC (E124) also showed the
eltA gene in the metagenome, which is associated with ETEC, and tended to be more dissimilar to any other DAEC or ETEC metagenome, indicating that this individual could have suffered from a coinfection with DAEC and ETEC strains. Further within-species diversity analysis based on single nucleotide polymorphism (SNP) patterns of conserved genes (strain level) suggested that at least two
E. coli strains coexisted in the community rather than one strain harboring DAEC and ETEC virulence genes. One strain dominated the population with a ∼94% relative abundance, while the second strain represented ∼6% of the population. On the other hand, the metabolic gene profiling of the gut microbiome did not reveal substantial differences between infections by the two pathotypes. Therefore, our results revealed that significant, albeit rather narrow, taxonomic but not metabolic signatures might exist in the gut microbiome that differentiate DAEC from ETEC infections, which warrants further investigation.
DISCUSSION
Diagnostic testing for diarrheal pathogens has relied for decades on culture-based techniques that do not provide a quantitative estimate of the pathogen or the response of the gut microbiome to the infection. Consequently, a significant fraction of diarrhea episodes remains undiagnosed, and others are spuriously associated with infections of putative pathogens, due to the challenges associated with the accurate detection of the etiological agent using traditional techniques. The signatures of pathogen-specific disturbances in the ecology of the healthy gut microbiome also remain uncharacterized. Here, we provide a novel metagenome-based methodology that employs pathogen population
in situ abundance, the level of intrapopulation diversity, and virulence gene content (see, for example,
Fig. 2) to study diarrheal cases and to provide diagnostic resolution that is usually not attainable by traditional methods.
Application of our methodology to child diarrheal samples and age-matched controls collected in northern Ecuador showed that DAEC was likely the causative agent of several diarrheal cases, and the DAEC isolates recovered from these samples were assigned to 11 distinct clonal complexes. These complexes were apparently associated with small-scale diarrheal outbreaks given that the great majority of isolates in the complexes (>75% of total) were recovered from disease as opposed to control samples (despite the similar number of case and control samples and isolates in our data set), and the subjects that provided the disease samples were from both rural and urban settings.
Even though DAEC is not thought to be a highly virulent
E. coli pathotype (
24,
34), DAEC infections were found to be accompanied by coelution of large amounts of human DNA and conferred small (in terms of number of taxa affected) but significant shifts in the composition of the gut microbiome relative to the control or infections caused by other pathotypes (e.g., ETEC) according to our study. These findings echoed the findings by Huang et al. (
16), who applied shotgun metagenomic to stool samples collected from two geographically isolated foodborne outbreaks in the United States, where the etiologic agent was identified by culture-dependent methods as distinct strains of
Salmonella enterica subsp.
enterica serovar Heidelberg. Similar to our study, the acute
Salmonella infections described by the Huang et al. study were accompanied by a high frequency of coeluting human DNA sequences, significant shifts in the gut microbiome composition and diversity relative to healthy control samples, signatures of high abundance of the pathogen in the metagenomic diarrheal sample, and reduced intrapopulation diversity. Hence, it appears that the DAEC infections identified by our metagenomic analysis were likely caused by a DAEC strain.
Although the number of samples analyzed here was limited and thus the compositional shifts identified should be considered preliminary results only, our results do indicate that there may be consistent signatures in the gut microbiome that could provide reliable additional diagnostic features to distinguish different etiologic agents of diarrhea. The negative correlation in abundance between
E. coli and
P. copri was a notable such signature that should be explored in future work in order to elucidate the underlying mechanism(s) for the observed anticorrelation pattern. Of particular interest also was the observation that, in general, ETEC samples presented lower average metagenomic abundance of the pathogen (5.1%) than DAEC samples (27.8%) by ∼5-fold, on average, for the cases with clear evidence of
E. coli pathotype infection. This observation suggests that ETEC infection may require a lower pathogen load in order to elicit disease than does DAEC infection, providing potentially an additional diagnostic metagenomic feature of ETEC versus DAEC infections. Consistent with the latter results, ETEC pathogens are thought to cause infection by the production of enterotoxins (
22,
35) that alter the concentration of important cellular messengers such as cyclic AMP, cyclic GMP, and Ca
2+, while DAEC is strongly attached to the cell surface, where it induces a cytopathic effect characterized by the development of long cellular extensions (
24). These differences in mechanisms of pathogenicity might explain, at least in part, the differences in pathogen abundance and associated changes in the gut microbiome (see, for example,
Fig. 5) that were observed in ETEC- and DAEC-dominated metagenomes.
A critical question in diagnostic testing for enteric pathogens is how often conventional culture-dependent techniques, such as those implemented for isolating
E. coli from stool samples, readily capture the targeted pathogen as opposed to transient or dormant/inactive populations. Our metagenomic and/or epidemiological data showed that from the total set of diarrheal DAEC (
n = 16) samples analyzed, the isolate was highly similar to the recovered MAG in only ∼30% of the cases (
n = 5). The same results for ETEC-positive and EPEC-positive samples (by isolation and PCR) were 70 and 0%, respectively, for a total average of roughly about 50%. Thus, it appeared that in about half of the diarrheal samples the isolate likely was not the causative agent. This may help explain the results of many studies that have observed high rates of enteropathogen infection in individuals without diarrhea symptoms (
36).
Beyond pathogen detection, our metagenomic study also generated complete or nearly complete pathogen MAGs directly from the samples, which allowed a more comprehensive characterization of the virulence and diversity of the infectious E. coli population. In addition to application in diagnostic testing, population genome binning can also be used to recover pathogenic genotypes and, when coupled to epidemiological information, to identify person-to-person transmission events and outbreak dynamics, which represent important tasks in public health investigations.
Although a small number of samples were excluded from the pathogen-specific signatures of the infection based on the relatively low coverage observed in comparison to other positive samples or to controls, it is important to highlight that finding a 0.05- to 0.1-fold coverage, i.e., the minimum threshold required for robust detection of a target genome in a complex gut metagenome (
37), still translates to a relatively large number of cells
in situ. The average
E. coli genome size is 5 Mbp, and our metagenome libraries were, on average, ∼1.78 Gbp in size after removing contaminant host DNA and low-quality reads. Finding 0.05-fold coverage for an
E. coli genome (∼5% of the estimated genome size) would require 0.25 Mbp of
E. coli sequenced reads or ∼0.014% of the total metagenome size. Our extracted DNA came from an average of 1.26 × 10
6 cells in total based on the normalization of the average extracted DNA concentrations (6.45 ng/μl) with the estimated molecular weight of an
E. coli bacterium 5.14 × 10
−6 ng (
38–40). Therefore, 0.014% of 1.26 × 10
6 would be 1.77 × 10
2 in total (i.e., >100 cells), which is still a large number of cells that could potentially cause a disease. Thus, the limit of detection of metagenomics, as applied here, was not low enough to detect relatively low-abundance microorganisms and should be combined with methods that offer a lower detection limit for this purpose, such as qPCR and isolation, for a more comprehensive assessment of enteric infections. Furthermore, sample heterogeneity could also account for the lack of metagenomic evidence of
E. coli infection for the ∼50% of disease samples identified above as false-positive calls by isolation and PCR. Our typical sample size was ∼0.5 to 0.8 g, which represents a small portion of the total stool, and we might therefore have undersampled the (potential)
E. coli population present in the gut. Multiple replicate samples and high sampling volumes (i.e., 2 to 5 g or more) should be used to avoid such sample heterogeneity issues in the future. Regardless of the potential effects of sample heterogeneity, our findings collectively highlight the potential of metagenomics as a diagnostic tool for infectious diseases, the strengths of combining traditional culture-based and PCR techniques with shotgun metagenomics, and the applicability of our bioinformatic framework to the study of enteric pathogens.