Identification and quality of MAGs and isolate genomes.
In total, eight isolate-MAG pairs were analyzed from eight distinct human fecal samples (
Tables 1 and
2;
Fig. 1). The estimated completeness of the genomes based on the Microbial Genomes Atlas (MiGA) workflow ranged from 91% to 95.5% (average, 94.5%) for the isolates and from 81.1% to 95.5% (average, 92%) for the MAGs (
Table 1). Contamination estimates were 0.0% for all isolate genomes (as expected for pure isolate DNA sequencing) and ranged from 0.9% to 7.2% for the MAGs. The high level of estimated completeness of all isolates was also consistent with the high sequencing depth of their genomes obtained by the genome reads (average, 22×) (
Tables 1 and
2;
Fig. 2). MAGs were recovered from the same sample as the isolate (no coassembly was performed), using the large contigs of the assembly (longer than 500 bp) and MaxBin 2.0 with default settings (see Materials and Methods for details). Genome-aggregate average nucleotide identity (ANI) within all isolate-MAG pairs was 98.9% or above, with an average of 99.77%, indicating that the MAGs obtained belonged to the same population as the isolates and were members of
E. coli, with the exceptions of samples Q196 (97.63% ANI) and Q51 (96.99% ANI) (
Table 3), in which the MAGs apparently represented a strain(s) of the population distinct from the isolate.
The number of isolate genes shared with the corresponding (paired) MAG recovered from the same sample ranged from 2,118 in Q51 to 4,371 in E230, representing, on average, 74% of the isolate genes, with the exception of Q51, in which they represented only 42% of the genes (
Table 3). Similarly, MAG genes shared with the isolate represented, on average, 75% of total MAG genes, with the exception of Q51, where they represented only 24%. The variation in these values correlated well with the ANI between the isolate genome and its corresponding MAG; genome pairs showing relatively low ANI values also shared fewer genes. For instance, the isolate and MAG originating from Q51 showed 96.99% ANI. E230 was a different case in that a high number of genes (6,990) was recovered in the isolate genome compared to the average number of genes in
E. coli genomes or the MAGs studied here (average, 4,902; standard deviation [SD], 154) (
Tables 1 and
2), suggesting that there was a coculture of two strains, which would explain the low ratio of shared genes (ANI between the isolate and the MAG was 99.84%). Sample E158 was also an exception to the above-mentioned rule (high ANI; high percentage of shared genes) because the low ratio of shared genes was presumably due to low MAG completeness (<90%).
The rest of the MAG genes (nonshared) either had a distant match or did not match with isolate population genes (MAG specific) (
Fig. 1) and could be considered the result of natural variability within the natural population present in the sample or could be due to assembly or binning errors as shown below. Given the ANI and genome content results, the Q51 isolate and its corresponding MAG clearly did not represent the same genotype (or strain) and thus were removed from further analysis. Although Q196 had a low ANI with its MAG (97.63%), similar to Q51, it was included in further analysis because the number of genes shared between the isolate and the MAG was not that low, representing 70.26% and 60.88% of the total isolate and MAG genes, respectively.
Frequencies of population core, variable, and isolate-specific genes.
Metagenomic reads were mapped onto the corresponding isolate genome from the same sample in order to reveal the core genes within the natural
E. coli population in the sample and identify genes that were variable in the population, i.e., those that were carried by some but not all members of the population, as well as isolate-specific genes (those with no metagenomic reads mapping onto them) (
Fig. 1;
Fig. 3, panel 1.2; Fig. S1). Metagenomic reads were also mapped on the corresponding MAG sequence (
Fig. 3, panel 1.3) in order to assess whether or not the MAG represented the metagenomic population as well as the isolate did (whether the average nucleotide identity of mapped reads [ANIr] to the MAG and the ANIr to the isolate sequence were similar). Indeed, average identity and coverage level values were similar between the MAG and the isolate, showing <10% difference in most cases (Table S1) except for the E230, Q51, and Q196 samples, which represented heterogeneous populations of pathogenic and commensal
E. coli, as shown below. In all cases, including the samples with heterogeneous populations, the isolate genome was preferred over the MAG as the reference in the recruitment plot for the analysis of core variable and strain-specific genes because of the uncertainty about the extent to which the MAG may represent sequences, including contamination from other taxa.
Variable genes were defined as those with coverage significantly lower than the mean genome coverage (
P < 0.01) but >10% of the mean coverage; isolate-specific genes were defined as those showing ≤10% of the mean genome coverage, including no coverage by reads (completely absent). To calculate the level of coverage that provided a probability (
P value) of <0.01 (i.e., that the gene coverage was significantly lower than the mean genome coverage, with the null hypothesis being that the gene is present in the population and core), the resulting distribution of coverage values for all genes of a MAG or an isolate genome based on the recruitment plot with metagenome reads was fit to a log-normal distribution using the enveomics.R package, v1.4.1 (
10). Core genes were defined as those that showed coverage similar to the average coverage of the whole genome (i.e.,
P ≥ 0.01).
Variable genes were low in frequency, in general, ranging from 0% to 2.24% of the total isolate genes (average, 0.997%), while isolate-specific genes ranged from 0.10% to 13.31%, with an average of 3.12% (
Fig. 4;
Table 3). Annotation of population-variable genes revealed that they were related to glycosyltransferases, membrane transporters, secretion proteins, transcriptional regulators, and type III effector proteins that could be related to increased virulence of the population and/or increased fitness in the gut (Table S2), implying that variable genes could provide different members of the population with different adaptive traits.
Samples B200, E230, and Q196 had the highest numbers of isolate-specific genes, which accounted for >1% of the total genes in the genome, in contrast to the other four isolates analyzed (
Fig. 4;
Table 3). Note that the isolate genome recovered from sample Q196 is likely to represent an
E. coli population other than the abundant genotype present in the sample, as evidenced by a low ANI between the isolate genome and the MAG (97.6%). Annotation of these isolate-specific genes showed that >30% represented uncharacterized proteins, even reaching 49% in Q196, while 4.65% to 10.51% of isolate-specific genes were attributable to mobile elements, mainly phage proteins, transposases, and integrases (Table S3). The relative frequency of uncharacterized proteins among isolate-specific genes was significantly higher than that in the total isolate genome (
P, 0.04 by the chi-square test). This was not the case for the mobile elements (
P, 0.12 by the chi-square test), although ratios for mobile elements in samples B200 and E230 were much higher (>2-fold) among isolate-specific genes than in the whole genome. However, 20 to 38% of the isolate-specific genes identified were at the edges of contigs (<100 bp from the edge), in comparison to 5 to 40% for the whole genome, and this could be problematic for metagenomic read coverage estimation (fewer reads can typically be mapped at the ends of contigs with high stringency thresholds such as those used here). Therefore, we did not include isolate-specific genes from the isolates in the quality assessment of the MAGs (
Fig. 1); in addition, such genes should not be expected to be captured by MAGs, since they are carried by only a few cells, or even a single cell, of the total population.
As also discussed previously with regard to the E230 isolate genome, an unexpectedly high number of genes (6,990) was observed compared to the average number of genes in
E. coli isolate genomes and the MAGs studied here (average, 4,902, SD, 154) (
Table 1). Taxonomic profiling of the isolate-specific genes in E230 showed that all these genes were assignable to
E. coli, suggesting that there was a coculture of two strains that was sequenced, and this presumably accounted for the high numbers of isolate-specific genes not captured by the metagenome and core genes not detected by the MAG (
Fig. 2 and
4). This was also supported by the high fragmentation of the E230 genome assembly (1,630 contigs;
N50, 5,050 bp) (
Table 2). Hence, the relatively lower performance of the MAG in the case of E230 (see below) was most likely due to a mixed coculture of the isolate genome rather than a low-quality MAG.
Population core and variable genes missed by MAGs (completeness).
We next examined if the population core and variable genes identified by the read recruitment analysis above using the isolate genome as the reference were captured by the corresponding MAGs from the same sample. Core genes not captured by the MAGs represented a fraction ranging from 11.31% (B45) to 34.7% (E158) (24.84%, on average) of the total core genes, while the fraction of variable genes missing in the MAGs was 13.63% to 69.56% (50.17%, on average) (
Table 3; Fig. S2). As expected, MAGs missed all isolate-specific genes in these comparisons; these genes are too rare
in-situ to be assembled, since they are specific to the isolate in question.
A fraction of missed core genes (ranging from 12% to 79%) and variable genes (from 19% to 39%) in our samples found matching MAG homologs in BLASTN whole-genome comparisons of isolates with MAGs. However, these matches showed ≤90% nucleotide identity or ≤90% alignment length, indicating that either these genes were indeed real population-divergent homologs—genes not shared with the isolate—or they resulted from assembly errors (Fig. S3). The remaining missed core (16% to 71%) and variable (64% to 71%) population genes that did not find any homolog among the MAG genes were assembled but binned in other
Enterobacteriaceae MAGs from the same fecal sample (Fig. S3), revealing binning errors when closely related populations coexist in the same samples. Genes with relatively low sequence identity or alignment length and genes binned in other MAGs accounted for >95% of the total core genes and/or variable genes missed by the MAG. In two of the samples (E230, Q196), the percentage of genes that had no match with any MAG gene exceeded 30% (Fig. S3). Not surprisingly, these two samples were also among the samples with the highest numbers of isolate-specific genes (
Table 3;
Fig. 4) while isolate Q196 also had a low ANI value (<99%) with its corresponding MAG, suggesting the copresence of different genotypes.
Importantly, assessment of MAG completeness using either MiGA (based on the presence/absence and single/multiple copy of 111 essential universal genes) or CheckM (based on 1,173 marker genes conserved in the
Enterobacteriaceae family) showed that all MAGs were of high completeness (86.5% to 91%) based on recently proposed standards (
9), and the estimated completeness was slightly higher by CheckM than by MiGA (
Table 1;
Fig. 2). However, when genome completeness was estimated based on true-positive recovery rates for (i) core genes only (recovered population core genes/total population core genes, using the isolate as the reference in the read recruitment plot) and (ii) core and variable genes combined (recovered population core and variable genes/total core and variable genes), the rates ranged from 65.29% to 88.68% and 64.95 to 88.65%, respectively (
Fig. 2). These rates were lower than the completeness assessments of MiGA and CheckM, by 16.9% and 20.4%, respectively, on average (
Fig. 2).
Frequency of contamination from non-E. coli sequences in MAGs and underlying causes.
MAG genes that were not shared with the population after whole-genome comparison to the isolate genes with substantial metagenome coverage (core and variable population genes), ranged from 3.97% (B45) to 48.05% (E158) (
Table 4). These genes were separated into two categories: (i) genes that had distant matches showing either >70% but ≤90% identity or alignment length, or both, with the isolate genes and (ii) genes with no matches with isolate genes at ≥70% identity and alignment length (
Fig. 1), i.e., MAG-specific genes. The first category accounted, in total, for 25.8% to 45.02% of the nonshared MAG genes (
Table 4), indicating true gene variability for cases of lower identities and potential assembly issues for cases of low alignment length values. The second category accounted for more than half of the nonshared MAG genes in all cases (
Table 4). In the MAG-isolate pairs with low ANI values (
Table 3), some of these MAG-specific genes could represent real population-variable genes not present in the isolate. However, we posit that this fraction, representing between 2.35% (B45) and 26.41% (E158) of the entire genome (average, 14.42%, excluding Q196), corresponds, for the most part, to MAG contamination: genome fragments erroneously included in the MAG.
A considerable fraction (20% to 45%; average, 30.6%, excluding Q196) of the MAG-specific genes were found in contigs shorter than 1,000 bp (
Table 4) (reported toward the end of the MAG sequence file) and usually showed more-variable coverage than the rest of the genome (higher or lower [examples can be found in
Fig. 3]), further corroborating the proposition that these genes represented binning errors. In agreement with this interpretation, best-match analysis of MAG-specific genes against available reference genomes showed that a substantial fraction, ranging from 11.87% (E158) to 46.63% (E230) of these MAG-specific genes or 2.42% to 10.03% of total MAG genes, did not match
Enterobacteriaceae (
Table 4); the remaining genes matched
Enterobacteriaceae genomes. Some of the genes not matching
Enterobacteriaceae were also parts of short contigs (7% to 36%) (
Table 4). Importantly, both CheckM and MiGA failed to estimate MAG contamination by other bacterial families in most cases, i.e., their contamination estimates were 0.9% to 3.6%, because these tools are based on universal genes only, and the majority of the extraneous genes were not universal or core genes.
The highest numbers of MAG-specific genes were observed in the MAGs recovered from the E158, B200, Q196, and E184 metagenomes (>20% of total MAG genes). Overall, these MAGs had high completeness and low contamination (
Table 1) but exhibited the highest numbers of non-
Enterobacteriaceae genes, followed by the E230 MAG, which also exhibited 9% MAG-specific genes (
Table 4).
Evaluation of the effect of contig length on MAG quality.
Our MAGs overall showed high completeness (>80%) and low contamination (<10%) according to MiGA (
Table 1), and the
N50 values (50% of the assembly is found in contigs longer than the
N50 value) varied among the MAGs and were lower than the
N50 values for the isolates in most cases (
Table 2). The importance of contig number and length for retrieving accurate and complete MAGs has been discussed previously (
18), and those studies suggested that fewer than 10 long contigs should ideally be used as candidates for complete MAGs (i.e., fully circularized genomes). In agreement with this suggestion, we found that the number of contigs making up the MAG sequence was positively correlated (
R = 0.91) with the percentage of genes that were MAG specific in our data set (Fig. S4), implying that using longer contigs decreases false-positive binning errors. To more accurately quantify this parameter, we performed binning again, using different minimum lengths (1 kb, 2 kb, and 5 kb) for including contigs in the binning step. Our evaluation using MiGA showed that MAG completeness was not significantly decreased in this range of contig lengths except for the E158-MAG, while contamination decreased and quality increased in almost all cases (Table S4). A contig length of 1,000 bp offered the best compromise between lower contamination and minor decreases in quality and completeness, which was roughly consistent with what was reported previously based on different binning algorithms (
19). Similarly, the number of nonshared MAG genes decreased in most cases, as expected (Table S4;
Fig. 5), while isolate genes captured by the corresponding MAG decreased slightly and even increased in some cases, indicating that using longer contigs (lower fragmentation) also allowed for more-reliable binning (Table S4;
Fig. 5).