We used three classification methods in conjunction with manual curation to analyze the contigs in our samples: (i) geNomad, a deep neural network method, to classify contigs as mobile genetic elements (i.e., plasmids and viruses) and to assign viral taxonomy; (ii) MMseqs2, which assigns taxonomy based on protein genes, with Genome Taxonomy Database (GTDB) as the reference database (so classification was biased toward bacteria and archaea); and (iii) Kaiju to classify eukaryotic contigs using the Kaiju database that contains a subset of the National Center for Biotechnology Information (NCBI) BLAST nr database containing all proteins from archaea, bacteria, viruses, fungal, and microeukaryotic reference genomes (db_nr_euk). Manual curation was used to detect spuriously assigned contigs based on the small subunit ribosomal ribonucleic acid genes (small subunit [SSU] rRNAs). For example, a contig could be classified as viral but has a bacterial SSU. In these cases, we classified contigs as eukaryotic, mitochondrial, or chloroplast by using NCBI BLAST (
41) on the SSU genes.
Not unexpectedly, using multiple classification methods leads to discrepancies, but we found that the databases had the largest impact on the results (Supplemental Methods). If an organism’s genome is not close to anything in the database, many of these methods attempt to determine the last common ancestor. However, often classifications at the superkingdom level (e.g., a sequence classified as bacteria with no phylum or lower taxonomic rank assigned) are ambiguous and get classified differently, depending on the algorithm and database. For example, if only MMseqs2 with the GTDB database was used, 85,455 contigs are classified as bacterial. However, if this is combined with the classification results for viruses and eukaryotic sequences, the number of bacterial contigs is halved to 40,986 contigs (Supplemental Methods). This underscores the importance of continuing to get complete genomes from metagenomes to better populate reference databases and to treat low confidence classifications with skepticism.
Combining all of the results led to the conclusion that of the contigs we assembled, 4% are eukaryotic; 0.3%–0.7% are archaeal; 38% are bacterial; 40% are viral; and 14% are unclassified/ambiguous (
Fig. 2A). A more conservative estimate of the number of viral contigs is 25%, which also increases the number of unclassified contigs to 29%. We encourage the reader to see the Supplemental Methods for full results from each of the classification methods, assumptions, and reasoning for reaching our final classification of the contigs. The unclassified/ambiguous contigs had a median length of ~5,000 bp; possibly many of these contigs could not be classified because they were too short, rather than the databases not having representative sequences. Sorting through the different classification methods also points out the hazards of relying on one method for classification, even if the focus of a study is purely prokaryotic, viral, or eukaryotic. Where different methods agree, there is higher confidence that the classification is correct.
Bacterial taxa comprise nearly all prokaryotic contigs, while archaeal taxa are rare
Unsurprisingly, given that we prefiltered the water at 11 µm, archaea and bacteria comprised a large proportion of the contigs. There were 40,986 bacterial contigs, which is 38% of all contigs. We only obtained one contig with a complete archaeal SSU gene, which was classified as in the
Nitrosopumilus genus by both MMSeqs2 and SILVA (
Tables S1 and S2). This is consistent with a recent 16S study where a
Nitrosopumilus-like operational taxonomic unit (OTU) was the only dominant archaeal OTU in the South Bay of the SFE (
42). However, the archaeal SSU gene in our data set likely represents a different species from that represented by the OTU or the two high-quality MAGs that Rasmussen et al. reported in a subsequent metagenomics study (
26), based on an SSU alignment (Fig. S3). This suggests that there is still an unexplored location- or condition-dependent dominance of specific
Nitrosopumilus strains in the South Bay.
Despite the existence of only one full-length archaeal SSU gene, MMseqs2 classified contigs in 12 other archaeal phyla (
Fig. 2B). We are not prepared to assert that the archaeal diversity is as high as this. We used the GTDB taxonomy, which is sometimes based on highly fragmented and incomplete bins, and we considered some taxonomic collapse possible, i.e., the high number of closely related species in the GTDB database may have been caused by sequencing error and assembly artifacts. However, coverage of the contigs involved is in the 5–10× range and thousands of bases in length, and we believe there is a significant archaeal population at USGS Station 36. We consider it likely that much of the population is in the sediment rather than being planktonic.
Plasmids have a 3:1 ratio with archaeal and bacterial species
Based on the geNomad assignments, and removing the contigs that have SSUs, there are 1,963 plasmids. Since we estimate that there are 450–500 archaeal and bacterial species (see next section for justification of this estimate), plasmids appear to be at a 3:1 ratio with the number of genomes. This ratio underscores the importance of ongoing efforts to link plasmids to chromosomes in metagenomes (
15,
43). Of the plasmids that had MMseqs2 classification phylum level or below, three of the contigs were classified at the archaeal and 1,591 were classified as bacterial, consistent with the estimated ratio of archaeal to bacterial species.
Plasmids are thought to be better retained by Illumina sequencing due to size selection and the inability to sequence circular contigs with Nanopore native ligation library preparation. However, in the Illumina assembly, geNomad identified only 244 contigs (average length, 3,663 bp) as plasmids, as compared to the 1,963 plasmids it identified in the Nanopore assembly (average length 15,264 bp) (Tables S3 and S4). Since the Illumina sequencing was about 14% of the depth of the Nanopore assembly (21 Gbp vs 150 Gbp) and the assembly is about a third of the Nanopore assembly (0.95 Gbp vs 2.95 Gbp), fewer identified plasmids are expected in the Illumina assembly. However, in terms of the number of total contigs in the assemblies (813,394 vs 107,977 contigs of Illumina vs Nanopore assemblies), the fraction of contigs identified as plasmids is far fewer in the Illumina assembly (0.03% vs 1.8%). In addition, the shorter length of contigs likely led to generally lower confidence predictions by geNomad (Fig. S4). From this analysis, despite using both size selection and native ligation in the Nanopore library preparation, we obtained more plasmids in the Nanopore assembly, and the Nanopore assembly allows for higher confidence identification of plasmids. This result may be due to the extremely deep Nanopore sequencing, although 21 Gbp for an Illumina metagenome is generally deeper than what other studies obtain.
Eukaryotic species, mitochondria, and chloroplasts
Using the eukaryotic specific database with the Kaiju classification of contigs, as well as manual curation of eukaryotic, mitochondrial, and chloroplast SSUs, allowed us to determine that we sequenced environmental DNA of eukaryotes, such as mammals, polychaete worms, clams, mussels, sea anemone, zooplankton, ciliates, and diatoms (
Table S1; Fig. S5). We also detected smaller eukaryotes that came through the 11-µm prefilter, such as
Ostreococcus tauri. The top five species based on contig counts classified by Kaiju were ciliate
Stylonychia lemnae (896 contigs), diatom
Thalassiosira pseudonana (524 contigs), green alga
Ostreococcus sp. “
Lucimarinus” (396 contigs), green alga
Picochlorum sp. BPE23 (273 contigs), and
Ostreococcus tauri (272 contigs). Eleven species out of 256 comprised 82% of the eukaryotic contigs classified by Kaiju (
Table S5). The Kaiju database only had eukaryotic sequences from fungi and microeukaryotes and did not report classifications from larger organisms.
We assembled some complete, circular mitochondria and chloroplast genomes and classified them based on their SSUs (
Table S1). In general, nearly all chloroplasts were initially classified as cyanobacteria (19 of 22) based on SILVA. Only one mitochondrion was correctly classified by SILVA; the rest needed manual curation. We found that if we had only relied on the MMseqs2 classification of contigs, we would have misclassified the mitochondria and chloroplasts as bacteria. Most were classified as bacteria at the superkingdom level, but four of the mitochondria were classified at the species level (UBA4416 sp016787365, PWPS01 sp003554915, and CAIVPM01 sp018335935) and two at the genus level (
Pelagibacter and UBA2645). Similarly, there are some chloroplasts that are classified in the class Cyanobacteria. Current theory holds that mitochondria arose from a bacterium becoming an endosymbiont in an archaeal cell (
44), so similarities of between these genomes and bacterial ones are not unexpected. However, these results are a reminder to check if a MAG from a metagenome is a mitochondrion or a chloroplast before labeling it as a new species.
Estimation of the number of prokaryotic genomes at USGS Station 36
Key questions in metagenomics concern what organisms are present in a sample and what the relative taxonomic abundances are. Not only do the answers to these questions provide an idea of the types of functions occurring in the sample, but also they provide species abundances to analyze with environmental data. In order to estimate how many genomes we could aim to assemble, we used marker genes to estimate the number of prokaryotic genomes present and then we used the coverage of the contigs as proxies for relative abundance.
Two common choices for marker genes are the SSU ribosomal RNA gene and 30S ribosomal protein S3 (rpsC) (
45). SSUs have the advantage that they are the basis for much of modern taxonomy (
46) prior to the advent of GTDB (
47) using full genomes or genome bins for taxonomic classification, and there are existing databases that can be used to place SSUs in the Tree of Life (
48,
49). At the same time, the disadvantage to using SSUs is that sometimes they have copy numbers greater than one, which distorts relative abundance measurements. Using rpsC can overcome this issue since it is a single-copy gene, but the lack of databases providing taxonomic information is problematic. Nanopore sequencing can mitigate these issues because the reads are long enough to span ribosomal operons; thus, SSUs (and the rest of the ribosomal operon) do not cause assembly fragmentation (
50). Moreover, if the sequencing is deep enough, the contigs are often sufficiently long to contain both SSU and rpsC genes, thus allowing rpsC genes to be assigned a taxonomy consistent with that of the SSUs.
To evaluate how much of the microbiome we were able to decompose into individual genomes, we used SSUs, rpsC, and contig taxonomy via MMSeqs2 (using GTDB as the database) to get a count of distinct organisms. First, we evaluated the SSUs in the Nanopore assembly. We found a total of 561 bacterial and archaeal SSUs. Of those, 37 were reported as truncated by cmsearch; i.e., the covariance model could not be fit at one or both ends. We manually curated the 37 and concluded that they were assembly artifacts or from extremely rare organisms and decided to exclude them from further consideration. This left 524 SSUs.
There are two confounding factors when using SSUs to estimate genome abundances: (i) one genome can have multiple SSUs that have different sequences, and (ii) genomes from different species or strains can have identical SSUs (
51). Accounting for contigs that contain multiple SSU sequences (
29;
Fig. 3;
Table S2), we found that the number of estimated genomes becomes 495. We found 17 contigs that have at least two different SSU sequences. Searching for identical SSUs between different contigs, we found 49 groups of SSUs. We did not reduce the number of estimated genomes based on contigs having identical SSUs because it is not possible to tell if these contigs belong to the same genome or to different strains. For example, two
Glaciecola contigs, contig_37285 (772,808 bp) and contig_57621 (575,645 bp), both have two SSUs, and all four of these SSUs are identical. However, an alignment of these two contigs using LAST (
52) did not result in any significant alignment longer than 5,743 bp, which is likely the alignment of ribosomal operons. This suggests that these contigs either represent two pieces of the same genome or they are from different strains/species, despite having identical SSUs. For these reasons, some of the SSUs in
Fig. 3 are identical, and our estimate of the number of genomes based on SSUs is still 495 but with a potential lower bound of 447.
We next annotated all of the predicted protein-coding genes in the Nanopore assembly and extracted the rpsC genes. We found a total of 548 contigs with rpsC genes. However, the Nanopore assembly also contained genomes for 59 mitochondria and 23 chloroplasts, and these generally contain rpsC genes as well. Eighteen of the rpsC genes were on contigs known to have mitochondrial SSUs, so the upper bound on the number of prokaryotic genomes represented by rpsC genes is 530, and the lower bound is 466.
Finally, we used MMseqs2 taxonomy on all of our contigs with the GTDB database. We selected the contigs with species-level taxonomic assignments, at least 10 labels that agreed with the contig assignment, and coverage of at least 10 in order to get a conservative taxonomic assignment. This yielded an estimate of 489 as a lower bound for the number of distinct species. Note that this taxonomic assignment amounts to a clustering of the longer contigs using the GTDB organisms as seeds. A set of contigs all being assigned to the same GTDB organism at the species level does not necessarily mean that it is the same species or strain; it may just be a statement that the contigs are closer to the GTDB organism than to anything else in the database, and we infer that they are therefore likely to be close to each other as well.
Based on the three estimates, we conclude that the USGS Station 36 sample contained approximately 450–500 distinct bacterial organisms to which we can confidently assign taxonomy.
Table S2 lists the organisms in decreasing order of relative abundance, i.e., coverage of the contig. We also examined diversity based on the Illumina assembly and found a total of 467 prokaryotic SSUs. However, only 36 were not found to be truncated by cmsearch. It is worth noting that while the estimate based on the Illumina assembly leads to a consistent estimate of the number of distinct species, it does not generally allow assignment to lower taxonomic ranks (
Table S6). Most of the fragments were significantly shorter than 1,000 bp, and only the extreme sensitivity of cmsearch allowed them to be seen (
53). A comparison of the taxonomy at the class level between the Illumina and Nanopore SSUs generally shows agreement for the number of prokaryotic SSUs that can be classified (Fig. S6), although ~42% of the total Illumina SSUs are unclassified, likely because they are truncated.
We generated a phylogenetic tree based on the bacterial and archaeal SSUs to study the phylogeny and included coverage of the contig the SSU was on as a proxy for abundance (
Fig. 3). Six species comprised 30% of the relative abundance of bacteria and archaea: a SAR86 species (10.3%), a SAR11 Clade Ia species (7.8%), three
Actinomarina species (3.9%, 3.0%, and 2.9%), and a
Litoricola species (2.8%). Notably, all of these species except for the
Actinomarina had closely related relatives that were not highly abundant. These results suggest that in order to study the population dynamics of the estuary microbiome, we will need to look at the individual genome level. Although we were able to tell which species were the most abundant, obtaining high-quality genome bins is necessary to help understand why some may be more successful than others.