INTRODUCTION
Protists are a group of extremely diverse organisms that dominate the living biomass and energy flow of planktonic microbial eukaryotic communities (
1–3). They possess a wide range of distinct morphologies and physiologies and play important ecological roles in aquatic ecosystems as primary producers, consumers, predators, decomposers, parasites, and links to higher trophic levels (
1,
4,
5). A major goal in protistan community ecology is to understand how diverse taxa within a community interact to influence overall ecosystem function, an objective that would benefit greatly from an ability to assess the entire protistan community using a single approach.
Characterizing the entire microbial eukaryotic community in an environment presents significant obstacles because of the very high diversity of natural protistan communities. Attaining this goal is nearly impossible using traditional, morphology-based approaches due to the existence of cryptic and morphologically nondescript species, the many taxonomic schemes employed for various protistan groups, and the different collection, fixation, and processing procedures on which they depend. The application of DNA sequencing and the movement toward a “molecular taxonomy” for protists, however, has begun to provide morphology- and culture-independent approaches to the investigation of protistan community diversity (
4,
6,
7). Molecular characterization of protistan communities has focused largely on sequencing the small-subunit ribosomal gene (the 18S rRNA gene) rather than other genes (e.g., the cytochrome
c oxidase I gene) more commonly used for barcoding other eukaryotes (
8,
9). The application of such molecular approaches has lagged behind similar investigations of prokaryotic communities (
10) but has already led to significant new discoveries in the field of protistan ecology, such as the presence of a large diversity of rare taxa in different environments (
11,
12) and several previously undetected protistan lineages (
13,
14).
The DNA sequence information used to characterize the species composition and diversity of natural protistan assemblages must be translated into taxonomic information that reflects species-level distinction if it is to be useful for protistan ecologists. Sequences are generally grouped into operational taxonomic units (OTUs) based on their similarity to each other or to reference sequences in databases. The use of DNA sequences for protistan ecology started with Sanger sequencing of relatively long segments (>500 bp) of the 18S rRNA gene or the full-length gene (≈1,800 bp) (see, e.g., references
15,
16, and
17). These sequence lengths provided sufficient information for the development of OTU-calling programs that allowed species-level discrimination (
18–20). More recently, however, next-generation sequencing (NGS) approaches have led to a shift toward the use of shorter but much more numerous sequences, generally from hypervariable regions of the 18S rRNA gene (see, e.g., references
21,
22, and
23). The relationship of these shorter sequences to longer reads and their appropriateness for use in diversity analysis have been questioned (
24–28), but few direct comparisons have been made between the two types of sequences for microbial taxa, especially for microbial eukaryotes.
We obtained data sets of full-length and 454 pyrotag (V9 region) sequences of the 18S rRNA genes from a set of globally distributed oceanic samples in order to compare the information generated by these two independent sequencing approaches. We compared OTU calling for a data set of ∼190,000 pyrotag sequences to OTUs established using ∼6,500 full-length sequences. Our results revealed that pyrotag sequences called at 98% sequence similarity yielded a number of OTUs similar to that from full-length sequences called at 97% sequence similarity (approximately species level). Moreover, pyrotag sequences from a single pyrotag OTU aligned perfectly to full-length sequences from multiple full-length sequence OTUs, indicating that pyrotag sequences from one hypervariable region may not be suitable for differentiating all species present within an entire protistan community. Singletons (i.e., OTUs with only 1 sequence represented in the entire data set) strongly influenced the predictions of species richness for both types of sequence data but had little effect on the overall higher-level taxonomic composition. The protistan species richness observed from the pyrotag and full-length sequence data sets differed markedly, but the patterns of community similarities of samples as indicated by nonmetric multidimensional scaling (nMDS) plots were similar for the two data sets. Our study indicates that while the use of NGS for generating data sets has great potential for the investigation of microbial eukaryote diversity, some caution must be exercised when the data set is used for estimating species richness.
MATERIALS AND METHODS
Sample collection and extraction.
Ten samples were collected from five different marine environments at various depths (5 to 2,500 m) between August 2000 and November 2005 as part of a Global Protistan Survey (
Table 1). Water samples were collected using Niskin bottles mounted on a rosette equipped with conductivity, temperature, and depth (CTD) sensors.
Water samples of 2 to 20 liters were prefiltered through 200-μm and 80-μm mesh to remove most metazoa and were then filtered (<10 mm Hg) onto 45-mm GF/F (Whatman) filters. The filters were placed in 2 ml of 2× lysis buffer (100 mM Tris [pH 8], 40 mM EDTA [pH 8], 100 mM NaCl, 1% SDS) and were stored frozen (−20°C) on shipboard or flash frozen in liquid nitrogen for later DNA extraction. DNA was extracted according to the procedures detailed by Countway et al. (
15).
Pyrotag sequencing and quality filtering.
Pyrosequencing of the V9 hypervariable region of the 18S rRNA gene was performed according to the procedures described by Amaral-Zettler et al. (
29). The primers used to sequence the V9 region were 1380F (5′-CCCTGCCHTTTGTACACAC-3′), 1389F (5′-TTGTACACACCGCCC-3′), and 1510R (5′-CCTTCYGCAGGTTCACCTAC-3′). The emulsion PCR was performed using standard Roche protocols, and sequencing was carried out on the Genome Sequencer FLX system (Roche, Basel, Switzerland) using the GS LR70 long-read sequencing kit (Roche).
Sequence reads that did not have (i) an exact match to the proximal primer (1380F or 1389F) and (ii) the presence of the distal primer (1510R) were removed, as were reads that had one or more ambiguous bases (N′s). Further quality filtering was done using the free software package mothur, version 1.30.0 (
http://www.mothur.org/), by following the standard operating procedures for 454 sequences (
30). These procedures included removing pyrotags with an average quality score of <35 using a 50-bp sliding window and those with >8 homopolymers present. Preclustering with a 1-bp mismatch allowance was performed to minimize the effects of sequencing or amplification errors on OTU clustering (
31), and chimeras were removed using the UCHIME algorithm (
32). Sequences are available at the Visualization and Analysis of Microbial Population Structure (VAMPS) website, hosted by the Marine Biological Laboratory (VAMPS.mbl.edu), and in the NCBI Sequence Read Archive under SRA number SRP001225.
Full-length sequencing and quality filtering.
Full-length 18S rRNA gene sequences were amplified by PCR using the universal eukaryote primers Euk-A (5′-AACCTGGTTGATCCTGCCAGT-3′) and Euk-B (5′-GATCCTTCTGCAGGTTCACCTAC-3′). The PCR and cloning procedures are based on those of Countway et al. (
33) with slight modifications, including the use of AmpliTaq Gold for PCR amplification, and amplicons were cloned using a TOPO-TA kit (Invitrogen). Sequencing of the full-length 18S rRNA gene was conducted by the Joint Genome Institute (JGI, Walnut Creek, CA) using two vector primers (T3 and T7) and an internal primer (Euk-570F).
Sequences that did not have both the proximal (Euk-A) and distal (Euk-B) primers, or that could not be assembled due to failure of the internal sequencing primer, were removed from the data set. Sequences were then passed through the pintail algorithm to remove chimeras (
34).
OTU calling.
The mothur software package (version 1.30.0) was also used for calling OTUs for both the pyrotag and full-length sequence data sets (average neighbor method) to ensure that the OTU-calling procedures for the two sequence data sets were comparable. An aligned SILVA eukaryotic full-length 18S rRNA gene reference database (version 102; aligned and provided by the mothur software package) was used to aid in the alignment of full-length and pyrotag sequences using mothur prior to OTU calling. All full-length sequences were pooled, and OTU calling was performed at a sequence similarity of 97% based on the observation that OTUs called using full-length sequences at 97% by mothur yielded a number of OTUs most comparable to that called by the Microbial Eukaryote Species Assignment (MESA) program at 95% sequence similarity (
20; S. K. Hu and D. A. Caron, unpublished data). Calling OTUs at 95% sequence similarity in MESA was designed to create OTUs of full-length or nearly full length 18S rRNA gene sequences at approximately species-level distinctions using a data set of well-curated and morphologically well described protistan species and their sequences (
20).
Comparison of the numbers of OTUs formed.
The pyrotag and full-length sequence data sets were first compared by examining the number of OTUs formed by the two data sets. Pyrotag sequences were randomly subsampled to yield the same number of sequences in each sample as the full-length data set (i.e., to standardize to the full-length sequence data set). This subsampled pyrotag data set was used to call OTUs at different sequence similarities (95 to 100%) using mothur, and the resulting numbers of OTUs were compared to the numbers of OTUs called for the full-length sequences using mothur at 97% sequence similarity.
The analysis described above allowed the comparison of OTU formation between the pyrotag and full-length sequence data sets using the same number of sequences, in order to avoid the potential bias inherent in using a much larger pyrotag sequence data set. One of the strengths of NGS, however, is the substantially greater number of sequences that can be generated from a sample. Subsequent comparisons between the pyrotag and full-length sequence data sets, therefore, included all pyrotag sequences. Richness for the two data sets (all full-length or all pyrotag sequences) was examined through the construction of rarefaction curves for both data sets. Rarefaction curves of the observed numbers of OTUs were generated for OTUs called at 97% sequence similarity for both data sets and additionally at 98% and 99% similarity for the pyrotag data set. Rarefaction curves were constructed with and without the inclusion of singletons in the two data sets.
Taxonomy of pyrotag and full-length sequence OTUs.
The best BLAST result for a representative sequence from each OTU using the SILVA database (version 111) was used to provide taxonomic information for pyrotag and full-length sequence OTUs. The representative sequence from each OTU was selected by mothur and was the sequence with the smallest distance to all other sequences within the OTU. If more than one sequence matched this condition, then the sequence with the smallest average distance to other sequences in the OTU was selected.
Mapping pyrotags to full-length sequences.
All pyrotag sequences from the 20 pyrotag OTUs that had the most pyrotag sequences were mapped to the full-length sequences in order to investigate how an OTU generated with pyrotag sequences related to OTUs generated with full-length sequences. The pyrotag sequences were mapped to full-length sequences using the Burrows-Wheeler Aligner algorithm with no mismatch allowance (
35), and the number of full-length OTUs matched perfectly by the pyrotag sequences in a single pyrotag OTU was tallied. In the event that a pyrotag sequence mapped perfectly to multiple full-length sequences (i.e., the full-length sequences had identical V9 regions), one of these matching full-length sequences was randomly selected as the perfect match to the pyrotag sequence (i.e., each pyrotag sequence matched with only one full-length sequence for subsequent analysis).
nMDS analysis of the pyrotag and full-length data sets.
The results of nonmetric multidimensional scaling (nMDS) analysis using the pyrotag and full-length sequence data sets were compared in order to investigate potential differences in community similarities inferred from the two types of sequences. The number of sequences in each data set was standardized separately for the two data sets (450 for the full-length data set and 4,736 for the pyrotag data set) by randomly subsampling the sequences in each sample down to the number of sequences in the sample with the least number of sequences for each data set. This standardization procedure was performed to avoid the potential bias of having different numbers of sequences in each sample. In addition to the analyses using the full-length and pyrotag sequence data sets that included singletons, a third analysis was performed on the pyrotag data set with the singletons removed (yielding a data set of 4,472 pyrotags per sample). All subsampling of sequences was performed using mothur. The standardized data sets were square-root transformed to down-weight highly represented OTUs. The PRIMER software package (version 6) was used to calculate the Bray-Curtis community composition similarity values that were based on the OTU composition of each sample. A matrix of pairwise Bray-Curtis similarity values was constructed for each data set (i.e., full-length sequences, pyrotag sequences, and pyrotag sequences without singletons) and was then used to perform CLUSTER (group average mode with the SIMPROF test for significance) and nMDS analyses. The results of the SIMPROF significance test from the CLUSTER analysis were overlaid on the nMDS plots, which are visual representations of the similarities among communities.
Nucleotide sequence accession numbers.
RESULTS
The number of pyrotags obtained after applying quality-filtering procedures ranged from 4,736 to 37,034 per sample, resulting in a total of 190,696 pyrotag sequences for all 10 samples (
Table 1). Pyrotag sequences were ∼110 bp long (mean, 110 bp; 25% quartile, 110 bp; 75% quartile, 112 bp) after quality-filtering procedures (including the removal of primers). A total of 6,579 full-length sequences were available after quality filtering. The number of full-length sequences in each sample ranged from 450 to 879.
Numbers of OTUs from pyrotag and full-length sequence data sets.
We compared the number of OTUs formed from pyrotag sequences called at different sequence similarities (95 to 100%) to the number of OTUs formed from full-length sequences called at 97% sequence similarity in order to investigate which sequence similarity used to call OTUs for the pyrotag data set resulted in a number of OTUs most comparable to that generated by the full-length sequence data set. The pyrotag data set for this analysis was standardized by randomly subsampling the number of pyrotags in each sample to match the number of full-length sequences in each sample (
Table 1) in order to avoid potential biases from using a much larger pyrotag data set. OTU calling for all full-length sequences (6,579 sequences) at 97% similarity in mothur resulted in a total of 1,400 OTUs, while the same number of pyrotag sequences called at various sequence similarities, ranging from 95 to 100% sequence similarity, resulted in 1,113 to 1,695 OTUs (i.e., ≈80 to 120% of the number of OTUs called for the full-length sequences) (
Fig. 1). OTU calling of the pyrotag sequences at 98% sequence similarity yielded a total of 1,419 OTUs, which was most similar to the number of full-length OTUs called at 97%. Singleton OTUs (i.e., OTUs composed of only one sequence in the data set) constituted a large portion (>50%) of the OTUs in all treatments of both data sets. Calling OTUs using full-length sequences at 97% sequence similarity generated the highest proportion of singletons (72% of the total number of OTUs) compared to the number of singletons generated from the standardized pyrotag sequence data called at different levels of similarity (range, 54 to 64% of the total number of OTUs). The pyrotag singletons in this analysis included some “secondary” singletons that were generated as a consequence of the subsampling process. The use of the standardized data set yielded an average of 59% singleton OTUs over the range of sequence similarities used to call OTUs (
Fig. 1), whereas calling of OTUs using the entire pyrotag data set yielded an average of 51% singleton OTUs.
OTU rarefaction curves from the pyrotag and full-length sequence data sets.
Rarefaction curves were constructed using all sequences available for the pyrotag (190,696 sequences) and full-length (6,579 sequences) sequence data sets in order to examine sequence coverage (
Fig. 2). Full-length OTUs were called at 97% similarity, while pyrotag OTUs were called at 97%, 98%, and 99% sequence similarities. Calling pyrotag sequences at 98% sequence similarity resulted in a number of OTUs that was most comparable to that from the full-length sequence data set in the previous standardized comparison (see the preceding section), and OTU definitions of 97% and 99% were included in order to investigate how differences of 1% in sequence similarity would affect the analysis results. The entire pyrotag data set generated a large number of OTUs (at 97% sequence similarity, 6,414 OTUs; at 98%, 7,775 OTUs; at 99%, 11,234 OTUs), while the full-length sequence data set generated 1,400 OTUs, as noted above. The rarefaction curve for the full-length sequence data set was similar to the pyrotag rarefaction curves for the portion of the curves where the data sets (i.e., number of sequences sampled) overlapped (
Fig. 2B). The curve for the full-length sequences appeared to have a slightly different trajectory than those for the pyrotag sequences (
Fig. 2B), but the curve for the full-length sequences was always bracketed by the curves generated by the pyrotag OTU data sets called at 97%, 98%, and 99% sequence similarities.
The removal of singleton OTUs altered the form of the rarefaction curves dramatically. The effects of removing singletons were investigated because there have been suggestions that some singleton OTUs may be artifactual (
31,
36,
37) (for more detail, see Discussion). The number of OTUs observed decreased substantially for both data sets with the removal of singletons, as expected (
Fig. 2C and
D). The number of OTUs observed for the pyrotag data set was reduced to 40 to 52% of the total number of OTUs when singletons were removed (from 11,234 to 4,530 OTUs at 99% sequence similarity; from 7,775 to 3,828 OTUs at 98%; from 6,414 to 3,349 OTUs at 97%). The number of OTUs observed in the full-length sequence data set without singletons was reduced to 28% of the total number of OTUs (from 1,400 to 387 OTUs). The rarefaction curve for the full-length data set without singletons had a much more shallow slope and a lower maximum than the pyrotag rarefaction curves for the same number of sequences sampled (
Fig. 2D).
Taxonomic compositions of pyrotag and full-length sequence data sets.
The higher-level taxonomic compositions (supergroup level) of the entire pyrotag and full-length sequence data sets were tallied in order to (i) compare the taxonomic compositions obtained by calling pyrotag OTUs at different levels of similarity, (ii) compare the taxonomic compositions of the pyrotag and full-length sequence data sets, and (iii) examine the effect of the removal of singletons on the taxonomic information inferred using both data sets. This analysis focused on pyrotag OTUs called at 97%, 98%, and 99% sequence similarities for the reasons given in the preceding section.
Calling OTUs using the pyrotag data set at three different levels of sequence similarity had very little effect on the higher-level taxonomic composition of the sequences (
Fig. 3). Similarly, the presence or removal of singleton OTUs did not substantially influence the taxonomic composition derived from either the full-length or the pyrotag sequence data set. However, the full-length sequence data set yielded a larger proportion of sequences identified as rhizarians than the pyrotag sequence data set (14 to 15% of full-length sequences and ≈3% of pyrotag sequences) and smaller proportions of sequences identified as stramenopiles (1 to 3% of full-length sequences and 12 to 13% of pyrotag sequences) and as cryptophytes and haptophytes (≈0% of full-length sequences and ≈3% of pyrotag sequences). Other groups (e.g., alveolates) constituted similar proportions of the two data sets.
Mapping pyrotags to full-length sequences.
All pyrotag sequences from a single pyrotag OTU were mapped to full-length sequences at 100% sequence similarity in order to investigate how individual pyrotag OTUs related to full-length OTUs. Full-length OTUs for the analysis were called at 97% sequence similarity. This analysis focused on 20 pyrotag OTUs that had the most pyrotag sequences. Pyrotag OTUs were called at 95%, 97%, and 99% sequence similarities (
Table 2).
Pyrotag sequences from nearly all of the 20 pyrotag OTUs with the most sequences mapped to full-length sequences from multiple full-length OTUs. The results were similar among the three sequence similarities examined for the pyrotags, but lower sequence similarity tended to hit more full-length OTUs. In only three cases did the pyrotags from a well-populated pyrotag OTU match a single full-length OTU (the 17th, 18th, and 19th most populated OTUs when pyrotag sequences were called at 99%, 97%, and 95% sequence similarities, respectively). The pyrotag OTUs with the most sequences formed at 95%, 97%, or 99% sequence similarity consisted of high numbers of pyrotags (34,787 to 49,709), and pyrotags from these three OTUs mapped perfectly to full-length sequences that were distributed among 127, 109, or 75 full-length OTUs, respectively (
Table 2). While sequences from a single pyrotag OTU aligned perfectly to sequences from multiple full-length OTUs, the full-length OTUs were generally from the same higher-level (e.g., class or order) taxonomic group (e.g., Dinoflagellata) (
Table 3).
nMDS analysis using pyrotag and full-length sequences.
nMDS plots were constructed using the pyrotag and full-length sequence data sets to (i) compare community similarities among samples using pyrotag and full-length sequences (
Fig. 4A and
B) and (ii) examine the effect of the removal of singletons on the clustering of communities for the pyrotag data set (
Fig. 4B and
C). OTUs for the pyrotag data set were called at 98% sequence similarity, while OTUs for full-length sequences were called at 97%, for the reasons provided above. Each data set was standardized to the number of sequences in the sample with the lowest number of sequences (for the full-length data set, 450 sequences/sample; for the pyrotag data set, 4,736 sequences/sample; for the pyrotag data set without singletons, 4,472 sequences/sample). The Bray-Curtis similarity values estimated for the full-length data set were generally lower (average, 5%) than those estimated for the pyrotag data sets (average with singletons, 14%; average without singletons, 13%), but the overall patterns and the clustering of some communities in the nMDS plots were similar for the three data sets (
Fig. 4). For example, communities from the Ross Sea (RS) consistently clustered together, and communities from deep water at the East Pacific Rise (EPR) (sampling depths, 1,500 m and 2,500 m) consistently clustered together as well in both the pyrotag and full-length sequence data sets. One difference between the pyrotag and full-length sequence data sets concerned the community from the Arctic Ocean (AO) collected at 35 m. The two Arctic samples (35 and 500 m) were not significantly different (
P, >0.05 by the SIMPROF test) in the full-length data set (
Fig. 4A) but were significantly different in the pyrotag data sets. There were also some differences in the clustering of communities collected from the Gulf Stream (GS), EPR, and Eastern North Pacific (ENP) among the three data sets as indicated by the results of the SIMPROF significance test (i.e., subclusters on the nMDS plots), although these samples occupied the same positions on the plots in all three analyses.
The removal of singletons from the pyrotag data set had little effect on the resulting nMDS plot except that the community collected from the ENP at 5 m was statistically similar to the community collected from the GS at 15 m in the nMDS plot constructed with the pyrotag data set after the removal of singletons (
Fig. 4C). On the other hand, the community collected from the ENP at 5 m was not statistically different from the community collected from the EPR at 20 m in the nMDS plot constructed with the entire pyrotag data set (
Fig. 4B).
ACKNOWLEDGMENTS
This work was supported by grants from the National Science Foundation (OCE-0550829, MCB-0703159, MCB-0084231, OCE-1136818) and the Gordon and Betty Moore Foundation. Pyrosequencing was provided by the International Census of Marine Microbes (ICoMM) with financial support from a W. M. Keck Foundation award to the Marine Biological Laboratory at Woods Hole. The full-length sequencing work conducted by the U.S. Department of Energy Joint Genome Institute (JGI) is supported by the Office of Science of the U.S. Department of Energy under contract no. DE-AC02-05CH11231.
We thank the staff from JGI, including Jim Bristow for sponsoring the sequencing, Susannah Tringe and Ed Kirton for processing the sequences, and Ed Kirton for assembling the sequences. William C. Nelson assisted with early bioinformatics analysis of the full-length sequence data, especially the implementation of the chimera-checking algorithm in our informatics pipeline.