INTRODUCTION
Identifying meaningful operational taxonomic units (OTUs) is a significant bottleneck in the analysis of 16S rRNA sequences from complex microbial communities, particularly for large data sets generated by next-generation sequencing. Spurious sequences created by PCR or sequencing errors can greatly inflate the total number of OTUs (i.e., the alpha diversity) of a sample if not treated properly (
1,
2). Although attempts have been made to address the problem of inflated alpha diversity from erroneous OTUs (
1,
3–5), there have been few attempts to make OTUs that more accurately reflect ecologically cohesive bacterial populations.
Most common methods of forming OTUs with next-generation sequencing use a single genetic cutoff for creating OTUs. The most common approach for calling OTUs is to cluster sequences into groups based on sequence identity or genetic distances alone (taxonomy-independent [
6], taxonomy-unsupervised [
7], or
de novo [
8] clustering). Sequences are usually aligned using a pairwise or multiple-alignment algorithm to create a distance matrix, and sequences are clustered based on a sequence identity cutoff. Many heuristics have been developed to decrease the computational demand of OTU calling with various degrees of accuracy, such as CD-HIT (
9), UCLUST (
8), DySC (
10), and ESPRIT (
11). Another approach is to bin sequences into groups within a well-curated database of known sequences (taxonomy-dependent [
6], phylotyping [
12], or closed-reference [
13] clustering). Sequences that do not match the database are lost, even though they could represent important, novel organisms. To overcome this problem, novel sequences can be retained as distinct clusters (open reference), but this comes at the expense of speed and convenience. All of these commonly applied techniques rely on a genetic cutoff, typically >97% sequence identity, to inform OTU clustering.
Although it is common to use a single sequence identity cutoff for clustering, more insight can be gained by adjusting the sequence clustering for individual taxonomic lineages (
14,
15) or by using multiple genetic cutoffs for analysis (
16,
17). Hunt et al. (
14) developed a program called AdaptML to infer population boundaries from the ecological information on isolated strains. Different populations were often identified within what would generally be considered one species. Using two closely related populations predicted by AdaptML, Shapiro et al. (
18) were able to investigate the early events of bacterial speciation. Koeppel et al. (
15) used a program called EcoSim to infer units of bacterial diversity by estimating evolutionary parameters, such as periodic selection and drift, derived from phylogenetic relationships of isolated strains. This method can detect more total populations than are supported by AdaptML using ecology alone (
19). Both Youngblut et al. (
16) and Nemergut et al. (
17) repeated their analyses at various levels of clustering. Youngblut et al. (
16) found that using an inappropriate genetic cutoff would have changed their results. All of these studies demonstrate that more biological insight can be obtained from diversity studies when the clustering is done at different levels for different taxonomic lineages.
Sequencing and PCR errors and chimeras are significant issues in next-generation 16S rRNA libraries of microbial diversity. Inflated diversity estimates have been problematic with 454 pyrosequencing (
1,
3–5,
20) and Illumina data sets (
21,
22). Many attempts have been made to reduce the impact of sequencing error on the estimate of total diversity from chimeric sequences and PCR and sequencing errors (
3–5). With good-quality filtering and strict error-correcting software, many errors can be detected and removed from the data set, reducing the effective error rate. However, these methods do not help in identifying how these “cleaned” sequences should be grouped into OTUs for downstream analyses.
We hypothesized that identifying the appropriate grouping for each taxonomic lineage and detecting many methodological errors can be accomplished using the distribution of sequences across samples. Bacteria in different populations will respond uniquely to variation in environmental conditions, resulting in different distributions across sampled environments. This has been demonstrated for different taxa under a range of conditions (
14,
15) and during disturbance (
16). Conversely, 16S rRNA sequences derived from the same population will have the same distribution across sampled environments, whether the sequences are from slightly different copies of the 16S rRNA gene in the same organism or variation of the 16S rRNA sequence within a population or are sequences generated randomly in error. Thus, whether the underlying distribution is the same for ecological (i.e., the same population of bacteria) or methodological (i.e., sequencing-error) reasons, they should be considered a group and merged into one OTU.
Our goal was to develop a simple algorithm using the distribution of 16S rRNA sequences across samples to inform the creation of OTUs for large next-generation sequencing studies. This method accommodates differences in the level of genetic differentiation across taxa and reduces the number of redundant OTUs from sequences within the same population or created by sequencing error. To apply this method to 16S rRNA surveys created from next-generation sequencing, we developed an algorithm that uses distribution information, the relative abundances of sequences within all samples, and genetic distance to inform clustering. We compare this method (distribution-based clustering [DBC]) to commonly applied closed-reference (i.e., phylotyping), open-reference (i.e., a hybrid of phylotyping and
de novo clustering), and
de novo clustering methods using experimental mock-community data sets. We test the accuracy and sensitivity of all clustering methods in identifying true input sequences, clustering sequencing and methodological errors with the input sequences they are derived from, and retaining the information contained in the distribution of sequences across samples. Distribution-based clustering reflects the true distribution of input templates or organisms more accurately than OTUs from methods using sequence identity alone. Finally, we compare the results of each clustering method on a set of unknown samples from a stratified lake, showing that DBC calls fewer OTUs than either the
de novo or open-reference method yet is able to discriminate OTUs differing by a single base pair that show evidence of differing ecological roles. The source code, test data, and user guide are freely available for download at
https://github.com/spacocha/Distribution-based-clustering.
DISCUSSION
We present a novel method of calling OTUs that uses the ecology of the organisms they represent to inform the clustering. Typically, only genetic information is considered when forming OTUs. Incorporating information such as abundance and distribution into the OTU formation process creates OTUs that more accurately cluster sequences by the template or organism of origin and improves the information content of the resulting OTUs.
The gross trends in the data are similar, regardless of clustering algorithms. Principal-coordinate analysis (PCoA) plots, which identify the most obvious differences between samples, were similar across clustering methods (see Fig. S7 and S8 in the supplemental material). PCoA is particularly effective when the variable of interest (e.g., depth or disease state) is associated with major changes in community structure but is less effective at detecting subtle variations in community structure. Furthermore, it cannot pinpoint the specific sequences that drive these associations. Other approaches, such as univariate tests, including the Mann-Whitney U test and Fisher's exact test, and statistical learning techniques, such as random-forest classification, can test for associations between bacterial species abundance and environmental metadata (
36). Optimizing the clustering algorithm to detect such associations will increase the chances of gaining important biological insight. Thus, accurate OTU formation may not be as critical when trends in the data can be discerned at higher taxonomic levels, such as the ratio of
Bacteroidetes to
Firmicutes in obesity (
37). However, differences between closely related organisms are crucial for identifying evolutionary and ecological mechanisms (
18). In such cases, distribution-based clustering may be one of only a few tools that can be used to distinguish the signal from the noise of sequencing errors.
Run time is currently a severe limitation to implementing distribution-based clustering on very large data sets. Although many improvements can be made to the algorithm itself to increase the speed of the program (likely with lower accuracy), any implementation will likely be more computationally intensive than other methods, since it involves processing additional information. Steps can be taken to reduce the total run time, such as increasing the abundance skew (e.g., 100-fold more abundant representative sequences), decreasing the total-distance cutoff allowed for forming clusters (e.g., a cutoff of 0.05), or filtering out low-abundance sequences (e.g., singletons). All of these steps decrease the total number of pairwise comparisons and reduce the run time. However, they will also decrease the accuracy of the algorithm at removing incorrect OTUs (see Fig. S4 in the supplemental material).
There are some cases where the distribution-based clustering method should be used with caution. Distribution-based clustering predicts the most accurate OTUs when sequences are distributed in an ecologically meaningful way across samples, as in the mock community or in a stratified lake. However, methodological issues creating nonrandom errors across samples (e.g., different error rates across sequencing cells or runs) will increase the number of erroneous sequences that distribution-based clustering will keep as distinct OTUs (see Table S5 in the supplemental material). Nevertheless, distribution-based clustering still creates the most accurate OTUs of all clustering methods, even with the methodological errors found in the analysis. Users should also consider whether grouping sequences using a statistical test of similarity will impact the statistics of their downstream analyses.
Although no method formed OTUs that were as accurate as the distribution-based method with these mock communities, there are situations when different methods might be a more appropriate choice. Closed-reference clustering has the advantage of speed and convenience, especially for downstream processing, because information about the reference sequences can be precomputed (e.g., phylogenetic trees and taxonomic information). De novo clustering may be a good choice for higher-taxonomic-level analyses, as overclustering species should not affect phylum-level changes across samples, especially when the total number of predicted OTUs can affect the results. Open-reference clustering is less discriminating and tends to grossly overestimate the number of OTUs. However, it seems to be a good alternative when looking for trends between closely related organisms, especially if low-abundance OTUs can be filtered out.
When applied appropriately, each of the different clustering methods analyzed here can facilitate the discovery of important trends in 16S rRNA library sequence data. The introduction of the distribution-based clustering method gives researchers an additional tool that allows distinct OTUs to be retained even if they differ at a single base pair in a background of high microdiversity or sequencing error.