INTRODUCTION
The human gut microbiome is a complex environment hosting a large number of different bacteria. A cost-effective method to determine the bacterial composition of, e.g., human fecal samples is to sequence amplicons targeting the 16S rRNA gene. Microbial compositions of diverse environments, which are influenced by different factors or conditions (e.g., sampling time point, targeted rRNA region, response to health or disease, sequencing strategy, machinery, depth, and read lengths), were also studied with this method (
1–7).
The 16S rRNA gene spans about 1,500 bp and is structured in highly conserved regions interspersed with nine variable regions (V-regions), V1 to V9 (
8,
9). The conserved regions can be used for primer binding and thus allow for capturing a greater number of different bacterial taxa, sometimes including or not including archaea, while the variable regions permit the discrimination of these taxa within different microbial environments (
10). However, differences between the conserved regions and, therefore, differences in primer annealing result in an unequal amplification of bacteria present in a sample (
11). Depending on the particular V-region that was targeted, differences in the sequencing results and taxonomic outcome occurred, which led to misinterpretation (
12,
13). Further, not every variable region has the same sensitivity, i.e., allowing separation of closely related taxa (
14). Concerning archaea, the applicability of certain primer pairs has been covered well in previous studies (
12,
15,
16).
Second-generation sequencers, e.g., Illumina’s MiSeq, enable sequencing of amplicons up to 600 bp with high accuracy. This length allows targeting about one to three adjacent variable regions of the 16S rRNA gene using “universal” primers for the conserved regions. In a subsequent PCR, sequencing adapters are added to the amplicons (
17). After a cleanup step, the amplicon libraries are sequenced. The resulting reads are used to analyze similarities and differences between samples with different microbial compositions (e.g., alpha- and beta-diversity) (
18). In contrast, full-length 16S rRNA gene sequencing is possible by using third-generation sequencers, for instance, Oxford Nanopore MinION (
19) and the PacBIOs Sequel (
20), which were introduced in 2009 and 2008, respectively. The greatest advantage is the long read length (up to 10,000 bp) and sequencing on a single-molecule level in a short time. These long reads enable an improved identification of bacterial taxa, as shown in several recent studies (
21–27). Nevertheless, significant drawbacks include the relatively high error rate (up to 15% per sequence) (
28,
29), limited applicability in high-throughput studies, higher general costs, and even less standardization of protocols and analysis pipelines. However, despite the widespread use of 16S rRNA gene sequencing, there is a need to better understand the differences between the targeted region and the data analysis pipeline chosen in amplicon sequencing of the 16S rRNA genes.
For short-amplicon sequencing, a literature survey showed that the regions V1-V2/V3 (
30,
31), V3-V4/V5 (
32–34), and V4 (
35,
36) are most commonly used. However, the taxonomic classification differs considerably when targeting different variable regions (
37), affecting attempts to perform cross-study comparison and leading to further biases in compositional analysis, where short-amplicon primers are not as universal as desired (
11,
38). Since the taxonomic resolution seems to differ for some phyla for different variable regions (
39), closely related bacterial species and genera might be indistinguishable (
40). Moreover, the choice of bioinformatic processing pipelines and analysis tools is known to influence the results (
41–44). Different 16S rRNA gene-specific taxonomic classification methods, such as Mothur (
45), Qiime (
46), Qiime2 (
47), DADA2 (
48), and others, were developed. During data processing, sequences are clustered into operational taxonomic units (OTUs) at a threshold of 97% sequence similarity. Sequence representatives, i.e., sequences with the least mismatches to other sequences in a cluster, are used for taxonomic assignment. Amplicon sequence variants (ASVs) or zero-radius OTUs (zOTUs) have been suggested as alternatives to OTUs (
48,
49), as they correct for sequencing errors by different denoising approaches. In contrast to OTUs, these clusters are supposed to contain reads originating only from the same bacterial species, enabling a cross-study comparison (
49,
50). In any case, after clustering, sequences are classified for taxonomic assignment using databases of known 16S rRNA gene sequences, e.g., GreenGenes (GG) (
51), the Ribosomal Database Project (RDP) (
52), Silva (
53), the genomic-based 16S rRNA Database (GRD) (
54), or The All-Species Living Tree (LTP) (
55). Not only different pipelines and reference databases but also settings of a given pipeline influence the results and are an often-overlooked bias in microbiome studies (
42,
56–58). Nevertheless, some biases occurring in 16S rRNA gene amplicon sequencing have already been addressed in the past. Well-studied biasing factors, for instance, include sampling and storage procedures (
59–63), DNA extraction methods (
64–68), choice of variable region and primers (
12,
36,
69–72), library preparation and sequencing strategies (
73–76), and sequence data processing, including denoising, taxonomic classification, and the use of distinct bioinformatic tools (
42,
56–58). Further, the use of negative controls and mock communities as internal standards to detect contamination or aberrancies in the sequencing results was proposed (
77–79).
In this study, we joined several of these separate issues to raise awareness that the combination of primer sequence choice, clustering methods, reference database, and analysis parameters must be considered thoroughly to avoid increased bias. Thus, we created a large benchmark data set of 16S rRNA gene amplicon sequences, targeting different V-regions of the 16S rRNA gene, and systematically tested different software tools with different sets of parameters for the analysis. We sequenced three mock communities of increasing complexity with known composition, along with complex human fecal samples for comparison.
DISCUSSION
For short-amplicon 16S rRNA gene sequencing, primers spanning more than one V-region are commonly used, which enhances precision in identifying bacteria compared to a single region. Some of the most frequently used primer pairs enclose V1-V3, V3-V4, and V3-V5, which were used in large population-based cohorts, e.g., the Human Microbiome Project and others (
1,
33,
34). Nevertheless, each different primer pair or V-region used will cause bias in the data. In addition, sampling and sample storage, sample processing (including DNA extraction and amplicon generation), sequencing analysis, and data processing introduce further bias. In the last 10 years, many of these factors were studied for a variety of ecosystems, e.g., the human gut (
31,
40,
59,
68,
74,
81–84), oral and skin microbiomes (
64,
85,
86), food-related ecosystems (
87,
88), and environmental microbiomes such as water, marine environments, and sludge (
16,
69,
72,
89–91). Nevertheless, the combination of different bias-causing factors was rarely studied. In this study, we analyzed the effects of choice of primer, reference databases, clustering method, and specific pipeline settings in combination on human stool samples and mock communities with increasing complexity using recent approaches. We wanted to highlight the contribution of each these factors to the precision of taxonomic assignment, providing the scientific community with up-to-date guidelines for experimental design and data analysis. Anticipating conclusions, each experimental setting (e.g., cohort and environment) needs to be tested up front for best performance using different experimental settings and strategies.
First, the effect of different primer pairs on the corresponding microbial profile was evaluated. Irrespective of the reference database, the primer pair 341F-785R (V3-V4) slightly outperformed the other combinations and is, therefore, a justified choice for human gut samples. This is also in accordance with Thijs et al. (
71), who suggested the primer pair 341F-785R to be a good match for soil and plant-associated bacterial microbiome studies, and Rausch et al. (
92), who recommended the use of the V3-V4 region over V1-V2. The sequences produced by using the primer pair 515F-944R (V4-V5) performed well when analyzing the microbiota profile of the Zymo mock community but showed poor performance on the more complex ZIEL-I and ZIEL-II mock communities, suggesting that the primer combination may not be suitable for complex microbial ecosystems at all. This highlights also the importance of including mock communities in routinely performed 16S rRNA gene analysis, as a theoretical sequence analysis by Yang et al. (
14) suggested the V4-V5 region to be a good match based on its robustness in representing the full-length 16S rRNA sequences and, therefore, theoretically seemed to be a good primer pair. However, it did not perform well when real samples were used.
Obviously, mock communities do not fully reflect the complexity of a microbial community as it is seen in, e.g., human stool samples. Therefore, we included 33 human fecal samples in our analysis as well. Here, phylum-level classification is robust across the use of different primer pairs targeting different V-regions for
Bacteroidetes (except 515F-944R),
Proteobacteria, and
Firmicutes. In contrast, the detection of
Actinobacteria,
Tenericutes,
Lentisphaerae, and
Verrucomicrobia varied across the use of different primer pairs, highlighting that the choice of primer should be considered carefully. Intraindividual comparison at genus level showed a high degree of variability across the different targeted regions. This was due to many unknown or unclassified taxa at genus level as well as a generally large number of different taxa. This highlights the need for ecosystem-specific reference databases (
93,
94) and new bioinformatic tools that can integrate data across V-regions by taking into account region-specific bias. Here, we notice a need for large-scale studies covering multiple V-regions, which would allow for training taxonomic classifiers that can dynamically account for any region-specific bias. This would possibly be obsolete by sequencing the full-length 16S rRNA gene, although sequencing would still be influenced by the primer choice, i.e., 27F and 1492R, for nearly full-length sequencing. Full-length 16S rRNA gene sequencing is possible by using third-generation sequencing strategies (
24,
26,
27) or by the generation of short reads that are later
de novo assembled to a synthetic full-length sequence (
95). Those methods seemingly offer taxonomic identification down to species or even strain level (
27). Both approaches are not yet well established for high-throughput sequencing and are not cost-efficient, reproducible, or easy in handling and thus need further investigation to be competitive. Further, long-read sequencing still suffers from comparably high error rates (
29,
96).
It is known that the use of different bioinformatic pipelines can have an impact on the determined microbiota composition (
40,
43,
65,
97). However, the influence of reference databases for taxonomic prediction was, to our knowledge, not intensively studied. In this study, we evaluated the performance of five different databases using three different mock communities. We tested the ability of each database to identify the correct taxonomy and assessed how well the known diversity of the mock samples could be captured by each database. Our finding illustrated that the Silva and RDP databases were the most accurate 16S rRNA gene databases, showing similar performances consistently superior to those of GRD, LTP, and GG in terms of true positives at the genus level. GG failed to classify
Escherichia/Shigella,
Listeria,
Acetatifactor,
Bacillus,
Clostridium, and
Pseudomonas, in line with the results of Park and Won (
98), who found GG to be subpar compared to Silva. GG was last updated in 2013, and any usage is highly questionable.
In addition to the above, we found that quality assessment for each particular database could be conducted only when using a variety of V-regions and a sufficient complex mock community. Low-complexity mock communities using common bacteria did not reveal database issues. Thus, low-complexity mock communities might be used as positive controls in existing pipelines for general quality monitoring, but they are not recommended for detecting fundamental issues when setting up a new study, pipeline, or laboratory. Further, concerning other body sites (or environments), specific mock communities of sufficient complexity should be used. Certainly, the addition of ubiquitous bacteria, like the skin commensal Cutibacterium acnes in humans and other such bacteria, should be considered.
A third factor influencing taxonomic assignment is constituted by the denoising and OTU clustering steps of data analysis. To investigate this aspect, we compared classical OTUs generated by ≥97% clustering Qiime1, ASVs generated by DADA2 denoising (
48), and zOTUs generated by the USEARCH denoising algorithm (
49,
99). The numbers of features identified by these clustering approaches were nearly identical across all three approaches for the tested mock community. ASV clustering performed well in the human data sets despite the increased complexity, supporting the results of previous studies (
42,
100), which suggests that ASVs are the current best choice, as they showed the highest accordance with the theoretical composition of the tested mock community. However, zOTUs performed very similarly and are more robust and user-friendly concerning the input.
Specific settings, e.g., the truncation length, influence the number of reads retained for further analysis steps, as we have demonstrated. Selecting a suitable truncation length is of importance, as too-short reads have short or missing overlaps that lead to problems during merging. Conversely, too-long reads can be difficult to merge, as they show lower sequence quality. The varying number of detected ASVs for different truncation lengths is linked to the trade-off between incorporating reads of lower quality and the sensitivity for detecting low-abundance genera. By systematically reducing the reverse read length, the number of rarely observed sequences increased, as sequencing errors decrease. This highlights an important role for this parameter in the reproducibility of analysis results. To assess this potential bias, we suggest using sufficiently complex mock communities of known composition to determine suitable truncation lengths. Further, it is important to report this parameter (as well as all others) with respect to reproducibility of analysis results.
In summary, our results across 3 mock communities and 33 human samples suggest using primers for the V3-V4 region, which show good overall performance for human gut samples. As a reference database, we recommend using either Silva or RDP. Even though only minor differences were observed between clustering methods, we currently recommend using ASVs or zOTUs, with negligible difference between the two. Regarding pipeline settings, we suggest that truncated length combinations should be tested for the primer pairs used in each study. For example, we would suggest for V4 reads truncated to 250 bp and 180 bp for forward and reverse, respectively. However, the last settings depend on the amplicon lengths of the V-regions. To guarantee comparable and reliable results, we recommend creating specific (i.e., reflecting the targeted microbial environment) and sufficiently complex mock community to test whether the study design and the analysis pipelines will be suitable for the bacterial community of interest or type of sample desired (
Fig. 7).