INTRODUCTION
The definition of species as the unit of biodiversity for
Bacteria and
Archaea has been a longstanding problem in microbiology, with important conceptual implications for the study of microbial ecology and diversity (see, e.g., references
1–3). Currently, the predominant view is that species should be named based on a consensus of phenotypic and genomic characteristics (
4). Indeed, for most named species, this consensus does exist, resulting in a classification scheme with adequate stability, operability, and predictability (
5). For example, named species with available sequenced genomes can be reliably demarcated by a genome-aggregate average nucleotide identity (ANI) of 95%, with an accuracy of >98% (
6–8). However, analysis of whole genomes for environmental and diversity studies is far less common than an analysis of individual marker genes, notably the 16S rRNA gene sequence. Indeed, analysis of 16S rRNA full-length or partial gene sequences has revolutionized the study of prokaryotic diversity during the past 2 decades.
A typical 16S rRNA gene analysis pipeline involves the recovery of 16S rRNA-encoding sequences from environmental samples, their subsequent clustering at 97% nucleotide identity, and finally, a count or comparison of the resulting operational taxonomic units (OTUs), which are used as a unit of diversity, approximating the species level (see, e.g., references
9 and
10). The 97% identity threshold was first proposed as a (purposefully conservative) lower boundary to subsequently screen isolates for assignment to species using higher-resolution methods, such as DNA-DNA hybridization (
11), and is the default in two of the most popular pipelines for 16S rRNA gene analyses: QIIME and mothur (
12,
13). Moreover, surveys and estimations of the extant prokaryotic species diversity on Earth are often based on OTUs defined by 97% 16S rRNA gene sequence identity (see, e.g., references
10,
14, and
15). However, the more recently established 98.5% 16S rRNA gene nucleotide identity threshold appears to more precisely reflect genomic and nomenclatural standards for the species level (
6,
11). Other thresholds have been proposed as well, such as 98.65% (
16) and 98.7% (
17). In any case, it is now well appreciated that the above-mentioned 16S rRNA thresholds, although they should not be equated with the genomic and phenotypic standards for species definition (
4), represent an important unit of microbially diverse populations that is close enough to the species level, i.e., they can serve as a reliable proxy for measuring microbially diverse populations. It has also been realized that 16S rRNA genes are too conserved to delineate closely related species; thus, the 97% identity level (or even 98.5 to 98.7%) may lump together distinct species (
18,
19). For example, identical 16S rRNA gene sequences can be found between pairs of phenotypically and genomically distinct species, including some from the genera
Campylobacter,
Xanthomonas,
Escherichia,
Mycobacterium,
Yersinia, and
Cycloclasticus. However, the simplicity of the approach, its broad applicability due to the availability of (nearly) universal PCR primers to amplify 16S rRNA gene sequences from most prokaryotes, and the existence of large reference databases to facilitate analysis have made 16S rRNA gene-based surveys the method of choice for diversity studies.
Genome-derived parameters, such as the 95% average nucleotide identity (ANI), have been shown to better encompass the named species based on isolates (
8,
20) and the natural sequence-discrete populations sampled by metagenomics (
21), as these parameters capture better than 16S rRNA gene sequence analysis the traditional methods and standards for demarcating species, e.g., DNA-DNA hybridization (
20). For instance, Kim and colleagues estimated that the precision of the 98.65% 16S rRNA gene threshold for species demarcation is 92.2% (
16), meaning that about 8% of the pairs of genomes showing a 16S rRNA gene identity of >98.65% show an ANI of <95%; thus, they should be actually assigned to different OTUs or species. These results, which echoed previous similar studies with a smaller collection of genomes (
6), indicated that although genome-derived methods offer higher resolution than 16S rRNA gene-based ones, the difference is probably not dramatic. However, it is likely that these results are misleading with respect to how many distinct species may exist in a habitat that are grouped under the same 16S rRNA OTU for several reasons.
First, the genomes sequenced during the previous decade aimed to cover phylogenetic diversity, as opposed to close relatives, with the possible exception of pathogenic taxa. Yet, several close relatives often cooccur within a natural habitat (
21), which could amplify the above-mentioned limitation of the 16S rRNA gene method. Second, environmental surveys do not typically include complete high-quality 16S rRNA gene sequences but instead error-prone short sequences of the V4 or V6 (or other) region, with the potential effects of both underestimating the number of OTUs due to insufficient resolution and inflating OTU counts due to sequence artifacts and errors. Finally, the previous studies were focused on whether or not organisms should be assigned to the same species (or OTU) and did not evaluate how many distinct species are represented by these organisms, which is more relevant for environmental diversity surveys. Accordingly, it still remains speculative as to how much 16S rRNA gene analysis underestimates the prokaryotic diversity sampled within natural habitats. Addressing this issue is important for the estimation of prokaryotic community diversity richness, a highly important topic for diversity cataloguing and conservation (cf. reference
15 and debate therein), as well as for assessing diversity shifts across spatial or temporal scales, which is typically based on comparative studies of alpha and beta diversity. Importantly, the distribution of features, such as gene and metabolic pathways underlying abundance profiles, may differ between species and higher taxonomic levels (
1,
14), further underscoring the importance of accurate and reproducible comparisons. Such comparisons require a clear understanding of the taxonomic richness, which may differ depending on the level of genetic or phylogenetic depth assessed (e.g., 97% 16S rRNA gene versus 95% ANI [
6]). How OTUs are defined is particularly relevant for the underlying evolutionary or ecologic assumptions during these alpha/beta diversity comparisons as well (see, e.g., references
1 and
3).
To provide quantitative estimates of the degree of underestimation of naturally occurring diversity by 16S rRNA and, thus, guidance on how to perform diversity surveys more accurately in the future, we evaluated the 16S rRNA gene identities and ANI values in a collection of 8,350 complete genomes to determine the optimal threshold of 16S rRNA gene identity corresponding to 95% ANI, and we compared the number of OTUs defined at the 97% or 98.5% 16S rRNA gene identity level to those defined at 95% ANI based on the same genome sequences.