INTRODUCTION
Rapid identification of the species of isolated bacteria is essential for surveillance for human and animal health and for choosing optimal treatment and control measures. Since the beginning of microbiology more than a century ago, this has to a large extent been based on morphology and biochemical testing. However, for more than 30 years, 16S rRNA sequence data have served as the backbone for the classification of prokaryotes (
1), and tremendous amounts of 16S rRNA sequences are available in public repositories (
2–4). However, due to the conserved nature of the 16S rRNA gene, the resolution is often too low to adequately resolve different species and sometimes is not even adequate for genus delineation (
5,
6). Furthermore, many prokaryotic genomes contain several copies of the 16S rRNA gene with substantial intergene variation (
7,
8). It is also considered problematic that this gene represents only a tiny fraction, roughly about 0.1% or less, of the coding part of a microbial genome (
9).
Second- and third-generation sequencing techniques have the potential to revolutionize the classification and characterization of prokaryotes and is now being used routinely in some clinical microbiology labs. However, so far no consensus on how to utilize the vast amount of information in whole-genome sequence (WGS) data has emerged (
10). Nevertheless, a number of different methods have been proposed. Roughly, they can be divided into those that require annotation of genes in the data and those that employ the nucleotide sequences directly (
9).
One of the first attempts to employ WGS data for taxonomic purposes was carried out in 1999 (
11). At the time, 13 completely sequenced genomes of unicellular organisms were available, and distance-based phylogeny was constructed on the basis of the presence and absence of suspected orthologous (direct common ancestry) gene pairs. Later, it was recognized that methods that take into account gene content can be greatly influenced by horizontal gene transfer (HGT), and alternative methods were developed that used homologous groups (gene family content) (
12) or protein domains (
13).
Functional protein domains also form the basis of a recent approach developed by our group (
14). Here, the protein domains are combined into functional profiles of which some are species specific and can thus be used for inferring taxonomy.
As an extension of 16S rRNA analysis, which focuses on a single locus, super multilocus sequence typing (SuperMLST) has been proposed (
15). It relies on the selection of a set of genes that are highly conserved and hence can be used with any organism. In a publication from 2012, Jolley et al. suggested that 53 genes encoding ribosomal proteins be used for bacterial classification in an approach called ribosomal MLST (rMLST) (
16). Not all 53 genes were found in all bacterial genomes, but due to the relatively high number of sampled loci, this is not considered problematic. The rMLST method forms the basis of a proposed reclassification of
Neisseria species (
17) and has also been used for analyzing human
Campylobacter isolates (
18).
It is also possible to employ the sequence data directly without preannotation of genes. For instance, this can be done using BLAST (
19). An alternative, faster approach would be to look at
k-mers (substrings of
k nucleotides in DNA sequence data) and use the number of cooccurring
k-mers in two bacterial genomes as a measure of evolutionary relatedness. Using the
k-mer-based approach, we have developed a method, KmerFinder, which examines all regions of the genomes, not only core genes (
20). Furthermore, a gene segment will score highly despite the transposition of a gene segment within the genome since only the flanking regions will be mismatched.
In the current study, we have trained five different methods for species identification on a common data set of complete prokaryotic genomes: (i) the SpeciesFinder method, which serves as the baseline as it is based solely upon the 16S rRNA gene; (ii Reads2Type which is a variant that searched for species-specific 50-mers, predominantly within the 16S rRNA gene, with the help of non-species-specific 50-mers to quickly narrow the search; (iii) rMLST, which predicts species by examining 53 ribosomal genes; (iv) TaxonomyFinder, which is based on species-specific functional protein domain profiles; and finally (v) KmerFinder, which predicts species by examining the number of overlapping 16-mers.
The publicly available databases contain ample amounts of WGS data from prokaryotes, enabling us to conduct a large-scale benchmark study of the proposed methods. Hence, the process of reaching a consensus on how the WGS data should optimally be used for prokaryotic taxonomy is initiated.
DISCUSSION
In the present study, we trained five different methods for prokaryotic species identification on a common data set and evaluated their performance on three data sets of draft genomes or short sequence reads.
The SpeciesFinder method is based on the 16S rRNA gene, which has served as the backbone of prokaryotic systematics since 1977 (
1). Accordingly, sequencing of the 16S rRNA gene is a well-established method for identification of prokaryotes and has, in all likelihood, been used for annotating some of the isolates in the training and evaluation sets. In the light of this potential advantage of the SpeciesFinder method over the other methods, it is noteworthy that it had the lowest performance on all evaluation sets. Previous studies, however, have also pointed to the many limitations of the 16S rRNA gene for taxonomic purposes (
5–9). Examples, which are also observed in this study, include its inadequacy for the delineation of species within the
Borrelia burgdorferi sensu lato complex and the
Mycobacterium tuberculosis complex (
35). Similarly,
in silico studies of the applicability of the 16S rRNA gene for the identification of medically important bacteria led to the authors concluding that although the method is useful for identification to the genus level, it is able to identify only 62% of anaerobic bacteria (
36) and less than 30% of aerobic bacteria (
37) confidently to the species level.
The performance of SpeciesFinder was surpassed only marginally by Reads2Type. This is not surprising since the two methods are conceptually very similar: SpeciesFinder utilizes the entire 16S rRNA gene of approximately 1,540 nucleotides, while for most species, Reads2Type searches for species-specific 50-mers in the same gene. In terms of its future usability, Reads2Type has, however, one advantage over the other methods: like most of the other methods it is available as a web server, but uniquely it does not require the read data to be uploaded to the server. Instead, a small 50-mer database is transferred to the user's computer, and all computations are performed there. As a result, bottleneck problems on the server are avoided, and the data transfer is minimized, which may be particularly advantageous for users with limited Internet access.
While SpeciesFinder and Reads2Type sample only one locus, the rMLST method samples up to 53 loci—all ribosomal genes located to the chromosome of the bacteria. Evaluating on the data set of SRA draft genomes, rMLST, TaxonomyFinder, and KmerFinder performed equally well. However, on the more diverse and difficult set of NCBI draft genomes, the rMLST method performed only marginally better than SpeciesFinder and significantly worse than TaxonomyFinder and KmerFinder. In particular, the rMLST method consistently made incorrect identifications of a number of closely related species, e.g.,
Y. pestis versus
Y. pseudotuberculosis (
38) and
M. tuberculosis versus
M. bovis (
39). Also, rMLST consistently predicted the human pathogen
B. anthracis to be
B. thuringiensis. The latter is used extensively as a biological pesticide and is generally not considered harmful for humans.
B. anthracis and
B. thuringiensis are both members of the
B. cereus group and genetically very similar, with most of the disease and host specificity being attributable to their plasmid content (
40,
41). It has even been suggested that all members of the
B. cereus group should be considered to be
B. cereus and only subsequently be differentiated by their plasmids (
42). Hence, in concordance with rMLST sampling only chromosomal, core genes, it is not surprising that the method fails to distinguish these isolates. A similar example is given by the rMLST method identifying all
E. coli isolates as
Shigella sonnei. Although
Shigella sp. isolates have been rewarded their own genus, the separation of the genus from
Escherichia spp. is mainly historical (
43–45). To be sure, some of the mistakes commonly made by rMLST as well as the other methods highlight taxonomic taxa that are intrinsically difficult to distinguish due to a suboptimal initial classification. Although
Shigella has for several years been considered a substrain of
E. coli, the practical implications of renaming it are considered insurmountable. It should also be noted that the rMLST method was not developed for usage with a fixed training set but, rather, with all known alleles. Accordingly, the performance of the method is expected to improve with increased size of the reference rMLST database, which is currently expanding rapidly (Keith Jolley, Department of Zoology, University of Oxford, United Kingdom, personal communication).
The TaxonomyFinder method was the second most accurate method on the set of NCBI draft genomes and performed in the top for the SRA drafts set. In contrast to the other methods, it does not work directly on the nucleotide sequence of the isolates but, rather, on the proteome, utilizing functional protein domain profiles for the species prediction. It was the slowest of the tested methods, but in return for the extra time, the user is rewarded with an annotated genome.
The KmerFinder method performs its predictions on the basis of cooccurring k-mers, regardless of their location in the chromosome. It had the overall highest accuracy, worked on complete or draft genomes as well as short reads, and was found to be very robust as well as fast. Furthermore, the KmerFinder method holds promise for future improvements as the implementation used for this study was very simple. Only the raw number of cooccurring k-mers between the query and reference genome was considered although a parallel analysis indicated that the performance could be improved even further if more sophisticated measures were used, also taking into account the total number of k-mers in the query and reference genome. KmerFinder took approximately 9 s per query genome, which makes it the fastest of the tested methods. To test the general applicability of sampling the entire genome and not preselected genes or sets of genes for the species prediction, we also implemented a whole-genome BLAST-based method. The method used hit aggregation of significant matches between the query genome and all genomes in the common training set. As the final prediction, the species for which the query genome had the most bases matched was selected. The performance of this whole-genome BLAST-based method was tested on the NCBIdrafts and SRAdrafts evaluation sets and found to be very similar to that of KmerFinder (see File S2 in the supplemental material). The method was, however, almost 20 times slower than KmerFinder, taking approximately 3 min per genome.
It has previously been noted that some of the isolates present in public databases and, hence, used in this study, are wrongly annotated (
17,
46,
47). Based on the current study, it is likely that at least the six isolates from the NCBI
drafts set that all methods identified as something other than the annotated species are wrongly annotated or, alternatively, most closely related to an isolate in the common training data that is wrongly annotated. In agreement with this, one of the isolates has indeed been reannotated since we initially downloaded the data. Of the remaining five isolates, two
B. cereus isolates were found to be most closely related to the
Bacillus weihenstephanensis strain KBAB4 of the common training set. This strain is the single representative of the species in the public database and not the type strain. Hence, there is no guarantee that the sequenced strain represents the named taxon (
48). The same is the case for the
Clostridium botulinum strain C Eklund, which is predicted to be a
Clostridium novyi based on its close resemblance to
C. novyi strain NT of the training set.
Clostridium novyi strain NT is the only representative of this species in the database and not the type strain. Obviously, all the evaluated methods are highly dependent on the size and the accuracy of the set of genomes that they are trained on. Accordingly, all methods have the potential to improve their performance in the future when more genomes become available and when the present mistakes in the public databases are corrected. Another way to ensure future improvement is to combine the individual predictions of the methods and let the final predicted species of a query genome be decided by a majority vote. We are currently planning to implement such a system.
In the current study, we included only species in the evaluation sets which were also present in the training set. We have hence not tested how the methods would perform when presented with a species not included in the training set. SpeciesFinder searches for the closest match in the query genome to a database of 16S rRNA genes. If the species of the query genome is not represented in the database, the closest match is likely to be of a closely related species, but the method will also test if the percent identity and coverage of the 16S rRNA gene are above 98% and mark the prediction as “failed” if the match is below this threshold. The rMLST method searches for closest matches in a database of 53 different ribosomal genes. In our implementation, the method will not provide an output if the percent identity and coverage of the matches are below a threshold of 95%, and hence it will be able to select only a closely related species for species that are not represented in the training set. Other implementations of the rMLST method, however, would not necessarily have this limitation. The TaxonomyFinder method uses species- or phylum-specific protein profiles and would hence identify the correct phylum if the species of the query genome was not in the training set. Along with the predicted species, the KmerFinder outputs the number of cooccurring k-mers that the selection was based on. A high number of k-mers indicates that the identification is probable, while low numbers of k-mers indicate that the predicted species is likely to be a related species and that the actual species is not in the training data. Further investigations would be necessary to identify a threshold for the number of k-mers to make this distinction.
While some taxonomists consider the goal of bacterial taxonomy to “mirror the order of nature and describe the evolutionary order back to the origin of life” (
6,
49), a more pragmatic and applied view is likely to be advantageous for epidemiological purposes, where most outbreaks last less than 6 months. The number of prokaryotic genomes in public databases is currently sufficiently high to replace theoretical views of which loci to sample for optimal species identification by actual testing of how different approaches perform. One locus (the 16S rRNA gene) was initially used for sequenced-based examination of relationships between bacteria, and when the approach was found to have limitations, more loci were added in MLST and multilocus sequence analysis (MLSA) (
50,
51). The addition of still more loci has been suggested for improving MLSA even further (
16,
35). This study suggests that an optimal approach should not be limited to a finite number of genes but, rather, look at the entire genome.
Conclusion.
The 16S rRNA gene has served prokaryotic taxonomy well for more than 30 years, but the emergence of second- and third-generation sequencing technologies enables the use of WGS data with the potential of higher resolution and more phylogenetically accurate classifications. Methods that sample the entire genome, not just core genes located to the chromosome, seem particularly well suited for taking up the baton.