INTRODUCTION
Since the advent of whole-genome sequencing, it has become clear that current bacterial taxonomy is not always consistent with bacterial evolutionary history. In particular, some official taxa are not monophyletic, and taxa of the same rank often differ in the diversity they represent in terms of sequence identity or shared gene content. The genus
Lactobacillus is a prime example of both problems (
1). First, it is larger than a typical bacterial family. Second, the genus is paraphyletic: the genera
Pediococcus,
Leuconostoc,
Weissella,
Oenococcus, and
Fructobacillus are also descendants of the most recent common ancestor of
Lactobacillus. Together, these six genera form a monophyletic taxon that is sometimes referred to as the
Lactobacillus genus complex (LGC) (
2). The LGC is an important source of bacteria with medical, food, and feed applications (
1) and is a key player in the human microbiome (
3,
4). Because of all this, LGC taxonomy has been the topic of much study and debate, especially during recent years (
1,
5–7).
In an attempt to systematically correct inconsistencies in the official taxonomy of all
Bacteria and
Archaea, Parks et al. (
8) recently constructed the Genome Taxonomy Database (GTDB). They built a phylogeny of a dereplicated subset of all sequenced bacterial and archaeal genomes and determined the diversity within each official taxon using a heuristic called relative evolutionary divergence (RED). With this information, they corrected taxa that were not monophyletic or that contained a diversity much above or below the average diversity of their rank. In their alternative, genome-based taxonomy, all six LGC genera were integrated into the family
Lactobacillaceae, and the genus
Lactobacillus was split into 15 smaller genera.
The approach of Parks et al. (
8) to split/merge only taxa that are outliers in terms of diversity within their rank is appealing for the taxonomic ranks from phylum to genus, because these are relatively arbitrary and have no meaningful cutoffs in terms of diversity. The species level, however, is different. First, while there is agreement that the ranks from phylum to genus are arbitrary, there exists discussion on the possible biological meanings of the species rank. For example, one recent study found that there exists a relatively large gap between within-species and between-species genome distances; in other words, that the “genome space” shows a discontinuity corresponding to the species level (
9). If this would prove to be indeed the case, it would mean that a prokaryotic species concept based on this discontinuity would hold real biological meaning. In another recent study, the hypothesis that taxa of the species rank show the property of exclusivity more than taxa of other ranks was explored (
10). A taxon is exclusive if all of its members are more related to each other than to anything outside of the taxon. The authors concluded that their data falsified this hypothesis: many taxa, from many different ranks, indeed show the property of exclusivity, but that this was not more often the case for taxa of the species rank. Thus, there is currently no agreement on whether a particular biological significance should be attached to the species rank. However, a second way in which the species rank stands out is more pragmatic. In contrast to the other taxonomic ranks (e.g., the genus rank), relatively fixed similarity cutoffs to separate species are generally accepted and largely seem to correspond to the historically defined species (
9,
11).
The cutoff that is commonly used to separate bacterial species based on their genome sequences, is 94 to 96% average nucleotide identity (ANI) (
11,
12). Despite this cutoff being accepted as the gold standard for bacterial species delimitation, it is seldom checked for type strain genomes of validly published species and subspecies. Type strain genomes of different species are sometimes more closely related than ∼95% ANI, while type genomes of subspecies of the same species are sometimes more distantly related than ∼95% ANI. For example, it has been shown that the type strains of
Lactobacillus casei and
Lactobacillus zeae are too closely related to consider them separate species (
13). This has led to the rejection of the
L. zeae name. A related problem is that newly sequenced genomes, of type strains or otherwise, are not systematically checked for similarity against type strains. In the NCBI assembly database, uploaders of new assemblies are free to choose which species name they assign to the genomes. This has resulted in many classification inconsistencies. For example, we have previously shown that many genomes annotated as
Lactobacillus casei in the NCBI database are in reality more closely related to the
Lactobacillus paracasei type strain (
14). Finally, it is possible that some sequenced genomes are so distantly related to all currently described species that they should be considered a new species.
In this work, we aimed to infer
de novo species within the LGC by downloading all genomes belonging to this taxon from the NCBI assembly database and clustering them based on pairwise genome similarities. Advantages of this genome-based,
de novo species taxonomy are as follows. (i) Future work on comparing LGC species based on genomes can be performed with a complete and nonredundant data set of exactly one representative genome per species. (ii) Comparative genomics studies of an individual species can proceed with a complete set of all genomes that are within species range of its type strain. Furthermore, we reconciled our
de novo species with the currently established LGC species by identifying the genomes of type strains and by comparing 16S rRNA gene sequences to those of type strains. On the basis of this reconciliation, we propose changes to the official taxonomy in the form of splits or mergers of species. As a cornerstone for genome quality control, pairwise genome comparisons, and phylogeny inference, we made use of single-copy core genes (SCGs) of the LGC. The classical approach to identify SCGs in a set of genomes is to first determine the complete pangenome of the genomes, but this process is very computationally expensive, especially for data sets spanning more than a single species. Therefore, we developed a novel approach that should be easily applicable to any (large) genome data set. Briefly, we first identify candidate SCGs in a small, random subset of genomes and then extract those candidates from all genomes in the data set using HMM profiles. Our pipeline for rapid SCG extraction was implemented in progenomics, a toolkit for prokaryotic comparative genomics, and is available at
https://github.com/SWittouck/progenomics.
DISCUSSION
We developed a novel approach to extract single-copy core genes (SCGs) from large genome data sets and applied it to calculate pairwise CNI as well as fastANI distances between all high-quality genomes within the LGC. We observed that for relatively closely related genomes (CNI of 80% to 100%), the fastANI similarities were lower than the CNI similarities. This could be explained by the fact that the CNI is based exclusively on conserved genes, and these genes can be expected to mutate more slowly. In contrast, fastANI is based on all homologous genome regions between two genomes and thus encompasses also recently gained, potentially fast-evolving genes. For more distantly related genomes (CNI < 80%), fastANIs were higher than CNIs. A possible explanation for this is that for these genome pairs, it becomes more difficult for an ANI calculation tool to detect the homologous regions that show lower sequence identity; the fastANI distance might thus be based only on the homologous regions with higher sequence identity because they are easier to detect. Thus, the CNI might be a better reflection of evolutionary distance because it is based on a fixed set of (prealigned) genes.
We observed a discontinuity in the density of pairwise CNI values, confirming observations made by others (e.g., reference
9). This could be taken as evidence that the species level holds real biological meaning. Alternatively, one could argue that the observed discontinuity is the result of genome sampling bias. Indeed, there have been many comparative genomics studies in recent years where the aim was to compare as many strains as possible from the same species (e.g., references
19 and
20), and this might have biased genome databases toward sets of closely related strains. However, the observation that only the CNI range of 88% to 96% shows full exclusivity of clusters is a stronger confirmation of the biological meaning of the species rank. This should be tested in more taxa to further explore whether the property of exclusivity is connected to the rank of species and to study the bacterial species question in general.
In discussing the biological significance of the species rank, we focused on species properties defined in terms of genome-genome distances. However, a number of bacterial species concepts in terms of the underlying, generative biological processes have also been proposed. In the species concept of Dykhuizen and Green (
21), bacterial species are groups of strains within which recombination is common, while recombination between species is comparatively rare. Another popular species concept is the ecotype concept of Cohan and Kane (
22), which states that bacterial species occupy a given ecological niche. Within a species, an innovative mutation is able to purge diversity through a vertical sweep, but this process cannot occur between different species. It is not yet clear at this point what the potential discontinuity and/or exclusivity of bacterial species would mean for the validity of the various bacterial species concepts that have been put forward.
Using a pragmatic, genome distance-based species concept, we clustered genomes of the LGC into de novo species using single-linkage clustering with a threshold of 94% CNI. By following this strategy, we implicitly defined a species as a set of strains for which it holds that each strain shows at least 94% core genome sequence identity to at least one other strain in the species and not to a single strain outside of the species. Using this species definition, we found that species membership in the LGC was almost transitive: all strains within a species were more than 94% similar to each other (there was only one exception to this). This has a number of interesting consequences. First, it means that, when using this species concept, species are very robust to strain/genome sampling; adding extra strains will not lead to mergers of species that were previously separated. Second, it means that for strain classification, we can calculate the similarity of the new strain to one (random) representative strain per species. When the strain is >94% similar for its core genes to one strain of a species, it will also be >94% similar to all other strains in the species. Similarly, when the strain is <94% similar to one strain of a species, it will also be <94% similar to all other strains in the species. Finally, transitivity means that clustering with single linkage or complete linkage will yield identical results. Therefore, transitivity is a very useful property to have for a given species definition and genome space. However, it should be noted that despite the fact that only one exception to transitivity was found in this work, more exceptions might be found when additional genome sequences become available in the future.
We identified situations where type strain genomes from more than one published species were found in the same
de novo species. We argue that in these situations, the published species names should be merged. In taxonomic terms, such a merger means that the species names are considered heterotypic synonyms, i.e., names associated with different type strains but referring to the same species. Because the 94% CNI cutoff value we used (corresponding to approximately 93% fastANI) was lower than what is commonly suggested (
11,
12), more species mergers than splits can be expected
a priori. This cutoff value was chosen because it corresponded to the discontinuity we observed in the density of pairwise CNIs. A consequence of this is that for some of the mergers suggested by our pipeline, the type strain genomes will still be relatively dissimilar. Whether a merger is still preferable in these “gray area” situations depends on the species concept one wishes to use and more specifically, on the importance one attributes to the observed discontinuity in genome space for taxonomy.
Four genome clusters contained type strain genomes from different species that were very closely related (CNI > 96%).
Lactobacillus micheneri and
Lactobacillus timberlakei belonged to a cluster with a minimum CNI of 96.4%. These species were described in the same recent publication (
23), so it is not straightforward which name should be considered the earlier synonym.
Weissella thailandensis (
24) and
Weissella jogaejeotgali (
25) were present in a cluster with a minimum CNI of 97.4%. The
W. jogaejeotgali paper mentioned that its type strain was closely related to
W. thailandensis, with 99.39% 16S identity. According to our genome-based comparison,
W. jogaejeotgali should be considered a later heterotypic synonym. The type strains of
Lactobacillus fructivorans (
26) and
Lactobacillus homohiochii (
27) were present in a cluster with a minimum CNI of 97.9%. It was already known that these type strains are very closely related (
28). In addition, it has been shown that other strains classified as
L. homohiochii clearly form a separate species from
L. fructivorans (
28). Thus, in this case, the
L. homohiochii type strain seems badly chosen. According to the International Code of Nomenclature of Prokaryotes (
29), this means that
L. homohiochii should be considered a later heterotypic synonym of
L. fructivorans and a new type strain should be proposed for the other strains currently classified as
L. homohiochii. Unfortunately, none of these other strains have been sequenced yet. Finally, a similar situation occurred for
Pediococcus acidilactici (
30) and
Pediococcus lolii (
31), which were in the same cluster with a minimum CNI of 98.1%. It has already been found by Wieme et al. that the
P. lolii type strain DSM 19927 is an
P. acidilactici strain (
32).
P. lolii should be considered a later synonym of
P. acidilactici and if other
P. lolii strains cluster separately, a new type strain and species name should be proposed for them.
Some other cases of potential species mergers were less clear-cut. Before the species
Lactobacillus amylotrophicus was introduced (
33), with type strain DSM 20534, its type strain had been classified to
Lactobacillus amylophilus, with type strain DSM 20533 (
34). We now found that these two type strain genomes were almost identical (CNI of ∼100%) at the level of their core gene sequences. However, the publication that introduced the species
L. amylotrophicus (
33) clearly shows that strain DSM 20534 is relatively distant from strain DSM 20533 based on a comparison of
pheS and
rpoA gene sequences. Thus, we suspect that a mistake was made and that both genomes are actually the same strain, while the other type strain has not yet been sequenced. It is difficult to guess which of the type strains was actually sequenced (twice), but luckily, strain DSM 20533 was also sequenced by another institute. This second DSM 20533 genome was also identical to the other two. Therefore, it is highly likely that all three genomes are DSM 20533, the
L. amylophilus type strain, and that the
L. amylotrophicus type strain has not been sequenced yet.
In a 2012 paper (
35), it was suggested that
Lactobacillus kimchii (
36) and
Lactobacillus bobalius (
37) be considered later heterotypic synonyms of
Lactobacillus paralimentarius (
38). In our results, the type strains of
L. kimchii and
L. bobalius were in the same cluster, with a minimum CNI of 94.4%, while the
L. paralimentarius type strain was present in a different cluster. The two clusters showed a maximum CNI value of only 93.9%. Thus, we can confirm that
L. bobalius can be considered a later heterotypic synonym of
L. kimchii, and we do not object to also considering both of them later heterotypic synonyms of
L. paralimentarius since their clusters were extremely closely related.
Leuconostoc gelidum (
39),
Leuconostoc inhae (
40), and
Leuconostoc gasicomitatum (
41) were present in the same cluster, with a minimum CNI of 94.6%. The type strains of
Leuc. gasicomitatum and
Leuc. inhae even shared 99.5% CNI. Thus,
Leuc. inhae should definitely be considered a later heterotypic synonym of
Leuc. gasicomitatum. To our knowledge, this has not been suggested before. These two type strains shared 94.8% and 94.6% CNI with the
Leuc. gelidum type strain, respectively, just on the border of being the same species.
In 2018, three
Lactobacillus gasseri (
42) strains were reclassified as
Lactobacillus paragasseri, sp. nov. (
43). We found both species in the same cluster, with a minimum CNI of 94.6%. Thus, this is a borderline case, and we do not object to
L. paragasseri being considered a separate species. In a similar case, a subspecies of
Leuconostoc mesenteroides (
44) was recently introduced as the new species
Leuconostoc suionicum (
45). We found the type genomes of these species in the same cluster with a minimum CNI of 95.2%. This is again a borderline case.
Lactobacillus casei (
46) and
Lactobacillus zeae (
47) were in the same cluster, with a minimum CNI of 95.4%. There has been much confusion surrounding the names
L. casei,
L. paracasei, and
L. zeae. This is mainly the case because in 1971, strain ATCC 393 was chosen as the type strain for
L. casei based on carbohydrate fermentation profiles (
48). In 1980, this type strain was included in the Approved Lists of Bacterial Names (
49). However, the strain later appeared to be relatively distant to the other historical
L. casei strains. Subsequent proposals to change this illogical type strain (e.g., reference
50) were rejected because they violated the International Code of Nomenclature of Bacteria (
13). Thus, strain ATCC 393 remained the
L. casei type strain, which had some confusing consequences. First, the name
L. zeae was rejected because its type strain was too closely related to strain ATCC 393. In addition, the name
L. paracasei, whose type strain was very closely related to the historical
L. casei strains, was kept alive. The consequence was that many of the strains that were historically classified to
L. casei should now be reclassified to
L. paracasei, and that all strains historically classified to
L. zeae should now be reclassified to
L. casei (
14). In this work, we confirmed that the
L. casei type strain and the (rejected)
L. zeae type strain are closely related, but not so closely that it would be absurd to consider them different species.
The species
Leuconostoc garlicum has one high-quality genome in the NCBI database, but it was reclassified to
Leuconostoc lactis (
51) by our pipeline. The name
Leuc. garlicum occurs in a few publications but has never been published, validly or otherwise. Therefore, we suggest that this species name should not be validated and that these strains should classified to the species
Leuc. lactis. Finally, we found that
Lactobacillus aviarius subsp.
aviarius and
Lactobacillus aviarius subsp.
araffinosus (
52) had a CNI of 91.1%, and therefore, we suggest that the latter be renamed
Lactobacillus araffinosus comb. nov. To our knowledge, it has never been suggested before that these should be separate species.
We labeled 13 genome clusters that contained no type genomes with a species name because their NCBI species label(s) matched their 16S rRNA hits. However, because the type strains of these species were not sequenced or their sequenced genomes were not of sufficient quality, we could not identify these de novo species with 100% certainty. Therefore, we propose that priority be given to (re)sequencing of these type strains. Further, we found that five species without NCBI species labels showed matches to type strains on the 16S rRNA level; these type strains should also be sequenced to be able to identify the de novo species with certainty. Finally, we discovered at least eight new species for which type strains should be chosen and deposited and species names should be picked.
To make it easy for people to classify their own (potential) LGC genomes against our
de novo species taxonomy, we wrote a small classification tool. It is available at
https://github.com/SWittouck/proclasp and includes a tutorial with an LGC reference database.
Conclusions.
We constructed a genome-based species taxonomy of the Lactobacillus genus complex by performing single-linkage clustering on 2,459 high-quality genomes with a 94% core nucleotide identity cutoff. We found that the resulting species were discontinuous, fully exclusive, and almost fully transitive. On the basis of a comparison of the de novo species with published species, we proposed nine mergers of two or more species and one split of a species. Further, we discovered at least eight yet-to-be-named species that have not been published before, validly or otherwise. Finally, we have shown that our newly developed method to extract single-copy core genes allows for taxonomic studies on thousands of genomes on a desktop computer and is therefore applicable to other genera or even larger taxa. We believe that correcting the published species in the direction of our de novo taxonomy will lead to more meaningful species that show a consistent diversity.