ABSTRACT

High-throughput genome sequencing technologies enable the investigation of complex genetic interactions, including the horizontal gene transfer of plasmids and bacteriophages. However, identifying these elements from assembled reads remains challenging due to genome sequence plasticity and the difficulty in assembling complete sequences. In this study, we developed a classifier, using random forest, to identify whether sequences originated from bacterial chromosomes, plasmids, or bacteriophages. The classifier was trained on a diverse collection of 23,211 chromosomal, plasmid, and bacteriophage sequences from hundreds of bacterial species. In order to adapt the classifier to incomplete sequences, each complete sequence was subsampled into 5,000 nucleotide fragments and further subdivided into k-mers. This three-class classifier succeeded in identifying chromosomes, plasmids, and bacteriophages using k-mer distributions of complete and partial genome sequences, including simulated metagenomic scaffolds with minimum performance of 0.939 area under the receiver operating characteristic curve (AUC). This classifier, implemented as SourceFinder, has been made available as an online web service to help the community with predicting the chromosomal, plasmid, and bacteriophage sources of assembled bacterial sequence data (https://cge.food.dtu.dk/services/SourceFinder/).
IMPORTANCE Extra-chromosomal genes encoding antimicrobial resistance, metal resistance, and virulence provide selective advantages for bacterial survival under stress conditions and pose serious threats to human and animal health. These accessory genes can impact the composition of microbiomes by providing selective advantages to their hosts. Accurately identifying extra-chromosomal elements in genome sequence data are critical for understanding gene dissemination trajectories and taking preventative measures. Therefore, in this study, we developed a random forest classifier for identifying the source of bacterial chromosomal, plasmid, and bacteriophage sequences.

INTRODUCTION

Horizontal gene transfer (HGT) is one of the main drivers of bacterial evolution (1). HGT generally occurs between closely related species through conjugation, transformation, and transduction (1). Conjugation typically refers to the transmission of plasmids through cell-to-cell contact mediated by pili. Transformation is the acquisition of naked DNA through the cell membrane, and transduction is the transfer of DNA fragments via bacteriophages (13). HGT plays a major role in bacterial adaptation, diversity, and the dissemination of extrachromosomal genes encoding antimicrobial resistance (AMR), virulence, and heavy metal resistance (3, 4). High-throughput sequencing technologies have enabled the large-scale sequencing of bacterial isolates and communities, making it possible to study complex transmission and dissemination patterns (5).
Metagenomic techniques offer powerful tools for surveying organisms in an environmental sample. These technologies have proven useful for understanding many aspects of bacterial communities from gut microbiome compositions to the global surveillance of AMR (6, 7). Metagenomic sequences can contain multiple species and sequence types. However, metagenomic assemblies from short reads (~100 to 300 bp) are usually fragmented due to presence of closely related strains, repeated sequences, and shared sequences between bacterial species (811). This fragmentation makes the identification of plasmid and bacteriophage DNA from assemblies challenging.
To date, a number of tools have been published for identifying plasmids and/or bacteriophages from whole-genome sequences (WGS) and metagenomic reads using alignment- or machine-learning-based approaches. Alignment-based approaches search for sequence matches against a reference database; the accuracy of these approaches is highly dependent on the availability and curation of matching sequences in the database (12). Alignment-based tools such as PlasmidFinder and VirSorter provide high prediction accuracy for WGS. However, these tools can perform poorly with contigs assembled from metagenomes because they are not designed for identifying incomplete sequences (4, 1215). On the other hand, machine-learning-based approaches can be designed for identifying extrachromosomal elements using short sequence-based matching, or by identifying sequence characteristics such as GC content, codon usage, and gene density (12, 14, 16, 17). Machine-learning models developed for both WGS and metagenomic reads use various algorithms such as support vector machines (18), logistic regressions (12), random forests (19), and artificial neural networks (14, 15, 2023). The majority of the approaches train the models with the encoded genome sequences (15, 2023) or the frequencies of short k-mers (12, 14). Mlplasmids, PlasFlow, VirFinder, MARVEL, and PPR-Meta are examples of tools for identifying plasmids and/or bacteriophages using machine-learning techniques (12, 14, 15, 18, 19).
PPR-Meta is a three-class classifier published in 2019 that can differentiate between chromosome, plasmid, and bacteriophage derived metagenomic sequences using a bi-path convolutional neural network (15). This tool consists of three classifiers, each trained with artificial metagenomic contigs that fall into a certain length range. PPR-Meta works with the shifting windows principle where it predicts the source of a query sequence by chopping it into nonoverlapping smaller pieces, which are predicted by the classifiers corresponding to their size range. PPR-Meta provides a final class probability by combining the predictions from these pieces.
In this study, we classified the source of bacterial sequences derived from chromosomes, plasmids, and bacteriophages using a random forest classifier. Our classifier detects genomic signatures from k-mer frequencies and can identify the source of complete or partial sequences, including simulated metagenomic scaffolds, with high accuracy. Compared with PPR-Meta, our models included in the SourceFinder tool provides more accurate predictions for the identification of the chromosome class and some cases of the plasmid class using tree-based classifiers, thanks to the larger and possibly more diverse training data set. Additionally, SourceFinder offers a simple online source identification through a user-friendly service for people without command line experience.

RESULTS

Bacterial chromosomal, plasmid, and bacteriophage sequence data.

In this work, we built classifiers to predict the chromosomal, plasmid, or bacteriophage sources for complete and partial sequences that are commonly found in assembled short read genomic and metagenomic data sets. To train the models, we started by building a training data set from sequences that were downloaded from the National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) and PATRIC databases (24, 25). The sequences originated from a taxonomically diverse collection of bacterial hosts: 1,213 genera and 67 families for chromosomes, 502 genera and 220 families for plasmids, and 1,071 genera and 110 families for the bacteriophages (Fig. S1).
Because some taxonomic groups were overrepresented, e.g., sequences from the Enterobacteriaceae amount to nearly 20% of the available data, we chose to down sample the collection using a k-mer-based clustering approach. The number of plasmid and bacteriophage clusters converged around the 87% k-mer similarity threshold where the three sequence classes reached approximately the equal number of clusters. Therefore, we chose to use this convergence point at the 87% k-mer similarity threshold which reduced the total number of sequences by 49%, 29%, and 31% for chromosomes, plasmids, and bacteriophages, respectively (Fig. S2). Once the similar sequences were clustered, one sequence was kept per cluster (Table S1), and the remaining sequences were excluded from the full data set (Table S2). This resulted in 9,163 chromosome, 7,750 plasmid, and 8,877 bacteriophage sequences. This homology reduction preserved nearly all of the taxonomic diversity in the data set (Fig. S1). Before clustering, the average lengths of the chromosomes were approximately 5,000,000 bases (5 mb) while the plasmids and bacteriophages were 50,000 bases (50 kb); these average lengths remained the same after the clustering (Fig. S3).

K-mer length effect on whole-sequence-based predictions.

Random forest classifiers were trained using a single representative sequence from each cluster. The features were oligonucleotide k-mers that were generated by stepping by one nucleotide at a time over each chromosome, plasmid, and bacteriophage sequences (Fig. 1). The k-mer frequencies were calculated and collected into a matrix where the rows were the sequences and the columns were the normalized k-mer counts. K-mers of different lengths may provide different information to the models. Short k-mers are informative for the sequence structure, codon usage, and GC content, while longer k-mers may be valuable for unique patterns such as the presence or absence of genes or mutations. Therefore, the k-mer length was optimized to improve the model performance. Sequences were classified using k-mers from five to eight nucleotides. Overall, we found that the models built from whole sequences with different k-mer sizes did perform similarly. The highest performance was obtained with the 6-mers where the hold-out performance reached 0.992 ± 0.001 area under the curve (AUC) (0.928 ± 0.002 Matthews correlation coefficient [MCC]). Per sequence class, the model accuracy reached 0.979, 0.944, and 0.930 for chromosomes, plasmids, and bacteriophages, respectively (Table 1; Table S5).
FIG 1
FIG 1 A summary of k-mer matrix preparation from whole and partial sequences. Chromosomes, plasmids, and bacteriophages (a) were used in the complete or fragmented form (b). All the sequences were subsequenced into k-mers (c) to generate input matrixes. Each matrix was normalized (d) and merged (e) into one matrix which contains chromosomes, plasmids, and bacteriophages. The merged matrix was shuffled (f) and split into training, testing, and hold-out data sets (g).
TABLE 1
TABLE 1 An overview of the average model performances and standard deviations of the three-class classifiersa
  Test datasetHold-out dataset
Modelk-mer sizeAUC for test datasetMCC for test datasetAUC for hold-out datasetMCC for hold-out dataset
Trained and tested with the whole sequences60.992 ± 0.0010.927 ± 0.0030.992 ± 0.0010.928 ± 0.002
Trained with the whole sequences, tested with 5-kb fragments (×1b)50.878 ± 0.0010.416 ± 0.0130.876 ± 0.0010.419 ± 0.006
Trained and tested with the 5-kb fragments (×1b)50.914 ± 0.0020.662 ± 0.0070.911 ± 0.0000.655 ± 0.005
Trained and tested with the 5-kb fragments (×5c)50.933 ± 0.0010.700 ± 0.0050.933 ± 0.0000.705 ± 0.002
Trained and tested with the 5-kb fragments (×10d)50.940 ± 0.0020.711 ± 0.0050.939 ± 0.0000.713 ± 0.002
a
Each model differs in the training and test data and k-mer size as shown. Model performances were calculated using a 5-fold cross-validation, and the highest performances were reported for both test and hold-out data sets in AUC and MCC.
b
×1, each sequence was sampled only one time.
c
×5, each sequence was sampled randomly five times.
d
×10, each sequence was sampled randomly 10 times.

K-mer length and sampling effect on fragment-based predictions.

To evaluate the accuracy and generalizability of the models for classifying incomplete sequences (a likely scenario when analyzing metagenomic assemblies), the models that were trained with the whole sequences were tested against randomly selected fragments of chromosomes, plasmids, and bacteriophages that were 5 kb in length. K-mers from five to eight were tested for the fragment classification. The highest average prediction performances reached 0.876 ± 0.001 AUC (0.419 ± 0.006 MCC) with 5-mers. Per sequence class, the model accuracy dropped to 0.018, 0.839, and 0.893 for chromosomes, plasmids, and bacteriophages, respectively. Even though the overall model performance was high with the fragments, the robustness of the model decreased compared with the evaluation made with the whole sequences. In particular, the model failed to distinguish the chromosomes from plasmids and bacteriophages (Fig. 2; Table 1; Table S5). This could suggest that the original contig lengths of the chromosomes were influencing the prediction, and would result in a negative outcome for predicting against smaller sequence fragments.
FIG 2
FIG 2 Prediction performances of three-class unweighted classifiers. (a) The model was trained and tested using whole sequences and 5-mers as features, and (b) the model was trained with whole sequences and tested using 5-kb fragments and 5-mers as features. On the left side of each panel, the receiver operating characteristic (ROC) curves show the AUC performance for each class. On the right side of each panel, the class prediction performances are shown using a confusion matrix. CH, chromosome; PL, plasmid; PH, bacteriophage.
To improve the model generalizability on sequence fragments, the random forest models were regenerated by training and testing using randomly sampled 5-kb fragments from full-length chromosome, plasmid, and bacteriophage sequences with various k-mer sizes. When the model was trained and tested using a single random 5-kb fragment sampled from each sequence, the highest average model performance reached 0.911 ± 0.000 AUC (0.655 ± 0.005 MCC) with 5-mers (Table 1; Table S5). As the number of randomly sampled fragments increased from one to 10, the prediction performances increased to 0.939 ± 0.000 AUC (0.713 ± 0.002 MCC) with 5-mers. Per sequence class, the model accuracy reached 0.895, 0.652, and 0.844 for chromosomes, plasmids, and bacteriophages, respectively (Fig. 3; Table 1; Table S5). Although training with sequence fragments came with a notable improvement in the accuracy for the chromosome class, it also corresponded with a loss in accuracy for the plasmid class compared with the models trained from whole sequences.
FIG 3
FIG 3 Prediction performances of three-class classifiers that are trained and tested with 5-kb fragments that are subsequenced into 5-mers. Fragments were subsampled only (a) one time, (b) five times, and (c) 10 times. On the left side of each panel, the ROC curves show the AUC performance for each class. On the right side of each panel, the class prediction performances are shown using a confusion matrix. CH, chromosome; PL, plasmid; PH, bacteriophage.
In this analysis, there were two chromosomes, 911 plasmids, and 51 bacteriophage sequences that were shorter than 5 kb in the test and hold-out data sets. The three-class classifier trained with 5-mers identified the source of the <5-kb fragments with 1.0, 0.731, and 0.810 accuracy for chromosomes, plasmids, and bacteriophages, respectively. These accuracies for these shorter fragments match the overall accuracies obtained from the hold-out set (Fig. 3; Table 1; Table S5).

Prediction performances on different taxonomic levels for fragments.

The taxon representation in the training data set was unbalanced. While we observed high overall performances, there is a risk that the model is only predictive for the well represented taxa. In order to assess the generalizability of the best fragment-based model (defined as the 5-mer model built from 10 randomly subsampled, 5-kb fragments per element) for the underrepresented taxa, performances were inspected for each host taxon at the family level (Fig. S4; Table S6). In addition, Spearman’s correlation coefficient was calculated between the hold-out performances of every host taxon and the corresponding sequence counts in the training data. A correlation was detected in case of plasmids (Spearman’s correlation coefficient = 0.403, P value < 0.05) and bacteriophages (Spearman’s correlation coefficient = 0.480, P value < 0.05). However, correlation was not significant for chromosomes (Spearman’s correlation coefficient = −0.02, P value > 0.05). These results suggest that the host representations in the training data have an impact on the accuracy of the plasmid and bacteriophage identification. Additionally, this correlation could be the consequence of the overlap between the sequence structures of plasmids and bacteriophages, e.g., GC content; therefore, more examples of plasmids and bacteriophages are required for the correct sequence classification (Fig. S5).

False-positive bacteriophage predictions.

The presence of prophages, which are chromosomally integrated bacteriophage sequences, may cause chromosomes to be misclassified. Prophages might be a major challenge in the fragment-based model because the model relies on a small fraction of the sequences for the classification. Of the hold-out data set, 410 chromosome and 62 plasmid sequences were predicted to contain prophages using PhiSpy (26). When the whole-sequence-based model was tested with the prophage containing hold-out sequences, 1% of the chromosomes and 4.5% of the plasmids were predicted to be bacteriophages (Table S7). When the whole-sequence-based model was tested with the hold-out sequences lacking prophages, 2% of the chromosomes and 3% of the plasmids were predicted to be bacteriophages (Table S7). Overall, for the whole-sequence-based model, the presence of bacteriophage elements appears to have a negligible effect on the overall accuracy.
When the same experiment was repeated with using the model that was generated from 5-kb fragments (10 random subsequences, 5-mer features), 50 chromosome fragments and 90 plasmid fragments were predicted to contain at least 500 bp of prophage DNA. When the fragment-based model was tested with the prophage containing fragments, 20% of the chromosomal fragments and 6% of the plasmid fragments were predicted to be bacteriophages (Table S7). When the fragment-based model was tested on fragments lacking prophage elements, 5% of the chromosomal fragments and 8.1% of the plasmid fragments were classified as bacteriophages (Table S7). The presence of prophages was found to have a significant effect on the misclassification of the 5-kb chromosome fragments (P value < 0.05, Chi-square contingency test), but not for the 5-kb plasmid fragments (P value = 0.108, Chi-square contingency test). Overall, these results indicate that in the models built from short 5-kb sequences, some of the misclassification of chromosomal sequences is due to the presence of prophage elements.

Validation with the simulated metagenomic assemblies.

To test the fragment model’s (10 random subsequences, 5-mer features) generalizability to metagenomic scaffolds, simulated metagenomic reads were generated using BBMap from the complete chromosome, plasmid, and bacteriophage sequences. The simulated reads were assembled into scaffolds where the majority of the sequences were fragmented into the multiple contigs (N50 = 40,936 bp). A total of 27,634 scaffolds (corresponding to 29% of all scaffolds) were at least 1 kb. These were used for the validation of the fragment-based model. After removing chimeric sequences, the remaining 26,484 scaffolds (chromosome: 25,888, plasmid: 397, bacteriophage: 199) were tested by the fragment model which reached the accuracy of 0.914, 0.698, and 0.794 for chromosomes, plasmids, and bacteriophages, respectively (Fig. S6). These performances are comparable with the performances for the partial sequences. Overall, these results suggest that the fragment model can classify the metagenomic scaffolds minimum 1-kb length with at least 70% accuracy which corresponds to the plasmid identification performance.

Comparison to PPR-Meta.

We tested our three-class classifiers (whole-sequence-based and fragment-based) against the PPR-Meta tool. First, the PPR-Meta tool was tested with the whole sequences from the hold-out data set without fragmenting the sequences. The tool predicted the chromosome, plasmid, and bacteriophage sequences with 0.868, 0.819, and 0.986 accuracy, respectively (Table S8; Fig. S7). Compared with our whole model (5-mer features), PPR-Meta performed approximately 8.3% and 11.9% poorer with the chromosome and plasmid sequences, and 6.3% better with the bacteriophage sequences. Additionally, the PPR-Meta tool was tested using the hold-out data set that was fragmented into 5-kb fragments and sampled 10 times. The tool predicted chromosome, plasmid, and bacteriophage sequences with 0.780, 0.749, and 0.977 accuracy, respectively (Table S8; Fig. S7). Compared with our fragment model (10 random subsequences, 5-mer features), PPR-Meta performed approximately 12.8% poorer with the chromosome sequences, but 14.9% and 15.8% better with the plasmid and bacteriophage sequences, respectively. Finally, the source of the simulated metagenomic scaffolds was detected using the PPR-Meta tool. The chromosome, plasmid, and bacteriophage derived contigs were identified with 0.820, 0.877, and 0.935 accuracy, respectively (Table S8; Fig. S7). Likewise, PPR-Meta performed approximately 10.3% poorer with the chromosome sequences, but 25.6% and 17.8% better with the plasmid and bacteriophage sequences, respectively. These results suggest that PPR-Meta is highly precise for the bacteriophage identification, yet our whole and fragment models are more accurate at identifying the chromosome sourced sequences. In terms of total runtime, PPR-Meta’s runtime increased with the contig length (Table S8) in contrast to our tool, as more models were used for large contigs. Overall, our tool could be run efficiently and as accurately as PPR-Meta regardless of contig length.

SourceFinder.

The random forest classifier that was trained with 5-mers and 5-kb fragments that were subsampled 10 rounds from each element is implemented as a command line tool well as a web service called SourceFinder. The command line version of SourceFinder is available through Bitbucket (https://bitbucket.org/vlagri/sourcefinder/src/master/). The command line tool requires Anaconda packages (version: 4.4.0) and KMC (version: 3.0.0) and can run on a Linux machine with 1 CPU, 10 GB RAM memory (for 1,000 input elements), and 22 GB storage (for downloading trained random forest models). It makes predictions for each contig based on the average class probabilities of five different random forest models. The web service is available through the Center for Genomic Epidemiology (https://cge.food.dtu.dk/services/SourceFinder/). It has a user option for selecting the number of times a contig is sampled. The web service runs on backend hardware and thus, besides the web browser, has no other computing requirements. Runtimes are generally 1 minute per genome. An example job submission and output are shown for the online SourceFinder tool in Fig. 4.
FIG 4
FIG 4 SourceFinder as a web service. (a) The interface of SourceFinder was demonstrated. Job submission is possible after selecting subsampling rounds from one to 10 and uploading a query file in the FASTA format. (b) When the job is completed, SourceFinder provides a final class prediction per contig in the query file. Probabilities per class are available in the downloadable TSV file.

DISCUSSION

In this study, we developed random forest classifiers which could identify the source of whole sequences as chromosomes, plasmids, or bacteriophages based on their k-mer distributions. When the random forest models were trained and tested with whole sequences, the overall model performance with 5-mers reached 0.990 AUC (Table 1) for all three sequence classes. This model, however, lost robustness when it was tested with shorter sequences characteristic of assembled short read genomic and metagenomic data sets. Therefore, the random forest classifier was retrained with the 5-kb fragments. Despite fragment lengths corresponding to 0.1% of chromosomes and 10% of plasmids and bacteriophages, the fragment model (10 random subsequences, 5-mer features) performed accurately and overall reached 0.939 AUC (Table 1). Thus, this fragment model (10 random subsequences, 5-mer features), implemented as SourceFinder, is generalizable over a wide range of sequence hosts and to fragments that are shorter than 5 kb. SourceFinder is available as a command line program and as a web server hosted at the Center for Genomic Epidemiology (https://cge.food.dtu.dk/services/SourceFinder/).
We observed that the k-mer size had an effect on the fragment model performances where the short k-mers were better for distinguishing between the chromosomes, plasmids, and bacteriophages. This might be explained by the information type that the short and long k-mers provide to the models. The short k-mers inform the models regarding the overall sequence structure, codon usage, and GC content, whereas the long k-mers provide information to the classifiers regarding the unique patterns such as absence or presence of genes or mutations. A study by Bohlin et al. showed, based on the relative entropy calculations, that chromosome structures significantly differ from plasmid and bacteriophage structures; however, genomic islands that contain plasmid and bacteriophage genes resemble plasmid and bacteriophage structures (27). This relative similarity between the chromosomes and genomic islands might be due to the amelioration and adaptation of the external DNA toward the host (2729). Nevertheless, extra-chromosomal genes recently integrated in sequences through HGT are expected to resemble the donor more so than the host (29). Taken together, our models can identify sequence classes even from the partial sequences due to the significant structural differences between sequence classes that were captured by k-mer distributions.
The fragment model has the lowest success rate for the plasmid identification where 27% of the plasmid fragments were predicted as chromosomes, yet only 6% of the chromosome fragments were predicted as plasmids. This notable misclassification might be the result of plasmid patterns being disrupted by the fragmentation. Another reason could be recent integration of plasmid sequences into the chromosomes: if the chromosomal fragments in the training data set mainly composed of plasmids, it could cause biased learning and consequentially misclassifications. Finally, it is possible that the data set contains noise due to mislabeled chromosome sequences in the NCBI RefSeq data set.
Our whole-genome and fragment models outperformed the PPR-Meta tool with respect to predicting the chromosomal class. However, neither our whole nor fragment models outperformed PPR-Meta when predicting the bacteriophage class. While our whole-genome model did outperform PPR-Meta tool with respect to plasmid identification, the same could not be said about our fragment model. The preciseness of the PPR-Meta for the plasmid and bacteriophage identification could be due to the multiple machine-learning models designed for the various sequence sizes. Thus, designing separate models for the short sequences may also improve SourceFinder’s sensitivity; however, the multiple models could also slow down the turnaround time of the online tool. In addition, hyperparameter tuning using fragments rather than whole sequences could improve the accuracy of SourceFinder.
In conclusion, the developed three-class classifier identifies chromosome, plasmid, and bacteriophage derived sequences with overall 0.939 AUC using a simple machine-learning algorithm regardless of the completeness and host taxonomy of the sequences. Our fragment model is useful in situations where highly fragmented de novo assemblies need to be characterized. The SourceFinder tool which is composed of our fragment model outperforms the state-of-the-art tool in some classes and provides compatible performances for the rest. SourceFinder can also be run on our servers with an intuitive web browser interface.

MATERIALS AND METHODS

Data sets.

A set of 18,022 complete representative bacterial chromosomal sequences were downloaded from the National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database; 10,852 plasmid and 12,930 bacteriophage sequences were downloaded from the Pathosystems Resource Integration Center (PATRIC) databases in August 2020 (30, 31). The taxonomic information was extracted from the sequence headers. The extracted genus level information was translated to the family level using the NCBI Taxonomy information by the Python ete2 package (version:2.3.10) (32, 33). To remove the redundant sequences while retaining the diversity of the data, all sequences were clustered based on the k-mer similarities using the KMA tool (version: 1.1.0) for a given similarity threshold (34). KMA applies the Hobohm-1 algorithm to cluster the similar sequences, and it was performed per type of data with the following parameters: -k16, -NI, -Sparse and for the following query and template k-mer similarity thresholds: 81%, 84%, 87%, 90%, 93%, 95%, 96%, and 99%. For the machine-learning training, testing, and validation, one representative sequence was kept from each cluster for a given k-mer similarity threshold, and the other sequences were excluded.

K-mer frequencies.

In order to predict the three sequence classes (chromosome, plasmid, and bacteriophage), we generated multiclass machine-learning models. These models were trained on matrices containing k-mer frequencies from whole or fragmented sequences (Fig. 1). All of the input sequences were divided into nucleotide k-mers and the frequencies of the k-mers were counted using the KMC tool (version: 3.0.0) (35). KMC considers each k-mer and its complement, counting only the k-mer that is highest in a lexicographical sorting. KMC was performed for k-mer sizes 5, 6, 7, and 8 with the following parameters: “-fm” and “-ci1.” These optional parameters describe the type of input as a multi-fasta file and the minimum k-mer frequency of 1 that should be reported, respectively.

Creation of the input matrix and normalization.

For each k-mer length, frequencies were collected into a matrix where the rows represented the isolate inputs and the columns represented unique k-mers (Fig. 1). Each matrix contained all three classes of sequences: chromosome, plasmid, and bacteriophage. To eliminate the bias introduced by differences in sequence lengths, the k-mer frequencies were normalized by dividing the counts with the total number of used k-mers of the given sequence. Each normalized matrix was utilized for training, testing, and validation of the machine-learning model.

Model generation and cross validation.

Random forest was chosen for the machine learning due to its ease of interpretability, ability to handle multi-class problems, and robustness against overfitting (36). Models were generated using ensemble.RandomForestClassifier from the Python Scikit-learn library (version: 0.23.2) (37). To assess the model performances and overfitting, 10% of the input data were held out for validation, and the remaining was used for training and testing the models using a 5-fold cross-validation, which was applied through model_selection.GroupShuffleSplit from the Python Scikit-learn library (version: 0.23.2) (37).
Selected random forest parameters were tuned using the random grid search method through model_selection.GridSearchCV from the Scikit-learn library (Table S3) on the training and testing data (version: 0.23.2) (37). The grid search was completed only for the normalized whole sequences and all k-lengths, repeated 100 times. The optimal random forest parameters were: tree depth of 400, 800 trees, and data bootstrap as false. The remaining parameters were kept as default. Unless otherwise stated, these parameters optimal for the whole sequences were used for all random forest models described in the study. Each tuned random forest model was trained and tested five times and the ensemble of the trained models was applied to the hold-out data set for the final validation. Each classifier provides class probabilities that sum to 1.0. For each tested sequence, the category with the highest class probability was accepted as the final prediction.

Assessment of machine-learning models.

Machine-learning model performances were assessed using multiple performance metrics: the receiver operating characteristic (ROC) area under the curve (AUC), Matthews correlation coefficient (MCC), macro F-score, macro precision, macro recall, accuracy, and confusion matrix using Scikit-learn (version: 0.23.2) (37). AUC was calculated using the one versus rest option to handle the multi-class outputs and the performances were macro-averaged. Statistical tests, including Spearman’s correlation coefficient and the chi-square contingency test, were calculated using stats.spearmanr stats.chi2_contingency functions from SciPy (version: 1.2.2) (38). Additionally, the significance threshold for P value was set to 0.05.

Fragmentation of sequences.

In order to avoid potential biases due to sequence length, models were built from subsampled sequence fragments. Fragmentation was performed by randomly sampling complete sequences using the subsequence length of 5,000 bases (5 kb). When sequences were shorter than 5 kb, the entire sequence was selected. Sequences larger than 5 kb were assigned a random starting point between the first position and 5 kb before the last position. We also tested sampling multiple sequence fragments from a single sequence. In theory, multiple fragmentation could improve the model performances by increasing the coverage of the sequences in the training data. To assess the impact of multiple fragmentation on the model performances, sampling rounds of one, five, or 10 were tested. In other words, if the sampling round was five, five different 5-kb fragments were sampled from each given sequence. When a sequence was shorter than 5 kb, the entire sequence was used multiple times. Additionally, to test the robustness of the method, the fragmentation process was repeated five times to confirm the model performances with different sets of random fragments.
The trained models were named after the type of input that was used in model training. In the “whole”-sequence models, the model was trained and tested with the matrix composed of k-mer frequencies of whole sequences where each sequence was considered a row in the matrix. In the sequence “fragment” models, the model was trained and tested with the matrix composed of k-mer frequencies of fragments where each fragment was considered a row in the matrix. Furthermore, if the fragment sampling round was greater than one (five or 10), fragments originating from the same sequence were kept in the same training, testing, or hold-out data set to prevent possible overfitting.

Validation with the simulated metagenomic assemblies.

In order to assess the generalizability of the fragment model against simulated metagenomic reads, 100 sequences per class were randomly selected from the hold-out data set (Table S4). In total, metagenomic reads were simulated from 300 full-length sequences with random coverage (from 10× to 50×) to simulate coverage similar to metagenomic sequencing using the randomreads.sh tool in the BBMap tool suite (version: 38.90). The following parameters were used: the read length was set to 150 bp, minimum insertion size to 300 bp, maximum insertion size to 600 bp, and phred quality score to 35; also, the tool was used with the metagenome and paired end options (39). The simulated metagenomic reads were concatenated into a single file and assembled into metagenomic contigs using the spades.py tool from the SPAdes software (version: 3.15.2) with the –meta, −12, and –only-assembler options (40). The metagenomic scaffolds shorter than 1 kb were discarded. The original source of scaffolds was determined by realigning the assemblies to the 300 reference sequences using the nucmer function of the Mummer tool (version: 4.0.0beta2) with the default parameters (41). The metagenomic scaffolds that aligned to multiple sequence classes were discarded due to likely chimerism.

Detection of prophages.

In order to understand bacteriophage predictions occurring within chromosomal and plasmid sequences, prophages were identified from the chromosome and plasmid sequences using the PhiSpy tool (version: 4.2.17) (26). As required by the PhiSpy tool prior to prophage detection, sequences were annotated with genomic features in the GenBank format (.gbk) using Prokka (version: 1.14.5) (42) with the default parameters. Afterwards, the PhiSpy tool was executed with the annotated files by setting the minimum contig size parameter to 1 kb.

Comparison to PPR-Meta.

PPR-Meta uses a bi-path convolutional neural network and is the state-of-the-art three-class classifier for identifying chromosome, plasmid, and bacteriophage derived sequences at once. The PPR-Meta tool (version: 1.1) was tested with the hold-out data set, including 908 chromosome, 758 plasmid, and 913 bacteriophage sequences using the default parameters. The comparison was performed for the whole sequences, fragmented sequences of 10 rounds of sampling, and simulated metagenomic scaffolds. The tool performances were reported using the accuracy metric.

Data availability.

The Python 3.6.10 scripts used for this project are available on Bitbucket (https://bitbucket.org/vlagri/sourcefinder/src/master/) All genomes are available through the NCBI RefSeq database (https://ftp.ncbi.nlm.nih.gov/refseq/release/) and the PATRIC resource (https://www.patricbrc.org) using the genome identifiers in Tables S1. The fragment-based model is available as a web service at https://cge.food.dtu.dk/services/SourceFinder/.

ACKNOWLEDGMENTS

This work was funded by the Novo Nordisk Foundation (Grant NNF16OC0021856: Global Surveillance of Antimicrobial Resistance awarded to F.M.A. and O.L.). M.N. and J.D. were funded in part by the United States National Institute of Allergy and Infectious Diseases Bacterial and Viral Bioinformatics resource center award (Contract No. 75N93019C00076) to PI Rick Stevens. The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
We declare no conflicts of interest.

Supplemental Material

File (spectrum.02641-22-s0001.pdf)
File (spectrum.02641-22-s0002.xlsx)
ASM does not own the copyrights to Supplemental Material that may be linked to, or accessed through, an article. The authors have granted ASM a non-exclusive, world-wide license to publish the Supplemental Material files. Please contact the corresponding author directly for reuse.

REFERENCES

1.
Soucy SM, Huang J, Gogarten JP. 2015. Horizontal gene transfer: building the web of life. Nat Rev Genet 16:472–482.
2.
Bello-López JM, Cabrero-Martínez OA, Ibáñez-Cervantes G, Hernández-Cortez C, Pelcastre-Rodríguez LI, Gonzalez-Avila LU, Castro-Escarpulli G. 2019. Horizontal gene transfer and its association with antibiotic resistance in the genus Aeromonas spp. Microorganisms 7:363.
3.
Carattoli A. 2013. Plasmids and the spread of resistance. Int J Med Microbiol 303:298–304.
4.
Carattoli A, Zankari E, García-Fernández A, Voldby Larsen M, Lund O, Villa L, Møller Aarestrup F, Hasman H. 2014. In silico detection and typing of plasmids using PlasmidFinder and plasmid multilocus sequence typing. Antimicrob Agents Chemother 58:3895–3903.
5.
Orlek A, Stoesser N, Anjum MF, Doumith M, Ellington MJ, Peto T, Crook D, Woodford N, Walker AS, Phan H, Sheppard AE. 2017. Plasmid classification in an era of whole-genome sequencing: application in studies of antibiotic resistance epidemiology. Front Microbiol 8:182.
6.
Hendriksen RS, Munk P, Njage P, van Bunnik B, McNally L, Lukjancenko O, Röder T, Nieuwenhuijse D, Pedersen SK, Kjeldgaard J, Kaas RS, Clausen PTLC, Vogt JK, Leekitcharoenphon P, van de Schans MGM, Zuidema T, de Roda Husman AM, Rasmussen S, Petersen B, Bego A, Rees C, Cassar S, Coventry K, Collignon P, Allerberger F, Rahube TO, Oliveira G, Ivanov I, Vuthy Y, Sopheak T, Yost CK, Ke C, Zheng H, Baisheng L, Jiao X, Donado-Godoy P, Coulibaly KJ, Jergović M, Hrenovic J, Karpíšková R, Villacis JE, Legesse M, Eguale T, Heikinheimo A, Malania L, Nitsche A, Brinkmann A, Saba CKS, Kocsis B, Solymosi N, Global Sewage Surveillance project consortium., et al. 2019. Global monitoring of antimicrobial resistance based on metagenomics analyses of urban sewage. Nat Commun 10:1124.
7.
Nielsen RL, Helenius M, Garcia SL, Roager HM, Aytan-Aktug D, Hansen LBS, Lind MV, Vogt JK, Dalgaard MD, Bahl MI, Jensen CB, Muktupavela R, Warinner C, Aaskov V, Gøbel R, Kristensen M, Frøkiær H, Sparholt MH, Christensen AF, Vestergaard H, Hansen T, Kristiansen K, Brix S, Petersen TN, Lauritzen L, Licht TR, Pedersen O, Gupta R. 2020. Data integration for prediction of weight loss in randomized controlled dietary trials. Sci Rep 10:20103.
8.
Brown CL, Keenum IM, Dai D, Zhang L, Vikesland PJ, Pruden A. 2021. Critical evaluation of short, long, and hybrid assembly for contextual analysis of antibiotic resistance genes in complex environmental metagenomes. Sci Rep 11:3753.
9.
Deng Z, Delwart E. 2021. ContigExtender: a new approach to improving de novo sequence assembly for viral metagenomics data. BMC Bioinformatics 22:119.
10.
Arredondo-Alonso S, Willems RJ, van Schaik W, Schurch AC. 2017. On the (im)possibility of reconstructing plasmids from whole-genome short-read sequencing data. Microb Genom 3:e000128.
11.
Lapidus AL, Korobeynikov AI. 2021. Metagenomic data assembly - the way of decoding unknown microorganisms. Front Microbiol 12:613791.
12.
Ren J, Ahlgren NA, Lu YY, Fuhrman JA, Sun F. 2017. VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome 5:69.
13.
Roux S, Enault F, Hurwitz BL, Sullivan MB. 2015. VirSorter: mining viral signal from microbial genomic data. PeerJ 3:e985.
14.
Krawczyk PS, Lipinski L, Dziembowski A. 2018. PlasFlow: predicting plasmid sequences in metagenomic data using genome signatures. Nucleic Acids Res 46:e35.
15.
Fang Z, Tan J, Wu S, Li M, Xu C, Xie Z, Zhu H. 2019. PPR-Meta: a tool for identifying phages and plasmids from metagenomic fragments using deep learning. Gigascience 8.
16.
Andreopoulos WB, Geller AM, Lucke M, Balewski J, Clum A, Ivanova N, Levy A. 2021. Deeplasmid: deep learning accurately separates plasmids from bacterial chromosomes. bioRxiv.
17.
Aytan-Aktug D, Clausen PT, Szarvas J, Munk P, Otani S, Nguyen M, Davis JJ, Lund O, Aarestrup FM. 2021. PlasmidHostFinder: prediction of plasmid hosts using random forest. bioRxiv.
18.
Arredondo-Alonso S, Rogers MRC, Braat JC, Verschuuren TD, Top J, Corander J, Willems RJL, Schurch AC. 2018. mlplasmids: a user-friendly tool to predict plasmid- and chromosome-derived sequences for single species. Microb Genom 4.
19.
Amgarten D, Braga LPP, da Silva AM, Setubal JC. 2018. MARVEL, a tool for prediction of bacteriophage sequences in metagenomic bins. Front Genet 9:304.
20.
Auslander N, Gussow AB, Benler S, Wolf YI, Koonin EV. 2020. Seeker: alignment-free identification of bacteriophage genomes by deep learning. Nucleic Acids Res 48:e121.
21.
Fang Z, Zhou H. 2020. Identification of the conjugative and mobilizable plasmid fragments in the plasmidome using sequence signatures. Microb Genom 6.
22.
Liu F, Miao Y, Liu Y, Hou T. 2020. RNN-VirSeeker: a deep learning method for identification of short viral sequences from metagenomes. IEEE/ACM Trans Comput Biol Bioinform P.
23.
Ren J, Song K, Deng C, Ahlgren NA, Fuhrman JA, Li Y, Xie X, Poplin R, Sun F. 2020. Identifying viruses from metagenomic data using deep learning. Quant Biol 8:64–77.
24.
O'Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, Rajput B, Robbertse B, Smith-White B, Ako-Adjei D, Astashyn A, Badretdin A, Bao Y, Blinkova O, Brover V, Chetvernin V, Choi J, Cox E, Ermolaeva O, Farrell CM, Goldfarb T, Gupta T, Haft D, Hatcher E, Hlavina W, Joardar VS, Kodali VK, Li W, Maglott D, Masterson P, McGarvey KM, Murphy MR, O'Neill K, Pujar S, Rangwala SH, Rausch D, Riddick LD, Schoch C, Shkeda A, Storz SS, Sun H, Thibaud-Nissen F, Tolstoy I, Tully RE, Vatsan AR, Wallin C, Webb D, Wu W, Landrum MJ, Kimchi A, et al. 2016. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res 44:D733–D745.
25.
Antonopoulos DA, Assaf R, Aziz RK, Brettin T, Bun C, Conrad N, Davis JJ, Dietrich EM, Disz T, Gerdes S, Kenyon RW, Machi D, Mao C, Murphy-Olson DE, Nordberg EK, Olsen GJ, Olson R, Overbeek R, Parrello B, Pusch GD, Santerre J, Shukla M, Stevens RL, VanOeffelen M, Vonstein V, Warren AS, Wattam AR, Xia F, Yoo H. 2019. PATRIC as a unique resource for studying antimicrobial resistance. Brief Bioinform 20:1094–1102.
26.
Akhter S, Aziz RK, Edwards RA. 2012. PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucleic Acids Res 40:e126.
27.
Bohlin J, van Passel MW, Snipen L, Kristoffersen AB, Ussery D, Hardy SP. 2012. Relative entropy differences in bacterial chromosomes, plasmids, phages and genomic islands. BMC Genomics 13:66.
28.
Dobrindt U, Hochhut B, Hentschel U, Hacker J. 2004. Genomic islands in pathogenic and environmental microorganisms. Nat Rev Microbiol 2:414–424.
29.
Vernikos GS, Thomson NR, Parkhill J. 2007. Genetic flux over time in the Salmonella lineage. Genome Biol 8:R100.
30.
Davis JJ, Wattam AR, Aziz RK, Brettin T, Butler R, Butler RM, Chlenski P, Conrad N, Dickerman A, Dietrich EM, Gabbard JL, Gerdes S, Guard A, Kenyon RW, Machi D, Mao C, Murphy-Olson D, Nguyen M, Nordberg EK, Olsen GJ, Olson RD, Overbeek JC, Overbeek R, Parrello B, Pusch GD, Shukla M, Thomas C, VanOeffelen M, Vonstein V, Warren AS, Xia F, Xie D, Yoo H, Stevens R. 2020. The PATRIC Bioinformatics Resource Center: expanding data and analysis capabilities. Nucleic Acids Res 48:D606–D612.
31.
Sayers EW, Beck J, Brister JR, Bolton EE, Canese K, Comeau DC, Funk K, Ketter A, Kim S, Kimchi A, Kitts PA, Kuznetsov A, Lathrop S, Lu Z, McGarvey K, Madden TL, Murphy TD, O'Leary N, Phan L, Schneider VA, Thibaud-Nissen F, Trawick BW, Pruitt KD, Ostell J. 2020. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 48:D9–D16.
32.
Huerta-Cepas J, Dopazo J, Gabaldón T. 2010. ETE: a python environment for tree exploration. BMC Bioinformatics 11:24.
33.
Schoch CL, Ciufo S, Domrachev M, Hotton CL, Kannan S, Khovanskaya R, Leipe D, Mcveigh R, O’Neill K, Robbertse B, Sharma S, Soussov V, Sullivan JP, Sun L, Turner S, Karsch-Mizrachi I. 2020. NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database (Oxford) 2020.
34.
Clausen PTLC, Aarestrup FM, Lund O. 2018. Rapid and precise alignment of raw reads against redundant databases with KMA. BMC Bioinformatics 19:307.
35.
Kokot M, Dlugosz M, Deorowicz S. 2017. KMC 3: counting and manipulating k-mer statistics. Bioinformatics 33:2759–2761.
36.
Breiman L. 2001. Random forests. Machine Learning 45:5–32.
37.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Louppe G, Prettenhofer P, Weiss R. 2012. Scikit-learn: machine learning in python. arXiv arXiv preprint arXiv:12010490.
38.
Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, Burovski E, Peterson P, Weckesser W, Bright J, van der Walt SJ, Brett M, Wilson J, Millman KJ, Mayorov N, Nelson ARJ, Jones E, Kern R, Larson E, Carey CJ, Polat İ, Feng Y, Moore EW, VanderPlas J, Laxalde D, Perktold J, Cimrman R, Henriksen I, Quintero EA, Harris CR, Archibald AM, Ribeiro AH, Pedregosa F, van Mulbregt P, Vijaykumar A, Bardelli AP, Rothberg A, Hilboll A, Kloeckner A, Scopatz A, Lee A, Rokem A, Woods CN, Fulton C, Masson C, Häggström C, Fitzgerald C, Nicholson DA, Hagen DR, Pasechnik DV, SciPy 1.0 Contributors. 2020. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods 17:261–272.
39.
Bushnell B. 2014. BBMap: a fast, accurate, splice-aware aligner. https://www.osti.gov/biblio/1241166.
40.
Nurk S, Meleshko D, Korobeynikov A, Pevzner PA. 2017. metaSPAdes: a new versatile metagenomic assembler. Genome Res 27:824–834.
41.
Marcais G, Delcher AL, Phillippy AM, Coston R, Salzberg SL, Zimin A. 2018. MUMmer4: a fast and versatile genome alignment system. PLoS Comput Biol 14:e1005944.
42.
Seemann T. 2014. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30:2068–2069.

Information & Contributors

Information

Published In

cover image Microbiology Spectrum
Microbiology Spectrum
Volume 10Number 621 December 2022
eLocator: e02641-22
Editor: Katharina Schaufler, University of Greifswald
PubMed: 36377945

History

Received: 11 July 2022
Accepted: 1 November 2022
Published online: 15 November 2022

Keywords

  1. assembly identification
  2. bacteriophage
  3. chromosome
  4. machine learning
  5. plasmid
  6. source identification

Contributors

Authors

National Food Institute, Technical University of Denmark, Kongens Lyngby, Denmark
Vladislav Grigorjev
National Food Institute, Technical University of Denmark, Kongens Lyngby, Denmark
National Food Institute, Technical University of Denmark, Kongens Lyngby, Denmark
National Food Institute, Technical University of Denmark, Kongens Lyngby, Denmark
National Food Institute, Technical University of Denmark, Kongens Lyngby, Denmark
Consortium for Advanced Science and Engineering, University of Chicago, Chicago, Illinois, USA
Data Science and Learning Division, Argonne National Laboratory, Argonne, Illinois, USA
Northwestern Argonne Institute for Science and Engineering, Evanston, Illinois, USA
Consortium for Advanced Science and Engineering, University of Chicago, Chicago, Illinois, USA
Data Science and Learning Division, Argonne National Laboratory, Argonne, Illinois, USA
Northwestern Argonne Institute for Science and Engineering, Evanston, Illinois, USA
National Food Institute, Technical University of Denmark, Kongens Lyngby, Denmark
National Food Institute, Technical University of Denmark, Kongens Lyngby, Denmark

Editor

Katharina Schaufler
Editor
University of Greifswald

Notes

Derya Aytan-Aktug and Vladislav Grigorjev contributed equally to this article. Author order was determined in order of decreasing seniority.
The authors declare no conflict of interest.

Metrics & Citations

Metrics

Note:

  • For recently published articles, the TOTAL download count will appear as zero until a new month starts.
  • There is a 3- to 4-day delay in article usage, so article usage will not appear immediately after publication.
  • Citation counts come from the Crossref Cited by service.

Citations

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. For an editable text file, please select Medlars format which will download as a .txt file. Simply select your manager software from the list below and click Download.

View Options

Figures and Media

Figures

Media

Tables

Share

Share

Share the article link

Share with email

Email a colleague

Share on social media

American Society for Microbiology ("ASM") is committed to maintaining your confidence and trust with respect to the information we collect from you on websites owned and operated by ASM ("ASM Web Sites") and other sources. This Privacy Policy sets forth the information we collect about you, how we use this information and the choices you have about how we use such information.
FIND OUT MORE about the privacy policy