Viral taxonomy classification.
vConTACT2 is the taxonomy classification tool implemented in MP. Briefly, it uses viral genomes as nodes of a monopartite network from which it derives viral clusters. Genomes are grouped based on protein content and similarity, and the resulting clusters are largely concordant with the International Committee on Taxonomy of Viruses (ICTV) viral taxa (
9).
vConTACT2 splits clusters into subclusters, where clusters are an approximation of phage (sub)families, and subclusters are approximations of phage genera. For each sequence in the network, vConTACT2 assigns a label representing the clustering “status.” A user can encounter several status descriptions, e.g., “Clustered” when the sequence clearly falls within a defined cluster and subcluster. More complex status classifications are also possible, e.g., “Overlap” when a sequence fits two or more clusters equally well, “Clustered/Singleton” when a subcluster consists of only one sequence, or “Outlier” when the sequence cannot be assigned to any cluster.
The network produced by vConTACT2 has viral sequences (reference genomes and vOTUs) as nodes, and its edges are weighted proportionally with the similarity between two nodes. This network can be inspected using Cytoscape (
13), and this helps the user to deduce the taxonomic context of each vOTU, including the most difficult cases. Unfortunately, repeating this manual inspection for each vOTU is very time-consuming. To automate the process, we developed a novel Python script named
graphanalyzer, which assigns an optimal taxonomy to every vOTU following the same process as a user would do manually.
For each vOTU to be classified, graphanalyzer searches the corresponding node in the similarity network and retrieves the list of nodes directly connected to it (its “neighbors”), which are also in the same cluster as the vOTU (if any). The list is then ordered by the weight of the edge connecting the node to the vOTU, and the taxonomy is inherited from the first reference genome found in this sorted list. If no reference genome is found, graphanalyzer retrieves a new sorted list with all nodes that are simply neighbors, regardless of clustering information. If a reference genome exists within this second list, the taxonomy information is inherited from the reference with the highest weight. If no reference genome is found, the vOTU remains unclassified.
In this process, the taxonomic classification for each vOTU can be more or less confident depending on the weight assigned by vConTACT2 and the clustering status. To be more conservative, we decided to stop the inherited taxonomy at the level of subfamily if the status was “Clustered/Singleton” or “Overlap” and at the level of family if the taxonomy was inherited from a genome that was not in the same cluster. Otherwise, the taxonomy was retained until the level of genus.
The reliability of the taxonomic classification can be assessed by looking at the column “Status” and “Weight” in the taxonomy table output. The “Weight” column is the score calculated by vConTACT2 as the negative logarithm of the hypergeometric P value multiplied by the total number of pairwise genome comparisons. The higher this score, the higher the similarity between two genomes and so the probability they are evolutionarily related. The “Status” column refers to the label assigned by vConTACT2 to each sequence in the network. In general, a “Status” labeled as “Clustered” should be interpreted as more reliable compared to other labels, such as “Outlier” or “Overlap”. The taxonomy table can be sorted according to these columns in order to rank the vOTUs based on the reliability of the taxonomic assignment.
For each vOTU,
graphanalyzer also produces an interactive subnetwork that can be explored to retrieve extra information without the need for Cytoscape (
Fig. 2). Further information on
graphanalyzer, including its installation and usage, a description of its main algorithm and its classification levels, and a detailed user guide to the interpretation of the taxonomy table and the interactive subnetworks it produces, is freely available in its public repository at
https://github.com/lazzarigioele/graphanalyzer.
MetaPhage report.
MetaPhage results are summarized in a rich, easy-to-read html report that can be opened in any web browser. The report is organized in different sections for each step of the workflow.
The Miner Comparison section shows the number of vOTUs identified by the different phage mining tools and allows a direct comparison of the performance of each software (
Fig. 3A). The Viral Contigs Distribution plot reports the length distribution on the identified viral contigs. The Summary Table section displays per-sample general statistics, such as vOTU total count, abundance, vOTU length (minimum, average, and maximum), and vOTU distribution in a searchable table. The Taxonomy Table section (
Fig. 3B) is an interactive and searchable per-vOTU table, displaying taxonomy information and host association information (retrieved using the information contained in the INPHARED database [
14]). The taxonomy is automatically assigned using the
graphanalyzer tool, and it stops at the genus level if available. For each vOTU, it is also possible to visualize the corresponding subnetwork generated by
graphanalyzer, together with its nucleotide consensus sequence, predicted protein sequences (FASTA format), and coordinates files (GFF3 format). The table allows the user to query for any combination of phage miners to restrict the search to only the vOTU of interest.
The vOTUs Distribution (
Fig. 3A) section contains links to the generated heatmap and violin plots, both interactive plots that give information about vOTU abundance (log count) and distribution across the samples.
The report also provides alpha and beta diversity metrics in the corresponding report sections to describe within-sample complexity and between-sample diversity. In particular, alpha diversity is calculated using different metrics, such as observed, Shannon, Abundance-based coverage estimators, Simpson, and Fisher, while beta diversity is calculated using Bray-Curtis and Jaccard indices.
Several important files that are generated during the analysis and are useful for downstream analysis are also directly linked within the report for easy and fast access, in the Relevant Files section (
Fig. 3C). In particular, the vOTU count table reports the abundance of each vOTU in each sample, the taxonomy table, multi FASTA files containing the vOTU nucleotide and protein sequences, and the coordinate of the predicted proteins on the corresponding contigs in gff3 format. Moreover, a phyloseq object is also provided with both raw and cumulative sum scaling (css) normalization counts. phyloseq is an R package to import, store, analyze, and graphically display complex OTU-clustered high-throughput phylogenetic sequencing data (
15). The package implements advanced/flexible graphic systems and leverages many of the tools available in R for ecology and phylogenetic analysis for downstream analysis.
The report also contains the output of the taxonomic classification performed using kraken2 (
16), providing both an interactive bar plot (
17) summarizing the results of the taxonomic classification on the whole data set and both the single Kraken2 and Krona pie chart output file for each sample (
Fig. 3A).
Finally, the last part of the report provides general statistics about read quality check/preprocessing and assembly metrics (calculated using the metaQuast tool [
18]).
Validation and testing.
MetaPhage performance was tested by analyzing a DNA virome derived from virus-like particles (VLP) and a shotgun metagenome data set from a published study (
19). The authors analyzed 20 healthy infant stool samples, collected at 0 to 4 days after birth and at 1 and 4 months of life, in order to investigate the origin of the viral population in the gut. The authors reported 2,552 viral contigs, 1,029 of which were classified as bacteriophages containing more than 10 open reading frames (ORFs).
Our MetaPhage analysis, run considering the minimum vOTU length of 3,000 as described in reference
19, found 5,284 vOTUs (2,507 of which contained more than 10 ORFs) in the DNA virome data set. Twenty-seven percent of the identified vOTUs were predicted by either VirFinder (
20), VIBRANT (
21), VirSorter2 (
22), or DeepVirFinder (
23), while 8.6% were predicted by all of the miners. VIBRANT was the miner that predicted the higher number of vOTU (4,078). In the shotgun metagenomics data set, we predicted 14,274 vOTUs (4,666 of which contained more than 10 ORFs). A total of 24.8% of the identified vOTUs were predicted only by DeepVirFinder and VIBRANT, while only 2.3% were predicted by all of the five miners. DeepVirFinder was the miner that predicted the overall higher number of vOTU (10,508). Details on the number of vOTUs predicted by the different miners on the two data sets can be found as at
https://doi.org/10.6084/m9.figshare.20449644. Both the data sets were classified using vConTACT2, and the taxonomy assignment was done by means of our
graphanalyzer tool as described in the Materials and Methods and Results sections. In the DNA data set, the taxonomy classification allowed us to identify 3 phyla (1,943 vOTUs), 3 classes (1,943 vOTUs), 4 orders (1,954 vOTUs), 9 families (1,894 vOTUs), and 32 genera (87 vOTUs). In the shotgun data set, we identified 3 phyla (2,376 vOTUs), 3 classes (2,376 vOTUs), 4 orders (2,376 vOTUs), 9 families (2,321 vOTUs), and 19 species (47 vOTUs).
Bacteriophage classification results, collapsed at the family taxonomic rank and reads per million (RPM) normalized, are shown in
Fig. 4A. These results are largely in concordance with those of the original authors, with differences attributed to an increased number of phage genome sequences available from public databases for taxonomic assignments and an increase in the number of phage families defined. Results are also confirmed looking at the alpha diversity (
Fig. 4B), which shows a lower richness and diversity at month 0 compared to month 1 and month 4. As phage taxonomy changes over time, MetaPhage will be able to easily adapt and incorporate new taxa using updated versions of the core databases.
Time consumption.
The validation testing was run for both the data sets on Cineca’s Galileo100 HPC cluster (
https://www.hpc.cineca.it/hardware/galileo100) using SLURM as a job scheduler and requesting a variable number of resourcing ranging from 4 to 48 cores and from 8 GB to 64 GB of RAM depending on the type of process.
The DNA virome data set analysis (60 samples) was completed in 23 h. The sum of each single process real time was around 60 h (without considering the jobs running in parallel). On average, per sample, DeepVirFinder processes ran in 6 min, Phigaro (
24) in 1 min, VIBRANT in 2 min, VirFinder in 1 min, and VirSorter2 in 23 min. DIAMOND (
25), vConTACT2, and
graphanalyzer together took 3 h 43 min to complete.
The shotgun metagenome data set (59 samples) was run in 37 h. The sum of each single process real time was slightly more than 171 h. On average, per sample, DeepVirFinder processes ran in 32 min, Phigaro in 5 min, VIBRANT in 10 min, VirFinder in 10 min, and VirSorter2 in 1 h and 32 min. DIAMOND, vConTACT2, and graphanalyzer together took 4 h and 46 min to complete.