Annotated and unannotated domains and proteins of orthornaviruses
To compile a comprehensive set of orthornavirus proteins, we used a recently published data set (hereafter environmental metatranscriptome RNA virus [EMRV] set) of predicted orthornavirus genomes spanning more than 370,000 viral contigs from nearly 500 operationally defined virus families of which 98 had been approved by the ICTV at the time of publication (2022) (
1). All proteins and domains annotated in this study were extracted, yielding a set of 647,383 protein sequences. Then, evolutionarily conserved but unannotated (putative) proteins and domains that are conserved in groups of viruses were identified (see
Fig. S1 in the supplemental material for a schematic and Materials and Methods for details). In brief, ORFs from start to stop were predicted in all six frames and matched to the published annotations (
1). ORFs without annotation and with only partial annotation were processed to retrieve the unannotated sequence stretches. In the case of polyproteins and multidomain proteins, unannotated and partially annotated proteins of at least 200 amino acids (aa), in which a continuous stretch of at least 60 aa was unannotated, hereinafter conserved unannotated domains (CUDs), were extracted. Unannotated ORFs between 60 and 199 aa (
n = 1,362,871) were not sliced further and kept for downstream analysis (hereinafter “ORFans”). To identify evolutionarily conserved sequences that are likely to be expressed and functional, ORFans and CUDs were clustered by sequence similarity (see Materials and Methods). All clusters that included proteins from different genomes which are represented by at least three leaves in the RdRp-based phylogenetic trees or at least one CUD were retained for further analysis, resulting in 6,117 clusters of ORFans representing 100,694 sequences and 13,085 CUDs representing 31,247 sequences (
Fig. S2A and B).
Next, we assessed which known protein domains were missed during profile-based annotation and might be present within the CUD and ORFan set. To this end, we downloaded a curated set of viral reference genomes from the ICTV (ICTV exemplars,
https://ictv.global/vmr, VMR_MSL38_v2) matching the 98 virus families recognized in the EMRV set and extracted the annotated domains and proteins (
n = 32,648) from the GenBank files (hereafter exemplar domains). Exemplar domains were clustered and compared against the same viral profile databases used for EMRV annotation ran using hhsearch (
25). Clusters with at least one highly confident hit (probability ≥95%) were harmonized and assigned accordingly. As a result, 64% of ICTV clusters were assigned confidently, spanning 83% of all exemplar domains. Across virus families, about 70% of all clusters associated with a given family could be confidently assigned, increasing to 80% if considering only clusters spanning domains on the RdRp-encoding segment, those recovered by Neri et al. (
1) and included in the EMRV data set (
Fig. S3A). The clusters without confident assignment were dominated by functionally uncharacterized domains (
Fig. S3B). Thus, some known viral proteins and domains are not identified with high probability by the used viral protein profiles and might be present among the CUDs and ORFans. These under-annotated domains were identified as part of the refinement of the orthornavirus proteome as described below.
The pan-proteome of Orthornavirae
Combining the annotated domains and proteins with CUDs and ORFans, we constructed a pan-proteome for each orthornavirus family. All domains and ORFs (annotated or not) were clustered by sequence similarity and domains were either labeled by their functional tag or as CUD or ORFan. Similar procedure was performed for the ICTV exemplar set, and the resulting virus family pan-proteomes were compared (
Fig. 1A). For the 98 virus families represented in both the ICTV exemplars and the EMRV set, comparable numbers of functions (profile assignments) per virus family were detected, confirming that robust domain annotation was obtained for the EMRV set. Functional assignment per virus family across all 498 families was lower because about 25% of the families included viruses for which only the RdRp was detected, that is, most likely, viruses with segmented genomes (
Fig. 1B).
A substantial majority of the domains annotated by a particular profile were found in a single virus family in both the EMRV set (85%) and the ICTV exemplar set (89%). Thus, these domains represent virus family- or even genus- or species-specific functions (
Fig. 1C and E). The RdRp was the only protein represented in all virus families. Very few other domains and functions were found to be broadly distributed across virus families. These widespread functions are different types of capsids (40% of families in the EMRV set vs 76% in the ICTV exemplar set), mRNA capping enzymes (35% vs 47%), helicases (33% vs 45%), and proteases (22% vs 30%) (
Fig. 1D and F). The consistently lower abundance of these domains across virus families within the EMRV set compared to the ICTV exemplars is probably due to the absence of non-RdRp-encoding segments of multipartite genomes in the EMRV set.
The structurome of Orthornavirae
To predict the structures and functions of the CUDs and ORFans, we generated structural models using AlphaFold2 (
17). Only a fraction of structures (about 34% of CUDs and 10% of ORFans) could be predicted with high confidence (mean plddt score ≥70), indicating the possibility of larger intrinsically disordered stretches (
Fig. S4A and 5A). To address this possibility, we predicted intrinsically disordered structures in CUDs and ORFans. Although we found no significant correlation with the plddt score of the structural models, proteins predicted to have disordered regions have mostly a plddt score below 70 (
Fig. S4B and 5B). To account for the uncertainty of correct folds among low plddt structures, we clustered all CUDs and ORFans independently with Foldseek (0.8 coverage) and kept only clusters in which at least one member had a mean plddt score of 70 or higher, resulting in 412 ORFan and 1,594 CUD representatives.
These representatives were compared to PDB using Dali (
26). Secondary structure prediction with Psique (
27) indicated an all-α-helical fold for about 69% of CUDs and nearly 76% of ORFans (
Fig. 2). Furthermore, for 53% of the CUDs and 35% of the ORFans, folds distinct from the simplest ones, such as a single α-helix or helix-turn-helix (HTH), were predicted. Those were considered as CUD and ORFan of interest (COI and OOI hereafter, respectively). About 37% and 13% of the ORFans and CUDs with a predicted simple fold were predicted to contain at least one transmembrane domain compared to about 8% and 9% of the more structured ORFans and CUD, respectively, indicating a substantial enrichment of small transmembrane proteins among ORFans with a simple fold. Most of the COIs and OOIs were confined to a single virus family (Fig. S6). Foldseek clusters spanning representatives of multiple families represent mainly capsid proteins (see below). Furthermore, COIs were checked with a “neighborhood” approach to determine whether a given COI was confidently annotated in a related genome from the same virus family, pointing to its function (
Fig. 3A and B). In brief, COIs were searched against the annotated domains using psi-blast to obtain a provisional annotation of the COI. Next, related proteins from neighboring genomes with or without COI were aligned and the respective annotations were mapped to the alignment. Whenever annotations from neighboring proteins overlapped with the COI, the annotation was compared with the psi-blast hit. If consistent, the COI was considered a “refinement” and was not investigated further. Whenever there was no overlap with an annotated domain, a conflicting or mixed result, or no psi-blast hit, the COI was kept for manual inspection of the Dali results. Of the 852 COIs with confidently predicted structures, 62 were from genomes not phylogenetically assigned, 553 represented “refinements,” many as part of the RdRp.
To incorporate the predicted structures of COIs and OOI into the context of known structures across all
Orthornavirae families, we constructed a structurome of
Orthornavirae by modeling one representative structure per functional tag per virus family from the pangenome (e.g., one representative RdRp structure per virus family, 2,421 annotated representatives in total). The ICTV exemplar domains lacking strong profile annotation were modeled too (3,000 representatives). This modeling resulted in 6,419 predicted structures which were pre-clustered with Foldseek (0.8 coverage, 4,022 clusters). The great majority of the structures, 88%, remained singletons (
Fig. S7A). In 27 cases, OOIs were found in a Foldseek cluster together with annotated domains and/or ICTV exemplar domains, and the same was found for 13 COIs. Then, 4,022 cluster representatives were analyzed by an all-vs-all Dali comparison, finalizing the
Orthornavirae structurome. Structures within the structurome were clustered based on the Dali all-vs-all z-scores in an iterative procedure in which individual structures were added to a cluster as long as the mean z-score to each other structure already present was above a given z-score. The threshold was set at a z-score of 7 or higher (“z7 clusters”) to avoid over-clustering. This procedure resulted in 333 z7 clusters of which 59% contained two structures, whereas the largest cluster consisted of 43 structures (
Fig. S7B). These 333 z7 clusters represented 1,201 (30%) of all representative structures in the Dali all-vs-all run. The structurome was visualized as a structure-similarity network in which each structure is a node connected to other nodes via edges weighted by the pairwise z-score.
Figure S8 shows a subset of the network in which only structures present in a z7 cluster are shown or which have at least one connection to a structure within a z7 cluster with a z-score of 7 or higher (about 36% of all structures in the Dali all-vs-all run). As expected, we detected z7 clusters of structures coming from well known, functionally characterized domains such as RdRp, helicases, single jelly-roll (SJR) capsid proteins, and proteases. Some functional labels were distributed across several z7 clusters. For example, RdRp structures that dominate the structurome (489 structures initially) contributed to 4 z7 clusters with one harboring about 88% of the representative RdRp structures (
Fig. S7B).
About 28% (
n = 92) of z7 clusters consisted of OOIs and COIs only, that is, represented unique shared folds. Notably, nearly all α-helical domains and proteins (annotated or not) were placed in eight larger z7 clusters (10 representatives or more) in which they are connected by z-scores of 7 or higher but these connections apparently reflect generic structural similarity rather than shared distinct folds (hubs labeled “Helical” in
Fig. S8).
OOIs and COIs were inspected semi-manually by considering Dali results, secondary structures, and structural relationships within the structurome. All-helical COIs and OOIs mostly did not show conclusive Dali hits, with moderate structural similarity between small α-helical modules detected across apparently unrelated proteins. Inspection of COIs and OOIs predicted to fold into globular structures either revealed a fold and function not yet reported for a given virus family (“new”), or a fold not yet reported for a given virus family for which no clear function could be assigned due to significant but too variable Dali hits (“unclear”), or indicated a refinement of the current profile-based annotation of a given virus family (“refined”). Altogether, we classified the predicted structures for OOIs and COIs, respectively, as follows: 8 and 7 new; 14 and 16 unclear; and 23 and 12 refined (
Fig. 3C and D).
Saturation of the rarefaction curve of distinct domains sampled randomly across
Orthornavirae genome clusters suggests that the substantial majority of widely distributed
Orthornavirae domains are already known (
Fig. 3E). Conversely, new viral domains are expected to be found in single virus families which is in line with the findings of this study. Having both the structure-based and profile-based annotations at hand, we aimed to identify the core set of unique domains per virus family shared by at least 50% of genomes (see Materials and Methods for details) and compare it to the overall distribution of domains per virus family. For most orthornavirus families, we identified a small core, often consisting of the RdRp alone, and a “shell” of domains with intermediate frequencies (
Fig. S19). This distribution of domain frequencies could reflect true variation across a virus family, for example, on the genus level but also the presence of incomplete genomes in the data set, in particular, those with multiple segments, and a lack of complete domain annotation.
Discovery of new domains in the orthornaviral structurome
“Unexpected” OOIs and COIs identified in a particular virus family potentially could be either truly novel, that is, not reported so far in any
Orthornavirae, or new to the given virus family. There were no unequivocal examples of the former case although we identified two OOIs from the putative family
f.0145 (o.0036, c.0025,
Kitrinoviricota) with a predicted β-barrel fold. Best but not consistent Dali hits were two bacterial β-barrel fold proteins (top z-scores ~ 5, PilZ domain and HCP3, a paralog of type VI secretion system effector, Hcp1, pfam PF05638, for the two OOIs, respectively) and no structural similarity was observed to known virus proteins in the
Orthornavirae structurome or, to our knowledge, any other viruses (
Fig. S9). Apart from these two β-barrel domains, we identified several other domains known to be encoded by
Orthornavirae members but here found in unexpected virus families.
Nucleic-acid-binding domains
Several COI and OOI products with different folds seem to be involved in nucleic acid binding. One of these nucleic acid-binding domains is the phytoreovirus core-P7 dsRNA-binding domain (P7-dsRBD) that is known to be encoded by members of
Sedoreoviridae (28) and here was found in proteins of various virus families as annotated by profile comparison:
Chrysoviridae (70/81 leaves covered),
Endornaviridae (30/222),
Megabirnaviridae (3/12),
f.0281.base-Megabirna (7/29),
f.0285 (31/94),
Flaviviridae (4/360), with z-scores of 8–11 to each other (see pangenome in the Supplementary Material on zenodo) for all virus families with this domain).
Besides a refined census of P7-dsRBDs, for example, in
Cystoviridae, we identified an OOI with a similar fold in
Picobirnaviridae (in genomes of 14/1376 leaves scattered across the tree;
Fig. 4). Structure comparison against PDB revealed prominent similarity to nucleotide monophosphate (NMP) kinases, such as adenylate kinase (z-score ~8). Thus, proteins with a core-P7-like RBD fold likely originated from cellular NMP kinases and might have spread horizontally among diverse viruses. Representative viral protein structures with P7-dsRBD from different virus families including
Sedoreoviridae,
Chrysoviridae, and
Picobirnaviridae were aligned with homologous cellular kinases using FoldMason (
29), and a phylogenetic tree was constructed using IQTree2 (
30). Most of the viral P7-dsRBDs formed a distinct clade separate from the cellular kinases (
Fig. 4C) which is compatible with functional divergence after exaptation and subsequent horizontal spread. Exceptions are the P7-dsRBDs of
Picobirnaviridae which clustered within the cellular kinase clades, suggestive of independent exaptation events (
Fig. 4C). While the Walker A motif is intact in viral core-P7-like dsRBD fold proteins, the Walker B motif is missing, indicating loss of kinase activity (
Fig. S10 ). The P7-dsRBD domains are found either as stand-alone proteins or are incorporated into viral polyproteins (
Fig. S11). This domain was identified in three families from the order
Ghabrivirales (
Chrysoviridae,
f.0296, and
f.0285) indicative of an old acquisition but, in contrast, seems to have been more recently acquired at least twice by
Picobirnaviridae members (
Fig. 4B and C). Other members of
Picobirnaviridae, predicted to infect bacterial hosts, have been shown to encode a putative lysozyme (completely unrelated to dsRBD or kinases) in the same genomic location (
1). Furthermore, other members of
Picobirnaviridae encode a capsid protein 5′ of the RdRp ORF, demonstrating the flexibility of
Picobirnaviridae to capture diverse ORFs 5′ of the RdRp ORF. The exact functions of the P7-dsRBD domains in different virus families remain unclear.
Notably, all experimentally characterized viruses with a P7-dsRBD have dsRNA genomes, suggesting that this domain is specifically involved in capsid-associated dsRNA transactions. To our knowledge, there are no experimental studies for viruses with ssRNA genomes, such as flaviviruses, on the role of the observed P7-dsRBD. Nevertheless, given that all orthornaviruses produce replicative dsRNA intermediates in the host cell, it seems likely that these domains are involved in transcription and/or RNA packaging as reported for the phytoreovirus P7 (
31), but additional or alternative roles, for example, in the suppression of the host RNAi system, cannot be ruled out.
An OOI containing a common RBD fold (
32) consisting of four β-strands and two α-helices (obviously, unrelated to the kinase-derived RBD discussed above) was identified in members of the
Hepeviridae family (
Fig. 5). This OOI is located 3′ of the non-structural polyprotein and capsid protein ORFs. It is unclear whether this OOI actually binds RNA because the top Dali hits are the RBD that have lost the RNA-binding capacity and are instead involved in protein-protein interactions, such as RBD2 of the
Arabidopsis protein HYL1 (
33).
Another nucleic acid-binding domain is a dsRBD that is found in many virus families such as
Nodaviridae,
Astroviridae, and
Reoviridae and functions as a viral suppressor of host RNAi defense (see virus family pangenomes). We additionally identified related dsRBD as a COI domain in
f.0092, basal to
Permutotetraviridae (
Fig. S12). This domain was detected in genomes represented by nearly all leaves of this family (16/17).
Yet another fold implicated in nucleic acid binding is a winged helix-turn-helix (wHTH) domain found in family
f.0008 basal of
Polycipiviridae, the only virus so far identified in
f.0008 is Lothians earthworm picorna-like virus 1 (
34) (
Fig. 6). The wHTH domain apparently was acquired recently by
f.0008 members as it was only found in a distinct, distal clade (
Fig. 6C). In addition to the wHTH domain, the same and additional
f.0008 family members were shown to encode an SJR-fold protein (
Fig. 6B through D). Given the genome architecture of
f.0008 members with an annotated larger capsid ORF 5′ of the newly identified SJR protein, and given that it clusters with other plant movement proteins (MP) in the structurome, it is highly likely that this is a 30K superfamily MP that evolved by duplication of an SJR capsid protein followed by neofunctionalization (
11). Thus, plants are likely the hosts for at least this clade of
f.0008 members.
A galactose-binding domain
Independently of the putative MP and wHTH, other members of the family
f.0008 were found to encode a galactose-binding domain (
Fig. S13; Dali z-score of 15.5 against a bacterial carbohydrate-binding domain). The general genome organization of
f.0008 members differs between those encoding the wHTH protein and MP, and those encoding the carbohydrate-binding domain. The former viruses encode a 5′ non-structural polyprotein followed by the capsid protein ORF whereas the latter ones encode a single polyprotein with the capsid protein domain at the N-terminus followed by the galactose-binding domain. The galactose-binding domain is specific for a long branch within the
f.0008 family and is likely to be involved in host-specific interactions (
35).
Endonuclease domains
With well over 4,000 members,
Marnaviridae is currently the largest family in the
Pisuviricota phylum (
1). The few characterized viruses of this family infect diverse marine protists (
36). We found that a number of marnaviruses encode an unexpected domain that is located at the C-terminus of the RdRp and shows significant structural similarity to NucS-like endonucleases (restriction endonuclease fold) (
Fig. 7; Dali z score 8.2). Endonucleases of different folds are also encoded by several groups of orthornaviruses, such as influenza viruses, where this enzyme of the PD-(D/E)XK nuclease superfamily cleaves host mRNAs, snatching the cap for viral RNA synthesis (
37), and nidoviruses, in which NendoU (nidoviral uridylate-specific endoribonuclease) (
38) is involved in viral replication and evasion of host innate immunity (
39,
40). The
Marnaviridae endonuclease domain (MED) represents only the catalytic C-terminal domain of NucS-like endonucleases (“DEK” motif) (
41) but lacks the N-terminal dimerization domain (e.g., reference
41). The catalytic residues are conserved in MED, suggesting it is an active endonuclease (
Fig. 7A and B). Typically, endonucleases with this fold cleave double-stranded or single-stranded DNA (
41–43) suggesting that MED could target host DNA in marnavirus-infected cells. It remains unclear whether MED is proteolytically cleaved off the
Marnaviridae polyprotein and functions as a distinct protein or remains fused to the RdRp domain. Contigs encoding MED are scattered across the RdRp tree of
Marnaviridae suggesting spread via HGT and/or multiple losses of the MED domain (
Fig. 7D).
In addition, we identified an exonuclease in the ~450 members-strong, distinct viral family
f.0181 in which the majority of viruses are likely hosted by protists that use alternative genetic codes (
1). This family could not be assigned to known orders or classes in the phylum
Kitrinoviricota. Profile annotation indicated the presence of the RdRp in these contigs but left a ~1,500 amino acid residues-long unannotated N-terminal region and a ~300 residues-long unannotated C-terminal region in the polyprotein. Dali search revealed a DEDD superfamily exonuclease domain located at the very N-terminus (
Fig. S14; z-score: 10.6, e.g., RNase AS, a polyadenylate-specific exoribonuclease of
Mycobacterium tuberculosis, which belongs to the DEDDh family within the DEDD superfamily). In the orthornavirus structurome, it showed structural similarities with coronavirus ExoN (z-score 8.6) which is involved in proofreading during RNA replication in some of these viruses possessing the largest known RNA genomes (
44,
45), and the C-terminal nuclease domain of the nucleocapsid protein of mammarenaviruses (z-score 7.1) implicated in immune invasion (
46–48). Given that the maximum genome size of
f.0181 members is only up to 7500 nt, it is unlikely that this exonuclease is involved in proofreading. Of note, the DEDDh catalytic site is modified to DEEEh in the
f.0185 exonuclease domain, a variation that, to our knowledge, has not been reported previously. Moreover, structurome comparison revealed a viral methyltransferase domain within the last third of the large N-terminal region (z-score 10–11). This domain is involved in virus RNA capping during replication, a function that is widespread in diverse members of this phylum (
49). The C-terminal region was identified as an SJR capsid protein by Dali search and neighborhood analysis. With the identification of these three functional domains, the
f.0181 genome maps became less inscrutable.
A hydrolase domain
The
Solemoviridae family (
Pisuviricota) includes four genera of plant viruses (
50) and roughly 2,000 newly discovered members found primarily in aquatic, soil, and invertebrate metatranscriptomes (
1). A fraction of solemoviruses encode an OOI with a typical α/β hydrolase fold (
Fig. 8). The top Dali hits for this OOI include different types of hydrolases (e.g., deacetlyases and cutinases with z-scores of 6–7); hence, no clear target can be predicted. A phylogenetic distribution suggests the acquisition of this OOI in a larger clade of unclassified
Solemoviridae, although it is not found across all genomes in this clade. Of note, we cannot detect related folds in other orthornaviruses indicating a unique acquisition of this fold or major fold-remodeling post-exaptation by
Solemoviridae members. Given that none of the known plant solemoviruses encodes this hydrolase domain and that many solemoviruses were identified in plant-less aquatic environments, it seems likely that the host range of this virus family will be extended to additional host phyla.
Uncharacterized α/β-domains
In several cases, high-quality models were obtained for a COI, but there were no significant functionally characterized hits in structure comparisons. A case in point is the C-terminal domain of the
Secoviridae (
Pisuviricota [
51]) polyprotein located immediately downstream of the RdRp (
Fig. S15). Dali search showed mainly uncharacterized proteins, the closest being the N-terminal α/β-domain of TolB (z-score of 6.8) without a known function. No significant hits (z-score ≥ 5) were found for this COI in the orthornavirus structurome. The COI was only found in members of one
Secoviridae genus,
Nepovirus. Nepoviruses infect plants and are transmitted by beetles, aphids, whiteflies, leafhoppers, or nematodes (
52). There are no indications that this C-terminal domain is cleaved from the RdRp, so further research is needed to decipher the role of this COI during nepovirus replication and host specificity.
Similarly, an OOI encoded between the matrix and the glycoprotein in
Rhabdoviridae (
Negarnaviricota [
53]) members was predicted to adopt an α/β fold with 4 β-strands and 2 α-helices but without conclusive Dali hits (
Fig. S16). Orthornavirus structurome comparison showed similarity to another hypothetical protein of
Rhabdoviridae (OA33_gp5 from NC_025382.1, a
Betapaprhavirus member; encoded between G an L; z-score: 6.1). The function of this OOI remains unclear as is the case for other additional smaller
Rhabdoviridae proteins that are encoded, for example, by members of the genus
Ephemerovirus (NC 028241;
Fig. S16C) (
54).
A papain-like protease
A viral OTU (vOTU) domain, a papain-like-fold thiol protease (
55), was identified in several families basal to
Deltaflexiviridae (
Tymovirales,
Kitrinoviricota) family of fungal viruses (
f.0208, f.0210, and
f.0212;
Fig. S17). Previously, a vOTU protease was identified in some members of
Tymovirales (
55), and subsequently, in
Deltaflexiviridae (
1), but not in the newly discovered basal provisional families. The present structure comparison confidently extended the spread of the vOTU protease to these families. The catalytic Cys-His dyad is conserved in all these viral proteins indicating that they are active proteases (
Fig. S17A). In plant and animal viruses, vOTU domains function as deubiquitinylases implicated in immune evasion (
56), and a similar role can be envisioned for the vOTUs of
Deltaflexiviridae and their basal relatives.
An SJR domain
In the same vein, we identified the SJR capsid protein in the virus family
f.0198 of the same order,
Tymovirales (
Fig. S18). Similar capsid proteins were also found for the families
f.0194, f.0218, and
f.0217. Based on their overall relationship to other members of
Tymovirales, identification of capsid proteins in these families should have been expected, but profile comparison failed to detect these proteins due to sequence divergence.
ORFan refinements
Apart from the discovery of new domains, ORFan refinements were mainly achieved for capsid proteins, for example, for nine distinct OOI representatives in Leviviricetes, previously unannotated capsid proteins homologous to the typical levivirus capsid proteins were identified across various families such as Steitz-, Atkins-, Blume-, Solspi-, Fiers-, and Duinviridae but also for related putative new families in which no capsid has been so far reported (e.g., f.0361.base-Solspi and f.0367.base-Atkins). Other refinements included Rhabdoviridae matrix protein, phytoreovirus core-P7 dsRNA-like-binding domain in Cystoviridae, methyltransferases, E proteins or RNAi suppressor proteins in Arteriviridae, Flaviviridae, and Dicistroviridae and Mymonaviridae, respectively.