INTRODUCTION
Betacoronavirus SARS-CoV-2 emerged in the human population in 2019, and it is the causal agent of the new pandemic disease COVID-19 (
1), with a death toll that is increasing at the time of this writing (
https://covid19.who.int/). Genetic variations in SARS-CoV-2 genomes (annotated in the GISAID [
https://www.gisaid.org/], PubMed [
https://www.ncbi.nlm.nih.gov/pmc/], and ENA data banks [
https://www.ebi.ac.uk/ena/browser/home], among others) affect nonstructural and structural protein-coding regions. Despite the short history of SARS-CoV-2 circulation, newly arising variants exhibiting different mutational patterns are being identified regularly. A distinction has been made between variants of interest (VOI), due to features with potential impact (such as transmissibility), and variants of concern (VOC), due to definite evidence of enhanced transmissibility (
https://www.who.int/en/activities/tracking-SARS-CoV-2-variants/). New SARS-CoV-2 variants are likely to become prominent as COVID-19 continues, despite natural or vaccine-induced immunity (
2–5). Likewise, the generation of viral escape mutants is a major concern as a potential limitation of immune and antiviral agent efficacy for SARS-CoV-2 (
6–10), as it has been established for other RNA viruses.
The first step in the diversification of viruses during their epidemic spread is the generation of variants within each infected host. This pattern of intrahost evolution results in the formation of mutant spectra that constitute reservoirs of genetic and phenotypic virus variants in the infected host (
11,
12). Studies with several RNA viruses have shown that viral intramutant spectrum complexity, estimated by the average number of mutations per genome, expressed by a series of diversity indices (Shannon entropy, maximum mutation frequency, Gini Simpson, nucleotide diversity, number of polymorphic sites, and number of haplotypes [
13,
14]), may have an impact on viral tropism, viral persistence, disease progression, and response to antiviral interventions (several cases have been described or reviewed in references
11 and
15–22). Evidence of quasispecies dynamics has been reported for SARS-CoV-2 (
23–29), as well as for other coronaviruses (
30–34). However, it is unclear how mutant spectrum complexity parameters of this emerging pathogen vary among different viral isolates and whether previously observed effects of mutant spectrum composition on RNA virus behavior apply also to SARS-CoV-2, particularly its connection with disease severity.
Two recent studies indicated that mutant spectrum complexity in SARS-CoV-2 from patients who developed severe disease is higher than that from patients with mild disease, analyzing either the spike (S)-coding regions (
35) or the entire genome with limited mutant spectrum resolution (
36). In the present study, we have examined mutant spectra of the nsp12 (polymerase)- and S-coding regions of SARS-CoV-2 present in 30 nasopharyngeal swab samples taken at the time of diagnosis of patients progressing toward disparate disease outcomes. Applying a 0.5% cutoff value for point mutation and deletion detection, using SeekDeep as bioinformatics platform, we found that virus from patients who developed mild disease exhibited a significantly higher mutant spectrum complexity than virus from patients who developed moderate or severe disease (exitus). The difference occurred in both the nsp12 (polymerase)- and S-coding regions. In contrast, no significant differences in the spectrum of minority deletions were observed among virus from the three patient categories (mild, moderate, or severe disease). Some amino acid substitutions found at low frequency in mutant spectra, including substitutions with low statistical acceptability and with potential functional effects, are nevertheless present in SARS-CoV-2 isolates recorded in data banks.
DISCUSSION
The UDS analysis of the nsp12 (polymerase)- and S-coding regions of 30 biological samples without cell culture passage confirmed the presence of complex SARS-CoV-2 mutant spectra in diagnostic nasopharyngeal samples of the virus (
23–28). Contrary to a previous conclusion with other patient cohorts (
35,
36), our quantifications show that in both the nsp12 (polymerase)- and the S-coding regions analyzed, there was a positive association between the number of point mutations and a mild disease manifestation in the corresponding patients. No such association was observed with the minority deletions that also populated the mutant spectra (
Fig. 2 and
3). There are several non-mutually exclusive mechanisms that may contribute to a larger average number of point mutations in samples from patients that developed mild disease than in those from patients with moderate or severe disease. One is that the major sites of replication of the virus may not be identical in the three groups of patients. Mutational input may be affected by a variety of host cell functions, including editing activities (
50), or as a consequence of the effects on polymerase fidelity of nonstructural viral proteins that participate in genome replication, as evidenced with other RNA viruses (
51–54). This possibility for SARS-CoV-2 is suggested by nonidentical preferred transition mutation types in the isolates, depending on the associated disease severity (
Table 1). A second influence may lie in a longer time of asymptomatic intrahost virus replication prior to the onset of mild symptoms and COVID-19 diagnosis. A prolonged replication time does not necessarily imply a larger viral load in the infected host, but it may entail an increase in the average number of variant genomes in the population. Another possibility is that bottleneck events, which may transiently reduce the number of mutations scored within mutant spectra, intervene with higher intensity in patients doomed to severe disease than in those developing mild disease. This may come about through the immune response that may partially suppress viral replication and that it is also part of the COVID-19 pathogenesis process (
55–57). Several possibilities may explain dissimilar conclusions with other studies; for example, (i) independent cohorts may have been infected by virus belonging to clades displaying nonidentical behavior, and (ii) there may have been methodological differences, such as in the criteria to classify patients according to COVID-19 symptoms, in the PCR-UDS resolution attained, or in the sample type taken for analysis (naso/oropharyngeal swabs versus nasopharyngeal aspirates), among others. The multiple factors that contribute to a mutant spectrum complexity beg for studies with other cohorts to try to clarify whether complexity of viral RNA in diagnostic samples responds to discernible virological parameters and whether UDS data might help predict disease evolution or response to treatment, as previously documented for hepatitis C (
58,
59).
We have focused the mutant spectrum analysis on two regions of the SARS-CoV-2 genome whose encoded proteins are likely subjected to widely different constraints. The nsp12 (polymerase) is involved in genome replication and transcription, and the S glycoprotein has a major role in virus attachment, fusion, and entry, as well as in defining the antigenic profile of the virus. A total of 41 different amino acid substitutions in nsp12 and 15 substitutions in S have been recorded in the 30 mutant spectra analyzed (Table S2 in
https://saco.csic.es/index.php/s/8GH5aJgritCjEx5). Normalization to the sequenced protein length gives an average frequency of nonsynonymous mutations of 8% for nsp12 and 6% for S in the mutant spectra. Three substitutions in S map in the receptor-binding domain (RBD). One of them, A475V (present at 26% frequency in virus from a patient who developed mild disease), reduced the sensitivity to several monoclonal antibodies (
60). S494P (dominant in virus from a patient who developed mild disease) was listed among the nine most frequent substitutions in a large-scale study of 506,768 SARS-CoV-2 isolates; it is considered a likely vaccine-escape substitution and is possibly also involved in increased transmissibility of some isolates of the alpha variant detected beginning September 2020 (
61–63) (
https://www.cdc.gov/).
Substitutions that are present at low frequency are predicted to have more drastic structural and functional effects, and some of them have been identified in the sequences compiled in the CoV-GLUE database (compare
Table 2 and Fig. S6 in
https://saco.csic.es/index.php/s/8GH5aJgritCjEx5).
It is likely that disruptive amino acid substitutions belong to defective or minimally replicating (very low fitness) genomes that either have a transient existence in the population or can be maintained at detectable levels by complementation (for example those with lesions incompatible with polymerization activity) (
64). Defective genomes need not represent a biological or evolutionary dead end. They can exert modulatory effects on the entire population (
65), and they also constitute a rich substrate for RNA recombination to rescue viable genomes that may become epidemiologically competent viruses.
Newly replicated genomes
in vivo may incorporate deletions as a result of limited processivity of the coronavirus replicase (
66,
67). Genomes with deletions may, on average, be subjected to stronger negative selection than genomes with point mutations, blurring differences in their frequency among samples from the three patient categories. This is likely to apply mainly to out-of-frame deletions that give rise to truncated proteins; for example, in the S-coding region, we have identified deletions (Δ)23,555 to 23,570, Δ23,555 to 23,582, and Δ23,561 to 23,582, which are located near the S1/S2 cleavage site and are expected to impair S function. Their maintenance to the point of reaching sufficient concentration to be detectable by UDS may reflect an efficiency of complementation of
trans-acting structural proteins higher than that of nonstructural proteins (
64). This may also explain the lower frequency of out-of-frame deletions in the nsp12 (polymerase)- than in the S-coding region. It has been proposed that defective S proteins generated around the S1/S2 cleavage site could potentially reduce the severity of the infection (
68).
All point mutations and deletions were found at frequencies below 30% in the corresponding mutant spectra. Several important biological and clinical features could influence the shape of SARS-CoV-2 mutant spectra. However, it should be considered that the large size of the coronavirus genome may limit the accumulation of mutations relative to that of less complex RNA genomes due to negative effects of mutations on fitness (
69). Not even the point mutation hot spots were found at frequencies above 1% in the quasispecies where they were present (compare
Fig. 2 and
7). This is compatible with hot spots reflecting sites where lesions are more tolerated within a generally constrained RNA genome. The fact that hot spots according to mutant spectra do not coincide with those defined by consensus sequences adds to other observations that indicate that residue conservation criteria at these two levels do not coincide (
70). The fact that the great majority of mutations in SARS-CoV-2 mutant spectra are present at low frequency may slow down the response of the virus to specific selective constraints such as inhibitors or neutralizing antibodies. Under this scenario, viral load may become more important to furnish genomes with mutations required to respond to the constraints (
71). Comparative measurements with different RNA viruses are needed to endorse these potential effects of mutant spectrum composition.
The higher percentage of transitions versus transversions, and of nonsynonymous versus synonymous mutations, is in agreement with previous reports of mutant spectrum and consensus sequence analyses of SARS-CoV-2 (
23,
24,
35,
68,
72). Some differences with previous studies have been observed in the preferred mutation types (Fig. S3 in
https://saco.csic.es/index.php/s/8GH5aJgritCjEx5). While in the mutant spectra of our cohort T to C was the most frequent point mutation, other studies reported C to T as the preferred mutation type (
72,
73). C to T was, however, the most frequent mutation in virus from the subset of exitus patients (Fig. S3 in
https://saco.csic.es/index.php/s/8GH5aJgritCjEx5), hinting at the possibility that in previous studies virus from patients with moderate and severe COVID-19 might have been overrepresented. The lack of dominance of C to U transitions in our samples is also reflected in the absence of depletion of amino acids A, H, Q, P, and T when considering all amino acid substitutions observed (
50,
74); the data of Fig. S3 in
https://saco.csic.es/index.php/s/8GH5aJgritCjEx5 show a net gain of 3 amino acids in the A, H, Q, P, and T subset. Another possible explanation for differences with previous studies could be that the previous studies focused on consensus sequences obtained from data bases covering the whole genome, whereas our results correspond to two specific genomic regions sequenced by UDS.
The six point mutations that altered the consensus sequence of the mutant spectra relative to that of the reference,
NC_045512.2 (identified as “Divergence” in the heat map of
Fig. 2 and in Table S2 in
https://saco.csic.es/index.php/s/8GH5aJgritCjEx5), allowed an estimate of the rate of accumulation of mutations in the SARS-CoV-2 consensus sequence. The time interval between our Madrid isolates (dated April 2020) and the reference Wuhan isolate (dated December 2019) was 4 months. Considering this time interval, the average rate of evolution calculated is (1.6 ± 0.6) × 10
−3 mutations per nucleotide and year (m/n/y), and it is only slightly higher than the average value from 10 previous studies: (1.2 ± 0.6) × 10
−3 m/n/y (range 9.9 × 10
−4 to 2.2 × 10
−3 m/n/y) (
73,
75–83). Higher evolutionary rates are frequently obtained as the time intervals between the virus isolations considered for the calculation become shorter (reviewed in reference
84). The values for SARS-CoV-2 are comparable to those reported for other RNA viruses, suggesting that constraints at the quasispecies level may not affect significantly evolutionary rates considered at the epidemiological level (
85). Our results hint at the possibility that SARS-CoV-2 evolving in patients exhibiting mild symptoms may contribute a majority of the variants that drive the high rates of evolution quantified at the epidemiological level.
ACKNOWLEDGMENTS
We thank all personnel in the Clinical Microbiology Department of the FJD for help with the sample and data collection. We thank all health care professionals who attended to COVID-19 patients and collected the clinical samples that were included in this study in a difficult moment of the COVID-19 epidemic in Spain. We thank José María Aguado and Octavio Carretero for their support to the whole project. We are indebted to Cristina Villaverde for her technical expertise and help with the samples. We thank J. Gregori and J. Quer for their contribution to the quasispecies analyses of HCV-infected samples.
This work was supported by Instituto de Salud Carlos III, Spanish Ministry of Science and Innovation (COVID-19 Research Call COV20/00181) and cofinanced by European Development Regional Fund “A way to achieve Europe.” The work was also supported by grants CSIC-COV19-014 from Consejo Superior de Investigaciones Científicas (CSIC), project 525/C/2021 from Fundació La Marató de TV3, PID2020-113888RB-I00 from Ministerio de Ciencia e Innovación, BFU2017-91384-EXP from Ministerio de Ciencia, Innovación y Universidades (MCIU), PI18/00210 and PI21/00139 from Instituto de Salud Carlos III, and S2018/BAA-4370 (PLATESA2 from Comunidad de Madrid/FEDER). This research work was also funded by the European Commission – NextGenerationEU (Regulation EU 2020/2094), through CSIC’s Global Health Platform (PTI Salud Global). C.P., M.C., and P.M. are supported by the Miguel Servet program of the Instituto de Salud Carlos III (CPII19/00001, CPII17/00006, and CP16/00116, respectively) cofinanced by the European Regional Development Fund (ERDF). CIBERehd (Centro de Investigación en Red de Enfermedades Hepáticas y Digestivas) is funded by Instituto de Salud Carlos III. Institutional grants from the Fundación Ramón Areces and Banco Santander to the CBMSO are also acknowledged. The team at CBMSO belongs to the Global Virus Network (GVN). B.M.-G. is supported by predoctoral contract PFIS FI19/00119 from Instituto de Salud Carlos III (Ministerio de Sanidad y Consumo) cofinanced by Fondo Social Europeo (FSE). R.L.-V. is supported by predoctoral contract PEJD-2019-PRE/BMD-16414 from Comunidad de Madrid. C.G.-C. is supported by predoctoral contract PRE2018-083422 from MCIU. B.S. was supported by a predoctoral research fellowship (Doctorados Industriales, DI-17-09134) from Spanish MINECO.