In a recent study posted to the bioRxiv* preprint server, researchers assessed the impact of mutagenesis on the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genome and the effects of mutations in the neutral regions and non-neutral regions of the genome.
Evolution is a function of factors such as selection and mutagenesis, the deconvolution of which can improve understanding of the rate of nucleotide (nt) substitutions (mutational spectrum). Assessments of mutational spectra can enable the characterization of mutational signatures and viral evolutionary changes. Ribonucleic acid (RNA) viruses evolve rapidly, and therefore, it is important to uncover the impact of mutations on codon, amino acid (aa), and nt compositions.
About the study
In the present study, researchers elucidated the mutational spectrum of the SARS-CoV-2 genome.
A total of 4,339,984 SARS-CoV-2 genomic sequences were obtained on 14 October 2021 from the GISAID (global initiative on sharing all influenza) database. Low-quality sequences (comprising uncertain nt or <29001 nt) and duplicates were filtered out and the remaining 1,139,387 sequences were aligned with the reference (Wuhan-Hu-1 strain) genome.
The SARS-CoV-2 genome was categorized as follows (i) early genome obtained between December 2019 and March 2020, (ii) intermediate genome obtained in October 2020 and (iii) late genome obtained between September 2021 and October 2021. A phylogenetic tree was built based on the aligned SARS-CoV-2 genomic sequences (n=203,045), pruned to 54,521 nodes and ancestral genomes of the nodes of the pruned phylogenetic tree were reconstructed.
All possible variants of single nt substitutions were counted and adjusted to the expected numbers of nt substitutions obtained collected from the reference Wuhan-Hu-1 sequence. Three mutational spectrums of the SARS-CoV-2 genome were reconstructed, namely, ALL (representing all mutations), SYN (synonymous mutations) and SYN4F (synonymous mutations in four-fold degenerate regions). To assess the mutational spectrum stability, the mutational spectrums were reconstructed at nine different time points during the coronavirus disease 2019 (COVID-19) pandemic.
The three-nt context that resulted in the 192-component mutational spectrum was analyzed and the expected 12-component neutral nt composition was obtained. Alterations in nt composition were analyzed for the entire SARS-CoV-2 genome and comparative assessments for Coronaviridae and other single-stranded positive RNA [(+)ssRNA] viruses were performed based on codon use and RNA-dependent RNA polymerase (RdRP)-specific aa compositions by principal component analysis (PCA).
Computer simulations were performed for quantitative estimation of the expected nt composition at the neutral sites. Based on the expected aa substitution trajectories, aa were classified as losers [alanine (Ala), proline (Pro), glycine (Gly), arginine (Arg), serine (Ser, AGX), threonine (Thr), glutamine (Gln), glutamic acid (Glu), histidine (His) and aspartic acid (Asp)], intermediate [cysteine (Cys), leucine (Leu, CUX), Ser, valine (Val) and tryptophan (Trp)], gainers [Leu (UUX), phenylalanine (Phe), methionine (Met), tyrosine (Tyr) and isoleucine (Ile)], and neutral [lysine (Lys), aspargine (Asn)]. Changes in aa in the Omicron variant from the reference strain were analyzed.
Results
The reconstructed SARS-CoV-2 mutational spectra were highly G>U and C>U biased, indicating that the SARS-CoV-2 genome is U-enriched in weakly constrained sites. Neutral nt sites were U-saturated, indicative of ancestral viral exposure to comparable mutational pressure in the past and that significant changes are not expected in the future since the genomic system has reached equilibrium.
However, the nonsynonymous mutations evolved gradually toward equilibrium by replacing CG-rich aa (losers) with aa rich in U (gainers) with resultant loser aa deficit and gainer aa excess, a finding among all Coronaviridae viruses (except Omicron). Contrastingly Hepacivirus C, Zika virus and Bastrovirus BAS−1 showed loser aa excess by an excess of losers. Synonymous mutations were proximal to the compositional genomic equilibrium.
The team proposed a butterfly theme, i.e., minor alterations in mutational spectra considering a long-term perspective would lead to remarkable aa compositional alterations. They hypothesized that the butterfly theme would be most prominent among species with asymmetric mutational spectra, relaxed selection, and high mutational rates.
High-quality nt mutations (n=542,768) comprised 314,538 nonsynonymous mutations (n=314,538) synonymous (n=206,319) and synonymous at four-fold sites (n=92,734). The mutational spectra of SARS-CoV-2 comprised asymmetrical U-directed strands and excessive C>U mutations contributing to 37%, 25%, and 35% ALL, SYN, and SYN4F mutations, respectively.
More NNU codons demonstrated mutational spectra asymmetry than NNC and NNG codons and more NNA codons than NNG codons. The ALL spectra increased in asymmetry and directionality during the pandemic. There were no significant trends in codon use changes between early genomes and late genomes during COVID-19.
The nt composition of neutral and mutated and neutral sites (especially SYN4F sites) was close to the expected 12-component-based nt equilibrium, and whole-genome nt compositional changes followed the mutation bias, indicative of mutagenesis as a strong determinant of SARS-CoV-2 evolution. SARS-CoV-2 mutations were viral-specific and not host-specific.
Overall, the study findings showed a butterfly effect, i.e., tuning of protein spaces via permissive aa trajectories by mutation bias which can be common across species with biased mutagenesis, and that species-specific mutational spectra and species-specific aa compositions are interlinked.
*Important notice
bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.