Know thy enemy: Resolving the SARS-CoV-2 gene set and understanding COVID-19 variants using evolution
Can comparative genomics resolve the protein-coding gene set of SARS-CoV-2? Yes! We compared 44 Sarbecovirus genomes to determine which SARS-CoV-2 candidate genes encode functional proteins and which mutations are likely functional, so future efforts can focus on real genes and important mutations.
In February 2020, as the COVID-19 pandemic spread ferociously around the world, I wondered what I could do to help understand and fight that scourge. As a research scientist at MIT, my expertise is in comparative genomics, specifically in using evolutionary signals to distinguish protein-coding regions of genomes. RNA viruses have tiny genomes (100,000 times smaller than the human genome), so I presumed that all the protein-coding genes in SARS-CoV-2, the virus that causes COVID-19, would already be well known. I thought I could instead build on a project I had helped Rachel Sealfon with several years ago. When Rachel was a PhD student in our lab, she had developed FRESCo1, a software tool for detecting when a protein-coding region of a viral genome has significantly fewer synonymous substitutions than the rest of the gene, indicating that it serves some overlapping function such as folding into an RNA structure or encoding another protein in a different reading frame. I contacted Rachel, and we set out to see what FRESCo could tell us about the SARS-CoV-2 genome. My P.I., Manolis Kellis, was supportive and eager to help. Rachel’s PhD co-advisor, Pardis Sabeti, put us in touch with two expert virologists, Jeremy Luban and Robert Garry, to give us feedback on the project.
The first step was choosing an appropriate set of genomes for comparison. We found 44 complete genomes (SARS-CoV-2, SARS-CoV, and 42 bat viruses) in the Sarbecovirus subgenus that were just the right evolutionary distance for this kind of analysis. The foundation of my work is a software tool called PhyloCSF2, which uses codon substitution frequencies in a multi-species genome alignment to distinguish protein-coding regions, so as a get-my-feet-wet exercise I used PhyloCSF to create tracks for viewing evolutionary protein-coding potential all along the genome in UCSC’s SARS-CoV-2 Genome Browser3. (At our recommendation, the team at UCSC used the same 44 strains for other comparative genomics tracks.) As expected, my PhyloCSF tracks showed a strong protein-coding signal in the known genes that cover most of the genome. Then I zoomed in on the end of the genome and… wait, what’s this? There was no signal at all in ORF10! (Figure 1) So maybe the gene content of this virus wasn’t completely resolved after all...
Investigating further, I found that different sources supplied different gene annotations for the virus. NCBI included ORF10, whereas UniProt did not, and UniProt included two genes overlapping the nucleocapsid gene in a different reading frame, ORF9b and ORF9c, that were not included by NCBI. Various papers showed two different open reading frames (ORFs) overlapping ORF3a, confusingly naming both of them ORF3b. Many other conflicts existed, many other genes were proposed, the databases didn’t agree -- the genome annotation was a mess!
Our mission was now clear: To help fight this Death Star, we needed to create an accurate map with which to navigate its genome. Our arsenal of comparative genomics techniques was well-suited to resolve the ambiguities. My PhyloCSF tracks showed a strong protein-coding signal in all the main genes except ORF10. That included ORF8, which had been lost in SARS-CoV during the SARS pandemic and whose individual nucleotides have not been well conserved, but PhyloCSF left no ambiguity: ORF8 has been protein-coding throughout most of the subgenus. Because the PhyloCSF signal can be difficult to interpret for overlapping genes, Rachel used FRESCo to see if ORF9b, ORF9c, or either ORF3b showed synonymous constraint indicative of regions encoding proteins in two different reading frames. ORF9b did (Figure 1) but the others did not.
Finally, we wondered if there were other SARS-CoV-2 genes yet to be discovered, so I used PhyloCSF to calculate the evolutionary protein-coding potential of every ORF in the whole genome. When I looked at the Sarbecovirus alignment of the best-ranked candidate using CodAlignView4, a tool we had developed previously that color-codes differences between genomes to highlight protein-coding signatures, I found a striking pattern: the start and stop codons were almost perfectly conserved, and a large fraction of the differences were synonymous ones, which preserve the amino acid sequence (Figure 2). When I compared to Rachel’s FRESCo output, I saw that this ORF coincided almost exactly with a region of ORF3a in which synonymous substitutions in the ORF3a reading frame were suppressed. These are exactly the signals that we expect for an overlapping protein-coding gene. Moreover, the ORF was near the start of ORF3a, where it could be translated by “leaky scanning”, whereby a ribosome misses the start codon of ORF3a and instead begins translating at a downstream AUG codon. We had discovered a novel protein-coding gene in SARS-CoV-2! We called it ORF3c. While we were conducting this investigation, three other groups had independently discovered the same gene, two using the signal of synonymous constraint5,6 and a third using ribosome profiling, a technique that determines which parts of an RNA molecule are translated into protein7.
We ended up with a clean list of protein-coding genes in SARS-CoV-2 that are conserved in the Sarbecovirus subgenus (Figure 3). That still left open the possibility that some genes had arisen de novo in SARS-CoV-2 that are not present in other sarbecoviruses. To investigate this, we checked experimental datasets and incorporated evidence from mutations, such as a premature truncation of one ORF in many isolates, militating against it being protein-coding. Overall, we found no evidence of protein-coding function for any of the other candidates.
We next turned to classify which of the thousands of SARS-CoV-2 mutations that have occurred during the COVID-19 pandemic are most likely to affect its function, based on their evolutionary dynamics across the 44 related strains. In particular, we were able to classify likely-functional versus likely-neutral mutations in each of the “variants of concern”, which are variants that are thought to increase transmission or decrease the effectiveness of previous immunity. Comparing the broader view of evolution within the Sarbecovirus subgenus to the zoomed-in view of evolution within the SARS-CoV-2 virus led us to several surprising hypotheses. In particular, we speculated that the mutation encoding D614G, which occurred early in the pandemic and quickly spread to become the dominant form of the virus, is likely to revert to the D form later in the pandemic, and we found evidence that ORF8 contributes to within-individual fitness but not person-to-person transmission.
While all this was happening, I found myself repeatedly annoyed by the use of the name ORF3b to refer to two completely different ORFs. “Someone ought to fix that”, I thought. Well, if you want something done… I contacted some of the relevant researchers to see if we could agree on standardized names. One of them, Andrew Firth, with whom I had previously collaborated, introduced us to coronavirus luminaries Alexander Gorbalenya and John Ziebur, who serve on the Coronaviridae Study Group of the International Committee on Taxonomy of Viruses. We agreed to reserve the name ORF3b for the partial homolog of SARS-CoV ORF3b, and use the name ORF3d for the other “ORF3b”, as well as resolving conflicting usage of several other names, resulting in a paper bringing the community together9. Collaborating with this illustrious group of researchers was very educational for me, and much of what I learned fed back into improving both the content and presentation of the original 44-Sarbecovirus manuscript8.
Our paper is here: doi:10.1038/s41467-021-22905-7
- Sealfon, R. S. et al. FRESCo: finding regions of excess synonymous constraint in diverse viruses. Genome Biol. 16, 38 (2015) doi:10.1186/s13059-015-0603-7.
- Lin, M. F., Jungreis, I. & Kellis, M. PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions. Bioinformatics 27, i275–82 (2011) doi.org/10.1093/bioinformatics/btr209.
- Fernandes, J. D. et al. The UCSC SARS-CoV-2 Genome Browser. Nat. Genet. 52, 991–998 (2020) doi.org/10.1038/s41588-020-0700-8.
- I Jungreis, MF Lin, CS Chan, M Kellis. CodAlignView. CodAlignView: The Codon Alignment Viewer https://data.broadinstitute.org/compbio1/cav.php (2016).
- Cagliani, R., Forni, D., Clerici, M. & Sironi, M. Coding potential and sequence conservation of SARS-CoV-2 and related animal viruses. Infection, Genetics and Evolution vol. 83 104353 (2020) doi.org/10.1016/j.meegid.2020.104353.
- Firth, A. E. A putative new SARS-CoV protein, 3c, encoded in an ORF overlapping ORF3a. Journal of General Virology (2020) doi:10.1099/jgv.0.001469.
- Finkel, Y. et al. The coding capacity of SARS-CoV-2. Nature (2020) doi:10.1038/s41586-020-2739-1.
- Jungreis, I., Sealfon, R. & Kellis, M. SARS-CoV-2 gene content and COVID-19 mutation impact by comparing 44 Sarbecovirus genomes. Nat. Commun. (2021) doi:10.1038/s41467-021-22905-7.
- Jungreis, I. et al. Conflicting and ambiguous names of overlapping ORFs in SARS-CoV-2: A homology-based resolution. Virology (2021) doi:10.1016/j.virol.2021.02.013.