"Any sufficiently advanced technology is indistinguishable from magic."
- Arthur C. Clarke
A new kid on the block
Disruptive technologies are defined by their ability to displace an established technology and/or create a completely new industry. A timely example is nanopore sequencing, an innovative new technology that has captured the imagination of many across diverse scientific communities. The breadth and depth of its potential applications are ever increasing and being able to contribute a small piece of work to this field has been an incredibly rewarding experience.
So what, you might ask, is RNA sequencing using nanopore arrays? Put simply, it is the ability to directly sequence polyadenylated RNAs from their 3’ poly(A) tail to their 5’ cap, and to do so without any requirement for amplification or recoding that might otherwise bias the resulting data (direct RNA sequencing). As the technology develops, additional information such as detection of base modifications and poly(A) tail length estimates are rapidly coming within reach. Oh and the sequencing unit (MinION) is highly portable and fits in the palm of a hand.
We applied this methodology to the study of the herpes simplex virus type 1 (HSV-1) transcriptome, reasoning that the intrinsic limitations of older experimental approaches may have obscured layers of complexity that would challenge existing dogma about virus-host biology. This complexity was rapidly evidenced by the sequencing of full length polyadenylated RNAs which provided us with the ability to simultaneously identify transcription initiation sites, RNA cleavage sites, novel and alternative transcript isoforms, and – unexpectedly – also identify a new class of viral fusion transcripts encoding detectable proteins of unknown function! All of this is comprehensively detailed in our new paper, published in Nature Communications, which you can read here: Direct RNA sequencing on nanopore arrays redefines the transcriptional complexity of a viral pathogen
Fig. 1: Comparison of direct RNA nanopore sequencing to Illumina sequencing. HSV-1 genome wide sliding window (100 nt) coverage plots of poly(A) RNA sequenced by nanopore (black) and Illumina (red) technologies. Nanopore reads represent a single mRNA, directly sequenced, while Illumina reads are derived from highly fragmented mRNA. Illumina data (red dotted line) were normalized (red solid line) to produce the same overall coverage as the nanopore data. The HSV-1 genome is annotated with all canonical open reading frames (ORFs) and coloured according to kinetic class (green - immediate early, yellow - early, red - late, grey - undefined). Multiple ORFs are grouped in polycistronic units and these are indicated by black hatched boxes. The y-axis represent absolute read depth counts. Inset windows (blue hatched boxes) exemplify the 3’ bias inherent to direct RNA-seq (due to sequence reads being generated 3’ -> 5’) that is less prevalent in Illumina data.
Coming on in leaps and bounds
My first ever nanopore direct RNA-Seq run was performed back in the heady days of spring 2017. I remember being ecstatic about getting a shade over 10,000 sequence reads that I could align to a viral reference genome and begin examining the complexity of its transcriptome. Fast forward to winter 2017 and I have moved continents (London -> New York) to join a new lab, switched viruses (VZV -> HSV-1), and increased my sequencing output to over 500,000 sequence reads per run. This increase in capacity changed everything by providing incredible sequencing depth across the HSV-1 transcriptome.
The next major challenge was to overcome the high error rate in nanopore sequencing by getting creative with error-correction. While many better and faster error-correction approaches will undoubtedly be developed in due course, the absence of any such tools during this study required a customised approach (Fig. 2) that took a long time to optimise.
The primary advantage of our error-correction approach was that we were able to rescue unmapped reads, improve splice site detection, and reduce soft-clipping of read alignments – the latter allowing us to rescue bases that would otherwise be discounted due to (erroneously) poor alignment and significantly extend the mapped portion of a sequence read in both 5’ and 3’ directions.
Fig 2: Error-correction and generation of pseudo-transcripts to overcome sequencing errors inherent to the nanopore method. Raw nanopore reads include numerous indel and substitution errors that hinder the identification of encoded ORFs and thereby impede annotation of the transcriptome. Illumina datasets generated from the same material allowed error-correction using proovread. Subsequently the transcript start / stop positions and internal splice positions were used to generate what we termed pseudo-transcripts free of indel and substitution errors that permit unambiguous ORF prediction. Changes in CIGAR string lengths for a representative read are shown for each step of correction. CIGAR strings denote the alignment of a sequence read to the reference genome with M (match), I (insertion in sequence read relative to reference), D (deletion in sequence read relative to reference), and N (skipped region, intron). Note how error correction reduces the number of I and D characters, instead extending the length of match (M) regions.
The complex nature of simple truths
Our data point to the HSV-1 transcriptome being far more complex than previously realised. Similar observations have also been made for related herpesviruses including cytomegalovirus, Epstein-Barr virus, and the Kaposi sarcoma herpesvirus. Excitingly, and frustratingly, these observations require the wider herpesvirus community to delve back in to the literature and potentially re-evaluate many of the studies aimed at understanding the functions of specific viral genes. Knocking out a gene and describing its phenotype remains a favoured approach for interrogating viral gene function but interpreting this data is complicated where multiple distinct transcripts span the knockout region and may be vulnerable to disruption. The question now is whether parallel leaps in genome editing technologies, such as CRISPR-Cas9, can allow us to refine in vitro and in vivo analyses by, for instance, targeting transcription initiation sites for disruption – eschewing the need for cruder gene deletions.
I can’t finish this without making a special mention of my co-authors and mentors. To be given the opportunity to join a new lab and immediately go off on my own tangent with a new unproven technology required a tremendous amount of faith and support and I cannot thank my two mentors, Ian Mohr and Angus Wilson, enough. As for the co-authors, their contributions were all critical whether enabling me to chase up the biological implications of our findings or helping me to resolve some remarkably complex artefacts in earlier sequencing runs. Good science always requires good collaborators and I have some of the very best.
Written by Daniel P. Depledge