A CRISPR-Cas9 puzzle revealed by machine learning
The paper in Nature Communications is here: https://go.nature.com/2rKKfAQ
During my postdoc in the laboratory of Luciano Marraffini at the Rockefeller University I had the chance to work on the early developments of CRISPR-Cas9 technologies and in particular their application to modify the genome of bacteria or control gene expression. The catalytically dead variant of Cas9 known as dCas9 can be programmed to bind almost any gene of interest, but it just sits on the DNA instead of introducing a break as Cas9 would. Binding of dCas9 to DNA is strong enough to block the RNA polymerase when in the proper orientation (the guide must bind to the coding strand of the target gene), and the ease with which it can be reprogrammed makes it a fantastic tool to study the effect of silencing genes.
When I started my group at the Institut Pasteur 4 years ago, one of my first goal was to setup a genome-wide screen to exploit the properties of dCas9 to investigate the function of genes in E. coli in a systematic way. A fantastic postdoc called Lun Cui joined the lab to lead this project and obtained the first results within a year. A pooled library of ~10^5 guide RNAs expressed from a plasmid was introduced in an E. coli strain carrying dCas9 under the control of an inducible promoter. The effect of each guide on the growth of E. coli was measured by monitoring the relative abundance of guides in the population while inducing dCas9 expression.
We were pleased to observe that guide RNAs that target the coding strand of essential genes were rapidly depleted from the library as expected. The analysis of these data can reveal a lot of interesting information about essential and near-essential genes, which we started to investigate and will publish in another upcoming study. While working on this, we made the intriguing observation that some guides strongly impaired the growth of E. coli while binding in the wrong orientation (i.e. to the template strand). We and others had previously shown that dCas9 binding in this orientation only leads to moderate repression (~10-60% reduction in expression). Some essential genes likely do not tolerate even a moderate reduction in their expression level. However, we were really puzzled to see that a few non-essential genes including lpoB, a gene involved in cell-wall synthesis, were targeted by guides on the template strand which produced a fitness defect. It was unlikely that the moderate repression of a non-essential gene would be toxic, but I still called an expert in cell-wall synthesis to ask him whether what we observed with lpoB made any sense; it did not.
It became clear that lpoB was not involved in this phenotype when we deleted it from the genome of E. coli and still observed the same toxicity effect. The mystery deepened when we realized that this was not an isolated phenomenon. As many as 7% of the guides targeting the template strand of genes were toxic. Our attention was attracted to lpoB only because, by chance, two of the three guides that target its template strand show this phenomenon. Our first hypothesis to explain the toxicity of these guides was that they likely have off-target positions blocking the expression of essential genes. This turned out to be correct for some of them, but not for the guides targeting lpoB. As a matter of fact the vast majority of these unexpectedly toxic guides have no obvious off-target in the chromosome of E. coli that can explain their toxicity.
We thus turned to a machine learning approach to investigate whether these toxic guides shared some sequence features. A neural network model using only the guide sequence as an input was able to predict toxicity quite accurately (pearson-r: 0.54). This approach helped us identify the role played by the last 5 nucleotides of the guide RNA. There are 1024 possible sequences of 5 nucleotides, and quite surprisingly more than 100 of them make guide RNAs toxic to E. coli, with phenotypes ranging from small colony formation to the complete inhibition of growth. The 3' end of the guide sequence is also known as the seed sequence, and we thus termed this toxicity the "bad seed" effect.
Five nucleotide of identity to a target gene is too little to have a substantial effect on gene expression, and these toxic guides can have several hundreds of possible target positions in the chromosome of E. coli, making this phenomenon particularly puzzling and hard to investigate. Over the following two years we tried to solve this riddle with several approaches, some of them reported in the manuscript, but which only led to little insights so far. One in particular was to simply select mutants able to survive the toxicity of some guides and sequence them to look for mutations that could point towards the mechanism. In order to avoid selecting mutants that simply inactivated dCas9 we screened for clones that did not show the toxicity phenotype while maintaining a strong on-target repression. We could find such clones relatively easily and were very excited to sequence their genome, our imaginations already running wild with hypothesis about what we would find.
When the results came back, we were terribly disappointed to see that the only mutations that we found were in dCas9 itself, and most were actually frame-shift mutants. These mutants were most likely able to express low levels of dCas9 through ribomosomal slippage, explaining how they were selected in our screen. Further heroic efforts by Lun Cui to isolate interesting mutants, with smarter screen designs and painstaking colony screening, only revealed more intricate ways in which E. coli can reduce the expression level of a protein. In the end, these experiments did not bring us any closer to finding the toxicity mechanism. What they did tell us however, was that lowering the concentration of dCas9 could alleviate the toxicity while maintaining strong on-target repression. We thus performed the screen again, this time with a lower expression of dCas9, and obtained beautiful results. The effect of guide RNAs became much more consistent, and the "bad seed" effect was greatly alleviated. Altogether this study provided many insights into the properties of dCas9 in E. coli, and helped us formulate design rules that you will find in the discussion of the paper. Still, the mystery of the "bad seed" effect remains, but we will not throw in the towel so easily, so stay tuned...