Just using bioinformatics analysis, we discovered the crAssphage genome by serendipity. We were reanalyzing some data published in Nature using our new tool, cross-assembly (crAss), and found what appeared to be a single contig representing an approximately 97 kb phage genome that we called crAssphage. It was surprising because the original authors had overlooked it, and also because once we knew what we were looking for, we could find the genome in lots of samples, especially in human fecal metagenome samples. By the time we published the paper describing that work in Nature Communications, we looked at all publicly available metagenomes — 2,944 at that time, including 466 fecal metagenomes. We found crAssphage in 73% of the human fecal metagenomes that were publicly available.
The paper created a lot of interest, admittedly, in part due to our arguably peculiar choice of name for the phage, which according to our knowledge was the first to be named after a computer program. We talked about crAssphage at phage, bioinformatics, genomics, and microbiology meetings, and at other local and international scientific meetings across the world, and several questions repeatedly rose from the people we talked to. Is it real, or some kind of artifact? Where does it come from? And, is there any evidence that it affects either the health or disease status of humans?
To answer these questions, we decided to call in the help of the awesome global phage community. We started a new project to investigate where crAssphage could be found, which turned out to be bigger and more collaborative than we could have imagined.
Faced with an obvious challenge, and a straightforward solution, in mid-2015 we designed a dozen pairs of primers to amplify crAssphage.
By that time, we had crAssphage sequences from 226 metagenomes and used those to identify conserved regions across the genome. Next, we needed some samples, and so we turned to the South Bay International Wastewater Treatment Plant for samples. On our first visit to the WWTP, we were there to collect sewage influent — the raw material entering the sewage treatment plant — and we estimated that we needed between five and ten milliliters of influent. Fortunately, the plant manager had set aside 60 liters of influent for us that day. A little more than we needed!
We quickly found crAssphage sequences by PCR and sequenced them to confirm that we had found what we were looking for. Now we could optimize the PCR and perfect the technique. Alejandro Vega, an undergraduate student at SDSU took the lead in developing the protocols that would soon become used worldwide.
Now that we had the techniques optimized, we decided to see if other scientists would be interested in our hunt for crAssphage. We sent an email to our friends and colleagues, to phage email lists, and to scientists we had not yet met, blogged, and tweeted about our quest:
@BEDutilh and I are starting an global project to explore the phylogeography of #crassphage. Let us know if you want to join in!
— Rob Edwards (@linsalrob) October 17, 2015
We asked everyone to order the primers we had designed, collect some local samples, test them by PCR, and send us the DNA sequences. We sent the initial batch of emails out on Friday, October 16, 2015, and just three weeks later, on Friday, November 6th, 2015 Franklin Nóbrega in Stan Brouns lab then at Wageningen University emailed us our first sequences.
We asked people to order their own primers because we didn’t want to send out primers and potentially share contamination. We did share a few sets of primers with people that could not afford to order them, and we sequenced a few PCR products upon request, but largely everyone graciously volunteered their money, time, and effort to support our somewhat crazy quest. We did offer them authorship on the paper, but we hadn’t written a word of it at that point, of course!
We were rapidly accruing data from our international team of volunteers and so we needed a way to share it back with the world, and as new data came in from around the world add it to our existing catalog. We used some ideas from computer science to overcome the technological challenges to constantly updating and adding data. First, we started a GitHub repository that allows us to keep track of all of the updates to our data. Since we started the repository on November 8th, 2015, we have made 794 updates to the data. For most of 2016 we were adding data but through 2017 and 2018 we were analyzing the data and using the repository to keep track of those changes.
The repository has always been (and will remain) open access and Github tracked all the changes that we made over the last few years, download, explore, and analyze the data, that we discussed in this paper.
Another key decision we made early on was to use a reproducible pipeline to analyze the data. We used a common computer science tool called make. You can now download all the data that we collected for this project, all the ways that we analyzed it, and provided you have the tools installed, you can repeat any of the analyses we describe in the paper. Reproducible science!
As we said, the project grew and grew, as more people became interested to join and share their data. Three collaborations are especially worth mentioning. First, the European Union-funded COMPARE project who delivered a wealth of information about crAssphage sequences from sewage sites around the world. Second, the Lifelines project whose impressive population-based dataset allowed us to investigate the association of crAssphage with human health and disease on a massive scale. Third, the National Science Foundation-funded project investigating the fecal microbiota of non-human primates who provided the data that allowed us to discover ancient relatives of crAssphage in old-world and new-world monkeys, and apes. But the most important by far was the many contributions to the project by laboratories around the world. Without their ongoing support, we would not have been able to pull this off and make the project into such a success.
You can read the results in our paper, published in Nature Microbiology