The paper in Nature is here: http://go.nature.com/2zdMsdx
Today we are thrilled to publish the first phase of the Earth Microbiome Project in Nature: "A communal catalogue reveals Earth's multiscale microbial diversity". The "we" in this statement represents a vast number of individuals whose collective efforts have led to this milestone. It is our hope that many more people will join in the next phases of the project. But today I want to look back on the contributions of so many who brought us to this point.
In July 2010, a group of 26 researchers convened for the Terabase Metagenomics Workshop in Snowbird, Utah. Meeting leader Rick Stevens tasked the group with a simple but bold question: "What could you learn about microbial ecology if you had a trillion-base-pair sequencing run?" The researchers surmised they could run amplicon sequencing and metagenomes for 200,000 samples—and the vision of the Earth Microbiome Project (EMP) was born. The only scientists crazy enough to follow up on the idea were Jack Gilbert, Janet Jansson, and Rob Knight, who became the project's founders.
Jack, Janet, and Rob sent out the call to microbial ecologists: send us your proposals for samples to have sequenced by 16S ribosomal RNA amplicon sequencing. Provided there was a valid study design with good metadata, the EMP labs would extract DNA and sequence the samples, using a standard protocol. The samples started to pour in from hundreds of researchers in dozens of countries. From the Arctic Circle to Antarctica, the labs received water, soils, sediments, lots of swabs, and (of course) poop.
Fast forward to June 2012. Jack was speaking about the EMP at King Abdullah University of Science and Technology (KAUST) in Saudi Arabia, where I happened to be a postdoc at the time. I had dinner with Jack and was excited about the project, but I didn't think much more about it after that. In August of that year, I attended the ISME meeting in Copenhagen, and Rob gave a talk about the EMP. The project was audacious, but the community was clearly excited about it. Later that year, Rob also visited KAUST, and I had a chance to introduce myself. I expressed an interest in working in his lab, and the following September I started a postdoc in the Knight Lab.
When I arrived in Boulder, I had never worked with amplicon data, didn't know how to use QIIME, and didn't know Python—all critical parts of doing bioinformatics in the Knight Lab. The EMP was a successful project by then, generating data from thousands of samples for dozens of individual studies, but the vision of combining these data into a single analysis was not yet realized. However, the project started to gain momentum after the Knight Lab moved to the University of California San Diego, where we assembled a core analysis team, with help from additional researchers around the United States. Because of the sheer scale of the dataset, we limited our initial meta-analysis to the first 97 studies—this was still 27,751 samples and over 2 billion sequences!
We quickly realized that nearly every software tool we used had to be rewritten to handle the scale of the EMP dataset. This included the indispensable QIIME software (led by Greg Caporaso) and the associated online server and database Qiita (Antonio González, Jose Navas, Gail Ackermann, and others). Analysis of beta-diversity patterns required retooled versions of UniFrac (Daniel McDonald) and Emperor (Yoshiki Vázquez-Baeza). Additionally, an entirely new OTU picking algorithm, Deblur (Amnon Amir and Daniel McDonald), was developed that uses exact sequences instead of traditional OTUs, which was central to the EMP meta-analysis. A search tool, Redbiom (Daniel McDonald), was developed to allow researchers to query the EMP catalogue and search by metadata values for particular samples or by sequences for their favorite microbes.
Simultaneously, the metadata from those tens of thousands of samples had to be wrangled to enable cross-study comparisons, a painstaking task. Luckily I had been tapped to teach data analysis at the Scripps Institution of Oceanography, forcing me to finally master Python and the powerful data science package Pandas. This gave me the tools to get a handle on the all-important metadata. In another coincidence, Jon Sanders and I had come up with an ontology (technically, a structured categorical variable) to classify sample types of new samples coming into the project. We realized that this EMP Ontology (EMPO) would be a useful way to categorize existing EMP samples, building on the Environment Ontology. We were continually amazed by how well it captured different measures of microbial diversity.
We are only just scratching the surface of the "Earth microbiome", which of course is really a collection of countless microbiomes. However, with the framework introduced here, we are starting to get a handle on the factors driving the composition of microbial communities in different environments. We can now flip the question around and ask, "Where in the world is my favorite microbe found?" The EMP Trading Cards introduced in our paper (examples shown above) give a teaser of what this future might look like.
As we continue to add new samples from old and new studies alike, and expand our analyses to metagenomics and metabolomics, one thing is certain: the EMP will continue to be a collaborative and communal effort that everyone can take part in. We are excited to see where it leads!