Thanks to genome resolved metagenomics it has become possible to determine the lifestyles of uncultivated microbes and their roles in microbial communities from different ecosystems. This involves extracting and sequencing DNA directly from environmental samples without a culturing step. Assembling the DNA sequences and binning the genome fragments to individual genomes is the final step in the development of a picture of the microbial community. The resolution is limited by incompleteness of the single genomes, a problem that restricts conclusions regarding the individual metabolic capabilities and deductions related to social interactions of microbes with each other.
The challenge is to reconstruct as many high quality genomes as possible. Many different binning approaches exist, which all outperform the competing binning tools in their respective publications. But which of these strategies is the best for reconstructing high numbers of high quality genomes from metagenomes?
The answer to this question is not trivial. Five years ago the JGI supported an emerging technologies opportunity program (ETOP) project to tackle the problem. In a collaborative effort between the labs of Susannah Tringe and Jill Banfield, we tested and benchmarked many binning tools and approaches and applied them to metagenomics data from ecosystems of different levels of complexity, including high-complexity environments like soil. We found that the performance of automated binning methods varied a lot between ecosystems, between samples of the same ecosystem and even in the ability to recover different genomes from the same sample. Additionally, many predicted bins were of low quality in terms of completeness and contamination. Manual binning approaches such as emergent self-organizing maps (ESOMs) are able to reconstruct high quality genomes in some cases, but these are time consuming and not feasible to apply to a large number of assemblies of high complexity samples.
So the answer to the question of which binning approach performs best in all cases was: none.
So why not just stick to the respective tool that performs best on individual samples? Even if we took for each sample the binning prediction with the highest number of high quality genomes, we would still be missing out, as in many cases tools that didn’t perform as well overall still predicted genomes that didn’t get recovered by the best performing tool for this sample. Besides, multiple tools will predict slightly different versions of the same genome and deciding which is the best version is not trivial.
In order to get the best bins for each sample we came up with a de-replication, aggregation and scoring tool (DAS Tool) which is able to identify and select high quality bins from the collection of possibilities coming from multiple predictions in a non-redundant way. Using this strategy we were able to considerably increase the number of high quality genomes in samples from publicly available data of human gut, oil seeps, and soil compared to the results of a single prediction.
The Angelo Coast Reserve protects thousands of acres of the upper watershed of South Fork of the Eel River in Mendocino County (Image on the left; Credit: Akos Kokai via Flickr, CC BY 2.0). Site where soil samples were taken for this study is shown on the right (Credit: Allison Sharrar).
At the same time as this research was ongoing, our colleague Alex Probst was investigating the microbial community of a high CO2 cold water geyser in Utah. Applying DAS Tool to 27 different metagenomic samples resulted in the reconstruction of a total of 2,216 genomes including 104 different phylum-level lineages which shed light on the different pathways used for CO2 fixation (Probst et al., 2017). This showed us that our approach is able to scale up and can face the trend of cheaper sequencing costs resulting in higher sampling frequencies and sequencing depth.
Finally, our answer to the question about the best binning tool is: Apply a diversity of tools and select the best genomes using DAS Tool.
Crystal Geyser is a CO2-driven, cold-water geyser located in the Paradox Basin in Utah. (Credit: Cathy Ryan)
Christian M. K. Sieber, Alexander J. Probst, Allison Sharrar, Brian C. Thomas, Matthias Hess, Susannah G. Tringe & Jillian F. Banfield (2018). Recovery of genomes from metagenomes via a dereplication, aggregation and scoring strategy. Nature Microbiology. https://doi.org/10.1038/s41564-018-0171-1.