We are proud to present a new phylogenetic tree of more than 10,000 bacterial and archaeal genomes, based on the protein sequences of a global set of 381 gene families. Our original goal was to provide an accurate and comprehensive reference phylogeny to improve microbiome research, which heavily benefits from—or relies on—hierarchical representations of the relationships among the numerous microbes in a community (i.e., trees). Therefore we designed a computational workflow to evenly sample complete and draft genome data from NCBI, such that the included biodiversity is maximized, and we used the most robust possible yet balanced combination of algorithm and data volume for tree building, reaching the limit of the servers at the San Diego Supercomputer Center (SDSC).
As we started to examine the product, we instantly noticed and got excited with something unusual. Our tree appears to be a “whole”, with lineages gradually, continuously branching off from the center, leaving no apparent gaps which segregate clades into clusters (Fig. 1A). This impression is distinct from multiple previous works of the kind in the past dozen years (e.g.   and ), in which domains Bacteria and Archaea are connected by a long branch, so striking that it visually bisects the entire dendrogram.
We thought about what was common among these previous works and is different in ours. It is the breadth of “marker” genes used for phylogenetic reconstruction. For decades, evolutionary biologists usually adopted a small number of genes, many of them involved in the central dogma, to build the tree of life. The ribosomal genes are a widely adopted example . Admittedly, they carry multiple advantages: being universal, (usually) conserved, and (almost) free from horizontal gene transfer (HGT). What’s more, being functionally and spatially linked, they can be easily extracted from a metagenome and safely concatenated for tree building.
While using “core” genes is certainly an operational strategy, it is not ideal, as they constitute less than a percent of the entire genome, and it’s hard to presume that they fully represent the evolution of the latter. Instead, we adopted the phylogenomics strategy, which allows one to take full advantage of “whole (meta)genome sequencing”, and it is arguably more inline with the mission of modeling genome evolution. We sampled marker genes solely by sequence alignability, and did not discriminate by structure, function, or any a priori knowledge which implies that certain genetic information needs to be treated specially in certain lineages.
The major challenge to this strategy is the prevalence of HGTs in the microbial world , which jumble up the “tree” structure across gene families. In order to untangle the discrepant evolutionary relationships among the global gene sets (rather than handpicking a few genes and assuming they are HGT-free), we used a gene tree summary method, implemented in the software tool ASTRAL. This method estimates the evolutionary history of each gene independently, and then reconciles the different gene histories into a single “species tree”. It can be more robust in the face of divergent gene and species evolutionary history . Furthermore, it quantifies the topological discordance between individual gene trees and the species tree, which reflects the intensity of non-vertical evolution in each gene family, hence allowing us to assess its impact on any biological conclusion we attempted to land on.
The resulting species tree was compared to those from the conventional gene alignment concatenation method and using alternative taxon, locus, site sampling schemes and evolutionary models. In all tests, the Bacteria-Archaea is consistently short by using the global marker genes, in contrast to by using the concatenation of 30 ribosomal proteins. More specifically, the branch connecting Bacteria and Archaea is one order of magnitude shorter, and the evolutionary distance between extant Bacteria and Archaea species is around one third of the latter (Fig. 1).
We further measured the Bacteria-Archaea distance reflected by individual gene trees, which we had already computed during the tree summary workflow. The dominant majority of genes have this distance within a small and narrow range. However, a few outliers, many of which are “core” genes such as rpoC, tuf and fusA1 (in addition to ribosomal proteins), have this distance several to ten times longer (Fig. 2).
The correlation between a gene tree’s concordance with the species tree and the Bacteria-Archaea distance it reflects is weakly positive. We sequentially removed discordant gene trees from the dataset. Doing so resulted in a slightly increased Bacteria-Archaea distance (e.g., 1/6 most concordant genes yielded ~25% increase), but it is still far from the same metric calculated using ribosomal proteins (~300%). Therefore the Bacteria-Archaea proximity we observed is not an artifact due to the diversity of gene evolution patterns.
It is now safe to conclude that the two domains Bacteria and Archaea are indeed close to each other in evolution, as long as one chooses to trust the information from the larger majority of the microbial genomes. In contrast, relying on a few core genes is not only a technical compromise, but also leads to biased view of the tree of life at the domain level.
Our work is calling for revisiting the most basic evolutionary relationships of life. Yet it is just the beginning, and multiple fundamental questions remain to be answered. Broader biodiversity (in particular: eukaryotes), better optimized marker gene set, and further improved (in both accuracy and scalability) bioinformatics methods will be explored in order to expand the understanding of domain-level relationships.
And finally, don’t forget that the original mission of this work was to deliver a reference phylogeny to the community, and it is there, together with the genome pool, curated taxonomy (using both NCBI and GTDB systems), cool and interactive rendering, and lots of protocols and codes for using this resource in microbiome research, are available at: https://biocore.github.io/wol/.
Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B, Bork P. Toward automatic reconstruction of a highly resolved tree of life. Science. 2006. 311(5765):1283-7.
Mukherjee S, Seshadri R, Varghese NJ, Eloe-Fadrosh EA, Meier-Kolthoff JP, et al.. 1,003 reference genomes of bacterial and archaeal isolates expand coverage of the tree of life. Nat Biotechnol. 2017. 35(7):676-83.
Castelle CJ, Banfield JF. Major New Microbial Groups Expand Diversity and Alter our Understanding of the Tree of Life. Cell. 2018. 172(6):1181-97.
Ramulu HG, Groussin M, Talla E, Planel R, Daubin V, Brochier-Armanet C. Ribosomal proteins: toward a next generation standard for prokaryotic systematics?. Mol Phylogenet Evol. 2014. 75():103-17.
P Puigbò, YI Wolf, EV Koonin. Search for a 'Tree of Life' in the thicket of the phylogenetic forest. J Biol. 2009. 8(6):59.
Davidson R, Vachaspati P, Mirarab S, Warnow T. Phylogenomic species tree estimation in the presence of incomplete lineage sorting and horizontal gene transfer. BMC Genomics. 2015. 16(Suppl 10):S1.