Standardising archaeal taxonomy

It is no secret among microbiologists that the classification of the archaeal domain is heavily skewed. We propose to resolve this uneven classification by leveraging the vast number of public genome sequences. The result is a unified archaeal taxonomy based on evolutionary relationships.

Like Comment
Read the paper

Whenever I present on archaeal taxonomy, I ask the audience: “There is one archaeal phylum that might be better described as a gigantic superphylum, anybody wants to guess which one it is?” This is mainly an exercise to get the audience engaged, and a bit like preaching to the choir, because the answer is always a loud and clear “Euryarchaeota!” Indeed, most colleagues are aware that more than two thirds of the archaeal genomic diversity is currently grouped into this single phylum. To understand how we arrived at this huge skew in archaeal taxonomy, it’s worthwhile to take a time machine and travel back to an era when archaeal organisms were considered to be just another strange bacterium.

How it all began – the discovery of a new domain

The scientific history of Archaea starts with the landmark paper by Carl Woese and his postdoc George Fox in 19771 that introduced this new domain based on 16S rRNA sequences. Their discovery was made possible by a new sequencing technique known as RNA fingerprinting, co-developed by biochemist Fred Sanger2. Over the course of several years, Woese and Fox managed to sequence the rRNA of over 100 organisms and to study their evolutionary history by inferring phylogenetic rRNA trees. They were surprised to find that a group known as methanogenic “bacteria” was quite distinct from all other bacteria in their dataset. Based on this remarkable sequence divergence and a previous discovery that methanogens do not contain peptidoglycan in their cell envelope, Woese and Fox proposed a new domain of life, the Archaebacteria. Years later they updated the name to Archaea and, based on an extended rRNA phylogeny, they subdivided the domain into the two phyla (kingdoms). The Euryarchaeota, which occupied a relatively broad spectrum of niches and showed varied patterns of metabolism, and the Crenarchaeota, which comprised a relatively tight cluster of extremely thermophilic archaea.

Carl Woese at the lightboard in 1976 tracing rRNA sequences.

Carl Woese at the lightboard in 1976 tracing rRNA sequences. Photo by Ken Luehrsen/ library of the University of Illinois (educational use permitted; http://rightsstatements.org/vocab/InC-EDU/1.0/)

Expansion woes

This initial characterisation of the two archaeal phyla, the Euryarchaeota representing a wide metabolic repertoire, and the Crenarchaeota constituting a tight group of extreme thermophiles, established the scientific perception of both lineages for decades to come. Many new lineages related to Euryarchaeota were described over the coming years and were lumped into this diverse phylum, causing it to grow into a gigantic behemoth that now encompasses over two thirds of the entire archaeal genomic diversity, according to NCBI taxonomy3.

Crenarchaeota were long viewed as a group of thermophiles, and the discovery of mesophilic relatives eventually led to the proposal of a new phylum, arguing that mesophilic Crenarchaeota are different from hyperthermophilic ones. The name Thaumarchaeota, after the Greek "thaumas", meaning wonder, was proposed for this new lineage4. What followed was a further splitting of the Crenarchaeota with the proposal of many new candidate phyla including Aigarchaeota5, Bathyarchaeota6, and Verstraetearchaeota7. Subsequently, these novel lineages were recombined into the Thaumarchaeota-Aigarchaeota-Crenarchaeota-Korarchaeota (TACK) superphylum8, which now contains over 10 proposed phyla9. New archaeal lineages, again mostly phyla, were also described outside the Euryarchaeota and TACK lineages and grouped into the two superphyla, DPANN10 and Asgard archaea11.

This wealth of novel archaeal lineages resulted in an uneven classification, partially due to inconsistent and unregulated nomenclature, and created several additional taxonomic challenges. Firstly, many new taxa were only proposed at the highest rank, i.e. phylum, resulting in an incomplete classification of lower ranks. Secondly, some archaeal lineages are polyphyletic and intermixed with each other in the current NCBI taxonomy. In essence, the taxonomic expansion of the archaeal domain has resulted in widespread uneven and incomplete classifications.

The proposed solution

We started with the assumption that we could resolve this uneven classification in the Archaea by normalising all ranks across the entire domain. To do this, we needed an objective measure to apply to all archaeal taxa to achieve a “united classification”. This idea is not new and has been explored in the past, e.g. in the proposal for rational taxonomic boundaries of higher taxa based on 16S rRNA gene sequence identities12. However, our approach is conceptually different. Rather than applying a static cut off, as proposed for 16S rRNA gene sequences, we embraced the concept that each archaeal lineage evolves at a distinct evolutionary rate. To capture these rate differences, we used evolutionary trees of genome sequences. These genomic trees, commonly inferred from a concatenated set of marker proteins, not only reveal evolutionary relationships among taxa but also model their divergence in the form of different branch lengths. Our approach uses these trees as input and translates the branch lengths into relative values, which we call the relative evolutionary divergence (RED). In brief, it sets the root to 0 and the tips (the leaves of the tree), to 1 and then linearly interpolates the values for the intermediate nodes for each lineage3,13.

This approach, we argue, better reflects the evolutionary history of archaeal and bacterial lineages than a static threshold. We first explored the RED approach in the bacterial domain, as part of the Genome Taxonomy Database (GTDB)13, and found support for this conclusion. For example, the fast-evolving lineage Mycoplasma14, which is currently classified as a genus, is so diverged that it would represent two phyla based on the proposed 16S rRNA identify threshold of 75%12. However, the radiation of bacterial phyla has been estimated to have occurred 2-3 Gyr ago15, whereas most Mycoplasma members are dependent on vertebrate hosts which emerged only ~500 Myr ago16,17. These dates suggest that Mycoplasma emerged long after the primary diversification of bacterial phyla and therefore is unlikely to represent one or even two phyla. The RED based rank normalisation, which accounts for variable speeds of evolution, assigns Mycoplasma into a single order in the Firmicutes, likely representing a more realistic evolutionary scenario13.

An important consideration for the RED approach, that relies on phylogenomic trees as input, is the method used to infer such a tree. We tested this extensively including variables such as the selection of marker genes, different inference methods, accounting for compositional bias and fast-evolving sites, the number of taxa in a tree, the rooting of the tree, etc. As expected, we found that the resulting phylogenomic trees differ in tree topology, with the marker gene set emerging as the most important differentiator. However, we were only interested in the robustness of the taxonomy, which does not equal the overall tree topology since we only used about two-thirds of the internal nodes for taxonomic classification. That is, we only assigned names to internal nodes with high bootstrap support (>90%) to provide taxonomic stability, apart from a very small number of exceptions. We found that the taxonomic assignments were remarkably stable under all the variables mentioned above. Having established a stable backbone, we next tackled the uneven and incomplete archaeal classifications.

RED approach, that relies on phylogenomic trees as input by Christian Rinke

The RED approach takes phylogenomic trees as input, so we evaluated many different trees. Shown is the first author, Chris Rinke, several phylogenomic trees, a dwarf mulberry and a mandarin tree. 

Applying the RED approach to the archaeal domain was an iterative process. We used the NCBI taxonomy as a starting point to eventually derive the curated archaeal GTDB taxonomy, available at https://gtdb.ecogenomic.org. Thereby, we first resolved polyphyletic groups by retaining the name for the lineage containing the nomenclature type. Then, we followed the International Code of Nomenclature of Prokaryotes (ICNP) and recent proposals to expand the code, i.e. to include the rank of phylum and allow genome sequences as type material. Overall, about 10% of NCBI-defined taxa above the rank of species were identified as polyphyletic lineages. For example, the NCBI phyla Nanoarchaeota, Aenigmarchaeota and Woesearchaeota, were intermingled with each other and with other unclassified archaeal genomes.

Next, we adjusted the ranks based on the RED values with the goal to preserve as many of the existing names as possible, i.e. names that were effectively published, validly published, or proposed as Candidatus taxa. However, the behemoth Euryarchaeota could not be retained as a phylum and had to be split into five phyla level lineages, the Methanobacteriota, Halobacteriota, Thermoplasmatota, Hadarchaeota and Hydrothermarchaeota. The TACK and Asgard superphyla on the other hand were both assigned to the rank of phylum and all subordinate taxa, in particular classes, were adjusted accordingly. This required that all members of the Thaumarchaeota were contained in the only validly described class in this lineage, the Nitrososphaeria, and likewise, members of the Crenarchaeota were placed in the class Thermoproteia (replacement name for Thermoprotei). These modifications also meant that the names Euryarchaeota and Crenarchaeota are not part of the archaeal GTDB taxonomy. We initially discussed grandfathering these names in, but strong feedback from the scientific community and the fact that both names will be illegitimate when the rank of phylum is included in the ICNP, resulted in their rejection. In contrast to these major changes required at the phylum level, taxa at the ranks from species to order were more stable, with only ~12% name changes on average.

Rank-normalized archaeal GTDB taxonomy using relative evolutionary divergence (RED).

Rank-normalized archaeal GTDB taxonomy using relative evolutionary divergence (RED). Species dereplicated genome tree inferred from 122 concatenated archaeal proteins, The tree is decorated with the archaeal GTDB taxonomy R04-RS89 and contoured with the RED interval assigned to each taxonomic rank. The adjacent ranks overlap in some instances, as this permits existing taxon names to be placed on well-supported interior nodes. RED values used for rank normalisation are averaged over multiple plausible rootings.

We are aware that introducing changes into the archaeal classification and nomenclature can cause unrest among microbiologists who prefer literature continuity over rank normalisation. However, we argue that now is the time to fix the system before an ever-increasing avalanche of environmental genomes are integrated into this skewed classification. The proposed archaeal taxonomy for the Genome Taxonomy Database (GTDB) is up to this task. It is robust against a range of standard phylogenetic variables, effectively resolves polyphyletic groups and normalises ranks across the entire archaeal domain. We envision that this project will gather increasing community support, and we actively seek engagement through our web forum (https://forum.gtdb.ecogenomic.org) and encourage you, dear reader of this blog, to explore the archaeal GTDB taxonomy at https://gtdb.ecogenomic.org. Users can also classify their own genomes against this taxonomy using GTDB-Tk (Chaumeil et al. 2019).

Happy classifying!

Please read our full paper at: https://doi.org/10.1038/s41564-021-00918-8

 References

  1. C. R. Woese, G. E. Fox, Phylogenetic structure of the prokaryotic domain: the primary kingdoms. Proc Natl Acad Sci U S A. 74, 5088–5090 (1977).
  2. G. G. Brownlee, F. Sanger, B. G. Barrell, Nucleotide Sequence of 5 S -ribosomal RNA from Escherichia coli. Nature. 215, 735–736 (1967).
  3. C. Rinke, M. Chuvochina, A. J. Mussig, P.-A. Chaumeil, D. W. Waite, W. B. Whitman, D. H. Parks, P. Hugenholtz, Nature Microbiology, doi:10.1038/s41564-021-00918-8.
  4. C. Brochier-Armanet, B. Boussau, S. Gribaldo, P. Forterre, Mesophilic crenarchaeota: proposal for a third archaeal phylum, the Thaumarchaeota. Nat Rev Micro. 6, 245–252 (2008).
  5. T. Nunoura, Y. Takaki, J. Kakuta, S. Nishi, J. Sugahara, H. Kazama, G.-J. Chee, M. Hattori, A. Kanai, H. Atomi, K. Takai, H. Takami, Insights into the evolution of Archaea and eukaryotic protein modifier systems revealed by the genome of a novel archaeal group. Nucleic Acids Res. 39, 3204–3223 (2011).
  6. J. Meng, J. Xu, D. Qin, Y. He, X. Xiao, F. Wang, Genetic and functional properties of uncultivated MCG archaea assessed by metagenome and gene expression analyses. ISME J. 8, 650–659 (2014).
  7. I. Vanwonterghem, P. N. Evans, D. H. Parks, P. D. Jensen, B. J. Woodcroft, P. Hugenholtz, G. W. Tyson, Methylotrophic methanogenesis discovered in the archaeal phylum Verstraetearchaeota. Nature Microbiology. 1, 16170 (2016).
  8. L. Guy, T. J. G. Ettema, The archaeal ‘TACK’ superphylum and the origin of eukaryotes. Trends in Microbiology. 19, 580–587 (2011).
  9. B. J. Baker, V. De Anda, K. W. Seitz, N. Dombrowski, A. E. Santoro, K. G. Lloyd, Diversity, ecology and evolution of Archaea. Nature Microbiology. 5, 887–900 (2020).
  10. C. Rinke, P. Schwientek, A. Sczyrba, N. N. Ivanova, I. J. Anderson, J.-F. Cheng, A. Darling, S. Malfatti, B. K. Swan, E. A. Gies, Insights into the phylogeny and coding potential of microbial dark matter. Nature. 499, 431–437 (2013).
  11. K. Zaremba-Niedzwiedzka, E. F. Caceres, J. H. Saw, D. Bäckström, L. Juzokaite, E. Vancaester, K. W. Seitz, K. Anantharaman, P. Starnawski, K. U. Kjeldsen, M. B. Stott, T. Nunoura, J. F. Banfield, A. Schramm, B. J. Baker, A. Spang, T. J. G. Ettema, Asgard archaea illuminate the origin of eukaryotic cellular complexity. Nature. 541, 353–358 (2017).
  12. P. Yarza, P. Yilmaz, E. Pruesse, F. O. Glöckner, W. Ludwig, K.-H. Schleifer, W. B. Whitman, J. Euzéby, R. Amann, R. Rosselló-Móra, Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences. Nat Rev Micro. 12, 635–645 (2014).
  13. D. H. Parks, M. Chuvochina, D. W. Waite, C. Rinke, A. Skarshewski, P.-A. Chaumeil, P. Hugenholtz, A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nature Biotechnology. 36, 996–1004 (2018).
  14. C. Citti, A. Blanchard, Mycoplasmas and their host: emerging and re-emerging minimal pathogens. Trends in Microbiology. 21, 196–203 (2013).
  15. J. Marin, F. U. Battistuzzi, A. C. Brown, S. B. Hedges, The Timetree of Prokaryotes: New Insights into Their Evolution and Speciation. Molecular Biology and Evolution. 34, 437–446 (2017).
  16. S. Kumar, G. Stecher, M. Suleski, S. B. Hedges, TimeTree: A Resource for Timelines, Timetrees, and Divergence Times. Molecular Biology and Evolution. 34, 1812–1819 (2017).
  17. D.-G. Shu, H.-L. Luo, S. Conway Morris, X.-L. Zhang, S.-X. Hu, L. Chen, J. Han, M. Zhu, Y. Li, L.-Z. Chen, Lower Cambrian vertebrates from south China. Nature. 402, 42–46 (1999).

 

Chris Rinke

ARC Future Fellow, The University of Queensland