The exploration of microbial communities from environmental samples around the world has been a thriving field of microbiology for many years. After the introduction of high-throughput sequencing, a wealth of studies made use of 16S rDNA amplicon sequencing to investigate the taxonomic profiles of all kinds of environments. An alternative to the profiling via universal marker genes is metagenomic whole genome shotgun sequencing, which offers additional possibilities to examine the microbial diversity of a sample, for example the metabolic functions and pathways or the prevalence of viruses.
Another important application of metagenomics is the “mining” for novel enzymes from specific environments. In the last years, I have been a member of the Hotzyme consortium, which has the goal to discover novel thermostable enzymes, specifically hydrolases, from extreme temperature environments. One of the objectives of the project is the application of these enzymes in certain laboratory techniques or industrial applications.
During the project, we collected samples from various hot springs around the world, which were then subjected to whole genome shotgun sequencing, either by Illumina or Roche/454.
Besides looking for hydrolases, we were also interested in the microbial and viral profiles of the hot springs, since thermophiles and their viruses do not get as much attention as other microbes, which is partially due to the difficulties of obtaining laboratory cultures from species that have adapted to specific environmental niches. This also means that these types of metagenome projects are frequently faced with a lot of unknown sequences that have no match to the sequences collected in GenBank. For our metagenomes, we assembled the Illumina reads into contigs, which were then assigned to taxa using MEGAN based on sequence alignment to the non-redundant protein database "nr". Taxon abundances could then be estimated by mapping the reads back to the contigs and counting reads per taxon. While this approach worked out for us, it is not really straight forward and was also quite demanding in terms of computation time.
Starting about two years ago, several new programs for taxonomic profiling of metagenomic samples were published, which offer a significant computational speedup over the traditional sequence alignment by comparing k-mers - nucleotide sequences of length k - between the metagenomes and a reference database of microbial genomes. This approach allows for classifying millions of reads in less than an hour, which is a computational necessity for keeping up with the increasing availability and throughput of metagenomic sequencing in the coming years.
While being fast and straight-forward, the usage of genomic k-mers has some shortcomings, which limit its applicability in some situations. For example, the data structures for storing the k-mers for a fast lookup require a lot of RAM, so that the programs typically restrict their reference database to a set of completely assembled reference genomes. However most bacterial genomes are only available as draft genomes comprising a set of contigs rather than a full chromosome. Another issue is the rather strict need to find at least one identical k-mer between the read and the genome database in order to assign the read to a species. This requirement is worsened by the limited availability of sequenced genomes in many clades, so that the phylogenetic distance between the sequences in the database and the metagenomes from the samples are often quite high. Thus, the assignment of reads to taxa is still a challenging problem, especially in samples from extreme environments like hot springs. It is therefore crucial to compare sequences on protein level rather than on nucleotide level in order to bridge larger evolutionary distances and use a reference database that contains as many microbial and viral proteins as possible.
We therefore had the idea for making an easy-to-use taxonomic classifier that is as fast as k-mer based programs but also more sensitive by using a protein database. Instead of using fixed-length k-mers, we opted for finding maximum exact matches between reads and database using the Burrows-Wheeler transform as index structure, which is especially suited for this kind of search in a large database. Our new program "Kaiju" only requires a comparatively small amount of RAM, allowing it to use the entire microbial subset of the nr protein database, while having similar or faster runtime compared with the fastest k-mer classifiers. Kaiju can also compensate for substitutions in the amino acid sequences during the database search, which allows for finding even more divergent protein sequences. Kaiju directly reads high-throughput reads, for example from FASTQ files, and assigns each read to a taxon, either at species level or higher levels in the taxonomic tree in case of ambiguous database matches.
To test the benefit of using a protein database over a genomic database, we devised a large genome exclusion study on 882 genomes from those genera that have only few sequenced genomes available. This study showed that Kaiju has a much higher sensitivity compared with a classification based on k-mer comparison, while maintaining a similar precision. Additionally, we saw that Kaiju could classify many more reads in ten randomly selected metagenomes from a wide range of microbial habitats. Even so Kaiju limits itself to those genomic regions that contain protein-coding genes, which cover ca. 80 to 90 percent of a typical bacterial or archaeal genome, the sequencing reads often overlap with open reading frames, especially when using long or paired-end reads, so that the chance for a database match increases. When we used Kaiju on the Hotzyme metagenomes, we got similar classification results as before, while having a significantly reduced complexity of the workflow and computation time.
We hope that Kaiju will be helpful to other researchers for analysing their samples from microbial habitats around the world. The program can be downloaded for a local installation or be used through a web server, which also has a colorful visualisation of the metagenomes as little reward for the user. The circles denote taxa in the sample and are coloured by the respective phylum. The example figure below depicts a metagenome from a hot spring in the Yellowstone National Park, which mostly comprises Crenarchaeota and some thermophilic archaeal viruses.
Read the paper: Fast and sensitive taxonomic classification with Kaiju.