The human gut virome consists primarily of bacteriophage which may both play a crucial role in regulating and shaping microbial communities of the gut and facilitate horizontal gene transfer and microbial evolution. With 90% of viral sequencing reads sharing little to no homology to reference databases, the make-up of these viral communities also represents one of the biggest gaps in our understanding of the human microbiome. As the hosts of the majority of these viruses are also unknown, the virome research community relies heavily on sequencing and computational approaches.
When we developed our virome analysis pipeline we realised that if we were to make sense of this unknown majority we would need to move towards In silico methods and away from database dependent approaches. Our first port of call was a crucial stage in all reference-independent pipelines, the assembly step, at which short sequence reads are used to recreate the genome sequences of community members. Looking at previous studies we realised that there was no single assembly method used across all virome studies, nor had there been an extensive assembly comparison dedicated to the virome, which led us to this study.
Metagenomic assembly, or reconstructing the genome sequences of community members, is a common but challenging computational task due to the complexity of microbial communities and large amounts of sequencing data required to represent them in a meaningful way. Unfortunately for virome scientists, assembly challenges of viromes are more difficult – perhaps even by orders of magnitude. The ability of the assembler to overcome these challenges is of significant importance to a virome analysis pipeline, which is essentially built around this crucial step.
By testing 16 assembly approaches on a combination of 4 different virome datasets including both synthetic and human viromes, we observed significant variation in assemblers’ ability to overcome assembly challenges. Most assemblers failed to properly reconstruct phage genomes that we knew to be there, which was a worrying outcome. In most cases the assemblers resulted in only small proportions of the genomes being recovered and assemblies being short and fragmented. These findings have serious implications for virome analysis pipelines, as not only does the choice of assembly program used in a study directly impact which members of a viral community can be recovered, but certain viral genomes appear to challenge all current assembly approaches. We observed that extremes in abundance were responsible for aspects of poor assembly, as were the proportion of genomic repeat regions in each community member. However, these challenges did not explain the full variation in poor genome recovery, highlighting a continued need to improve and develop virome analysis approaches as well was important considerations when setting downstream analysis parameters and making final conclusions.