The relative simplicity of the central dogma--DNA to RNA to protein--belies the actual difficulty in gaining protein functional insight from DNA sequence alone.
As a microbiologist, I always loved the relative ease of genome-scale screens to quickly discover genes of interest. But then dreaded the next steps of trying to figure out what these candidates actually did. How many times has the glow of a successful screen faded into a fruitless bioinformatics search for function--e.g., BLAST telling you that your super interesting ORF encodes a family with a conserved 'DUF' (domain of unknown function)?
So, while genomic sequencing has been revolutionary in uncovering similar coding regions, it requires much more follow up work to generate structural information that can provide insights on how proteins interact, regulate enzymatic activity and control substrate specificity. Such resource-intensive work for each new protein makes triage and hypothesis-generation difficult, and makes accurate (and easy) structural prediction from sequence an attractive holy grail.
But, current structural prediction programs tend to model new sequences onto existing structural scaffolds. But what do you do when protein prediction searches find nothing in the database that looks similar? For example, in the Pfam database, >5,000 families still have no representative structures on which to model (15,000 families have at least 1 representative structure).
The answer seems to be going back to the DNA sequence, or rather a lot of DNA sequences. It has been well appreciated that protein interaction pairs co-evolve (i.e., despite evolutionary changes in amino acids, protein interactions are selected for and can lead to compensatory mutations when altered). And recent work has shown that existing DNA databases can provide enough diverse sequences for some proteins families such that de novo structural prediction programs can discriminate meaningful co-evolutionary changes from sequence drift and lineage effects. But, most of the unsolved protein families don't have enough representation in genomic databases to enable stable detection of enough co-evolving residue pairs for structural prediction.
In work in last week's Science (see the related perspective by Johannes Soding, here), David Baker's lab and collaborators show that we've overlooked a vast new resource to solve this problem: leveraging the the richness of metagenomics databases offered by microbial diversity sequencing projects. The authors found that the Rosetta prediction program had good de novo prediction accuracy for 27 protein families compared to their known structures, as long as there were enough sequences around (e.g., best accuracy was found when Nf >64; [Nf = number of clusters at 80% identity with good correlation with accuracy divided by square root of the protein length]).
Unfortunately, only 8% of protein families in current DNA databases had enough sequence diversity to actually satisfy the Nf criteria. But, by turning to metagenomic databases, they could increase protein family memberships by 100 fold in many cases, and bring 25% of protein families into accurate modeling territory (33%, if they used a Nf>32 cutoff, where fold level (if not protein-level) accuracy appeared at good).
This led to structural prediction for over 614 protein domains, representing 487,306 UniRef100 and 3.8 million metagenomic sequences. Of these 137 domains were entirely new with no similarity to previously seen structural folds. Overall, this work has added testable structural models for 12% of all known protein families, and given the rate of new diversity studies coming online, this number should only keep increasing.
However, there remain limitations to the current approach: first, most survey studies are currently microbial so eukaryotic protein families remain vastly underrepresented in the database; second, prediction approaches still require downstream validation efforts to visualize protein conformations that better take the native cellular context into account (e.g., interacting proteins, presence of ligands, enzyme activation state, membrane insertion).
But, while there is much more to be done in this space, the work shows that the future is bright for these computational approaches and that there is much to be gained by envisaging metagenomic datasets not just as repositories of diversity but as evolutionary tools from which we can ask questions about different aspects of cellular function. It will be exciting to see how much further we can continue to push the computation; for example, since conserved protein interactions can also exist between proteins, can such approaches provide de novo structural insight into larger complexes.