Some microbes are bigger than others, some can fix nitrogen gas, some can take up nutrients quickly. These characteristics, which inevitably influence the success that a microbe has in its environment, are collectively referred to as "traits". Can whole proteomes analogously have distinctive “traits”? And if so, what are the “traits” of protein expression and how might they differ across organisms? In this recent paper we defined a proteomic trait as a characteristic of an organism at the proteome-level, that includes both the abundance and identity of a protein (or group of proteins), and is connected to organismal fitness or performance. To characterize and compare proteomic traits across diverse marine microbes, we used metaproteomics (i.e., from "metaproteomes to proteomes").
I was hardly thinking along these lines 4.5 years ago when I first started this work. At that time, I was mainly focused on getting the protein extracted. In the first few months of my PhD, I lugged hundreds of litres of water from the ocean shores near Dalhousie University, inoculated this seawater with old cultures of Phaeodactylum tricornutum (thank you, Dr. Joerg Behnke!), and watched them grow. I would later use these huge volumes of water to generate ~80 filters with the same protein amount and composition to test out a variety of protein extraction approaches. But getting the proteins extracted was only the beginning of this journey.
One of the most fun and challenging parts of this paper for me was the amount of data. There was so. Much. Data. (I’m tempted to call it ‘big data’, but I won’t go that far.) The sheer amount of data meant that a huge amount of effort had to go into organizing the folders well from the onset. Where did all this data come from? These samples had a trifecta: Metagenomics, metatranscriptomics, metaproteomics, all from the same water! Combining these different techniques together gave us an opportunity to ask different methodological questions. (We strongly considered separating this into two papers; one just about methods and one about the scientific findings.) Some of the methods questions have been bothering me since I started my PhD – what happens when a database used for metaproteomics doesn’t match well to the samples? How does this influence identification and quantification of peptides? With this dataset, we were able to systematically tackle these questions.
The short and sweet of it is: yes, database choice matters for both identifying and inferring peptide quantities. We provided some simple methods for checking if these issues are influencing your conclusions (see the paper for the details!).
The methods questions were both frustrating and fun. It was frustrating to have so many unknowns, and so many things to double, triple, and quadruple check, before we could say anything about the biology. Bioinformatics in general was frustrating at times because much of the time is devoted to installing software. The fun parts were making a plot using a new method and knocking on Erin’s door: “Want to see something cool?” It’s those small, spontaneous interactions that I’ve missed most during the past year and a half.
One of the first questions we asked was simple: which types of proteins are in the water? It seems to me the simple questions end up being the hardest. It was a sea of “hypothetical proteins”. This was fascinating for me, that there are so many proteins that we have absolutely no clue what they are doing. Astonishingly, the most abundant protein group we observed (!) had no associated annotations and searching NCBI yielded nothing as well.
With this dataset, we were also able to look at the proteomic composition of the eukaryotic phytoplankton (mostly because we had lots of metatranscriptomic data!). By grouping proteins into “coarse-grained” groups, we could compare different taxa by these different groups. We recently published a paper of a “coarse-grained” proteomic model of a diatom (McCain et al 2021, Science Advances), so I decided early on that there would be a set of targeted questions with these metaproteomic data extending from this modelling approach. The simple question was: what proportion of the proteome comes from ribosomal or photosynthetic proteins? Are different taxa similar, or different based on these proteomic traits? As you may have read from the title, they were pretty different. I’m excited to see how metaproteomics can be used to quantify other proteomic traits across microbes, particularly those that are difficult to culture.
Imagine an organism living in a perfectly constant environment. There would be little advantage to regulating protein expression under different conditions. We used this logic to quantify another proteomic trait: the environment-independent proteomic fraction, a proxy for the cost of regulating protein expression, or regulatory cost. We classified all peptides corresponding to a microbial taxa into two categories, environment-independent and environment-dependent (based on the coefficient of variation of peptide abundance). What we found was surprisingly consistent with previous observations about SAR11, dinoflagellates, and other marine microbes. For example, other researchers have suggested SAR11 appears to have little regulatory investment, and our proxy for regulatory cost similarly identified this. Overall, this quantitative metric can be used to compare the various allocation strategies microbes in the ocean employ.
As with all large datasets, there is a serious risk of venturing into the “garden of forking paths”. More explicitly, if I looked at the dataset in hundreds of ways, I’d inevitably find something to tell a story about. I’ve tried very hard to avoid this by starting off with a few specific scientific questions. I also relied on previously published proteomic data to compare our estimates of the different proteomic traits. I can’t wait for more metaproteomic data to be generated, to get better estimates of these different proteomic traits, and to see if our initial hypotheses hold up.
Picture a new class of biogeochemical models where we explicitly represent gene expression in response to different environmental variables, which can directly be informed by rich data sources like metaproteomics. Microbes differ based on their gene repertoires and several authors have leveraged this to developed “gene-centric” biogeochemical models (e.g., Reed et al 2014, PNAS; Coles et al 2017, Science) which recapitulate geochemical trends. But what we suggested in our paper is that while microbes differ based on gene repertoires, there is a whole other dimension: different proteomic allocation strategies. Imagine the amount of diversity across microbes that exists just because of the amount of a protein expressed. Before we can do this, we must first discover this diversity, and we’ve made a step towards that direction. We’ve gone “from metaproteomes to proteomes”, but the eventual goal is to go “back again”. If we can represent these proteomic traits mathematically, can we put them back together again: a model of a metaproteome?