The term "big data" has been used so much that it's almost lost all meaning. However, I think we can agree that all of the worlds microbial genomes deposited in a handful of databases represents a pretty big dataset.
The question is, how do you search it? The short answer is you can't, or at least you couldn't. Yes, there were tools that allowed you to search such as BLAST, but a recent study in Nature Biotechnology had this to say about current methods,
"At first sight, BLAST and its successor algorithms seem to enable these searches, performing alignment of query sequences against large databases. There are two reasons why these tools do not suffice. First, they would not scale to databases the size of the European Nucleotide Archive or beyond. Second, they require assembled genomes as input; if applied to raw sequence data they would only find matches completely contained within a single read. However, assembly is fundamentally lossy when the input data contain multiple strains, and the highly heterogeneous historical data in the ENA would result in very variable assemblies, particularly of plasmids."
So current algorithms weren't cutting it and this was bad. It meant that time and money had been spent, experiments had been done, genomes had been sequenced and we didn't have access to the data. This meant that the answers to crucial biological questions are buried.
Researchers based in Oxford used knowledge of web search to produce a new data structure for genomes called BItsliced Genomic Signature Index (BIGSI). The hope is that this will solve the scaling problem with BLAST and allow any researcher to be able to search the vast amounts of deposited data. This has huge implications for surveillance of infections but also looking for rare events and mutations in genomes that can only be found in huge datasets.
You can try it out here:
Exponentially increasing amounts of unprocessed bacterial and viral genomic sequence data are stored in the global archives. The ability to query these data for sequence search terms would facilitate both basic research and applications such as real-time genomic epidemiology and surveillance. However, this is not possible with current methods. To solve this problem, we combine knowledge of microbial population genomics with computational methods devised for web search to produce a searchable data structure named BItsliced Genomic Signature Index (BIGSI). We indexed the entire global corpus of 447,833 bacterial and viral whole-genome sequence datasets using four orders of magnitude less storage than previous methods. We applied our BIGSI search function to rapidly find resistance genes MCR-1, MCR-2, and MCR-3, determine the host-range of 2,827 plasmids, and quantify antibiotic resistance in archived datasets. Our index can grow incrementally as new (unprocessed or assembled) sequence datasets are deposited and can scale to millions of datasets.
Ultrafast search of all deposited bacterial and viral genomic data
Phelim Bradley, Henk C. den Bakker, Eduardo P. C. Rocha, Gil McVean & Zamin Iqbal
Nature Biotechnology volume 37, pages 152–159 (2019)