Friday, June 07, 2024

Google for DNA indexes 10% of world’s known genetic sequences

Good news! This is only the beginning! Let the new discoveries flow!

"A tool that functions like a Google for DNA has demonstrated its promise for making all of the world’s biological sequence data cheaply and easily searchable, according to the Swiss team that developed it. In a proof of principle study, the researchers say they successfully indexed 10% of the world’s known DNA, RNA, and protein sequences—and the same method could be used to do the rest. ..."

From the abstract:
"The amount of biological sequencing data available in public repositories is growing exponentially, forming an invaluable biomedical research resource. Yet, making it full-text searchable and easily accessible to researchers in life and data science is an unsolved problem. In this work, we take advantage of recently developed, very efficient data structures and algorithms for representing sequence sets. We make Petabases of DNA sequences across all clades of life, including viruses, bacteria, fungi, plants, animals, and humans, fully searchable. Our indexes are freely available to the research community. This highly compressed representation of the input sequences (up to 5800 fold) fits on a single consumer hard drive (≈100 USD), making this valuable resource cost-effective to use and easily transportable. We present the underlying methodological framework, called MetaGraph, that allows us to scalably index very large sets of DNA or protein sequences using annotated De Bruijn graphs. We demonstrate the feasibility of indexing the full extent of existing sequencing data and present new approaches for efficient and cost-effective full-text search at an on-demand cost of $0.10 per queried Mpb. We explore several practical use cases to mine existing archives for interesting associations and demonstrate the utility of our indexes for integrative analyses."

‘Google for DNA’ indexes 10% of world’s known genetic sequences | Science | AAAS Achievement demonstrates feasibility of making all of life’s code easily searchable, researchers say




No comments: