Thursday, February 20, 2025

Generative AI tool Evo 2 marks a milestone in biology by predicting form and function of proteins in the DNA of all domains of life except viruses

Amazing stuff!

"Imagine being able to speed up evolution – hypothetically – to learn which genes might have a harmful or beneficial effect on human health. Imagine, further, being able to rapidly generate new genetic sequences that could help cure disease or solve environmental challenges. Now, scientists have developed a generative AI tool that can predict the form and function of proteins coded in the DNA of all domains of life, identify molecules that could be useful for bioengineering and medicine, and allow labs to run dozens of other standard experiments with a virtual query – in minutes or hours instead of years (or millennia). ...

Evo 2 was trained on a dataset that includes all known living species, including humans, plants, bacteria, amoebas, and even a few extinct species. ...

In this way, Evo 2 is able to generate – to write – new genetic code that has never existed before. With Evo 2, you can enter a sequence of up to 1 million nucleotides. ...

Evo 2, on the other hand, also includes the known genomes of 15,000 or so plants and animals – the eukaryotes – which includes humans. Our dataset has now expanded from about 300 billion nucleotides to almost 9 trillion with Evo 2. In terms of safety, we have left out the genomes of viruses to prevent Evo 2 from being used to create new or more dangerous diseases. ...

If you want to design a new gene, you prompt the model with the beginning of a gene sequence of base pairs, and Evo 2 will autocomplete the gene. ...

With Evo 2, we can be more direct and steer toward mutations that have useful functions. Evo 2 also includes machine learning models that will tell you if the sequence exists in nature and predict how this new sequence will function in real life. ...

The model is actually very good at distinguishing which mutations are just random, harmless variations and which cause disease. ..."

From the abstract:
"All of life encodes information with DNA. While tools for sequencing, synthesis, and editing of genomic code have transformed biological research, intelligently composing new biological systems would also require a deep understanding of the immense complexity encoded by genomes. We introduce Evo 2, a biological foundation model trained on 9.3 trillion DNA base pairs from a highly curated genomic atlas spanning all domains of life. We train Evo 2 with 7B and 40B parameters to have an unprecedented 1 million token context window with single-nucleotide resolution. Evo 2 learns from DNA sequence alone to accurately predict
the functional impacts of genetic variation—from noncoding pathogenic mutations to clinically significant BRCA1 variants—without task-specific fine tuning.
Applying mechanistic interpretability analyses, we reveal that Evo 2 autonomously learns a breadth of biological features, including exon–intron boundaries, transcription factor binding sites, protein structural elements, and prophage genomic regions.
Beyond its predictive capabilities, Evo 2 generates mitochondrial, prokaryotic, and eukaryotic sequences at genome scale with greater naturalness and coherence than previous methods.
Guiding Evo 2 via inference-time search enables controllable generation of epigenomic structure, for which we demonstrate the first inference-time scaling results in biology. We make Evo 2 fully open, including model parameters, training code, inference code, and the OpenGenome2 dataset, to accelerate the exploration and design of biological complexity."

Generative AI tool marks a milestone in biology | Stanford Report "Trained on a dataset that includes all known living species – and a few extinct ones – Evo 2 can predict the form and function of proteins in the DNA of all domains of life and run experiments in a fraction of the time it would take a traditional lab."

AI can now model and design the genetic code for all domains of life with Evo 2 "Arc Institute develops the largest AI model for biology to date in collaboration with NVIDIA, bringing together Stanford University, UC Berkeley, and UC San Francisco researchers"





No comments: