Common Sense: A new computational technique could make it easier to engineer useful proteins

Wednesday, April 03, 2024

A new computational technique could make it easier to engineer useful proteins

Good news! This is only the beginning! Smoothing helps, but not glossing over! 😊

Interestingly, it appears this research did not rely on very large artificial neural networks.

One of the authors, i.e. Tommi Jaakkola is a well known ML researcher!

"... “Protein design is a hard problem because the mapping from DNA sequence to protein structure and function is really complex. There might be a great protein 10 changes away in the sequence, but each intermediate change might correspond to a totally nonfunctional protein. ...

They began by training a type of model known as a convolutional neural network (CNN) on experimental data consisting of GFP [green fluorescent protein] sequences and their brightness — the feature that they wanted to optimize.

The model was able to create a “fitness landscape” — a three-dimensional map that depicts the fitness of a given protein and how much it differs from the original sequence — based on a relatively small amount of experimental data (from about 1,000 variants of GFP).

These landscapes contain peaks that represent fitter proteins and valleys that represent less fit proteins. Predicting the path that a protein needs to follow to reach the peaks of fitness can be difficult, because often a protein will need to undergo a mutation that makes it less fit before it reaches a nearby peak of higher fitness. To overcome this problem, the researchers used an existing computational technique to “smooth” the fitness landscape.

Once these small bumps in the landscape were smoothed, the researchers retrained the CNN model and found that it was able to reach greater fitness peaks more easily. The model was able to predict optimized GFP sequences that had as many as seven different amino acids from the protein sequence they started with, and the best of these proteins were estimated to be about 2.5 times fitter than the original. ...

The researchers also showed that this approach worked well in identifying new sequences for the viral capsid of adeno-associated virus (AAV), a viral vector that is commonly used to deliver DNA. In that case, they optimized the capsid for its ability to package a DNA payload. ..."

From the abstract:

"The ability to engineer novel proteins with higher fitness for a desired property would be revolutionary for biotechnology and medicine. Modeling the combinatorially large space of sequences is infeasible; prior methods often constrain optimization to a small mutational radius, but this drastically limits the design space. Instead of heuristics, we propose smoothing the fitness landscape to facilitate protein optimization. First, we formulate protein fitness as a graph signal then use Tikunov regularization to smooth the fitness landscape. We find optimizing in this smoothed landscape leads to improved performance across multiple methods in the GFP and AAV benchmarks. Second, we achieve state-of-the-art results utilizing discrete energy-based models and MCMC in the smoothed landscape. Our method, called Gibbs sampling with Graph-based Smoothing (GGS), demonstrates a unique ability to achieve 2.5 fold fitness improvement (with in-silico evaluation) over its training set. GGS demonstrates potential to optimize proteins in the limited data regime. ..."

A new computational technique could make it easier to engineer useful proteins | MIT News | Massachusetts Institute of Technology

Improving Protein Optimization with Smoothed Fitness Landscapes (open access)

Figure 1: Overview. (A) Protein optimization is challenging due to a noisy fitness landscape where the starting dataset (unblurred) is a fraction of the landscape with the highest fitness sequences hidden (blurred). (B) We develop Graph-based Smoothing (GS) to estimate a smoothed fitness landscape from the starting data. (C) A model is trained on the smoothed fitness landscape to infer the rest of the landscape. (D) Gradients from the model are used in Gibbs With Gradients (GWG) where on each step a new mutation is proposed. (E) The goal of sampling is for each trajectory to gradually head towards higher fitness.

Wednesday, April 03, 2024

A new computational technique could make it easier to engineer useful proteins

No comments: