Common Sense: genotype

Showing posts with label genotype. Show all posts

Tuesday, January 06, 2026

Machine learning reveals hidden dimensions of functional similarity in proteins

Recommendable!

Caveat: I did not read the entire, long article.

"Large language models trained on biological sequences, rather than natural language, are transforming biology, from predicting human genetic disease (1, 2) to the design of new-to-nature proteins (3–5). In this issue of PNAS, Cao et al. (6) extend these applications to detect the molecular underpinnings of phenotypic convergence by decoding patterns invisible to traditional sequence analysis approaches ...

Protein Language Models: Decoding Molecular Convergence

Protein language models (PLMs)—deep neural networks trained on millions of protein sequences—have emerged as powerful approaches for capturing the complex relationships between protein sequence, structure, and function (15–18). These models, adapted from natural language processing architectures, like transformers, learn to represent proteins as high-dimensional embeddings, numerical vectors that encode information about structural propensities, functional annotations, mutational effects, and evolutionary constraints.

PLMs are trained through self-supervised learning on large databases of protein sequences without explicit structural or functional labels. By learning to predict hidden amino acids (masked learning) or the next residue in a sequence (autoregressive), these models develop a representation of the “grammar” of proteins—which combinations of amino acids are permissible, which residues tend to co-occur, and which patterns correlate with specific structural and functional properties (3, 19–21). ...

They show that ACEP [Adaptive Convergence by Embedding of Protein] successfully identifies embedding-level convergence across the three test cases, demonstrating that PLMs have learned to recognize functional similarities. ..."

From the significance and abstract:

"Significance

In biology, repeated emergence of the same functional trait in evolution is important as it provides opportunity to decode the relations between genome or protein sequences to specific functions. Such functional convergence has been largely linked to sequence convergence at the level of single sites, because conventional methods cannot measure similarity of high-order features of sequences. This study reveals that the recent protein language models can extract embeddings from protein sequences reflecting high-order features, and develops statistical tests to evaluate the adaptive convergence of such features. The findings emphasize an underrated sequence basis for functional trait convergence in evolution, provide corresponding detection framework, and demonstrate potential power of deep learning in investigating the complex sequence–function mapping in evolutionary biology.

Abstract

Convergent evolution, or convergence, refers to repeated, independent emergences of the same trait in two or more lineages of species during evolution, often indicating functional adaptation to specific environmental factors.

Many computational methods have been proposed to investigate the genetic basis for organismal functional convergence, as an important way to decode the complex sequence–function map of proteins. These methods mostly focus on the convergence of amino acid states at the level of individual sites in functionally related proteins.

However, even without site-level sequence similarity, protein function similarity may also stem from convergence of high-order protein features, which cannot be captured by the conventional methods.

To fill this gap, we first derived numerical embeddings from protein sequences by pretrained protein language models (PLM).

In four previously reported cases, we found that functionally convergent proteins have similar embeddings despite no site-level convergence, indicating that PLM embeddings can reflect convergence of high-order protein features.

We then designed a pipeline to detect Adaptive Convergence by Embedding of Protein (ACEP). ACEP tests were significant on known and additional candidate genes with putative adaptive convergence like echolocation and crassulacean acid metabolism.

Genome-wide application showed that the ACEP framework can effectively enrich such candidates. Relations between convergences of PLM embeddings and specific protein physicochemical features were further examined.

In conclusion, PLM embeddings can indicate adaptive convergence of high-order protein features beyond site identities, demonstrating the power of deep learning tools for investigating the complex mapping between molecular sequences and functions."

Machine learning reveals hidden dimensions of functional similarity in proteins | PNAS

Language models reveal a complex sequence basis for adaptive convergent evolution of protein functions (no public access)

Fig. 1 Detecting molecular convergence using protein language model embeddings.