Wednesday, February 15, 2023

IBM: An AI foundation model that learns the grammar of molecules

Very impressive! Google is not the only company fast advancing machine learning  for chemistry! The prospects of this research are great! This is only the beginning!

"... Introducing MoLFormer-XL, the latest addition to the MoLFormer family of foundation models for molecular discovery. MoLFormer-XL has been pretrained on 1.1 billion molecules represented as machine-readable strings of text. From these simple and accessible chemical representations, it turns out that a transformer can extract enough information to infer a molecule’s form and function. ...
We found that MoLFormer-XL could predict a molecule’s physical properties, like its solubility, its biophysical properties, like its anti-viral activity, and its physiological properties, like its ability to cross the blood-brain barrier. It could even predict quantum properties, like a molecule’s bandgap energies, an indicator of how well it converts sunlight to energy. ...
Many molecular models today rely on graph neural network architectures that predict molecular behavior from a molecule’s 2D or 3D structure. But graph models often require extensive simulations or experiments, or use complex mechanisms, to capture atomic interactions within a molecule. Most graph models, as a result, are limited to datasets of about 100,000 molecules, sharply limiting their ability to make broad predictions. ...
we trained MoLFormer-XL to focus on the interactions between atoms represented in each SMILES string through a new and improved type of rotary embedding. Instead of having the model encode the absolute position of each character in each string, we had it encode the character’s relative position. This additional molecular context seems to have primed the model to learn structural details that make learning downstream tasks much easier. ...
To pack more computation into each GPU, we chose an efficient linear time attention mechanism and sorted our SMILES strings by length before feeding them to the model. Together, both techniques raised our per-GPU processing costs from 50 molecules to 1,600 molecules, allowing us to get away with 16 GPUs instead of 1,000. By eliminating hundreds of unnecessary GPUs, we consumed 61 times less energy and still had a trained model in five days. ..."

From the abstract:
"Models based on machine learning can enable accurate and fast molecular property predictions, which is of interest in drug discovery and material design. Various supervised machine learning models have demonstrated promising performance, but the vast chemical space and the limited availability of property labels make supervised learning challenging. Recently, unsupervised transformer-based language models pretrained on a large unlabelled corpus have produced state-of-the-art results in many downstream natural language processing tasks. Inspired by this development, we present molecular embeddings obtained by training an efficient transformer encoder model, MOLFORMER, which uses rotary positional embeddings. This model employs a linear attention mechanism, coupled with highly distributed training, on SMILES sequences of 1.1 billion unlabelled molecules from the PubChem and ZINC datasets. We show that the learned molecular representation outperforms existing baselines, including supervised and self-supervised graph neural networks and language models, on several downstream tasks from ten benchmark datasets. They perform competitively on two others. Further analyses, specifically through the lens of attention, demonstrate that MOLFORMER trained on chemical SMILES indeed learns the spatial relationships between atoms within a molecule. These results provide encouraging evidence that large-scale molecular language models can capture sufficient chemical and structural information to predict various distinct molecular properties, including quantum-chemical properties."

An AI foundation model that learns the grammar of molecules | IBM Research Blog Meet MoLFormer-XL, a pretrained AI model that infers the structure of molecules from simple representations, making it faster and easier to screen molecules for new applications or create them from scratch.

No comments: