Material Informatics x Sparse Modeling Vol.1: RDKit and Lasso

Material Informatics X Sparse Modeling Vol.1: RDKit And Lasso

Hello everyone. This is Unseo, a data scientist here at HACARUS – today I will introduce some of our latest work in bringing our technology to bear in the field of Material Informatics.

HACARUS provides solutions that emphasize “why?” of the process utilizing sparse modeling, in addition to “prediction” that has been performed by conventional machine learning.
Analysis using sparse modeling is also very suitable for application in the fields of drug discovery and material development.

Today, I would like to introduce an example of sparse modeling application to a simulated problem in materials informatics.

Before we begin

What is Materials Informatics (MI)?

Materials informatics (hereinafter referred to as MI) is a field that realizes efficient material development using computer science.

Material development takes a lot of time and effort – to make a substance with the desired properties, we need to repeat actions such as the following:
1. Predict necessary materials and methods using theory and empirical rules.
2. Make an experimental plan.
3. Actually create the material.
4. Test if the desired properties are obtained.
5. If it doesn’t work, start over.

However, due to recent improvements in computer performance, it has become possible to predict physical properties and functions by machine learning using simulations and large amounts of data with high accuracy.
This has led to the establishment of fields such as MI and chemoinformatics. MI is particularly focused on material development.

Furthermore, MI using machine learning has been gaining momentum in the recent years.
Until a while ago, the mainstream way to perform simulations was by using quantum chemistry calculations on supercomputers. However, simulating all of the huge number of candidate substances requires an unrealistic amount of calculation even with supercomputers.

This is why there are more and more attempts to obtain high-quality candidate substances in a short time by creating machine learning models from past experimental data and narrowing down promising candidate substances.

​MI x Sparse Modeling​

The application of sparse modeling to MI makes a lot of sense.
With “conventional” machine learning, it is usually difficult to understand why the prediction model made a prediction, even if the model’s prediction performance is good.
With Sparse modeling on the other hand, we can extract the essential elements from a small amount of data, and can estimate what are the important factors for prediction.

For detailed information about sparse modeling and Lasso, please see this article (in Japanese) “Why was sparse modeling born? Introducing the representative algorithm” LASSO “”.

When applied to material development, Sparse Modeling can make it easier to establish guidelines for the next substance to be tried by understanding the factors that contribute to the desired properties.
Therefore Sparse Modeling contributes to performing experiments more efficiently.

In the next section, we will predict the solubility of various molecules in water using the Lasso regression, which is a typical example of sparse modeling.
Then, let’s look at the results of structural analysis of the important factors for the prediction.

Lasso Prediction of Water Solubility​

​For this experiment, I referred to Chapter 9 of the book “Practical Materials Informatics” [1].
The book explains everything from programming such as how to use Python and the computer chemistry library RDKit to machine learning, Bayesian optimization, and problem solving in actual MI.
On the Support Page, the code actually used in the book is published, so you can download it and run it in your own environment.

Data

We used the open data called ESOL [2].
It records 1128 molecular structures (SMILES) and logP (octanol water partition coefficient), which is an index of water solubility.
We will now create a feature of this data from chemical structure (SMILES) only and create a Lasso model that predicts logP.

Partial data of ESOL

Partial data of ESOL

Feature value

Using RDKit, which is also used in the book [1] introduced earlier, we created a feature called Morgan fingerprint from the chemical formula. Morgan Fingerprint expresses the partial structure of a molecule as a binary vector.

For example, if the bit corresponding to the hydroxy group is the third bit, the vector of the substance having the hydroxy group can be expressed as (0, 0, 1, …). This article uses a 4096-bit Morgan Fingerprint, so you can write up to 4096 substructures.

(The figure is taken from the reference[3] p.4)

Learning

We will divide the data to evaluate the training model correctly later. 80% of the original data was used as training data to train a Lasso regression model that predicts logP from the Morgan Fingerprint of each molecule.

In other words, it is a model that predicts its properties only from the structure of the molecule. A simple model like Lasso is attractive because the learning is done quickly. In fact, on my Macbook (CPU: Core M, memory: 8 GB), one learning is completed in 1.26 seconds.

Forecast results

As a result of predicting with the trained model from the remaining 20% of the data not used for training, the R2 score was 0.723.

R2 Score is one of the evaluation indexes for the regression model, the closer it is to 1, the better the performance. It is said that it is not so predictable if the R2 score is less than 0.6, and it is better to suspect overfitting if the score is larger than 0.9, so I think it is reasonably accurate.

The following figure is a scatter plot when the horizontal axis is the measured value of logP and the vertical axis is the predicted value. The closer each point is to the green line, the better the prediction value, and it can be seen that the prediction is particularly good even for unknown data (Test).

Lasso's predictive performance

Lasso’s predictive performance


Lasso estimation of important structures​​

Next, we will move on to the main subject, analysis of structures important for prediction.
But before that, let me explain why Sparse Modeling can estimate important factors for prediction.

Why Lasso can explain predictions

One of the major characteristics of Lasso regression is sparsity. To make it easier to understand, let’s compare with Ridge regression, which is a similar linear regression model to Lasso.

The following figure shows the magnitudes of the regression coefficients of Ridge regression and Lasso regression side by side and plotted. (Drawn in 1024 bits for clarity)

It can be seen that the coefficient of Lasso regression has many 0 elements compared to the coefficient of Ridge regression. This feature is called sparsity.

The value of the coefficient corresponds to the bit of the feature, which is the partial structure of the molecule. It can be seen that most of the partial structures are not used for prediction in the Lasso regression, and only a few partial structures are involved in the prediction.

By narrowing down the important elements to a small number, it is possible to know what are the important elements that are common in several compounds.

Ridge's coefficient

Ridge’s coefficient

Lasso's coefficient

Lasso’s coefficient

Next, let’s look at the result of selecting 10% in order from the one with the largest absolute value (= important one) of the Lasso regression coefficient.

In the following figure, the horizontal axis shows the bit number (= partial structure), and the vertical axis shows the value of the Lasso regression coefficient corresponding to the bit.

Top 10% coefficient contributing to prediction

Top 10% coefficient contributing to prediction

It can be seen that the 2231 and 455 bits on the right side of the figure strongly contribute to the negative.
We can see that molecules with structures of 2231 and 455 are more likely to be soluble in water, as smaller logPs are more soluble in water.

Now you might be thinking – what kind of structure do these bits specifically refer to?
RDKit allows you to visualize the substructures that correspond to the assigned bits.

​Visualization of partial structure

The figure below shows a substance with a partial structure whose Lasso regression coefficient is in the top 10%, and the partial structure drawn from data unknown to the model.

It’s a little hard to see, but the number of bits in the substructure and its coefficients are listed below the substructure.

Among the partial structures of the substance 5- (3-Methyl-2-butenyl))-5-ethylbarbital, the partial structures that greatly contribute to logP have bit numbers of 807 and 950 and coefficients of 0.58 and 0.13, respectively.
Bit number 807, which is common to all three substances, is possessed by many different substances and is considered to be a fairly important factor because of its relatively large coefficient.

In addition, although there are not many substances with the structure of No. 116 of Acephate, the coefficient is larger than that of No. 807, so we can see that the effect on logP is greater than that of No. 807.

In Conclusion

We were able to visually understand the structures that are important for water solubility, by predicting the water solubility of various substances by Lasso regression and selecting and visualizing the partial structures that are important for prediction.

By using sparse modeling to extract the partial structure that contributes to the desired properties from the data, it would be possible to know what kind of substance we should make next. If we know what kind of substance we should make next, we may be able to run the cycle of experiment and analysis faster and more efficiently than before.
​​
It’s true that you might also be able to make better predictions using Deep Learning and other complex models instead of Lasso. However, it is very difficult for humans to understand the reason the decision making process for such a complicated model.

As introduced in this article, a simple and highly interpretable model like Lasso is key to providing human-friendly results, which makes it easier to conduct the validation process by experts.

Lastly, analysis using machine learning, like this one may contain major chemical mistakes and surprisingly good results.
Its authenticity cannot be determined only by us data scientists, but it can provide analyses that can streamline experiments.

The field of MI is a fusion field of computer science and material development. It is important for material development experts and us analysts to deepen mutual understanding and proceed with the project.


If you have technology and data for material development, but don’t know how to use the data, HACARUS may be able to help.
For more information on using our expertise for your development projects, please contact us here.

Inquiry Form

References​

[1] [船津 公人FUNATSU Kimito, 柴山 翔二郎SHIBAYAMA Shojiro, “Practice Materials Informatics実践 マテリアルズインフォマティクス”, Kindai kagaku sha Co.,Ltd (2020)](https://www.kindaikagaku.co.jp/information/kd0615.html)

[2] Delaney, John S. “ESOL: estimating aqueous solubility directly from molecular structure.” Journal of chemical information and computer sciences 44.3 (2004): 1000-1005.

[3] [Gregory Landrum, “fingerprints in the RDKit” RDKit UGM 2012, London (2012)](https://www.rdkit.org/UGM/2012/Landrum_RDKit_UGM.Fingerprints.Final.pptx.pdf)

Subscribe to our newsletter

Click here to sign up