Translated - SSRN 4048382 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

ed

A Machine Learning Method for Material Property


Prediction: Example Polymer Compatibility

iew
Zhilong Liang,1 Zhiwei Li,2 Shuo Zhou,2,3 Yiwen Sun,1,3 Jinying Yuan,2,* and Changshui Zhang 1,4,**
1 Institute for Artificial Intelligence, Tsinghua University (THUAI),
State Key Lab of Intelligent Technologies and Systems,
Beijing National Research Center for Information Science and Technology (BNRist),
Department of Automation, Tsinghua University, Beijing, P. R. China
2 Key Lab of Organic Optoelectronics and Molecular Engineering of Ministry of Education,

Department of Chemistry, Tsinghua University, Beijing, P. R. China.


3 Both authors share the same contribution
4 Lead Contact

ev
*Correspondence: [email protected]
**Correspondence: [email protected]
SUMMARY
Prediction of material property is a key problem because of its significance to material design and screening. We present a
brand-new and general machine learning method for material property prediction. As a representative example, polymer

r
compatibility is chosen to demonstrate the effectiveness of our method. Specifically, we mine data from related literature to build
a specific database and give a prediction based on the basic molecular structures of blending polymers and, as auxiliary, the
blending composition. Our model obtains at least 75% accuracy on the dataset consisting of thousands of entries. We

er
demonstrate that the relationship between structure and properties can be learned and simulated by machine learning method.
Keywords: Machine Learning, Material Property Prediction, Polymer Compatibility.

INTRODUCTION
Prediction of material property is highly significant for discovering new materials. Traditional empirical trial and error method is
pe
generally high time and labor consuming, considering the nearly infinite material structures and the required expertise [1]. Density
functional theory (DFT)-based method and molecular dynamics simulation provides a new paradigm for studying the properties
of materials. DFT is successfully applied to prediction of complex system behavior at an atomic scale and molecular dynamics
simulation helps to predict material structure and properties of thermodynamics [2,3]. Nevertheless, numerical algorithms and
computing resources become the main bottleneck when it comes to macroscale and multibody system.

Machine Learning (ML) which is data-driven achieves great progress on various tasks, such as image recognition [4,5] and natural
language processing [6,7]. Recently, scientists start to apply machine learning to material research and property prediction. For
ot

biological macro molecule material, ML method achieves high accuracy in protein structure AlphaFold2 network [8]. In inorganic
material field, ML assists prediction of low-dimensional materials’ properties [9], ceramics properties [10], steel fatigue strength
[11], alloy miscibility [12], battery electrolytes materials [13] and photovoltaic materials [14], etc. Researchers also utilize machine
learning method to research properties of metal-organic frameworks (MOF) and establish some specific databases [15,16,17]
tn

Except as mentioned above, machine learning could especially help to research polymer materials. Polymer materials have been
widely used in electrical, medicine and all aspects of production, engineering and daily life because of their excellent properties.
The properties of polymer materials are closely related to the multidimensional factors in the process of polymer synthesis and
processing, which makes it complicated and challengeable to give accurate prediction [18,19,20,21]. Researchers have made
active and effective attempts to apply ML method in exploring polymer syntheses and polymer materials [22]. The copolymer
synthesis and defectivity [23,24,25], mechanical properties of polymer composites [26], liquid crystal behavior of copolyether [27],
rin

thermal conductivity [28], dielectric properties [29], glass transition, melting, and degradation temperature and quantum physical
and chemical properties [30,31,32,33] have been applied with machine learning and good prediction accuracy is achieved.
Muramatsu et al. [34] have used ML method to investigate the relationship between the phase separation structure of polymer
blend and Young’s modulus, and builds a predictive framework based on two-dimensional images of polymer blend as the
descriptor.
ep

Herein, we apply a brand-new and general ML method to material property prediction. As a start for prediction of a series of work
on polymer structure and property, we choose polymer blend compatibility as our focus, since polymer blend materials are able
to integrate the advantages of various polymers and play an important role in industry. Trunk of our method includes: (1) Building
up a framework to mine data from literature and construct a specific dataset. (2) Designing a specialized prediction model, Half
Dense Difference Network. It is noteworthy that we can easily generalize the whole procedure to any other material field.
Following, we will give a brief introduction to polymer compatibility and prediction methods.
Pr

Polymer compatibility is a key physical quantity to influence properties of polymer blend. Polymer compatibility means the total
miscibility of homo-polymers and of random copolymers with each other on a molecular scale [35]. Poor compatibility will severely
limit the utility of polymer blend so scientists generally prefer polymer combinations with good compatibility and search for them
through series of chemical theories.

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4048382
Composition

ed
Polymer A Molecular Compatible
Information
Represent Predictive Classify

iew
Extraction
-ation Model
Module
Polymer B Model Incompatible

Composition

ev
Figure 1. Whole Information Flow of Prediction Method
For binary polymer blend, we collect these fundamental information, repeating units and component. We transform such
information to vector representations, and then pass them to Predictive Model to classify them as compatible or incompatible.

r
Some theories are conducted to predict polymer blend compatibility. One of the most fundamental model of polymer blends is
developed by Flory and Huggins, which theoretically describes the thermodynamic effect when polymers are mixed [36]:
Δ𝐺𝑀
𝑅𝑇

er
= 𝑛1 𝑙𝑛𝜙1 + 𝑛2 𝑙𝑛𝜙2 + 𝑛1𝜙2𝜒12, (1)

where 𝑛𝑖 means the number of moles of component 𝑖, 𝜙𝑖 means volume fraction of component 𝑖, 𝜒12 means interaction parameter,
𝑅 is the gas constant and 𝑇 is the absolute temperature.

According to the equation, the Flory–Huggins (F-H) interaction parameter, 𝜒 is considered as a characterizing factor to the
pe
compatibility of polymer blend [37]. Methods to estimate 𝜒 can be classified into two categories: experimental methods [38,39 ,40]
and numerical simulation computation methods [41,42,43]. Compared with these estimation methods, approximate prediction
method based upon Hildebrand Solubility Parameters (HSP) is shown as follow [44]:
𝑉
𝜒12 = (𝛿1 ‒ 𝛿2)2, (2)
𝑅𝑇
where 𝑉 means actual volume of a polymer segment, 𝛿𝑖 means Hildebrand Solubility Parameters of component 𝑖. Since 𝜒12
indicates the interaction parameter mentioned in F-H theory, the more similar 𝛿 parameters are, the more possibly polymer blends
ot

are compatible [45,46]. Askadskii tries to predict polymer compatibility in the use of HSP criterion [47], and makes efforts to predict
compatibility directly from chemical structure of the repeating units of polymers [48]. Schneier precisely classifies 28 specific
polymer blends (96.43% accuracy) [49] through modified HSP-calculation method and a critical value of 0.010 cal/mol. Although
the generalizability is not satisfying enough, the attempt to predict compatibility utilizing repeating unit inspires us greatly.
tn

Besides theoretical research, several researchers focus on specific compatibility experiments of polymer combination. These data
are reported in thousands of articles. We utilize this information dispersed in sporadic articles to construct a specific database.
Based on it, we design a predictive ML model and receive good results. Whole process is roughly shown as Figure 1. In summary,
our main contributions can be listed as follows:
 We design a scheme to mine data from literature and use the scheme to build a specific polymer compatibility database.
 We present our model Half Dense Difference Network (HDDN). We show that it is possible to predict polymer compatibility
rin

with only repeating structure and composition by machine learning method.


 We show that it is possible to verify and discover new chemical and material science principles via the help of machine
learning model.

RESULTS
ep

Data Collection and Information Extraction

Although many researchers conduct lots of compatibility studies and publish their work, there is not a specific database
constructed for polymer compatibility. We collect data by following means.
 Database extraction: Database PoLyInfo is developed by National Institute for Materials Science (NIMS) [50]. It contains
a number of polymer blend information and blend morphology information. Some entries have clear compatibility information
Pr

which can be inferred from morphology description, such as miscible, compatible, incompatible and so on. We collect them
and tag them with compatible and incompatible according to morphology description. We discard those cases where blend
is partially compatible or description is ambiguous.
 Text Data Mining: We search and download papers related to keywords ’Compatibility’ and ’Polymer’ from Google Scholar
and Tsinghua University Library, which sum up to 47K articles. In these articles, some sentences contain clear compatibility
information. For example, "Results of physical properties measurements reveal the blends of SR and FKM are

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4048382
technologically compatible". We design a filter to automatically export these sentences from articles, and get the database
from literature.

ed
All entries we collect above are transformed from text to standard vector form consisting of polymers repeating unit structure,
component information and compatibility label. Our ultimate dataset contains 1.4K reliable entries. We divide these data into
training-set, valid-set and test-set in two ways. (1) Random Division: we randomly choose 64% data for training, 16% for validation
and 20% for testing. (2) Balanced Division: we make sure all combinations of polymer blends in test-set don’t exist in training-set
and valid-set, which guarantees the model learns the rule rather than remember the combinations. At the same time, we balance
the categorical proportion of each subset by copying incompatible samples. Among these sets, valid-set is used to choose proper
super-parameters and test-set is used to verify generalizability. Details of division are listed in Table1. Data size of each subset

iew
are listed and inside parentheses are incompatible rate.

Table 1. Statistics of Dataset


Division Size Division Set
Training-set Valid-set Test-set
Random 1390 889 (38.5%) 223 (41.7%) 278 (40.0%)
Balanced 1766 1179 (50.0%) 281 (50.0%) 306 (50.0%)

ev
Competing Results

Up to now, there is no similar ML model published to solve polymer compatibility prediction. Given that, we apply several
fundamental and widely used ML model to this problem. On datasets generated in both ways, we compare our model against
several possible competitors. In consideration of fairness, we roughly make numbers of layers and nodes similar.

r
 Hildebrand Solubility Parameters (HSP): HSP considers segment volume, polymer density and structure features, and
reaches over 95% accuracy in Schneier’s database. When the ∆𝐻𝑚 value calculated is not larger than 0.010, the blend is
predicted to be compatible, otherwise incompatible. We play this procedure on our dataset, using HSP values from Polymer
Database1 and name them HSP and in the following.

er
 Multi-Layer Perception (MLP): as mentioned above, MLP is a fundamental neural network model. Through changing the
number of layers and nodes and nonlinear function, we can make the MLP fit a function and address some predictive
problems. This method replaces our Features Dense Module and Decision Module with MLP.
 Concatenated-Difference-Net (CDN): this model uses another layer-connection method comparing to our Difference
Module. CDN concatenates different layers value together and makes the dimension higher, while our model adds features
pe
together and guarantees dimension invariant.
 Dense-Net (DN): it is a very popular model in computer vision tasks. Compared with our model, DN contains dense-
connection in Difference and Decision Module. We try to compare our model with it and prove that our design according to
polymer structure and compatibility mechanism is useful.

Table 2. Results on Test-set with Competing Models


Model Random Division Results
ot

MSE Accuracy (%) Precision (%) Recall (%) Specificity (%) 𝐹1 score (%)
HSP a 46.91 32.88 39.34 51.49 35.82

MLP 5.09 89.98 91.16 83.00 94.61 86.85


CDN 5.55 90.65 89.56 86.71 93.26 88.10
tn

DN 5.60 89.12 86.50 86.36 90.94 86.36


HDDN 7.25 90.89 89.75 87.24 93.31 88.43
Model Balanced Division Results
MSE Accuracy (%) Precision (%) Recall (%) Specificity (%) 𝐹1 score (%)
HSP 68.93 70.59 66.67 71.26 65.57
rin

MLP 7.21 63.56 67.17 51.67 75.00 57.36


CDN 6.48 65.97 65.02 67.67 64.34 65.87
DN 10.74 62.42 65.36 45.42 75.77 51.08
HDDN 6.14 75.75 76.99 74.13 77.31 74.80
a HSP method gives prediction based on fixed criterion so has no training process and MSE result.
ep

Results on Random division and Balanced division testset are presented in Table 2. Since incompatible rate is 40% in the random
division testset, the model is effective when classification accurate is much higher than 60%. Our HDDN model reaches 90.89%
accuracy and highest Recall (87.24%) and 𝐹1 score (88.43%). As for other models, MLP, CDN and DN all get high accuracy
which is close to HDDN while HSP method obtains accuracy ranging from 31.58% to 46.91%, much lower than Schneier’s 96.43%.
This is because HSP method is weak in dealing with polar systems, where dipole-dipole interaction or hydrogen bonding further
increase compatibility, leading to poor precision. For example, poly(p-hydroxystyrene) is compatible with poly(ethylene oxide),
Pr

poly(methyl methacrylate), poly(lactic acid), poly(methyl vinyl ether) and poly(vinyl acetate) due to the strong hydrogen bonding,
which are falsely predicted. Among all machine learning models, specificity is all obviously higher than recall, which indicates that

1 www.polymerdatabase.com

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4048382
models perform better when the objects are compatible. This phenomenon can be explained as the random divided dataset is not
balanced and compatible samples conquer more proportion (60%), so models pay more attention on this part of data.

ed
A B

iew
ev
C D Random Division
Balanced Division

80

r
87.2
78
87.0
Accuracy(/%)

Accuracy(/%)
76

74

72
er 86.8

86.6

86.4
pe
70
86.2
68
86.0
500 400 300 200 100 500 400 300 200 100
Super Parameter Super Parameter
ot

Figure 2. Results of HDDN


(A) Training loss of HDDN with epoch increasing.
(B) Accuracy in test-set of HDDN with epoch increasing.
(C and D) Average accuracy in different super parameters is close to each other, which indicates our model is insensitive to super
parameters.
tn

As we present above, test-set in Balanced Dataset contains 50%incompatible-50%compatible samples, so the random or
ineffective model can reach up to at least 50% accuracy. After trained and tested five times, our model reaches 75.75% average
accuracy on test-set, which is obviously higher than other models (HSP:68.93%; MLP:63.56%; CDN:65.97%; DN:62.42%). What’s
more, our model obtains higher precision (76.99%), recall (74.13%), specificity (77.31%) and 𝐹1 score (74.80%) than all other
models. The results show that our architecture design performs not only better on random division dataset but also better on strict
rin

division dataset, and really discovers the mystery of polymer blend compatibility. At the same time, our model is super-parameter
insensitive, which is shown in Figure 2.

We also find that all models get lower scores than on Random Dataset, and the strict division may be the main reason. In the
Random Dataset, training-set contains the same kinds of polymer blend combination in test-set. Although composition rate may
be different, the influence of composition is obviously lower than that of structure representation. Therefore, model can easily
ep

handle such data in test-set according the experience of training-set. However, In Balanced Dataset, training-set and test-set are
strictly separated so that imitating is hard because of the lack of direct experience. This phenomenon also explains why it get
lower accuracy when we train it on a strict division dataset.

Ablation Results
We are interested in whether each module we design works, so we conduct ablation experiments to show the necessity of each
Pr

module. Ablation experiment is presented by Ren [51], and the principle is similar to variant-control in biology and chemistry. We
remove some modules from our model or replace them with other connections. In this process, we will prove that our modules
are working together well and all make contributions to this predictive task. We present the detail of HDDN as Figure 3, and name
these incomplete models HDDN-noc, HDDN-nodense, HDDN-nodiff and HDDN-noabs.

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4048382
Predictive Model

ed
Feature A Compatible
M Absolute MLP Classify
L Difference Classifier

iew
Feature B P
Incompatible
Composition Concatenate
A&B

Feature Features Difference


Extraction Dense &Decision
Module Module Module

ev
Figure 3. Structure of HDDN
Feature Extraction: reduce the dimension of sparse input and extract features. Feature Dense Connection: connect all hidden
layers, so that the output of the last layer can integrate different levels of molecular structure information. Difference and Decision:
make the difference between the two polymer structure vectors and use MLP classifier to get compatible decision.

r
 HDDN-noc: we dropout composition to observe the performances for the cases without composition information.
 HDDN-nodense: we replace the dense connection by MLP and keep the number of layers and nodes same, in order to
investigate the effect of dense connection.
 HDDN-nodiff: we remove the difference module and concatenate two polymer vector representation as input, so the 4096-

er
D vector get through dense connection and come into decision module directly.
 HDDN-noabs: we remove the absolute value process and use difference directly, in order to find out whether absolute
process is useful.

Table 3. Results on Test-set in Ablation Experiments


pe
Model Random Division Results
MSE Accuracy (%) Precision (%) Recall (%) Specificity (%) 𝐹1 score (%)
HDDN-noc 8.26 88.89 87.34 84.68 91.69 85.86
HDDN-nodense 6.68 90.43 90.07 85.59 93.64 87.71
HDDN-nodiff 4.48 89.53 91.41 81.42 94.91 86.12
HDDN-noabs 5.42 88.99 89.87 81.64 93.86 85.54
HDDN 7.25 90.89 89.75 87.24 93.31 88.43
ot

Model Balanced Division Results


MSE Accuracy (%) Precision (%) Recall (%) Specificity (%) 𝐹1 score (%)
HDDN-noc 6.26 70.51 67.30 78.33 62.98 72.09
HDDN-nodense 6.37 70.10 67.30 78.33 62.98 71.07
tn

HDDN-nodiff 5.81 62.30 66.72 46.25 77.72 54.53


HDDN-noabs 4.60 60.09 57.26 75.83 44.95 65.03
HDDN 6.14 75.75 76.99 74.13 77.31 74.80

Results of ablation experiments are listed in Table 3. On random division dataset, we can find that HDDN gets the highest accuracy
(90.89%) and at the same time, all alternatives perform well (gaps are smaller than 2%). As for the attention on compatible and
rin

incompatible samples, alternatives all get higher precision and specificity than HDDN and lower recall and 𝐹1 score than HDDN.
This is because that defect in architecture may cause model pays more attention on samples themselves but not the mechanism.
HDDN get the highest 𝐹1 score and this indicates that HDDN can balance precision and recall in the best way.

On Balanced division dataset, complete HDDN model reaches highest accuracy, and other alternatives get obviously lower
accuracy than HDDN (75.75%). As for 𝐹1 score, HDDN is also obviously higher than other alternatives. In details, we can see two
ep

thread polymer representation and difference module is the most important part of the whole architecture (HDDN-nodiff and
HDDN-noabs get low accuracy). This is because with two threads operation, we guarantee the position invariance between two
polymers and we needn’t worry about the turn where we put two representations together. At the same time, our difference method
is related with the chemistry theory, and proved to be effective. As for HDDN-noc, we can learn that composition rate is also very
important to our model, which is also consistent to the truth that we can not only predict compatibility without composition. Without
dense connection or absolute value module, HDDN will lose some representation ability and cannot learn the true mechanism
well through training. All analyses above are also true when we turn to 𝐹1 score. From results and analyses of ablation experiments
Pr

above, we can infer that each module in our model is useful and necessary. They work together effectively on solving this
classification problem.

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4048382
A B

ed
10 1
10 Prediction Prediction
Real Label Criterion
0
Curve Fitting

Prediction Difference

Interaction Parameter
Compatibility Result

Literature Data -1
-2

iew
5 -3
-4
-5
-6
0
0
0 10 20 30 40 0.0 0.2 0.4 0.6 0.8 1.0

ev
Index of PMMA/PVC Samples PMMA Volume Fraction
6
Normal
C D Lacking
Normal Lacking 5

Prediction Difference

r
Remove 4
PEO
-C-O-C-
3

PVPh
Remove
-O-H
er 2

1
pe
0
Vector Representation 0.0 0.2 0.4 0.6 0.8 1.0
PEO Volume Fraction
E
00...1.. 0 00...0..0 25 Normal
Lacking
Dimension for -COC- 20
ot

15

Predictive Model
tn

10

SHAP 5

0
-0.006 -0.004 -0.002 0.000 0.002
rin

SHAP value distribution of -COC-


Figure 4 Case Study of PMMA/PVC and Interpretability HDDN with PEO/PVPh
(A) Our prediction to PMMA/PVC blend samples and real label of it. Different samples have different composition and share the
same structure vector.
(B) The blue dotted line indicates the relationship between composition and our prediction difference for PMMA/PVC polymer
ep

blend. The red solid line indicates the relationship between composition and the Flory-Huggins interaction parameter. The
horizontal line is our predictor criterion.
(C) Structures of Normal PEO/PVPh polymer blend and Lacking one are shown. Dimension for -COC- is specific and highlighted.
Normal and Lacking both pass through the predictive model and SHAP model.
(D) The relationship between composition and our prediction difference for Normal and Lacking PEO/PVPh polymer blend.
(E) SHAP value distribution of -COC- dimension of Normal PEO/PVPh and Lacking PEO/PVPh.
Pr

Confidence Tests

As we mentioned above, we will make a judgement that the model works if it can get higher accuracy than random classification.
We will now discuss at least how accurate we are confident the model will be. We can assume that there is an objective accuracy
rate behind each model architecture method. We will mainly discuss and prove our confidence in the accuracy of HDDN model.

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4048382
Based on our hypothesis, the prediction behavior is a Bernoulli trail, and we define the probability of success as 𝜃. In Balanced
Division, we test our model on 306 samples for five times, which corresponds to Binomial trail. And we define that the correct

ed
classification numbers 𝑋 follows B (n, θ), where 𝑛 = 306*5 = 1530:
𝑋 ∼ 𝐵(𝑛,𝜃), 𝑃(𝑋 = 𝑥) =()
𝑛 𝑥
𝑥
𝜃 (1 ‒ 𝜃)(𝑛 ‒ 𝑥) (3)

where observation of 𝑋 is 𝑋0. We make hypothesis as follow:


𝐻0:𝜃 = 𝜃0 𝑉.𝑆. 𝐻1:𝜃 > 𝜃0. (4)
For hypothesis test, we conduct P-value confidence test:

iew
𝑛

∑( )
𝑛 𝑖
𝑃 𝑣𝑎𝑙𝑢𝑒 = 𝑠𝑢𝑝 𝑃𝜃(𝑋 > 𝑋0) = 𝑠𝑢𝑝 𝜃 (1 ‒ 𝜃)(𝑛 ‒ 𝑖). (5)
𝜃 = 𝜃0 𝜃 = 𝜃0 𝑖
𝑖 = 𝑋0
In our experiments, we reach 75.75% average accuracy and we can substitute 𝑋0 = 1530 ∗ 75.75% = 1159 into the formula. After
calculation, when 𝜃0 is equal to 73.07% and 73.87%, the P-values are equal to 0.01 and 0.05 respectively. That’s to say, we can
state that our model accuracy is higher than 73.07% at a significance level no less than 0.01 and is higher than 73.87% at a
significance level no less than 0.05 while the significance levels of 0.01 or 0.05 is conservative. Based on the above reasoning,
we prove that our model really learns how the structure-property mechanism works after training.

ev
Case Study and Interpretability

Following, we will pay attention to specific case to verify how our method performs on polymer blend compatibility.

The poly(methyl methacrylate) (PMMA) / poly(vinyl chloride) (PVC) blend is widely used as a polymer electrolyte. The poor

r
mechanical flexibility of the PMMA film limits its use in energy storage devices, while the mix of PVC can improve the mechanical
and electrical properties of the polymer electrolyte, and the conductivity as well [52,53]. In contrast, PMMA is also considered as
a processing aid in PVC production process. PMMA helps in the plasticization of PVC and helps both in the processing of PVC
and constituting a blend material.

er
We address the compatibility prediction of PMMA/PVC with our ML method. We use the model trained by Balanced Division, and
utilize all 46 entries of PMMA/PVC in our database. These entries share the same polymer representation but different
composition. We find that our model only misses one single sample and achieve accurate results in all other ones, which proves
the effectiveness of our method. As mentioned before, the Flory-Huggins parameter of the polymers and the intermolecular
pe
interactions between different repeat units are important factors that determine the mixed condition. Existing research shows that
with PMMA volume fraction increasing, 𝜒12 rises from negative values to zero, which indicates that compatibility is falling [54].
Through our ML method, prediction difference also rises with PMMA volume fraction increasing, and gradually more than the
criterion value. It shows the consistency between our model and experimental results in the tendency. As for the bias between
two curves in the range of 0% to 60%, it can be due to the inequacy of our data and the imperfection of our model. In the whole,
we can find that our model is truly efficient to polymer compatibility prediction.

Our method has been proved to be effective to polymer compatibility prediction, while we want to go further and investigate
ot

whether this model can be interpreted with chemistry knowledge. Based on our design, every dimension represents specific
chemistry structure and the relation can be obtained with RDkit Python Package. During several approaches to interpreting model
prediction, we utilize SHAP (SHapley Additive exPlanations) [55] to estimate influence weight of each dimension.
tn

We choose PEO/PVPh (poly(ethylene oxide)/poly(p-hydroxystyrene)) blend and apply the interpreting method to it. The process
and result are shown in Figure 4C. According to chemistry analysis, we find that the compatibility of PEO/PVPh system is
increased due to the hydrogen bonding forces of the hydroxyl groups [56]. To verify the consistency between our model and
chemistry knowledge, we set special Lacking PEO/PVPh by removing -COC- of PEO and -OH of PVPh. In addition, we choose
the special dimension which refers to the existence of ether bond (-COC-) and calculate its value. Through our model, we find
that prediction difference of Lacking PEO/PVPh is much higher than Normal one, which means that Lacking is obviously more
incompatible than Normal. As for SHAP values of -COC-, we find that values from Normal is distributed more negative. It can be
rin

referred that, after regular sampling, existence of -COC- tends to take negative influence to prediction and make model assume
blend more compatible. These phenomena prove that our model can be interpreted with chemistry knowledge.

DISCUSSION

In this work, we present a general ML method for material property prediction, and choose compatibility as our focus. To address
ep

this task, we establish a dataset based on literature mining and NIMS. We show that our model indeed achieves impressive
classification results and performs better than chemistry methods and other possible competing ML models. Through ablation
experiments we explain why our architecture design can work and quantify the contribution of each module. Each module in
HDDN works for specific purpose and effectively contributes to the prediction task. Furthermore, we conduct confidence test and
case study to demonstrate the reliability of our model. Interpretability investigation proves that our model can be interpreted with
chemistry knowledge, and more details of our model can be investigated as supplement to existing chemistry knowledge.
Pr

We also get some other interesting findings in our research. As to compatibility, our method pays more attention to polymer
structure and cares less about composition. HDDN model successfully deals with the compatibility prediction of binary blends,
but it cannot be directly used to deal with the compatibility of ternary or more blends and copolymers (such as ABS (Acrylonitrile
Butadiene Styrene)), which needs to be further expanded. In the future, we will try to optimize our model structure though it may

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4048382
be challenging. We will also apply this general method to other significant scientific problems, such as polymer self-assembly,
biodegradable materials and so on.

ed
EXPERIMENTAL PROCEDURES
Resource Availability

Lead Contact
Further information and requests for resources and reagents should be directed to and will be fulfilled by the Changshui Zhang
([email protected]).

iew
Materials Availability
This study did not generate new, unique reagents.

Data and Code Availability


The data and main code to reproduce the results of this study are available at following GitHub page:
https://github.com/Zhilong-Liang/Polymer-Compatibility

ev
Machine-learning Experiment Settings
All experiments are implemented in Python using Pytorch toolbox [57], and the computing is accelerated on a NVIDIA GeForce
RTX 2080 Ti GPU. We set the maximal training epoch number to 1000 and the mini-batch to 20. The initial learning rate is set to
10−4 for Balanced Division and 5∗10−5 for Random Division. Loss related parameter 𝜆 is set as 10, which determines the label of
incompatible samples. Total time spend is about 10 min for a complete training of 1000 iterations. Adam optimizer is used to
optimize the loss function. As for criterion, we choose MSE (Mean Square Error) as the loss function for the phase.

r
Since we are accomplishing a classification prediction task, we also care about the accuracy of the model. In addition, we test
some other indices:

er
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (𝑇𝑃 + 𝑇𝐹)/(𝑃 + 𝐹),
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃/(𝑇𝑃 + 𝐹𝑃),
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃/(𝑇𝑃 + 𝐹𝑁),
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑇𝑁/(𝑇𝑁 + 𝐹𝑃),
𝐹1 𝑠𝑐𝑜𝑟𝑒 = 2 ×
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑅𝑒𝑐𝑎𝑙𝑙
,
(6𝑎)
(6𝑏)
(6𝑐)
(6𝑑)
(6𝑒)
pe
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
𝑃 𝑁 𝑇 𝐹
where means incompatible, means compatible, means true, and means false. Precision refers to the proportion of the
"incompatible" predictions that are actually correct. Recall refers to the proportion of incompatible samples with correct prediction
among all incompatible samples, which is numerically equal to Sensitivity and both indicate the ability to predict incompatible
cases. Specificity means the ability to predict compatible cases. The 𝐹1 score comprehensively considers the effects of precision
and recall. If one of them is too small, the value of 𝐹1 will become smaller.

SUPPLEMENTAL INFORMATION
ot

Document S1 Supplemental Machine Learning Method.

ACKNOWLEDGMENTS
The authors are grateful for support from National Institute for Materials Science. We would like to thank Shuojin Wang et al. for
tn

the assistance in processing data. Meanwhile, we also would like to thank Haodi Liu for polishing the language and Weisheng
Pan for advice in interpretability study.

AUTHOR CONTRIBUTIONS
Z. Liang, C. Zhang. and J. Yuan conceived of the main research idea. Z. Liang carried out method design, machine-learning
modeling and wrote the manuscript. Z. Li and S. Zhou took part in the literature summary and conducted the HSP method
rin

experiments. Y. Sun took part in the experiment design. C. Zhang and J. Yuan supervised the project and revised the manuscript.
All authors discussed the results and commented on the manuscript.

DECLARATION OF INTERESTS
The authors declare no competing interests.
ep

References

1. Agrawal, A., and Choudhary, A. (2016). Perspective: Materials informatics and big data: Realization of the “fourth paradigm”
of science in materials science. APL Mater. 4, 053208.
2. Theodorou, D.N. (2004). Understanding and predicting structure–property relations in polymeric materials through molecular
Pr

simulations. Mol. Phys. 102, 147–166.


3. Geerlings, P., De Proft, F., and Langenaeker, W. (2003). Conceptual density functional theory. Chem. Rev. 103, 1793–1874.
4. Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K.Q. (2017). Densely connected convolutional networks. In 2017
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE), pp. 2261–2269.
5. He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR) (IEEE), pp. 770–778.

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4048382
6. Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv,
arXiv:1409.0473, https://arxiv.org/abs/1409.0473.

ed
7. Zeng, Z., Yao, Y., Liu, Z., and Sun, M. (2022). A deep-learning system bridging molecule structure and biomedical text with
comprehension comparable to human professionals. Nat. Commun. 13, 862.
8. Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A.,
Potapenko, A., et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589.
9. Yin, H., Sun, Z., Wang, Z., Tang, D., Pang, C.H., Yu, X., Barnard, A.S., Zhao, H., and Yin, Z. (2021). The data-intensive
scientific revolution occurring where two-dimensional materials meet machine learning. Cell Rep. Phys. Sci. 2, 100482.
10. Tang, Y., Zhang, D., Liu, R., and Li, D. (2021). Designing high-entropy ceramics via incorporation of the bond-mechanical

iew
behavior correlation with the machine-learning methodology. Cell Rep. Phys. Sci. 2, 100640.
11. Agrawal, A., Deshpande, P.D., Cecen, A., Basavarsu, G.P., Choudhary, A.N., and Kalidindi, S.R. (2014). Exploration of data
science techniques to predict fatigue strength of steel from composition and processing parameters. Integr. Mater. Manuf.
Innov. 3, 90–108.
12. Zhang, R.F., Kong, X.F., Wang, H.T., Zhang, S.H., Legut, D., Sheng, S.H., Srinivasan, S., Rajan, K., and Germann, T.C.
(2017). An informatics guided classification of miscible and immiscible binary alloy systems. Sci. Rep. 7, 9577.
13. Dave, A., Mitchell, J., Kandasamy, K., Wang, H., Burke, S., Paria, B., Póczos, B., Whitacre, J., and Viswanathan, V. (2020).
Autonomous discovery of battery electrolytes with robotic experimentation and machine learning. Cell Rep. Phys. Sci. 1,

ev
100264.
14. Feng, H.J., Wu, K., and Deng, Z.Y. (2020). Predicting inorganic photovoltaic materials with efficiencies >26% via structure-
relevant machine learning and density functional calculations. Cell Rep. Phys. Sci. 1, 100179.
15. Chen, P., Tang, Z., Zeng, Z., Hu, X., Xiao, L., Liu, Y., Qian, X., Deng, C., Huang, R., Zhang, J., et al. (2020). Machine-Learning-
Guided Morphology Engineering of Nanoscale Metal-Organic Frameworks. Matter 2, 1651-1666.
16. Moghadam, P.Z., Rogge, S.M.J., Li, A., Chow, C.-M., Wieme, J., Moharrami, N., Aragones-Anglada, M., Conduit, G., Gomez-

r
Gualdron, D.A., Van Speybroeck, V., and Fairen-Jimenez, D. (2019). Structure-Mechanical Stability Relations of Metal-
Organic Frameworks via Machine Learning. Matter 1, 219-234.
17. Rosen, A.S., Iyer, S.M., Ray, D., Yao, Z., Aspuru-Guzik, A., Gagliardi, L., Notestein, J.M., and Snurr, R.Q. (2021). Machine

er
learning the quantum-chemical properties of metal-organic frameworks for accelerated materials discovery. Matter 4, 1578-
1597.
18. Paunović, N., Bao, Y., Coulter, F. B., Masania, K., Geks, A. K., Klein, K., et al. (2021). Digital light 3D printing of customized
bioresorbable airway stents with elastomeric properties. Science Advances 7(6), eabe9499.
19. Zhao, C., Ma, Z., and Zhu, X. X. (2019). Rational design of thermoresponsive polymers in aqueous solutions: A
pe
thermodynamics map. Progress in Polymer Science 90, 269-291.
20. Audus, D. J., and De Pablo, J. J. (2017). Polymer informatics: opportunities and challenges. ACS Macro Lett 6: 1078–1082.
21. Zhou, T., Song, Z., and Sundmacher, K. (2019). Big data creates new opportunities for materials research: a review on
methods and applications of machine learning for materials design. Engineering 5(6), 1017-1026.
22. Cencer, M.M., Moore, J.S. and Assary, R.S. (2022), Machine learning for polymeric materials: an introduction. Polym Int.
23. Gu, Y., Lin, P., Zhou, C.,and Chen, M. (2021). Machine learning-assisted systematical polymerization planning: case studies
on reversible-deactivation radical polymerization. Science China Chemistry 64(6), 1039-1046.
24. Mohammadi, Y., Saeb, M.R., Penlidis, A., Jabbari, E., J Stadler, F., Zinck, P., and Matyjaszewski, K. (2019). Intelligent
ot

machine learning: tailor-making macromolecules. Polymers 11, 579.


25. Tu, K.H., Huang, H., Lee, S., Lee, W., Sun, Z., Alexander‐Katz, A., and Ross, C.A. (2020). Machine Learning Predictions of
Block Copolymer Self‐Assembly. Advanced Materials 32, 2005713.
26. Zhang, Z., and Friedrich, K. (2003). Artificial neural networks applied to polymer composites: a review. Composites Science
and technology 63, 2029-2044.
tn

27. Leon, F., Curteanu, S., Lisa, C., and Hurduc, N. (2007). Machine learning methods used to predict the liquid-crystalline
behavior of some copolyethers. Mol. Cryst. Liq. Cryst. 469, 1–22.
28. Wu, S., Kondo, Y., Kakimoto, M., Yang, B., Yamada, H., Kuwajima, I., Lambard, G., Hongo, K., Xu, Y., Shiomi, J., et al. (2019).
Machine-learning-assisted discovery of polymers with high thermal conductivity using a molecular design algorithm. NPJ
Comput. Mater. 5, 66.
29. Liang, J., Xu, S., Hu, L., Zhao, Y., and Zhu, X. (2021). Machine-learning-assisted low dielectric constant polymer discovery.
rin

Mater. Chem. Front. 5, 3823–3829.


30. Kim, C., Chandrasekaran, A., Huan, T.D., Das, D., and Ramprasad, R. (2018). Polymer genome: a data-powered polymer
informatics platform for property predictions. The Journal of Physical Chemistry C 122, 17575-17585.
31. Ma, R., Liu, Z., Zhang, Q., Liu, Z., and Luo, T. (2019). Evaluating polymer representations via quantifying structure–property
relationships. Journal of chemical information and modeling 59,3110-3119.
32. Ma, R., and Luo, T. (2020). PI1M: a benchmark database for polymer informatics. Journal of Chemical Information and
ep

Modeling 60, 4684-4690.


33. Kuenneth, C., Schertzer, W., and Ramprasad, R. (2021). Copolymer Informatics with Multitask Deep Neural Networks.
Macromolecules 54, 5957-5961.
34. Hiraide, K., Hirayama, K., Endo, K., and Muramatsu, M. (2021). Application of deep learning to inverse design of phase
separation structure in polymer alloy. Computational Materials Science 190, 110278
35. Krause, S. (1978). Polymer–polymer compatibility. In Polymer Blends (Elsevier), pp. 15–113.
36. Flory, P.J. (1953). Principles of Polymer Chemistry (Cornell university press).
Pr

37. De Gennes, P.G., and Gennes, P.G. (1979). Scaling Concepts in Polymer Physics (Cornell university press).
38. Sanchez, I.C. (1989). Relationships between polymer interaction parameters. Polymer 30, 471–475.
39. Weeks, N.E., Karasz, F.E., and MacKnight, W.J. (1977). Enthalpy of mixing of poly(2,6-dimethyl phenylene oxide) and
polystyrene. J. Appl. Phys. 48, 4068–4071.
40. Graessley, W.W., Krishnamoorti, R., Balsara, N.P., Fetters, L.J., Lohse, D.J., Schulz, D.N., and Sissano, J.A. (1994).

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4048382
Deuteration effects and solubility parameter ordering in blends of saturated hydrocarbon polymers. Macromolecules 27, 2574–
2579.

ed
41. Heine, D.R., Grest, G.S., and Curro, J.G. (2005). Structure of polymer melts and blends: Comparison of integral equation
theory and computer simulations. In Advanced Computer Simulation Advances in Polymer Science., C. Dr. Holm and K. Prof.
Dr. Kremer, eds. (Springer Berlin Heidelberg), pp. 209–252.
42. Fan, C.F., Olafson, B.D., Blanco, M., and Hsu, S.L. (1992). Application of molecular simulation to derive phase diagrams of
binary mixtures. Macromolecules 25, 3667–3676.
43. Accelrys, R. (2008). 4.; Accelrys Software. Inc., San Diego, CA.
44. Burke, J. (1984). Solubility Parameters: Theory and Application.

iew
https://cool.culturalheritage.org/coolaic/sg/bpg/annual/v03/bp03-04.html.
45. Hughes, L.J., and Britt, G.E. (1961). Compatibility studies on polyacrylate and polymethacrylate systems. J. Appl. Polym. Sci.
5, 337–348.
46. Larsen, M. (2009). Hansen solubility parameters and SWCNT composites. In Procedings of the 17th International Conference
on Composite Materials, ICCM-17, Edinburg.
47. Askadskiĭ, A.A. (2003). Computational materials science of polymers (Cambridge Int Science Publishing).
48. Askadskii, A.A., Matseevich, T.A., Popova, M.N., and Kondrashchenko, V.I. (2015). Prediction of the compatibility of polymers
and analysis of the microphase compositions and some properties of blends. Polym. Sci. Ser. A 57, 186–199.

ev
49. Schneier, B. (1973). Polymer compatibility. J. Appl. Polym. Sci. 17, 3175–3185.
50. Otsuka, S., Kuwajima, I., Hosoya, J., Xu, Y., and Yamazaki, M. (2011). PoLyInfo: Polymer database for polymeric materials
design. In 2011 International Conference on Emerging Intelligent Data and Web Technologies (IEEE), pp. 22–29.
51. Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal
networks. Advances in neural information processing systems 28.
52. Rhoo, H.J., Kim, H.T., Park, J.K., and Hwang, T.S. (1997). Ionic conduction in plasticized blend polymer electrolytes.

r
Electrochim. Acta 42, 1571–1579.
53. Ramesh, S., Leen, K.H., Kumutha, K., and Arof, A.K. (2007). FTIR studies of PVC/PMMA blend based polymer electrolytes.
Spectrochim. Acta A Mol. Biomol. Spectrosc. 66, 1237–1242.

er
54. Fekete, E., Földes, E., and Pukánszky, B. (2005). Effect of molecular interactions on the miscibility and structure of polymer
blends. Eur. Polym. J. 41, 727–736.
55. Lundberg, S.M., and Lee, S.I. (2017). A unified approach to interpreting model predictions. Advances in neural information
processing systems 30.
56. Pomposo, J.A., Cortazar, M., and Calahorra, E. (1994). Hydrogen bonding in polymer systems involving poly(p-vinylphenol).
pe
2. Ternary blends with poly(ethyl methacrylate) and poly(methyl methacrylate). Macromolecules 27, 252–259.
57. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., and Antiga, L. (2019).
Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32.
ot
tn
rin
ep
Pr

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4048382
ed
iew
r ev
er
pe
ot
tn
rin
ep
Pr

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4048382
ed
A Machine Learning Method for Material Property and Function
Prediction: Example Polymer Compatibility

iew
ev
Methodology
There can be several perspectives to study compatibility prediction, such as parameters estimation mentioned above
and direct compatibility classification. We find that compatibility related literature usually describe compatibility by
compatible and incompatible. We use ML to implement compatibility prediction. For this purpose, we need to prepare

r
a database to get a good predictive model. Following, we will introduce modules in our work flow: Data Collection
and Information Extraction, Molecular Representation and Predictive Model. It’s worth noting that these models are
basically general and can be transferred to other problems just with minor adjustments.

1 Data Collection and Information Extraction Module


er
Although many researchers have conducted lots of compatibility studies and published their work, there is not a specific
database constructed for polymer compatibility. To get our predictive model, we collect data by following means.
pe
Database extraction Database Polyinfo is developed by National Institute for Materials Science (NIMS) [1]. It con-
tains a number of polymer blend information and blend morphology information. Some entries have clear compatibility
information which can be inferred from morphology description, such as miscible, compatible, incompatible and so on.
We collect them and tag them with compatible and incompatible according to morphology description. We discard
those cases where blend is partially compatible or description is ambiguous.
ot

Text Data Mining We search and download papers related to keywords ’Compatibility’ and ’Polymer’ from Google
Scholar 1 and Tsinghua University Library 2 , which sum up to 47K articles. During these articles, some sentences
contain clear compatibility information. For example, "Results of physical properties measurements reveal the blends
of SR and FKM are technologically compatible" 3 . We design a filter to automatically export these sentences from
tn

articles. To achieve this goal, we design Information Extraction Model. Details are presented as following and also
shown in Figure.S1
As mentioned above, although we have collected thousands of compatibility-related articles, it is still a great challenge
to extract specific sentences containing clear compatibility description. Since these data are in the form of language
text, we use ML for language to process them automatically. Natural Language Processing (NLP) is a such kind of
rin

AI technology and has shown good power. We use this technology on our literature texts, and extract all sentences
containing information we need.
All sentences in literature can be divided into compatibility-related ones and compatibility-unrelated ones. We design
an Information Extraction Module to achieve classification.
ep

Information Extraction Module Firstly, we search for all sentences under a keyword root ’compati’ as our potential
corpus, randomly choose 2,000 sentences to establish database RawSen, and label them mutually according to their
Pr

1
https://scholar.google.com
2
https://lib.tsinghua.edu.cn/en
3
https://doi.org/10.1177/0095244309345409

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4048382
A PREPRINT - F EBRUARY 13, 2022

Information Extraction Module

ed
Related Database
Key
GPT3 MLP
Sentences wordroot
Featurizer Classifier
Search

iew
no-Related

Figure S1: Information Extraction Module. All sentences are screened through this module and
compatibility-related ones are collected to construct Database WholeSen.

meanings. Secondly, we transform all sentences to vectors through GPT3-featurizer. GPT3 (Generative Pre-Training)

ev
is an NLP generative model which has a great power in semantic representation [2]. It can characterize sentences by
long vectors based on their meanings, and distance between vectors is related semantic difference. Thirdly, we pass
these vectors to Multi-Layer Perception (MLP), to train a binary-classification screening network. MLP is comprised of
linear function and nonlinear function and theoretically has the ability to fit a function. Whole computation process is
described below:

r
h = W 0 x + b0 , (1a)
x1 = Relu(h) = max(0, h), (1b)

er
y = W1 x1+ b1 , (1c)
(1d)
where x presents input vector, y presents output label, and its value can be 0 (for no-related sample) or 1 (for related
sample). The whole error of this network is evaluated by Cross-Entropy loss:
pe
1 X
LIEM = CrossEntropy(y, p) = −[yi log(pi ) + (1 − yi ) log(1 − pi )], (2)
N i
where pi represents the probability of the sample i being compatibility-related.
Finally, we pass all sentences in our corpus through binary-classification screening network, and get the database
WholeSen from literature.
ot

2 Molecular Representation Model

To improve we should transform molecular structure into proper vector representation. We have transformed texts into
tn

vectors successfully, but molecular structure is far more complicated than simple text. We divide the representation
process into two steps.
First, we represent polymer repeating unit structure with proper character string according to a specific rule, which
has a strong relation with spatial information. For character string, we decide to use SMILES (Simplified Molecular
Input Line Entry System) [3, 4, 5]. Although InChI is introduced as a standard for formula representation by IUPAC,
rin

SMILES is still generally considered to be more human-readable than InChI [6].


Second, we transform these strings to vectors according to chemical structures at different scales. There are some
methods based on SMILES, such as RDkit Descriptors [7], MACCS Keys FingerPrint [8], PubChem FingerPrint [9]
and Circular FingerPrint [10]. Among these methods, RDkit Descriptors present high-dimensional features and indicate
molecular physical and chemical properties, while FingerPrint presents structure of molecules and indicate, for example,
whether there is phenyl or not. We use Circular FingerPrint in our polymer compatibility network.The whole process
ep

can be shown as Figure.S2.

3 Predictive Model

Our model here mainly consists of three modules: Feature Extraction Module, Features Dense Module, Difference and
Pr

Decision Module. The whole process can be shown as Figure.S3.

Feature Extraction Module Since molecular representations are always high-dimensional (for example, 2048-D
for Circular FingerPrint) and chemistry theories focus much more on just a few factors (for example, 4-D for HSP),

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4048382
A PREPRINT - F EBRUARY 13, 2022

ed
Molecular Representation Model

Circular
Repeating Repeating Finger-

iew
Polymer Units Graph Units Feature
Print
Formula SMILES Module

Figure S2: Molecular Representation Model. Polymers are processed through these stages and finally transformed

ev
to vector representations.

r
Predictive Model

Feature A
M
L
P
Absolute er
Difference
MLP
Classifier
Classify
Compatible
pe
Feature B
Incompatible
Composition Concatenate
A&B

Feature Features Difference


Extraction Dense &Decision
Module Module Module
ot

Figure S3: Predictive Model. Whole architecture includes three main modules, and composition are concatenated
after Feature Dense Module. After Decision Module, model will finally output classification prediction.
tn

we assume that many parameters in molecular representation are likely to be redundant. We use an MLP with linear
rin

connection layers and Sigmoid layers to reduce the features dimension. Each node in next hidden layer is a function of
all nodes in the current layer, and represents a new feature made up of former factors. In this way, we can extract the
proper features from initial input.
ep

Features Dense Module We assume that not only some specific parameters such as HSP and components influence
compatibility, but also basic structure features matter to compatibility, because the total interaction depends on atoms,
chains and functional group of polymers. Therefore, our final polymer compatibility prediction depends on features
Pr

at different depth level inside the network. Besides, existing researches also prove that connection between different
hidden layers and shortcut of network will improve the representation ability and avoid over-fitting [11, 12]. Therefore,
we construct dense shortcuts that connect different layers in order to make our model learn the right rules. Features
Dense Module integrates different depth features and these features will participate in the final prediction together. The

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4048382
A PREPRINT - F EBRUARY 13, 2022

algorithm is presented as follow:

ed
h1 = W0 x + b0 , (3a)
x1 = Sigmoid(h1 ) ∗ λ, (3b)
h2 = W1 (x + x1 )+ b1 , (3c)
x2 = Sigmoid(h2 ) ∗ λ, (3d)

iew
h3 = W2 (x + x1 + x2 )+ b2 , (3e)
x3 = Sigmoid(h3 ) ∗ λ, (3f)
1
Sigmoid(x) = , (3g)
1 + e−x
where x is feature vector after Feature Exaction Module, Wi is weight matrix, bi is bias vector, and λ is a parameter to
control the training loss (λ is set as 10 in actual experiments).

ev
Difference and Decision Module According to F-H theory and HSP theory, if polymer molecules have similar
structures and HSP parameters, F-H interaction parameter χ will be small and ∆GM will be negative, so the blends
can be compatible. Therefore, we use the difference between two vectors obtained after Features Dense Module as the
inputs of Decision Module. We don’t care about the sign so we calculate the absolute value of the difference and pass
the absolute difference vector to followed layers.

r
At the same time, we notice that in fact, the composition will obviously influence the polymer blend compatibility. It is
common that blend are more compatible at 10%-90% composition than at 50%-50% composition. We concatenate the
difference vector and composition information, and put the whole vector into a 3-layers MLP, which will give the final

er
prediction. As to our labels, 0 corresponds to compatible and λ (set to 10 in experiments) corresponds to incompatible.
Therefore, if the final output is closer to 0, it means our model predicts "compatible", while output closer to λ means
our model predicts "incompatible". Although our problem is a classification problem, we use MSE (Mean Square Error)
loss to finetune our network.
pe
References
[1] Shingo Otsuka, Isao Kuwajima, Junko Hosoya, Yibin Xu, and Masayoshi Yamazaki. Polyinfo: Polymer database
for polymeric materials design. In 2011 International Conference on Emerging Intelligent Data and Web
Technologies, pages 22–29, 2011.
[2] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Nee-
ot

lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv
preprint arXiv:2005.14165, 2020.
[3] David Weininger. Smiles, a chemical language and information system. 1. Introduction to, 1970.
[4] David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and
tn

encoding rules. Journal of Chemical Information and Computer Sciences, 28(1):31–36, 1988.
[5] David Weininger. Smiles. 3. depict. graphical depiction of chemical structures. Journal of Chemical Information
and Computer Sciences, 30(3):237–243, 1990.
[6] Stephen R Heller, Alan McNaught, Igor Pletnev, Stephen Stein, and Dmitrii Tchekhovskoi. Inchi, the iupac
international chemical identifier. Journal of Cheminformatics, 7(1):1–34, 2015.
rin

[7] Greg Landrum. Rdkit documentation. Release, 1(1-79):4, 2013.


[8] Joseph L Durant, Burton A Leland, Douglas R Henry, and James G Nourse. Reoptimization of mdl keys for use in
drug discovery. Journal of Chemical Information and Computer Sciences, 42(6):1273–1280, 2002.
[9] Xiang-Qun Sean Xie. Exploiting pubchem for virtual screening. Expert Opinion on Drug Discovery, 5(12):1205–
1220, 2010.
ep

[10] Robert C Glen, Andreas Bender, Catrin H Arnby, Lars Carlsson, Scott Boyer, and James Smith. Circular
fingerprints: flexible molecular descriptors with applications from physical chemistry to adme. IDrugs, 9(3):199,
2006.
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Pr

[12] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional
networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708,
2017.

This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4048382

You might also like