ML codes
ML codes
Why? Access to data associated with materials in electronic form enables engineers,
scientists and students to explore this data, display it graphically, find trends and develop
models.
What? In this tutorial, we will learn how to query, organize and plot data from the databases
associated with the Python libraries Pymatgen and Mendeleev.
How to use this? This tutorial uses Python, some familiarity with programming would be
beneficial but is not required. Run each code cell in order by clicking "Shift + Enter". Feel
free to modify the code, or change queries to familiarize yourself with the workings on the
code.
Suggested modifications and exercises are included in blue.
Outline:
1. Query from Pymatgen
2. Processing and Organizing Data
3. Plotting
4. Query from Mendeleev
Get started: Click "Shift-Enter" on the code cells to run!
# These lines import both libraries and then define an array with elements to be used below
import pymatgen as pymat
import mendeleev as mendel
import pandas as pd
elements = ['H', 'He', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Ne', 'Na', 'Mg',
'Al', 'Si', 'P', 'S', 'Cl', 'Ar', 'K', 'Ca', 'Sc', 'Ti', 'V', 'Cr',
'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn', 'Ga', 'Ge', 'As', 'Se', 'Br',
'Kr', 'Rb', 'Sr', 'Y', 'Zr', 'Nb', 'Mo', 'Tc', 'Ru', 'Rh', 'Pd', 'Ag',
'Cd', 'In', 'Sn', 'Sb', 'Te', 'I', 'Xe', 'Cs', 'Ba', 'Hf', 'Ta', 'W',
'Re', 'Os', 'Ir', 'Pt', 'Au', 'Hg', 'Tl', 'Pb', 'Bi', 'La', 'Ce', 'Pr',
'Nd', 'Pm', 'Sm', 'Eu', 'Gd', 'Tb', 'Dy', 'Ho', 'Er', 'Tm', 'Yb', 'Lu',
'Ac', 'Th', 'Pa', 'U', 'Np', 'Pu']
Fe_data["atomic_number"] = mendel.element("Fe").atomic_number
Fe_data["coefficient_of_linear_thermal_expansion"] =
pymat.Element("Fe").coefficient_of_linear_thermal_expansion
Fe_data["youngs_modulus"] = pymat.Element("Fe").youngs_modulus
Fe_data["specific_heat"] = mendel.element("Fe").specific_heat
Another way we can organize data is in lists, which can be very helpful if we want to create plots with
our data. Following the examples above, we will now query two specific properties for all elements
to get a list of values which will be indexed corresponding to the positions of the elements in the
"elements" list in the first cell of the tutorial.
sample = elements.copy()
# We will use the following arrays to group elements by their crystal structure at RT, all elements
that are gases and liquids at RT have been removed
fcc_elements = ["Ag", "Al", "Au", "Cu", "Ir", "Ni", "Pb", "Pd", "Pt", "Rh", "Sr", "Th", "Yb"]
bcc_elements = ["Ba", "Cr", "Cs", "Eu", "Fe", "K", "Li", "Mn", "Mo", "Na", "Nb", "P", "Rb", "Ta",
"V", "W" ]
hcp_elements = ["Be", "Ca", "Cd", "Co", "Dy", "Er", "Gd", "Hf", "Ho", "Lu", "Mg", "Os", "Re", "Ru",
"Sc", "Tb", "Tc","Ti", "Tl", "Tm", "Y", "Zn", "Zr"]
# Others (Solids): "B", "Sb", "Sm", "Bi" and "As" are Rhombohedral; "C" , "Ce" and "Sn" are
Allotropic; "Si" and "Ge" are Face-centered diamond-cubic; "Pu" is Monoclinic;
# "S", "I", "U", "Np" and "Ga" are Orthorhombic; "Se" and "Te" Hexagonal; "In" and "Pa"
are Tetragonal; "la", "Pr", "Nd", "Pm" are Double hexagonal close-packed;
Finally, the most efficient way we to visualize how the dataset we just created looks is to use
the Pandas library to display it. This library will take the list of lists and show it in a nice,
user-friendly table with the properties as the column headers.
For this exercise, we will work with the data extracted for elements with the FCC crystal
structure.
First, we will create a list of lists using a for-loop and the values we can query from the
Pymatgen library. We can specify the names for each column from our array of properties we
queried.
element_object = pymat.Element(item)
for i in querable_pymatgen:
element_values.append(getattr(element_object,i))
all_values.append(element_values) # All lists are appended to another list, creating a list of lists
# Pandas Dataframe
df = pd.DataFrame(all_values, columns=querable_pymatgen)
display(df)
Pandas allows for easier manipulation of the data than the structures we discussed before,
both dictionaries and lists of lists. We can make modifications to this dataframe in each of the
following cells, to showcase the flexibility the Pandas library offers.
To make this dataframe look better for example, we can start by using the list of elements
instead of numbered rows.
df.index = fcc_elements
display(df)
We can then use simple Pandas binary operations to only show elements that satisfy a certain
condition.
The first cell will display a version of the dataframe filtered to elements that have an atomic
mass greater or equal than 150u. (Pandas operator .ge)
The second cell will display a version of the dataframe filtered to elements with exactly 0.26
Poissons' ratio. (Pandas operator .eq)
There are standard operators for greater or equal (.ge), less or equal (.le), equal (.eq) and not
equal (.ne). A list of such operations can be found here. However, we can also create our
custom binary conditions.
The third cell will display a version with a custom binary condition. The elements shown
have Young's modulus less than 120 GPa, and Poissons' ratio greater than 0.25.
df_big_atoms = df[df.atomic_mass.ge(150)]
display(df_big_atoms)
df_poisson = df[df.poissons_ratio.eq(0.26)]
display(df_poisson)
3. Plotting
Finally, we are going to plot the values for the properties in the lists we just created. For this
tutorial we will make two scatter plots:
• Young's Modulus vs Melting Temperature
• Coefficient of Linear Thermal Expansion vs Melting Temperature
We will be using a Python library called Plotly to create these plots. This library allows you
to create plots that are really interactive and highly customizable.
Simple Plot
In this first cell we will import the library components we will use and create a simple plot.
import plotly #This is the library import
import plotly.graph_objs as go # This is the graphical object (Think "plt" in Matplotlib if you have
used that before)
from plotly.offline import iplot # These lines are necessary to run Plotly in Jupyter Notebooks, but
not in a dedicated environment
plotly.offline.init_notebook_mode(connected=True)
# The layout gives Plotly the instructions on the background grids, tiles in the plot,
# axes names, axes ticks, legends, labels, colors on the figure and general formatting.
# The trace contains a type of plot (In this case, Scatter, but it can be "Bars, Lines, Pie Charts", etc.),
# the data we want to visualize and the way ("Mode") we want to represent it.
# To plot, we create a figure and implement our components in the following way:
data = [trace] # We could include more than just one trace here
Now that we know how to make a basic plot, we can start adding more details to end up with
something that looks a little bit better. All modifications are explained in the comments, but
you can also find that information here https://plotly.com/python/axes/.
Before we start our new plot, wouldn't it look better if we could visualize the points with the
elements' names and color them according to their crystal structures?
# Here we are creating a function that takes a value X (Which will be the Symbol of the Element)
# and returns a color depending on what its crystal structure is in our arrays from the beginning.
# That is because we want to color data according to the crystal structure; therefore, we will have
to pass this info to the plot
def SetColor_CrystalStr(x):
if x in fcc_elements:
return "red" #This are standard CSS colors, but you can also use Hexadecimal Colors (#009900)
or RGB "rgb(0, 128, 0)"
elif x in bcc_elements:
return "blue"
elif x in hcp_elements:
return "yellow"
else:
return "lightgray"
# We will then create a list that passes all element symbols through this function. For that we will
use the python function "map"
# Map takes each element on a list and evaluates it in a function.
# You can see this list of generated colors looks like by uncommenting this line
#print(colors)
# Trace
# Trace
• Exercise 4. Do you find correlations between the properties plotted? If so, what are the
underlying reasons for them?
• Exercise 5. Select a different pair or properties and create a similar plot. You can insert new
cells below from the top menu (Insert -> Cell below) and copy and paste the code to create
new plots.
4. Query from Mendeleev
Another database we can query in a similar way is Mendeleev. Mendeleev is an API
(Application programming interface) dedicated library to provide access to element
properties in the periodic table. Just as Pymatgen, Mendeleev also uses an object and
attributes to handle a query. Mendeleev uses the element class (Note that is all lowercase).
Making a query in Mendeleev can be done either by using the chemical symbol the same way
Pymatgen does, or by providing the atomic number of the elements. Similarly, you can get a
property by using it as an attribute for the object. Again, not all properties that you can query
are listed here, but you can find them here. Note that there Mendeleev does not provide units
when returning values, but you can find them in the previous link too.
In this example we will query the thermal conductivity for the elements in the list "sample".
With a little bit of programming experience in Python you can again use the commented code
to query all the properties listed for the "sample" elements.
# You can get the same results using either of these two lists (Numbers correspond to the
element's atomic number)
sample = ['Fe', 'Co', 'Ni', 'Cu', 'Zn']
#sample = [26,27,28,29,30]
What? In this tutorial, we will learn how to use the linear regression model from the Scikit-
learn Python library to develop simple models for correlations between properties across
materials. Data will be obtained from Pymatgen and Mendeleev.
How to use this? This tutorial uses Python, some familiarity with programming would be
beneficial but is not required. Run each code cell in order by clicking "Shift + Enter". Feel
free to modify the code, or change queries to familiarize yourself with the workings on the
code.
Outline:
1. Getting data
2. Processing and Organizing Data
3. Creating the Model
4. Plotting
1. Getting Data
Using the queries from the MSEML Query_Viz tutorial, we will create lists containing
different properties of elements. In this first snippet of code we will import all relevant
libraries, the elements that will be turned into cases and the properties that will serve as the
attributes for the cases. The elements listed were chosen because they represent the three
most common crystallographic structures, and also querying them for properties yields a
dataset with no unknown values.
import numpy as np
import pymatgen as pymat
import mendeleev as mendel
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from random import shuffle
import matplotlib.pyplot as plt
fcc_elements = ["Ag", "Al", "Au", "Cu", "Ir", "Ni", "Pb", "Pd", "Pt", "Rh",
"Th", "Yb"]
bcc_elements = ["Ba", "Cr", "Eu", "Fe", "Li", "Mn", "Mo", "Na", "Nb", "Ta",
"V", "W" ]
hcp_elements = ["Be", "Ca", "Cd", "Co", "Dy", "Er", "Gd", "Hf", "Ho", "Lu",
"Mg", "Re",
"Ru", "Sc", "Tb", "Ti", "Tl", "Tm", "Y", "Zn", "Zr"]
others = ["Sb", "Sm", "Bi", "Ce", "Sn", "Si"]
# Others (Solids): "Sb", "Sm", Bi" and "As" are Rhombohedral; "C" , "Ce"
and "Sn" are Allotropic;
# "Si" and "Ge" are Face-centered diamond-cubic;
data_youngs_modulus = []
data_lattice_constant = []
data_melting_point = []
data_specific_heat = []
data_atomic_mass = []
data_CTE = []
data_CTE.append(pymat.Element(item).coefficient_of_linear_thermal_expansion
)
print(data_youngs_modulus)
#print(data_lattice_constant)
#print(data_melting_point)
#print(data_specific_heat)
#print(data_atomic_mass)
#print(data_CTE)
2. Processing and Organizing Data
Simple Linear Regression is one of the easiest models to implement, and can let you see how
one property is related to another explanatory variable. As with all Machine Learning models,
we will use a training set and a testing set.
SETS
Each of the lists we just created contains values for the properties of the elements. Analyzing
the code, you can infer that the indexes in these arrays represent a specific element. For
example:
For element "Ag" with index 2, the properties would be located at:
data_young_modulus [2]
data_lattice_constant[2]
data_melting_point[2]
data_specific_heat[2]
data_atomic_mass[2]
data_CTE[2]
TRAINING AND TESTING SETS
With all this data, we will select 45 elements to become the training set and 6 elements to be
the testing set.
The training group will be used to train or develop the model, the second group will be used
for testing. This is done to avoid or check overfitting. While overfitting is not likely in a
linear model it can happen in more complex models, like neural networks.
NORMALIZATION
Machine Learning models usually require data to be processed and normalized. In this linear
model we will process the data to be in the format required by the sklearn model, but we will
not normalize as we are interested in visualizing trends in our raw data, instead of precise
predictions.
# This Reshape function in the next two lines, turns each of the horizontal
lists [ x, y, z] into a
# vertical NumPy array [[x]
# [y]
# [z]]
# This Step is required to work with the Sklearn Linear Model
melt_train = np.array(melt_train).reshape(-1,1)
melt_test = np.array(melt_test).reshape(-1,1)
young_train = data_youngs_modulus[:45]
young_test = data_youngs_modulus[-6:]
young_train = np.array(young_train).reshape(-1,1)
young_test = np.array(young_test).reshape(-1,1)
lattice_train = data_lattice_constant[:45]
lattice_test = data_lattice_constant[-6:]
lattice_train = np.array(lattice_train).reshape(-1,1)
lattice_test = np.array(lattice_test).reshape(-1,1)
specheat_train = data_specific_heat[:45]
specheat_test = data_specific_heat[-6:]
specheat_train = np.array(specheat_train).reshape(-1,1)
specheat_test = np.array(specheat_test).reshape(-1,1)
mass_train = data_atomic_mass[:45]
mass_test = data_atomic_mass[-6:]
mass_train = np.array(mass_train).reshape(-1,1)
mass_test = np.array(mass_test).reshape(-1,1)
Sklearn uses an "Ordinary least squares linear regression" for its Linear model.
Implementation is described here. We will run it will all default values.
The regression function in the next cell creates a model. It does so by declaring a linear
model, and then fitting it to the training set. We then make predictions feeding the model the
X-axis testing set.
We can then use the trained model (represented as an equation) to look at the variance score
and understand how well our model does.
return predictions
Exercise 1. Research and provide for definition of the mean square error and variance score
in terms of the linear model and data. You can find these definitions here and here.
To visualize the trends that the model fit to our data, we will plot the two properties. We will
use Plotly to create this plot.
• The first cell defines a function that organizes data and generates the plot
• The following cells train models and plot
plotly.offline.init_notebook_mode(connected=True)
# The reshape functions in the next two lines, turns each of the
# vertical NumPy array [[x]
# [y]
# [z]]
# into python lists [ x, y, z]
# This step is required to create plots with plotly like we did in the
previous tutorial
x_train = x_train.reshape(1,-1).tolist()[0]
x_test = x_test.reshape(1,-1).tolist()[0]
y_train = y_train.reshape(1,-1).tolist()[0]
y_test = y_test.reshape(1,-1).tolist()[0]
predictions = predictions.reshape(1,-1).tolist()[0]
full_x_list = x_train + x_test
• Exercise 2. For the three cases above. What type of correlation do you find between
properties? Explain the origin of these, approximate, correlations.
• Exercise 3. Plot other two properties of your choice.
• Exercise 4. Note the mean squared error in the various models. What error would you make,
in average, if you used the melting temperature to predict Young's modulus?
Using neural networks to predict and classify crystal
structures of elements
Why? Neural networks are widely used for image classification, learning structures and
substructures within the data to identify patterns. Such neural network based classifiers can
identify patterns and correlations within the data.
What? In this tutorial we will learn how to use Neural Networks to create a Classification
Model to estimate the ground state of the crystal structure.
This is the third tutorial, following the MSEML Query_Viz, and was based on the
classification tutorial in the TensorFlow Tutorials.
How to use this? This tutorial uses Python, some familiarity with programming would be
beneficial but is not required. Run each code cell in order by clicking "Shift + Enter". Feel
free to modify the code, or change queries to familiarize yourself with the workings on the
code.
Outline:
1. Getting a dataset
2. Processing and Organizing Data
3. Creating the Model
4. Plotting
1. Getting a dataset
Datasets containing properties for the elements in the periodic table are available online;
however, it would be thematic to create our own, using the tools from the first tutorial on
MSEML Query_Viz. In this section we will query both Pymatgen and Mendeleev to get a
complete set of properties per element. We will use this data to create the cases from which
the model will train and test.
In this first snippet of code we will import all relevant libraries, the elements that will be
turned into cases and the properties that will serve as the attributes for the cases. We will get
47 entries (which is a small dataset), but should give us a somewhat accurate prediction. It is
important to note that more entries would move the prediction closer to the real value, and so
would more attributes.
The elements listed were chosen because querying them for these properties yields a dataset
with no unknown values, and because they represent the three most common crystallographic
structures.
import tensorflow as tf
from tensorflow import keras
from keras import initializers
from keras.layers import Dense
from keras.models import Sequential
%matplotlib inline
import matplotlib.pyplot as plt
import sys
fcc_elements = ["Ag", "Al", "Au", "Cu", "Ir", "Ni", "Pb", "Pd", "Pt", "Rh",
"Th", "Yb"]
bcc_elements = ["Ba", "Ca", "Cr", "Cs", "Eu", "Fe", "Li", "Mn", "Mo", "Na",
"Nb", "Rb", "Ta", "V", "W" ]
hcp_elements = ["Be", "Cd", "Co", "Dy", "Er", "Gd", "Hf", "Ho", "Lu", "Mg",
"Re",
"Ru", "Sc", "Tb", "Ti", "Tl", "Tm", "Y", "Zn", "Zr"]
random.Random(1).shuffle(elements)
As before, we will use the database queries to populate lists which can be displayed by the
Pandas library in a user-friendly table with the properties as the column headers.
if (item in fcc_elements):
all_labels.append([1, 0, 0]) # The crystal structure labels are
assigned here
elif (item in bcc_elements):
all_labels.append([0, 1, 0]) # The crystal structure labels are
assigned here
elif (item in hcp_elements):
all_labels.append([0, 0, 1]) # The crystal structure labels are
assigned here
# Pandas Dataframe
df = pd.DataFrame(all_values, columns=querable_values)
# We will patch some of the values that are not available in the datasets.
df.head(n=10)
ato ato boi en eva heat latti me sp ato ato elect mo bul you aver den
coefficient
mic mic lin _g pora _of_ ce_ ltin eci mi mic rical lar k_ ngs age_i sity
_of_linear
_nu _vo g_ h tion form con g_p fic c_ _ra _resi _vo mo _m onic_ _of
_thermal_
mb lum poi os _he atio sta oin _h ma diu stivit lu dul odu radiu _sol
expansion
er e nt h at n nt t eat ss s y me us lus s id
0. 58.
31 17 6.00
6.7 14 389. 426. 2.5 0.4 93 1.3 6.6 180 209. 0.76 890
0 27 43. 68. 0000 0.000013
0 32 1 7 1 56 31 5 7 .0 0 8333 0.0
0 00 e-08
36 95
16
0.
22 18 8.9 6.76
18. 21 232. 232. 3.5 0.1 1.7 19. 45. 1.09 932
1 69 20. 18. 34 0000 74.0 0.000013
10 67 0 2 4 60 5 10 0 5000 1.0
0 00 21 e-07
24
0
0. 88.
36 17 6.00
19. 12 367. 424. 3.6 0.2 90 1.8 19. 41. 1.04 447
2 39 11. 95. 0000 64.0 0.000011
80 16 0 7 5 84 58 0 88 0 0000 2.0
0 00 e-07
99 50
18
0.
59 34 6.2 1.80
8.8 24 704. 774. 2.7 0.1 1.3 8.8 370 463. 0.71 210
3 75 00. 53. 07 0000 0.000006
5 35 0 0 6 38 5 6 .0 0 2500 20.0
0 00 00 e-07
16
0
0. 58.
30 17 7.20
6.6 14 378. 430. 3.5 0.4 69 1.3 6.5 180 200. 0.74 890
4 28 05. 26. 0000 0.000013
0 72 6 1 2 43 34 5 9 .0 0 0000 8.0
0 00 e-08
07 00
16
0.
29 17 4.9 8.14
18. 20 301. 300. 3.5 0.1 1.7 18. 40. 1.04 879
5 67 68. 47. 30 0000 65.0 0.000011
70 77 0 6 8 64 5 74 0 1000 5.0
0 00 32 e-07
95
0
19
0.
30 13 6.9 2.20
10. 26 340. 368. 4.0 0.1 1.3 10. 220 1.07 193
6 79 80. 37. 66 0000 78.0 0.000014
20 13 0 2 8 29 5 21 .0 0000 00.0
0 58 56 e-08
70
9
31 18 5.50
15. 0. 332. 377. 3.3 0.5 44. 1.6 15. 57. 0.88 298
7 21 04. 14. 0000 74.0 0.000010
00 7 8 1 56 95 0 00 0 5000 5.0
0 11 00 e-07
ato ato boi en eva heat latti me sp ato ato elect mo bul you aver den
coefficient
mic mic lin _g pora _of_ ce_ ltin eci mi mic rical lar k_ ngs age_i sity
_of_linear
_nu _vo g_ h tion form con g_p fic c_ _ra _resi _vo mo _m onic_ _of
_thermal_
mb lum poi os _he atio sta oin _h ma diu stivit lu dul odu radiu _sol
expansion
er e nt h at n nt t eat ss s y me us lus s id
93 59
83 12
10
0.
40 22 2.9 4.30
8.3 14 494. 556. 3.8 0.2 1.3 8.2 380 275. 0.74 124
8 45 00. 39. 05 0000 0.000008
0 08 0 0 0 44 5 8 .0 0 5000 50.0
0 00 50 e-08
38
0
18
0.
59 36 3.8 5.40
9.5 23 824. 851. 3.1 0.1 1.3 9.4 310 411. 0.76 192
9 74 30. 80. 40 0000 0.000005
3 90 0 0 6 33 5 7 .0 0 6667 50.0
0 00 00 e-08
50
0
We again normalize the data and organize it into training and testing sets as before.
SETS
We have 47 elements for which the crystal structure is known and we will use 40 of these as
a training set and the remaining 7 as testing set.
NORMALIZATION
We will again use the Standard Score Normalization, which subtracts the mean of the feature
and divide by its standard deviation.
X−µσ
While our model might converge without feature normalization, the resultant model would be
difficult to train and would be dependent on the choice of units used in the input.
# SETS
# Training Set
train_values = all_values[:40, :]
train_labels = all_labels[:40, :]
# Testing Set
test_values = all_values[-7:, :]
test_labels = all_labels[-7:, :]
# NORMALIZATION
For this classification, we will use a simple sequential neural network with one densely
connected hidden layer. The optimizer used will be RMSPropOptimizer (Root Mean Square
Propagation).
The key difference between the regression model and the classification model is our metric to
measure network performance. While we used mean squared error (between the true outputs
and the network's predicted output) for the regression task, we use categorical crossentropy
(click here to learn more about it), using classification accuracy as a metric where higher
accuracy implies a better network.
model = Sequential()
model.add(Dense(16, activation='relu',
input_shape=(train_values.shape[1],), kernel_initializer=kernel_init))
#model.add(Dense(16, activation='relu', kernel_initializer=kernel_init))
model.add(Dense(3, activation=tf.nn.softmax)) # Output Layer
# This line matches the optimizer to the model and states which metrics
will evaluate the model's accuracy
model.compile(loss='categorical_crossentropy', optimizer=optimizer,
metrics=['accuracy'])
model.summary()
WARNING:tensorflow:From /apps/share64/debian7/anaconda/anaconda-
6/lib/python3.7/site-
packages/tensorflow/python/framework/op_def_library.py:263: colocate_with
(from tensorflow.python.framework.ops) is deprecated and will be removed in
a future version.
Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 16) 304
_________________________________________________________________
dense_2 (Dense) (None, 3) 51
=================================================================
Total params: 355
Trainable params: 355
Non-trainable params: 0
_________________________________________________________________
TRAINING
This model is trained for 350 epochs, and we record the training accuracy in the history
object. This way, by plotting "history" we can see the evolution of the "learning" of the
model, that is the decrease of the Mean Absolute Error. Models in Keras are fitted to the
training set using the fit method.
One Epoch occurs when you pass the entire dataset through the model. One Batch contains a
subset of the dataset that can be fed to the model at the same time. A more detailed
explanation of these concepts can be found in this blog. As we have a really small dataset
compared to the ones that are usually considered to be modeled by these neural networks, we
are feeding all entries at the same time, so our batch is the entire dataset, and an epoch occurs
when the batch is processed.
plt.figure()
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.plot(history.epoch, np.array(history.history['acc']),label='Training
Accuracy')
plt.plot(history.epoch, np.array(history.history['val_acc']),label =
'Validation Accuracy')
plt.legend()
plt.show()
WARNING:tensorflow:From /apps/share64/debian7/anaconda/anaconda-
6/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py:3066:
to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be
removed in a future version.
Instructions for updating:
Use tf.cast instead.
Current Epoch: 300
TESTING
Models in Keras are tested using the method evaluate. This method returns the classification
accuracy on the training and the testing sets.
loss, acc = model.evaluate(train_values, train_labels, verbose=0)
The last step in a Regression Model is to make predictions for values not in the training set,
which are determined by the method predict. In the following cell we print the Elements in
the testing set, the real values for their Young's Moduli and the predictions generated by the
Machine Learning model.
train_predictions = model.predict(train_values)
test_predictions = model.predict(test_values)
predicted_labels = []
true_labels = []
for i in range(all_predictions.shape[0]):
if (np.argmax(all_predictions[i]) == 0):
predicted_labels.append("FCC")
if (np.argmax(all_labels[i]) == 0):
true_labels.append("FCC")
if (np.argmax(all_predictions[i]) == 1):
predicted_labels.append("BCC")
if (np.argmax(all_labels[i]) == 1):
true_labels.append("BCC")
if (np.argmax(all_predictions[i]) == 2):
predicted_labels.append("HCP")
if (np.argmax(all_labels[i]) == 2):
true_labels.append("HCP")
plot_df
Atomic number True crystal structure Predicted crystal structure
0 27 HCP HCP
1 69 HCP HCP
2 39 HCP HCP
3 75 HCP HCP
4 28 FCC FCC
5 67 HCP HCP
6 79 FCC FCC
7 21 HCP HCP
8 45 FCC FCC
9 74 BCC BCC
10 64 HCP HCP
11 65 HCP HCP
12 72 HCP HCP
13 70 FCC FCC
14 55 BCC BCC
15 30 HCP HCP
16 56 BCC BCC
17 25 BCC BCC
18 26 BCC BCC
19 42 BCC BCC
20 11 BCC BCC
Atomic number True crystal structure Predicted crystal structure
21 71 HCP HCP
22 90 FCC FCC
23 29 FCC FCC
24 3 BCC BCC
25 81 HCP HCP
26 23 BCC BCC
27 37 BCC BCC
28 40 HCP HCP
29 24 BCC BCC
30 41 BCC BCC
31 47 FCC FCC
32 4 HCP HCP
33 44 HCP HCP
34 13 FCC FCC
35 22 HCP HCP
36 82 FCC FCC
37 20 BCC BCC
38 73 BCC BCC
39 66 HCP HCP
40 48 HCP HCP
41 68 HCP HCP
42 46 FCC FCC
43 63 BCC HCP
44 77 FCC FCC
Atomic number True crystal structure Predicted crystal structure
45 12 HCP HCP
46 78 FCC FCC
# --------------------------------------------------------------
# This block will be used to sort the elements by their atomic number
# # --------------------------------------------------------------
import plotly as py
import plotly.graph_objs as go
from plotly.subplots import make_subplots
from plotly.offline import iplot
py.offline.init_notebook_mode(connected=True)
# ---------
fig.append_trace(go.Bar(x=[_ for _ in elements if _ in fcc_elements],
y=[FCC_prediction[_] for _ in range(len(FCC_prediction)) if elements[_] in
fcc_elements], name='FCC', marker=dict(color='green'), showlegend=False,
textposition='inside', textfont={"size":24},
text=['*' if _ in elements[-7:] else None for _ in
[_ for _ in elements if _ in fcc_elements]]), row=1, col=1)
fig.append_trace(go.Bar(x=[_ for _ in elements if _ in fcc_elements],
y=[BCC_prediction[_] for _ in range(len(BCC_prediction)) if elements[_] in
fcc_elements], name='BCC', marker=dict(color='red'), showlegend=False),
row=1, col=1)
fig.append_trace(go.Bar(x=[_ for _ in elements if _ in fcc_elements],
y=[HCP_prediction[_] for _ in range(len(HCP_prediction)) if elements[_] in
fcc_elements], name='HCP', marker=dict(color='red'), showlegend=False),
row=1, col=1)
# ---------
# ---------
# ---------
fig.append_trace(go.Bar(x=[_ for _ in elements if _ in hcp_elements],
y=[FCC_prediction[_] for _ in range(len(FCC_prediction)) if elements[_] in
hcp_elements], name='FCC', marker=dict(color='red'), showlegend=False),
row=3, col=1)
fig.append_trace(go.Bar(x=[_ for _ in elements if _ in hcp_elements],
y=[BCC_prediction[_] for _ in range(len(BCC_prediction)) if elements[_] in
hcp_elements], name='BCC', marker=dict(color='red'), showlegend=False),
row=3, col=1)
fig.append_trace(go.Bar(x=[_ for _ in elements if _ in hcp_elements],
y=[HCP_prediction[_] for _ in range(len(HCP_prediction)) if elements[_] in
hcp_elements], name='HCP', marker=dict(color='green'), showlegend=False,
textposition='inside', textfont={"size":24},
text=['*' if _ in elements[-7:] else None for _ in
[_ for _ in elements if _ in hcp_elements]]), row=3, col=1)
# ---------
fig.update_xaxes(title=go.layout.xaxis.Title(text="FCC Elements",
font=dict(size=18)),showgrid=True, tickfont=dict(size=18), row=1, col=1)
fig.update_xaxes(title=go.layout.xaxis.Title(text="BCC Elements",
font=dict(size=18)),showgrid=True, tickfont=dict(size=18), row=2, col=1)
fig.update_xaxes(title=go.layout.xaxis.Title(text="HCP Elements",
font=dict(size=18)),showgrid=True, tickfont=dict(size=18), row=3, col=1)
fig.update_yaxes(title=go.layout.yaxis.Title(text="Probability",
font=dict(size=18)),showgrid=True, tickfont=dict(size=18),range=[0, 1.2],
row=1, col=1)
fig.update_yaxes(title=go.layout.yaxis.Title(text="Probability",
font=dict(size=18)),showgrid=True, tickfont=dict(size=18),range=[0, 1.2],
row=2, col=1)
fig.update_yaxes(title=go.layout.yaxis.Title(text="Probability",
font=dict(size=18)),showgrid=True, tickfont=dict(size=18),range=[0, 1.2],
row=3, col=1)
fig.show()
What? In this tutorial we will learn how to use neural networks from the Keras library to
create a regression model to estimate Young's modulus.
You can find another example of neural network regression using Keras in the TensorFlow
Tutorials nanoHUB tool.
How to use this? This tutorial uses Python, some familiarity with programming would be
beneficial but is not required. Run each code cell in order by clicking "Shift + Enter". Feel
free to modify the code, or change queries to familiarize yourself with the workings on the
code.
Outline:
1. Getting data
2. Processing and Organizing Data
3. Creating the Model
4. Plotting
1. Getting a dataset
We will repeat the process of obtaining a dataset used in the previous tutorial and the
explanations are repeated for convenience.
Datasets containing properties for the elements in the periodic table are available online;
however, it would be thematic to create our own, using the tools from the first tutorial on
MSEML Query_Viz. In this section we will query both Pymatgen and Mendeleev to get a
complete set of properties per element. We will use this data to create the cases from which
the model will train and test.
In this first snippet of code we will import all relevant libraries, the elements that will be
turned into cases and the properties that will serve as the attributes for the cases. We will get
49 entries (which is a small dataset), but should give us a somewhat accurate prediction. We
will also include some values to "patch" some unknown values in the dataset. It is important
to note that more entries would move the prediction closer to the real value, and so would
more attributes.
The elements listed were chosen because querying them for these properties yields a dataset
with few unknown values, and because they represent the three most common
crystallographic structures.
import tensorflow as tf
import keras
from keras import initializers
from keras.layers import Dense
from keras.models import Sequential
from keras import optimizers
import sys
import os
sys.path.insert(0, '../src/')
%matplotlib inline
import matplotlib.pyplot as plt
fcc_elements = ["Ag", "Al", "Au", "Cu", "Ir", "Ni", "Pb", "Pd", "Pt", "Rh",
"Th", "Yb"]
bcc_elements = ["Ba", "Cr", "Cs", "Eu", "Fe", "Li", "Mn", "Mo", "Na", "Nb",
"Rb", "Ta", "V", "W" ]
hcp_elements = ["Be", "Ca", "Cd", "Co", "Dy", "Er", "Gd", "Hf", "Ho", "Lu",
"Mg", "Re",
"Ru", "Sc", "Tb", "Ti", "Tl", "Tm", "Y", "Zn", "Zr"]
others = ["Si", "Ge"] # "Si" and "Ge" are Face-centered diamond-cubic;
After setting these values, we will proceed with our queries. Depending on the database
(either Pymatgen or Mendeleev) where the property can be found, this code fills up a list with
the properties of each of the elements. To visualize how the dataset we just created looks, we
will use the Pandas library to display it. This library will take the list of lists and show it in a
nice, user-friendly table with the properties as the column headers.
# Pandas Dataframe
df = pd.DataFrame(all_values, columns=querable_values)
# We will patch some of the values that are not available in the datasets.
# The labels (values for Young's modulus) are stored separately for clarity
(We drop the column later)
df.to_csv(os.path.expanduser('~/mseml_data.csv'), index=False,
compression=None) # this line saves the data we collected into a .csv file
into your home directory
all_labels = df['youngs_modulus'].tolist()
df = df.drop(['youngs_modulus'], axis=1)
df.head(n=10) # With this line you can see the first ten entries of our
database
• Exercise 1. Use the Pandas dataframe created above to plot Young's modulus vs melting
temperature. Note that this is recreating the plot from previous tutorials, but using the
Pandas framework to slice and access data
Most machine learning models are trained on a subset of all the available data, called the
"training set", and the models are tested on the remainder of the available data, called the
"testing set". Model performance has often been found to be enhanced when the inputs are
normalized.
SETS
With the dataset we just created, we have 49 entries for our model. We will train with 44
cases and test on the remaining 5 elements to estimate Young's Modulus.
NORMALIZATION
Each one of these input data features has different units and is represented in scales with
distinct orders of magnitude. Datasets that contain inputs like this need to be normalized, so
that quantities with large values do not overwhelm the neural network, forcing it tune its
weights to account for the different scales of our input data. In this tutorial, we will use the
Standard Score Normalization, which subtracts the mean of the feature and divide by its
standard deviation.
X−µσ
While our model might converge without feature normalization, the resultant model would be
difficult to train and would be dependent on the choice of units used in the input.
#We will rewrite the arrays with the patches we made on the dataset by
turning the dataframe back into a list of lists
# SETS
# Uncomment the line below to shuffle the dataset (we do not do this here
to ensure consistent results for every run)
#order = np.argsort(np.random.random(all_labels.shape)) # This numpy
argsort returns the indexes that would be used to shuffle a list
order = np.arange(49)
all_values = all_values[order]
all_labels = all_labels[order]
# Training Set
train_labels = all_labels[:44]
train_values = all_values[:44]
# Testing Set
test_labels = all_labels[-5:]
test_values = all_values[-5:]
# This line is used for labels in the plots at the end of the tutorial -
Testing Set
labeled_elements = [elements[x] for x in order[-5:]]
elements = [elements[x] for x in order]
# NORMALIZATION
For this regression, we will use a simple sequential neural network with one densely
connected hidden layer. The optimizer used will be RMSPropOptimizer (Root Mean Square
Propagation).
A cool tool developed by Tensorflow to visualize how a neural network learns, and play
around with its parameters, can be found here NN Tools.
model = Sequential()
model.add(Dense(32, activation='relu', input_shape=(train_values.shape[1],
), kernel_initializer=kernel_init, bias_initializer=bias_init))
model.add(Dense(64, activation='relu', kernel_initializer=kernel_init,
bias_initializer=bias_init))
#model.add(Dense(128, activation='relu', kernel_initializer=kernel_init,
bias_initializer=bias_init))
model.add(Dense(1, kernel_initializer=kernel_init,
bias_initializer=bias_init))
# This line matches the optimizer to the model and states which metrics
will evaluate the model's accuracy
model.compile(loss='mae', optimizer=optimizer, metrics=['mae'])
model.summary()
TRAINING
This model is trained for 2000 epochs, and we record the training accuracy in the history
object.
One Epoch occurs when you pass the entire dataset through the model. One Batch contains a
subset of the dataset that can be fed to the model at the same time. A more detailed
explanation of these concepts can be found in this blog. As we have a really small dataset
compared to the ones that are usually considered to be modeled by these neural networks, we
are feeding all entries at the same time, so our batch is the entire dataset, and an epoch occurs
when the batch is processed.
This way, by plotting "history" we can see the evolution of the "learning" of the model, that is
the decrease of the Mean Absolute Error. Models in Keras are fitted to the training set using
the fit method.
The blue curve that will come up from the History object represents how the model is
learning on the training data, and the orange curve represents the validation loss, which can
be thought of as the way our model evaluates data that it was not trained in. This validation
loss would start going up again when we start to overfit our data.
plt.figure()
plt.xlabel('Epoch')
plt.ylabel('Mean Abs Error')
plt.plot(history.epoch,
np.array(history.history['mean_absolute_error']),label='Loss on training
set')
plt.plot(history.epoch,
np.array(history.history['val_mean_absolute_error']),label = 'Validation
loss')
plt.legend()
plt.show()
SAVING A MODEL
Compiled and trained models in Keras can be saved and distributed in .h5 files using the
model.save() method. Running the cell below will save the current model we trained, both
weights and architecture to your home directory.
model.save(os.path.expanduser('~/model.h5'))
TESTING
Models in Keras are tested using the method evaluate. This method returns the testing loss of
the model and the metrics we specified when creating it, which in our case it's the Mean
Absolute Error. For the original model in this tutorial you should get a value of 29.59 GPa
for the Mean Absolute Error. This value would decrease with more training data, more
attributes/features, or a different optimizer. In the case of a model that overfits, you can
expect values to start increasing.
The last step in a regression model is to make predictions for values not in the training set,
which are determined by the method predict. In the following cell we print the elements in
the testing set, the real values for their Young's moduli and the predictions generated by our
machine learned model.
test_predictions = model.predict(test_values).flatten()
The easiest way to see if the model did a good job estimating the Young's Modulus for the
Elements is through a plot comparing Real Values with their Predictions. We will use Plotly
to create a plot like that. We covered how to plot in Plotly in the first tutorial of this tool. For
values in this plot, the line (x = y) indicates a perfect match and would be the desirable result
for the points. As you analyze the plot, you can hover on the points to see the data we
obtained in the cell above.
import plotly
import plotly.graph_objs as go
from plotly.offline import iplot
plotly.offline.init_notebook_mode(connected=True)
• Exercise 2. Compare these results for the Young's Modulus with the ones you got
from the Linear Regression Tutorial. How are they different? How can you explain
this difference?
• Exercise 3 (Advanced). Uncomment a line in the cell [4] to use three hidden layers in
the neural network and monitor the training. Is this model better than the one with two
hidden layers?