0% found this document useful (0 votes)
2 views

ML codes

This tutorial teaches users how to query, organize, and visualize materials data using Python libraries Pymatgen and Mendeleev. It covers querying properties of elements, organizing data in dictionaries and lists, and visualizing the data using the Pandas library and Plotly for interactive plots. Users are encouraged to modify code and perform exercises to enhance their understanding of data manipulation and visualization.

Uploaded by

adishree2017
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

ML codes

This tutorial teaches users how to query, organize, and visualize materials data using Python libraries Pymatgen and Mendeleev. It covers querying properties of elements, organizing data in dictionaries and lists, and visualizing the data using the Pandas library and Plotly for interactive plots. Users are encouraged to modify code and perform exercises to enhance their understanding of data manipulation and visualization.

Uploaded by

adishree2017
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Querying, Organizing and Visualizing Materials Data

Why? Access to data associated with materials in electronic form enables engineers,
scientists and students to explore this data, display it graphically, find trends and develop
models.
What? In this tutorial, we will learn how to query, organize and plot data from the databases
associated with the Python libraries Pymatgen and Mendeleev.
How to use this? This tutorial uses Python, some familiarity with programming would be
beneficial but is not required. Run each code cell in order by clicking "Shift + Enter". Feel
free to modify the code, or change queries to familiarize yourself with the workings on the
code.
Suggested modifications and exercises are included in blue.
Outline:
1. Query from Pymatgen
2. Processing and Organizing Data
3. Plotting
4. Query from Mendeleev
Get started: Click "Shift-Enter" on the code cells to run!

# These lines import both libraries and then define an array with elements to be used below
import pymatgen as pymat
import mendeleev as mendel
import pandas as pd
elements = ['H', 'He', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Ne', 'Na', 'Mg',
'Al', 'Si', 'P', 'S', 'Cl', 'Ar', 'K', 'Ca', 'Sc', 'Ti', 'V', 'Cr',
'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn', 'Ga', 'Ge', 'As', 'Se', 'Br',
'Kr', 'Rb', 'Sr', 'Y', 'Zr', 'Nb', 'Mo', 'Tc', 'Ru', 'Rh', 'Pd', 'Ag',
'Cd', 'In', 'Sn', 'Sb', 'Te', 'I', 'Xe', 'Cs', 'Ba', 'Hf', 'Ta', 'W',
'Re', 'Os', 'Ir', 'Pt', 'Au', 'Hg', 'Tl', 'Pb', 'Bi', 'La', 'Ce', 'Pr',
'Nd', 'Pm', 'Sm', 'Eu', 'Gd', 'Tb', 'Dy', 'Ho', 'Er', 'Tm', 'Yb', 'Lu',
'Ac', 'Th', 'Pa', 'U', 'Np', 'Pu']

1. Query from Pymatgen


Pymatgen is an open-source library in python used for material analysis. Pymatgen is a
powerful and popular resource that can be used to access data in the two repositories: the
Materials Project and the Crystallography Open Database. Pymatgen makes querying these
resources and obtain data from its internal database easy. We will start by querying the
database within the library by using the Element class.
Making a query in Pymatgen requires the chemical symbol of the element, which are all
listed in the cell above. From there, the property is accessible as an attribute of that Element
object. For a list of all the properties available click here to learn more about the Element
class.
In this example we will query the Young's modulus for the elements in the list "sample". You
will be able to see the values with the corresponding units for this quantity. You can use the
commented code to query all the properties listed for the "sample" elements.

querable_pymatgen = ["atomic_mass", "poissons_ratio","atomic_radius",


"electrical_resistivity","molar_volume","thermal_conductivity", "bulk_modulus",
"youngs_modulus",
"brinell_hardness", "average_ionic_radius", "melting_point", "rigidity_modulus",
"density_of_solid","coefficient_of_linear_thermal_expansion"]
sample = ['Fe', 'Co', 'Ni', 'Cu', 'Zn']

for item in sample:


element_object = pymat.Element(item)
print(item, element_object.youngs_modulus) # You can change "youngs_modulus" to any of the
properties in the querable_pymatgen list

#for item in sample:


# for i in querable_pymatgen:
# element_object = pymat.Element(item)
# print(item, i, getattr(element_object,i))

• Exercise 1. Modify the query above to extract Brinell hardness.


• Exercise 2. Uncomment the lines above to see all the properties of the selected
elements.
Remember: "Shift-Enter" to re-run the cell.
2. Processing and Organizing Data
After going through the basics of a query, we will now learn how to organize data in Python
lists and dictionaries.
Entries in a dictionary have a name (in our case, the element) and attributes associated with it.
Dictonaries can be useful to store a collection of data values from a particular element. In this
example, we will create one to store some of the properties for Iron, using queries from both
of the libraries we discussed. Note that the specific heat is obtained from Mendeleev, which
is another database to access properties of elements.

Fe_data = {} # Initializing a dictionary

# Each of the following lines is making a single entry

Fe_data["atomic_number"] = mendel.element("Fe").atomic_number
Fe_data["coefficient_of_linear_thermal_expansion"] =
pymat.Element("Fe").coefficient_of_linear_thermal_expansion
Fe_data["youngs_modulus"] = pymat.Element("Fe").youngs_modulus
Fe_data["specific_heat"] = mendel.element("Fe").specific_heat

#Print the entire entry for Fe


print(Fe_data)

#Print a specific attribute:


print(Fe_data["specific_heat"])

# This line is to delete an entry


# del Fe_data["atomic_number"]

Another way we can organize data is in lists, which can be very helpful if we want to create plots with
our data. Following the examples above, we will now query two specific properties for all elements
to get a list of values which will be indexed corresponding to the positions of the elements in the
"elements" list in the first cell of the tutorial.
sample = elements.copy()

CTE = [] # In this list we will store the Coefficients of Thermal Expansion


youngs_modulus = [] # In this list we will store the Young's Moduli
melting_temperature = [] # In this list we will store the Melting Temperatures

for item in sample:


CTE.append(pymat.Element(item).coefficient_of_linear_thermal_expansion)
youngs_modulus.append(pymat.Element(item).youngs_modulus)
melting_temperature.append(pymat.Element(item).melting_point)

# You can visualize the lists by uncommenting these print statements


#print(CTE)
#print(youngs_modulus)
#print(melting_temperature)

# We will use the following arrays to group elements by their crystal structure at RT, all elements
that are gases and liquids at RT have been removed

fcc_elements = ["Ag", "Al", "Au", "Cu", "Ir", "Ni", "Pb", "Pd", "Pt", "Rh", "Sr", "Th", "Yb"]
bcc_elements = ["Ba", "Cr", "Cs", "Eu", "Fe", "K", "Li", "Mn", "Mo", "Na", "Nb", "P", "Rb", "Ta",
"V", "W" ]
hcp_elements = ["Be", "Ca", "Cd", "Co", "Dy", "Er", "Gd", "Hf", "Ho", "Lu", "Mg", "Os", "Re", "Ru",
"Sc", "Tb", "Tc","Ti", "Tl", "Tm", "Y", "Zn", "Zr"]

# Others (Solids): "B", "Sb", "Sm", "Bi" and "As" are Rhombohedral; "C" , "Ce" and "Sn" are
Allotropic; "Si" and "Ge" are Face-centered diamond-cubic; "Pu" is Monoclinic;
# "S", "I", "U", "Np" and "Ga" are Orthorhombic; "Se" and "Te" Hexagonal; "In" and "Pa"
are Tetragonal; "la", "Pr", "Nd", "Pm" are Double hexagonal close-packed;

Finally, the most efficient way we to visualize how the dataset we just created looks is to use
the Pandas library to display it. This library will take the list of lists and show it in a nice,
user-friendly table with the properties as the column headers.
For this exercise, we will work with the data extracted for elements with the FCC crystal
structure.
First, we will create a list of lists using a for-loop and the values we can query from the
Pymatgen library. We can specify the names for each column from our array of properties we
queried.

all_values = [] # Values for Attributes

for item in fcc_elements:


element_values = []

element_object = pymat.Element(item)
for i in querable_pymatgen:
element_values.append(getattr(element_object,i))

all_values.append(element_values) # All lists are appended to another list, creating a list of lists
# Pandas Dataframe
df = pd.DataFrame(all_values, columns=querable_pymatgen)
display(df)

Pandas allows for easier manipulation of the data than the structures we discussed before,
both dictionaries and lists of lists. We can make modifications to this dataframe in each of the
following cells, to showcase the flexibility the Pandas library offers.
To make this dataframe look better for example, we can start by using the list of elements
instead of numbered rows.

df.index = fcc_elements
display(df)

We can then use simple Pandas binary operations to only show elements that satisfy a certain
condition.
The first cell will display a version of the dataframe filtered to elements that have an atomic
mass greater or equal than 150u. (Pandas operator .ge)
The second cell will display a version of the dataframe filtered to elements with exactly 0.26
Poissons' ratio. (Pandas operator .eq)
There are standard operators for greater or equal (.ge), less or equal (.le), equal (.eq) and not
equal (.ne). A list of such operations can be found here. However, we can also create our
custom binary conditions.
The third cell will display a version with a custom binary condition. The elements shown
have Young's modulus less than 120 GPa, and Poissons' ratio greater than 0.25.

df_big_atoms = df[df.atomic_mass.ge(150)]
display(df_big_atoms)

df_poisson = df[df.poissons_ratio.eq(0.26)]
display(df_poisson)

df_condition = df[(df['youngs_modulus'] < 120) & (df["poissons_ratio"] > 0.25)]


display(df_condition)

3. Plotting
Finally, we are going to plot the values for the properties in the lists we just created. For this
tutorial we will make two scatter plots:
• Young's Modulus vs Melting Temperature
• Coefficient of Linear Thermal Expansion vs Melting Temperature
We will be using a Python library called Plotly to create these plots. This library allows you
to create plots that are really interactive and highly customizable.
Simple Plot
In this first cell we will import the library components we will use and create a simple plot.
import plotly #This is the library import
import plotly.graph_objs as go # This is the graphical object (Think "plt" in Matplotlib if you have
used that before)

from plotly.offline import iplot # These lines are necessary to run Plotly in Jupyter Notebooks, but
not in a dedicated environment
plotly.offline.init_notebook_mode(connected=True)

# To create a plot, you need a layout and a trace

# The layout gives Plotly the instructions on the background grids, tiles in the plot,
# axes names, axes ticks, legends, labels, colors on the figure and general formatting.

layout = go.Layout(title = "Young's Moduli vs Melting Temperature",xaxis= dict(title= 'Melting


Temperature (K)'),
yaxis= dict(title= 'Youngs Modulus (GPa)'))

# The trace contains a type of plot (In this case, Scatter, but it can be "Bars, Lines, Pie Charts", etc.),
# the data we want to visualize and the way ("Mode") we want to represent it.

trace = go.Scatter(x = melting_temperature, y = youngs_modulus, mode = 'markers')

# To plot, we create a figure and implement our components in the following way:

data = [trace] # We could include more than just one trace here

fig= go.Figure(data, layout=layout)


iplot(fig)

Now that we know how to make a basic plot, we can start adding more details to end up with
something that looks a little bit better. All modifications are explained in the comments, but
you can also find that information here https://plotly.com/python/axes/.
Before we start our new plot, wouldn't it look better if we could visualize the points with the
elements' names and color them according to their crystal structures?

# Here we are creating a function that takes a value X (Which will be the Symbol of the Element)
# and returns a color depending on what its crystal structure is in our arrays from the beginning.
# That is because we want to color data according to the crystal structure; therefore, we will have
to pass this info to the plot

def SetColor_CrystalStr(x):
if x in fcc_elements:
return "red" #This are standard CSS colors, but you can also use Hexadecimal Colors (#009900)
or RGB "rgb(0, 128, 0)"
elif x in bcc_elements:
return "blue"
elif x in hcp_elements:
return "yellow"
else:
return "lightgray"

# We will then create a list that passes all element symbols through this function. For that we will
use the python function "map"
# Map takes each element on a list and evaluates it in a function.

colors = list(map(SetColor_CrystalStr, sample))

# You can see this list of generated colors looks like by uncommenting this line

#print(colors)

layout0= go.Layout(hovermode= 'closest', width = 600, height=600, showlegend=True, #


Hovermode establishes the way the labels that appear when you hover are arranged # Establishing
a square plot width=height
xaxis= dict(title=go.layout.xaxis.Title(text='Melting Temperature (K)', font=dict(size=24)),
zeroline= False, gridwidth= 1, tickfont=dict(size=18)), # Axis Titles. Removing the X-axis Mark.
Adding a Grid
yaxis= dict(title=go.layout.yaxis.Title(text="Young's Modulus (GPa)", font=dict(size=24)),
zeroline= False, gridwidth= 1, tickfont=dict(size=18)), # Axis Titles. Removing the Y-axis Mark.
Adding a Grid
legend=dict(font=dict(size=24))) # Adding a legend

# Trace

trace0 = go.Scatter(x = melting_temperature,y = youngs_modulus, mode = 'markers',


marker= dict(size= 14, line= dict(width=1), color=colors), # We add a size, a border and our
custom colors to the markers
text= sample, # This attribute (Text) labels each point to this list, which contains our elements in
the same indexes as our properties
showlegend = False)

# Empty Traces for Legend


legend_plot_FCC = go.Scatter(x=[None], y=[None], mode='markers', marker=dict(size=14, line=
dict(width=1),color='red'), name = 'FCC')
legend_plot_BCC = go.Scatter(x=[None], y=[None], mode='markers', marker=dict(size=14, line=
dict(width=1),color='blue'), name = 'BCC')
legend_plot_HCP = go.Scatter(x=[None], y=[None], mode='markers', marker=dict(size=14, line=
dict(width=1),color='yellow'), name = 'HCP')

data = [trace0, legend_plot_FCC, legend_plot_BCC, legend_plot_HCP]

fig= go.Figure(data, layout=layout0)


iplot(fig)
Exercise 3. a) Find the three metals with highest Young's moduli. b) What are the Young's moduli of
Al, Fe and Pb?

layout0= go.Layout(hovermode= 'closest', width = 600, height=600, showlegend=True, #


Hovermode establishes the way the labels that appear when you hover are arranged # Establishing
a square plot width=height
xaxis= dict(title=go.layout.xaxis.Title(text='Melting Temperature (K)', font=dict(size=24)),
zeroline= False, gridwidth= 1, tickfont=dict(size=18)), # Axis Titles. Removing the X-axis Mark.
Adding a Grid
yaxis= dict(title=go.layout.yaxis.Title(text='Coefficient of Linear Thermal Expansion (K<sup>-
1</sup>)', font=dict(size=24)), zeroline= False, gridwidth= 1, tickfont=dict(size=18)), # Axis Titles.
Removing the Y-axis Mark. Adding a Grid
legend=dict(font=dict(size=24))) # Adding a legend

# Trace

trace0 = go.Scatter(x = melting_temperature,y = CTE, mode = 'markers',


marker= dict(size= 14, line= dict(width=1), color=colors), # We add a size, a border and our
custom colors to the markers
text= sample, # This attribute (Text) labels each point to this list, which contains our elements in
the same indexes as our properties
showlegend = False)

# Empty Traces for Legend


legend_plot_FCC = go.Scatter(x=[None], y=[None], mode='markers', marker=dict(size=14, line=
dict(width=1),color='red'), name = 'FCC')
legend_plot_BCC = go.Scatter(x=[None], y=[None], mode='markers', marker=dict(size=14, line=
dict(width=1),color='blue'), name = 'BCC')
legend_plot_HCP = go.Scatter(x=[None], y=[None], mode='markers', marker=dict(size=14, line=
dict(width=1),color='yellow'), name = 'HCP')

data = [trace0, legend_plot_FCC, legend_plot_BCC, legend_plot_HCP]

fig= go.Figure(data, layout=layout0)


iplot(fig)

• Exercise 4. Do you find correlations between the properties plotted? If so, what are the
underlying reasons for them?
• Exercise 5. Select a different pair or properties and create a similar plot. You can insert new
cells below from the top menu (Insert -> Cell below) and copy and paste the code to create
new plots.
4. Query from Mendeleev
Another database we can query in a similar way is Mendeleev. Mendeleev is an API
(Application programming interface) dedicated library to provide access to element
properties in the periodic table. Just as Pymatgen, Mendeleev also uses an object and
attributes to handle a query. Mendeleev uses the element class (Note that is all lowercase).
Making a query in Mendeleev can be done either by using the chemical symbol the same way
Pymatgen does, or by providing the atomic number of the elements. Similarly, you can get a
property by using it as an attribute for the object. Again, not all properties that you can query
are listed here, but you can find them here. Note that there Mendeleev does not provide units
when returning values, but you can find them in the previous link too.
In this example we will query the thermal conductivity for the elements in the list "sample".
With a little bit of programming experience in Python you can again use the commented code
to query all the properties listed for the "sample" elements.

querable_mendeleev = ["atomic_number", "atomic_volume", "boiling_point", "electron_affinity",


"en_allen", "en_pauling", "econf", "evaporation_heat", "fusion_heat", "heat_of_formation",
"lattice_constant", "melting_point", "specific_heat", "thermal_conductivity"]

# You can get the same results using either of these two lists (Numbers correspond to the
element's atomic number)
sample = ['Fe', 'Co', 'Ni', 'Cu', 'Zn']
#sample = [26,27,28,29,30]

for item in sample:


element_object = mendel.element(item)
print(item, element_object.thermal_conductivity) # You can put any of the properties in the
querable_mendeleev list

#for item in sample:


# for i in querable_mendeleev:
# element_object = mendel.element(item)
# print(item, i, getattr(element_object,i))

Using a linear regression model to predict material


properties
Why? Creating regression models that represent correlations between properties associated
with materials enables engineers, scientists and students to find interesting trends on datasets
and develop simple surrogate models.

What? In this tutorial, we will learn how to use the linear regression model from the Scikit-
learn Python library to develop simple models for correlations between properties across
materials. Data will be obtained from Pymatgen and Mendeleev.

How to use this? This tutorial uses Python, some familiarity with programming would be
beneficial but is not required. Run each code cell in order by clicking "Shift + Enter". Feel
free to modify the code, or change queries to familiarize yourself with the workings on the
code.

Suggested modifications and exercises are included in blue.

Outline:
1. Getting data
2. Processing and Organizing Data
3. Creating the Model
4. Plotting

Get started: Click "Shift-Enter" on the code cells to run!

1. Getting Data

Using the queries from the MSEML Query_Viz tutorial, we will create lists containing
different properties of elements. In this first snippet of code we will import all relevant
libraries, the elements that will be turned into cases and the properties that will serve as the
attributes for the cases. The elements listed were chosen because they represent the three
most common crystallographic structures, and also querying them for properties yields a
dataset with no unknown values.

import numpy as np
import pymatgen as pymat
import mendeleev as mendel
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from random import shuffle
import matplotlib.pyplot as plt

fcc_elements = ["Ag", "Al", "Au", "Cu", "Ir", "Ni", "Pb", "Pd", "Pt", "Rh",
"Th", "Yb"]
bcc_elements = ["Ba", "Cr", "Eu", "Fe", "Li", "Mn", "Mo", "Na", "Nb", "Ta",
"V", "W" ]
hcp_elements = ["Be", "Ca", "Cd", "Co", "Dy", "Er", "Gd", "Hf", "Ho", "Lu",
"Mg", "Re",
"Ru", "Sc", "Tb", "Ti", "Tl", "Tm", "Y", "Zn", "Zr"]
others = ["Sb", "Sm", "Bi", "Ce", "Sn", "Si"]
# Others (Solids): "Sb", "Sm", Bi" and "As" are Rhombohedral; "C" , "Ce"
and "Sn" are Allotropic;
# "Si" and "Ge" are Face-centered diamond-cubic;

elements = fcc_elements + bcc_elements + hcp_elements + others

# This function randomly arranges the elements so we can have


representation for all groups both in the training and testing set
shuffle(elements)

data_youngs_modulus = []
data_lattice_constant = []
data_melting_point = []
data_specific_heat = []
data_atomic_mass = []
data_CTE = []

for item in elements:


data_youngs_modulus.append(pymat.Element(item).youngs_modulus)
data_lattice_constant.append(mendel.element(item).lattice_constant)
data_melting_point.append(mendel.element(item).melting_point)
data_specific_heat.append(mendel.element(item).specific_heat)
data_atomic_mass.append(pymat.Element(item).atomic_mass)

data_CTE.append(pymat.Element(item).coefficient_of_linear_thermal_expansion
)

# You can see how these lists look by printing them.

print(data_youngs_modulus)
#print(data_lattice_constant)
#print(data_melting_point)
#print(data_specific_heat)
#print(data_atomic_mass)
#print(data_CTE)
2. Processing and Organizing Data

Simple Linear Regression is one of the easiest models to implement, and can let you see how
one property is related to another explanatory variable. As with all Machine Learning models,
we will use a training set and a testing set.

SETS

Each of the lists we just created contains values for the properties of the elements. Analyzing
the code, you can infer that the indexes in these arrays represent a specific element. For
example:

For element "Ag" with index 2, the properties would be located at:

data_young_modulus [2]
data_lattice_constant[2]
data_melting_point[2]
data_specific_heat[2]
data_atomic_mass[2]
data_CTE[2]
TRAINING AND TESTING SETS

With all this data, we will select 45 elements to become the training set and 6 elements to be
the testing set.

The training group will be used to train or develop the model, the second group will be used
for testing. This is done to avoid or check overfitting. While overfitting is not likely in a
linear model it can happen in more complex models, like neural networks.

NORMALIZATION

Machine Learning models usually require data to be processed and normalized. In this linear
model we will process the data to be in the format required by the sklearn model, but we will
not normalize as we are interested in visualizing trends in our raw data, instead of precise
predictions.

# Here we will organize the data in a format sk-learn accepts


# We will develop linear models that relate Young's modulus and melting
temperature, CTE vs. melting temperature
# and lattice constant and melting temperature

# These would be the sets for Melting Point

melt_train = data_melting_point[:45] # This takes the first 45 entries to


be the Training Set
melt_test = data_melting_point[-6:] # This takes the last 6 entries to be
the Testing Set

# This Reshape function in the next two lines, turns each of the horizontal
lists [ x, y, z] into a
# vertical NumPy array [[x]
# [y]
# [z]]
# This Step is required to work with the Sklearn Linear Model

melt_train = np.array(melt_train).reshape(-1,1)
melt_test = np.array(melt_test).reshape(-1,1)

#Each data set will be divided in training and test data


# These would be the sets for Young's Modulus

young_train = data_youngs_modulus[:45]
young_test = data_youngs_modulus[-6:]
young_train = np.array(young_train).reshape(-1,1)
young_test = np.array(young_test).reshape(-1,1)

# These would be the sets for Lattice Constants

lattice_train = data_lattice_constant[:45]
lattice_test = data_lattice_constant[-6:]
lattice_train = np.array(lattice_train).reshape(-1,1)
lattice_test = np.array(lattice_test).reshape(-1,1)

# These would be the sets for Specific Heat

specheat_train = data_specific_heat[:45]
specheat_test = data_specific_heat[-6:]
specheat_train = np.array(specheat_train).reshape(-1,1)
specheat_test = np.array(specheat_test).reshape(-1,1)

# These would be the sets for Atomic Mass

mass_train = data_atomic_mass[:45]
mass_test = data_atomic_mass[-6:]
mass_train = np.array(mass_train).reshape(-1,1)
mass_test = np.array(mass_test).reshape(-1,1)

# These would be the sets for CTE


coefTE_train = data_CTE[:45]
coefTE_test = data_CTE[-6:]
coefTE_train = np.array(coefTE_train).reshape(-1,1)
coefTE_test = np.array(coefTE_test).reshape(-1,1)
3. Function to define model, train it and make predictions

Sklearn uses an "Ordinary least squares linear regression" for its Linear model.
Implementation is described here. We will run it will all default values.

The regression function in the next cell creates a model. It does so by declaring a linear
model, and then fitting it to the training set. We then make predictions feeding the model the
X-axis testing set.

We can then use the trained model (represented as an equation) to look at the variance score
and understand how well our model does.

# This function defines a model, trains it, and uses it to predict


# It also outputs the linear model and information about its accuracy

def regression(x_train, x_test, y_train, y_test):

# Define the model and train it


model = linear_model.LinearRegression()
model.fit(x_train, y_train)

#Join train + test data


full_x = np.concatenate((x_train, x_test), axis=0)
full_y = np.concatenate((y_train, y_test), axis=0)

# Use the model to predict the entire set of data


predictions = model.predict(full_x) # Make it for all values

# Print model and mean squared error and variance score


print("Linear Equation: %.4e X + (%.4e)"%(model.coef_,
model.intercept_))
print("Mean squared error: %.4e" % (mean_squared_error(full_y,
predictions)))
print('Variance score: %.4f' % r2_score(full_y, predictions))

return predictions

Exercise 1. Research and provide for definition of the mean square error and variance score
in terms of the linear model and data. You can find these definitions here and here.

4. Function for plotting data

To visualize the trends that the model fit to our data, we will plot the two properties. We will
use Plotly to create this plot.
• The first cell defines a function that organizes data and generates the plot
• The following cells train models and plot

import plotly #This is the library import


import plotly.graph_objs as go # This is the graphical object (Think "plt"
in Matplotlib if you have used that before)
from plotly.offline import iplot # These lines are necessary to run Plotly
in Jupyter Notebooks, but not in a dedicated environment

plotly.offline.init_notebook_mode(connected=True)

def plot(x_train, x_test, y_train, y_test, x_label, y_label, predictions):

# The reshape functions in the next two lines, turns each of the
# vertical NumPy array [[x]
# [y]
# [z]]
# into python lists [ x, y, z]

# This step is required to create plots with plotly like we did in the
previous tutorial

x_train = x_train.reshape(1,-1).tolist()[0]
x_test = x_test.reshape(1,-1).tolist()[0]
y_train = y_train.reshape(1,-1).tolist()[0]
y_test = y_test.reshape(1,-1).tolist()[0]
predictions = predictions.reshape(1,-1).tolist()[0]
full_x_list = x_train + x_test

# Now we get back to what we know. Remember, to plot in Plotly, we need


a layout and at least one trace

layout0= go.Layout(hovermode= 'closest', width = 800, height=600,


showlegend=True, # Hovermode establishes the way the labels that appear
when you hover are arranged # Establishing a square plot width=height
xaxis= dict(title=go.layout.xaxis.Title(text=x_label,
font=dict(size=24)), zeroline= False, gridwidth= 1,
tickfont=dict(size=18)), # Axis Titles. Removing the X-axis Mark. Adding a
Grid
yaxis= dict(title=go.layout.yaxis.Title(text=y_label,
font=dict(size=24)), zeroline= False, gridwidth= 1,
tickfont=dict(size=18)), # Axis Titles. Removing the Y-axis Mark. Adding a
Grid
legend=dict(font=dict(size=24))) # Adding a legend

training = go.Scatter(x = x_train, y = y_train, mode = 'markers',


marker= dict(size= 10, color= 'green'), name=
"Training Data")
# This trace contains the values for the data in the training set

actual = go.Scatter(x = x_test, y = y_test, mode = 'markers',


marker= dict(size= 10, color= 'red'), name=
"Testing Data")
# This trace contains the values for the data in the testing set

prediction = go.Scatter(x = full_x_list, y = predictions, mode =


'lines',
line = dict(color = "blue", width = 1.5),name=
"Model")
# This trace will be the line the model fitted the data to

data = [training, actual, prediction]


fig= go.Figure(data, layout=layout0)
iplot(fig)
5. Train models and plot results

predictions = regression(melt_train, melt_test, young_train, young_test)


# This line calls the Regression model implemented in the function

plot(melt_train, melt_test, young_train, young_test, "Melting Temperature


(K)", "Young's Modulus (GPa)", predictions)
# This line plots the results from that model

predictions = regression(melt_train, melt_test, lattice_train,


lattice_test)

plot(melt_train, melt_test, lattice_train, lattice_test,


"Melting Temperature (K)", "Lattice Constant (Å)",
predictions)

predictions = regression(melt_train, melt_test, coefTE_train, coefTE_test)

plot(melt_train, melt_test, coefTE_train, coefTE_test, "Melting Temperature


(K)", "Coefficient of Linear Thermal Expansion (K<sup>-1</sup>)",
predictions)

• Exercise 2. For the three cases above. What type of correlation do you find between
properties? Explain the origin of these, approximate, correlations.
• Exercise 3. Plot other two properties of your choice.
• Exercise 4. Note the mean squared error in the various models. What error would you make,
in average, if you used the melting temperature to predict Young's modulus?
Using neural networks to predict and classify crystal
structures of elements
Why? Neural networks are widely used for image classification, learning structures and
substructures within the data to identify patterns. Such neural network based classifiers can
identify patterns and correlations within the data.

What? In this tutorial we will learn how to use Neural Networks to create a Classification
Model to estimate the ground state of the crystal structure.
This is the third tutorial, following the MSEML Query_Viz, and was based on the
classification tutorial in the TensorFlow Tutorials.

How to use this? This tutorial uses Python, some familiarity with programming would be
beneficial but is not required. Run each code cell in order by clicking "Shift + Enter". Feel
free to modify the code, or change queries to familiarize yourself with the workings on the
code.

Suggested modifications and exercises are included in blue.

Outline:

1. Getting a dataset
2. Processing and Organizing Data
3. Creating the Model
4. Plotting

Get started: Click "Shift-Enter" on the code cells to run!

1. Getting a dataset

Datasets containing properties for the elements in the periodic table are available online;
however, it would be thematic to create our own, using the tools from the first tutorial on
MSEML Query_Viz. In this section we will query both Pymatgen and Mendeleev to get a
complete set of properties per element. We will use this data to create the cases from which
the model will train and test.

In this first snippet of code we will import all relevant libraries, the elements that will be
turned into cases and the properties that will serve as the attributes for the cases. We will get
47 entries (which is a small dataset), but should give us a somewhat accurate prediction. It is
important to note that more entries would move the prediction closer to the real value, and so
would more attributes.

The elements listed were chosen because querying them for these properties yields a dataset
with no unknown values, and because they represent the three most common crystallographic
structures.
import tensorflow as tf
from tensorflow import keras
from keras import initializers
from keras.layers import Dense
from keras.models import Sequential

import pymatgen as pymat


import mendeleev as mendel
import pandas as pd
import numpy as np
import random

%matplotlib inline
import matplotlib.pyplot as plt
import sys

fcc_elements = ["Ag", "Al", "Au", "Cu", "Ir", "Ni", "Pb", "Pd", "Pt", "Rh",
"Th", "Yb"]
bcc_elements = ["Ba", "Ca", "Cr", "Cs", "Eu", "Fe", "Li", "Mn", "Mo", "Na",
"Nb", "Rb", "Ta", "V", "W" ]
hcp_elements = ["Be", "Cd", "Co", "Dy", "Er", "Gd", "Hf", "Ho", "Lu", "Mg",
"Re",
"Ru", "Sc", "Tb", "Ti", "Tl", "Tm", "Y", "Zn", "Zr"]

elements = fcc_elements + bcc_elements + hcp_elements

random.Random(1).shuffle(elements)

querable_mendeleev = ["atomic_number", "atomic_volume", "boiling_point",


"en_ghosh", "evaporation_heat", "heat_of_formation",
"lattice_constant", "melting_point", "specific_heat"]
querable_pymatgen = ["atomic_mass", "atomic_radius",
"electrical_resistivity","molar_volume", "bulk_modulus", "youngs_modulus",
"average_ionic_radius", "density_of_solid",
"coefficient_of_linear_thermal_expansion"]
querable_values = querable_mendeleev + querable_pymatgen
Using TensorFlow backend.

As before, we will use the database queries to populate lists which can be displayed by the
Pandas library in a user-friendly table with the properties as the column headers.

all_values = [] # Values for Attributes


all_labels = [] # Crystal structure labels (0 = fcc, 1 = bcc, 2 = hcp)

for item in elements:


element_values = []

# This section queries Mendeleev


element_object = mendel.element(item)
for i in querable_mendeleev:
element_values.append(getattr(element_object,i))

# This section queries Pymatgen


element_object = pymat.Element(item)
for i in querable_pymatgen:
element_values.append(getattr(element_object,i))

all_values.append(element_values) # All lists are appended to another


list, creating a List of Lists

if (item in fcc_elements):
all_labels.append([1, 0, 0]) # The crystal structure labels are
assigned here
elif (item in bcc_elements):
all_labels.append([0, 1, 0]) # The crystal structure labels are
assigned here
elif (item in hcp_elements):
all_labels.append([0, 0, 1]) # The crystal structure labels are
assigned here

# Pandas Dataframe
df = pd.DataFrame(all_values, columns=querable_values)

# We will patch some of the values that are not available in the datasets.

# Value for the CTE of Cesium


index_Cs = df.index[df['atomic_number'] == 55]
df.iloc[index_Cs,
df.columns.get_loc("coefficient_of_linear_thermal_expansion")] = 0.000097
# Value from: David R. Lide (ed), CRC Handbook of Chemistry and Physics,
84th Edition. CRC Press. Boca Raton, Florida, 2003

# Value for the CTE of Rubidium


index_Rb = df.index[df['atomic_number'] == 37]
df.iloc[index_Rb,
df.columns.get_loc("coefficient_of_linear_thermal_expansion")] = 0.000090
# Value from: https://www.azom.com/article.aspx?ArticleID=1834

# Value for the Evaporation Heat of Ruthenium


index_Ru = df.index[df['atomic_number'] == 44]
df.iloc[index_Ru, df.columns.get_loc("evaporation_heat")] = 595 # kJ/mol
# Value from: https://www.webelements.com/ruthenium/thermochemistry.html

# Value for the Bulk Modulus of Zirconium


index_Zr = df.index[df['atomic_number'] == 40]
df.iloc[index_Zr, df.columns.get_loc("bulk_modulus")] = 94 # GPa
# Value from: https://materialsproject.org/materials/mp-131/

df.head(n=10)
ato ato boi en eva heat latti me sp ato ato elect mo bul you aver den
coefficient
mic mic lin _g pora _of_ ce_ ltin eci mi mic rical lar k_ ngs age_i sity
_of_linear
_nu _vo g_ h tion form con g_p fic c_ _ra _resi _vo mo _m onic_ _of
_thermal_
mb lum poi os _he atio sta oin _h ma diu stivit lu dul odu radiu _sol
expansion
er e nt h at n nt t eat ss s y me us lus s id

0. 58.
31 17 6.00
6.7 14 389. 426. 2.5 0.4 93 1.3 6.6 180 209. 0.76 890
0 27 43. 68. 0000 0.000013
0 32 1 7 1 56 31 5 7 .0 0 8333 0.0
0 00 e-08
36 95

16
0.
22 18 8.9 6.76
18. 21 232. 232. 3.5 0.1 1.7 19. 45. 1.09 932
1 69 20. 18. 34 0000 74.0 0.000013
10 67 0 2 4 60 5 10 0 5000 1.0
0 00 21 e-07
24
0

0. 88.
36 17 6.00
19. 12 367. 424. 3.6 0.2 90 1.8 19. 41. 1.04 447
2 39 11. 95. 0000 64.0 0.000011
80 16 0 7 5 84 58 0 88 0 0000 2.0
0 00 e-07
99 50

18
0.
59 34 6.2 1.80
8.8 24 704. 774. 2.7 0.1 1.3 8.8 370 463. 0.71 210
3 75 00. 53. 07 0000 0.000006
5 35 0 0 6 38 5 6 .0 0 2500 20.0
0 00 00 e-07
16
0

0. 58.
30 17 7.20
6.6 14 378. 430. 3.5 0.4 69 1.3 6.5 180 200. 0.74 890
4 28 05. 26. 0000 0.000013
0 72 6 1 2 43 34 5 9 .0 0 0000 8.0
0 00 e-08
07 00

16
0.
29 17 4.9 8.14
18. 20 301. 300. 3.5 0.1 1.7 18. 40. 1.04 879
5 67 68. 47. 30 0000 65.0 0.000011
70 77 0 6 8 64 5 74 0 1000 5.0
0 00 32 e-07
95
0

19
0.
30 13 6.9 2.20
10. 26 340. 368. 4.0 0.1 1.3 10. 220 1.07 193
6 79 80. 37. 66 0000 78.0 0.000014
20 13 0 2 8 29 5 21 .0 0000 00.0
0 58 56 e-08
70
9

31 18 5.50
15. 0. 332. 377. 3.3 0.5 44. 1.6 15. 57. 0.88 298
7 21 04. 14. 0000 74.0 0.000010
00 7 8 1 56 95 0 00 0 5000 5.0
0 11 00 e-07
ato ato boi en eva heat latti me sp ato ato elect mo bul you aver den
coefficient
mic mic lin _g pora _of_ ce_ ltin eci mi mic rical lar k_ ngs age_i sity
_of_linear
_nu _vo g_ h tion form con g_p fic c_ _ra _resi _vo mo _m onic_ _of
_thermal_
mb lum poi os _he atio sta oin _h ma diu stivit lu dul odu radiu _sol
expansion
er e nt h at n nt t eat ss s y me us lus s id

93 59
83 12

10
0.
40 22 2.9 4.30
8.3 14 494. 556. 3.8 0.2 1.3 8.2 380 275. 0.74 124
8 45 00. 39. 05 0000 0.000008
0 08 0 0 0 44 5 8 .0 0 5000 50.0
0 00 50 e-08
38
0

18
0.
59 36 3.8 5.40
9.5 23 824. 851. 3.1 0.1 1.3 9.4 310 411. 0.76 192
9 74 30. 80. 40 0000 0.000005
3 90 0 0 6 33 5 7 .0 0 6667 50.0
0 00 00 e-08
50
0

2. Processing and Organizing Data

We again normalize the data and organize it into training and testing sets as before.

SETS

We have 47 elements for which the crystal structure is known and we will use 40 of these as
a training set and the remaining 7 as testing set.

NORMALIZATION

We will again use the Standard Score Normalization, which subtracts the mean of the feature
and divide by its standard deviation.

X−µσ
While our model might converge without feature normalization, the resultant model would be
difficult to train and would be dependent on the choice of units used in the input.

# SETS

all_values = [list(df.iloc[x]) for x in range(len(all_values))]


# List of lists are turned into Numpy arrays to facilitate calculations in
steps to follow (Normalization).
all_values = np.array(all_values, dtype = float)
print("Shape of Values:", all_values.shape)
all_labels = np.array(all_labels, dtype = int)
print("Shape of Labels:", all_labels.shape)

# Training Set
train_values = all_values[:40, :]
train_labels = all_labels[:40, :]

# Testing Set
test_values = all_values[-7:, :]
test_labels = all_labels[-7:, :]

# NORMALIZATION

mean = np.nanmean(train_values, axis = 0) # mean


std = np.nanstd(train_values, axis = 0) # standard deviation

train_values = (train_values - mean) / std # input scaling


test_values = (test_values - mean) / std # input scaling

print(train_values[0]) # print a sample entry from the training set


#print(train_labels[0])
Shape of Values: (47, 18)
Shape of Labels: (47, 3)
[-0.80084167 -0.75983551 -0.00894162 -0.40732945 0.15599373 0.16654528
-1.09549525 0.09455406 0.02631292 -0.82400017 -0.80570946 -0.67799461
-0.75661221 0.70972845 0.6516648 -0.77257498 0.11409173 -0.3075323 ]
3. Creating the Model

For this classification, we will use a simple sequential neural network with one densely
connected hidden layer. The optimizer used will be RMSPropOptimizer (Root Mean Square
Propagation).

To learn more about Root Mean Squared Propagation, click here.

The key difference between the regression model and the classification model is our metric to
measure network performance. While we used mean squared error (between the true outputs
and the network's predicted output) for the regression task, we use categorical crossentropy
(click here to learn more about it), using classification accuracy as a metric where higher
accuracy implies a better network.

# DEFINITION OF THE MODEL

# The weights of our neural network will be initialized in a random manner,


using a seed allows for reproducibility
kernel_init = initializers.RandomNormal(seed=14)

model = Sequential()
model.add(Dense(16, activation='relu',
input_shape=(train_values.shape[1],), kernel_initializer=kernel_init))
#model.add(Dense(16, activation='relu', kernel_initializer=kernel_init))
model.add(Dense(3, activation=tf.nn.softmax)) # Output Layer

# DEFINITION OF THE OPTIMIZER

optimizer = tf.train.RMSPropOptimizer(0.002) # Root Mean Squared


Propagation

# This line matches the optimizer to the model and states which metrics
will evaluate the model's accuracy
model.compile(loss='categorical_crossentropy', optimizer=optimizer,
metrics=['accuracy'])
model.summary()
WARNING:tensorflow:From /apps/share64/debian7/anaconda/anaconda-
6/lib/python3.7/site-
packages/tensorflow/python/framework/op_def_library.py:263: colocate_with
(from tensorflow.python.framework.ops) is deprecated and will be removed in
a future version.
Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 16) 304
_________________________________________________________________
dense_2 (Dense) (None, 3) 51
=================================================================
Total params: 355
Trainable params: 355
Non-trainable params: 0
_________________________________________________________________
TRAINING

This model is trained for 350 epochs, and we record the training accuracy in the history
object. This way, by plotting "history" we can see the evolution of the "learning" of the
model, that is the decrease of the Mean Absolute Error. Models in Keras are fitted to the
training set using the fit method.

One Epoch occurs when you pass the entire dataset through the model. One Batch contains a
subset of the dataset that can be fed to the model at the same time. A more detailed
explanation of these concepts can be found in this blog. As we have a really small dataset
compared to the ones that are usually considered to be modeled by these neural networks, we
are feeding all entries at the same time, so our batch is the entire dataset, and an epoch occurs
when the batch is processed.

class PrintEpNum(keras.callbacks.Callback): # This is a function for the


Epoch Counter
def on_epoch_end(self, epoch, logs):
sys.stdout.flush()
sys.stdout.write("Current Epoch: " + str(epoch+1) + '\r') # Updates
current Epoch Number

EPOCHS = 300 # Number of EPOCHS

# HISTORY Object which contains how the model learned


history = model.fit(train_values, train_labels,
batch_size=train_values.shape[0], \
epochs=EPOCHS, validation_split=0.1, verbose = False,
callbacks=[PrintEpNum()])

# PLOTTING HISTORY USING MATPLOTLIB

plt.figure()
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.plot(history.epoch, np.array(history.history['acc']),label='Training
Accuracy')
plt.plot(history.epoch, np.array(history.history['val_acc']),label =
'Validation Accuracy')
plt.legend()
plt.show()
WARNING:tensorflow:From /apps/share64/debian7/anaconda/anaconda-
6/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py:3066:
to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be
removed in a future version.
Instructions for updating:
Use tf.cast instead.
Current Epoch: 300

TESTING

Models in Keras are tested using the method evaluate. This method returns the classification
accuracy on the training and the testing sets.
loss, acc = model.evaluate(train_values, train_labels, verbose=0)

print("Training Set Accuracy: %f" %(acc))

loss, acc = model.evaluate(test_values, test_labels, verbose=0)

print("Testing Set Accuracy: %f" %(acc))


Training Set Accuracy: 1.000000
Testing Set Accuracy: 0.857143
MAKING PREDICTIONS

The last step in a Regression Model is to make predictions for values not in the training set,
which are determined by the method predict. In the following cell we print the Elements in
the testing set, the real values for their Young's Moduli and the predictions generated by the
Machine Learning model.

train_predictions = model.predict(train_values)
test_predictions = model.predict(test_values)

all_labels = np.vstack((train_labels, test_labels))


all_predictions = np.vstack((train_predictions, test_predictions))

predicted_labels = []
true_labels = []

for i in range(all_predictions.shape[0]):
if (np.argmax(all_predictions[i]) == 0):
predicted_labels.append("FCC")
if (np.argmax(all_labels[i]) == 0):
true_labels.append("FCC")
if (np.argmax(all_predictions[i]) == 1):
predicted_labels.append("BCC")
if (np.argmax(all_labels[i]) == 1):
true_labels.append("BCC")
if (np.argmax(all_predictions[i]) == 2):
predicted_labels.append("HCP")
if (np.argmax(all_labels[i]) == 2):
true_labels.append("HCP")

predicted_labels = np.array(predicted_labels).reshape((-1, 1))


true_labels = np.array(true_labels).reshape((-1, 1))
headings = ["Atomic number", "True crystal structure", "Predicted crystal
structure"]

atomic_number_array = np.array(df.iloc[:, 0]).reshape((-1, 1))


plot_table = np.concatenate((atomic_number_array, true_labels,
predicted_labels), axis=1)
plot_df = pd.DataFrame(plot_table, columns=headings)

plot_df
Atomic number True crystal structure Predicted crystal structure

0 27 HCP HCP

1 69 HCP HCP

2 39 HCP HCP

3 75 HCP HCP

4 28 FCC FCC

5 67 HCP HCP

6 79 FCC FCC

7 21 HCP HCP

8 45 FCC FCC

9 74 BCC BCC

10 64 HCP HCP

11 65 HCP HCP

12 72 HCP HCP

13 70 FCC FCC

14 55 BCC BCC

15 30 HCP HCP

16 56 BCC BCC

17 25 BCC BCC

18 26 BCC BCC

19 42 BCC BCC

20 11 BCC BCC
Atomic number True crystal structure Predicted crystal structure

21 71 HCP HCP

22 90 FCC FCC

23 29 FCC FCC

24 3 BCC BCC

25 81 HCP HCP

26 23 BCC BCC

27 37 BCC BCC

28 40 HCP HCP

29 24 BCC BCC

30 41 BCC BCC

31 47 FCC FCC

32 4 HCP HCP

33 44 HCP HCP

34 13 FCC FCC

35 22 HCP HCP

36 82 FCC FCC

37 20 BCC BCC

38 73 BCC BCC

39 66 HCP HCP

40 48 HCP HCP

41 68 HCP HCP

42 46 FCC FCC

43 63 BCC HCP

44 77 FCC FCC
Atomic number True crystal structure Predicted crystal structure

45 12 HCP HCP

46 78 FCC FCC

crystal_structures = ["FCC", "BCC", "HCP"]


FCC_prediction = []
BCC_prediction = []
HCP_prediction = []

for item in range(len(all_predictions)):


FCC_prediction.append(all_predictions[item].tolist()[0])
BCC_prediction.append(all_predictions[item].tolist()[1])
HCP_prediction.append(all_predictions[item].tolist()[2])

# --------------------------------------------------------------

# This block will be used to sort the elements by their atomic number

atomic_number = list(df.iloc[:, 0]) # From the Pandas Dataset


order = np.argsort(atomic_number) # Sorting Indexes

# Sorting the lists by the indexes


# elements = [elements[x] for x in order]
# FCC_prediction = [FCC_prediction[x] for x in order]
# BCC_prediction = [BCC_prediction[x] for x in order]
# HCP_prediction =[HCP_prediction[x] for x in order]

# # --------------------------------------------------------------

import plotly as py
import plotly.graph_objs as go
from plotly.subplots import make_subplots
from plotly.offline import iplot

py.offline.init_notebook_mode(connected=True)

fig = make_subplots(rows=3, cols=1, vertical_spacing=0.2)

# ---------
fig.append_trace(go.Bar(x=[_ for _ in elements if _ in fcc_elements],
y=[FCC_prediction[_] for _ in range(len(FCC_prediction)) if elements[_] in
fcc_elements], name='FCC', marker=dict(color='green'), showlegend=False,
textposition='inside', textfont={"size":24},
text=['*' if _ in elements[-7:] else None for _ in
[_ for _ in elements if _ in fcc_elements]]), row=1, col=1)
fig.append_trace(go.Bar(x=[_ for _ in elements if _ in fcc_elements],
y=[BCC_prediction[_] for _ in range(len(BCC_prediction)) if elements[_] in
fcc_elements], name='BCC', marker=dict(color='red'), showlegend=False),
row=1, col=1)
fig.append_trace(go.Bar(x=[_ for _ in elements if _ in fcc_elements],
y=[HCP_prediction[_] for _ in range(len(HCP_prediction)) if elements[_] in
fcc_elements], name='HCP', marker=dict(color='red'), showlegend=False),
row=1, col=1)
# ---------

# ---------

fig.append_trace(go.Bar(x=[_ for _ in elements if _ in bcc_elements],


y=[FCC_prediction[_] for _ in range(len(FCC_prediction)) if elements[_] in
bcc_elements], name='FCC', marker=dict(color='red'), showlegend=False),
row=2, col=1)
fig.append_trace(go.Bar(x=[_ for _ in elements if _ in bcc_elements],
y=[BCC_prediction[_] for _ in range(len(BCC_prediction)) if elements[_] in
bcc_elements], name='BCC', marker=dict(color='green'), showlegend=False,
textposition='outside', textfont={"size":24},
text=['*' if _ in elements[-7:] else None for _ in
[_ for _ in elements if _ in bcc_elements]]), row=2, col=1)

fig.append_trace(go.Bar(x=[_ for _ in elements if _ in bcc_elements],


y=[HCP_prediction[_] for _ in range(len(HCP_prediction)) if elements[_] in
bcc_elements], name='HCP', marker=dict(color='red'), showlegend=False),
row=2, col=1)
# ---------

# ---------
fig.append_trace(go.Bar(x=[_ for _ in elements if _ in hcp_elements],
y=[FCC_prediction[_] for _ in range(len(FCC_prediction)) if elements[_] in
hcp_elements], name='FCC', marker=dict(color='red'), showlegend=False),
row=3, col=1)
fig.append_trace(go.Bar(x=[_ for _ in elements if _ in hcp_elements],
y=[BCC_prediction[_] for _ in range(len(BCC_prediction)) if elements[_] in
hcp_elements], name='BCC', marker=dict(color='red'), showlegend=False),
row=3, col=1)
fig.append_trace(go.Bar(x=[_ for _ in elements if _ in hcp_elements],
y=[HCP_prediction[_] for _ in range(len(HCP_prediction)) if elements[_] in
hcp_elements], name='HCP', marker=dict(color='green'), showlegend=False,
textposition='inside', textfont={"size":24},
text=['*' if _ in elements[-7:] else None for _ in
[_ for _ in elements if _ in hcp_elements]]), row=3, col=1)
# ---------

fig.update_xaxes(title=go.layout.xaxis.Title(text="FCC Elements",
font=dict(size=18)),showgrid=True, tickfont=dict(size=18), row=1, col=1)
fig.update_xaxes(title=go.layout.xaxis.Title(text="BCC Elements",
font=dict(size=18)),showgrid=True, tickfont=dict(size=18), row=2, col=1)
fig.update_xaxes(title=go.layout.xaxis.Title(text="HCP Elements",
font=dict(size=18)),showgrid=True, tickfont=dict(size=18), row=3, col=1)

fig.update_yaxes(title=go.layout.yaxis.Title(text="Probability",
font=dict(size=18)),showgrid=True, tickfont=dict(size=18),range=[0, 1.2],
row=1, col=1)
fig.update_yaxes(title=go.layout.yaxis.Title(text="Probability",
font=dict(size=18)),showgrid=True, tickfont=dict(size=18),range=[0, 1.2],
row=2, col=1)
fig.update_yaxes(title=go.layout.yaxis.Title(text="Probability",
font=dict(size=18)),showgrid=True, tickfont=dict(size=18),range=[0, 1.2],
row=3, col=1)

fig.update_layout(height=700, width=1200, barmode='group', bargap=0.3)

fig.show()

Using neural networks to estimate Young's Modulus for


elements
Why? Creating regression models from attributes can help scientists, engineers and students
get an approximate value for unknown or unmeasured properties of materials.

What? In this tutorial we will learn how to use neural networks from the Keras library to
create a regression model to estimate Young's modulus.

You can find another example of neural network regression using Keras in the TensorFlow
Tutorials nanoHUB tool.

How to use this? This tutorial uses Python, some familiarity with programming would be
beneficial but is not required. Run each code cell in order by clicking "Shift + Enter". Feel
free to modify the code, or change queries to familiarize yourself with the workings on the
code.

Suggested modifications and exercises are included in blue.

Outline:

1. Getting data
2. Processing and Organizing Data
3. Creating the Model
4. Plotting

Get started: Click "Shift-Enter" on the code cells to run!

1. Getting a dataset

We will repeat the process of obtaining a dataset used in the previous tutorial and the
explanations are repeated for convenience.

Datasets containing properties for the elements in the periodic table are available online;
however, it would be thematic to create our own, using the tools from the first tutorial on
MSEML Query_Viz. In this section we will query both Pymatgen and Mendeleev to get a
complete set of properties per element. We will use this data to create the cases from which
the model will train and test.

In this first snippet of code we will import all relevant libraries, the elements that will be
turned into cases and the properties that will serve as the attributes for the cases. We will get
49 entries (which is a small dataset), but should give us a somewhat accurate prediction. We
will also include some values to "patch" some unknown values in the dataset. It is important
to note that more entries would move the prediction closer to the real value, and so would
more attributes.

The elements listed were chosen because querying them for these properties yields a dataset
with few unknown values, and because they represent the three most common
crystallographic structures.

import tensorflow as tf
import keras
from keras import initializers
from keras.layers import Dense
from keras.models import Sequential
from keras import optimizers

import pymatgen as pymat


import mendeleev as mendel
import pandas as pd
import numpy as np

import sys
import os
sys.path.insert(0, '../src/')

%matplotlib inline
import matplotlib.pyplot as plt

fcc_elements = ["Ag", "Al", "Au", "Cu", "Ir", "Ni", "Pb", "Pd", "Pt", "Rh",
"Th", "Yb"]
bcc_elements = ["Ba", "Cr", "Cs", "Eu", "Fe", "Li", "Mn", "Mo", "Na", "Nb",
"Rb", "Ta", "V", "W" ]
hcp_elements = ["Be", "Ca", "Cd", "Co", "Dy", "Er", "Gd", "Hf", "Ho", "Lu",
"Mg", "Re",
"Ru", "Sc", "Tb", "Ti", "Tl", "Tm", "Y", "Zn", "Zr"]
others = ["Si", "Ge"] # "Si" and "Ge" are Face-centered diamond-cubic;

elements = fcc_elements + others + bcc_elements + hcp_elements

querable_mendeleev = ["atomic_number", "atomic_volume", "boiling_point",


"en_ghosh", "evaporation_heat", "heat_of_formation",
"lattice_constant", "specific_heat"]
querable_pymatgen = ["atomic_mass", "atomic_radius",
"electrical_resistivity",
"molar_volume", "bulk_modulus", "youngs_modulus",
"average_ionic_radius", "density_of_solid",
"coefficient_of_linear_thermal_expansion"]
querable_values = querable_mendeleev + querable_pymatgen
Using TensorFlow backend.

After setting these values, we will proceed with our queries. Depending on the database
(either Pymatgen or Mendeleev) where the property can be found, this code fills up a list with
the properties of each of the elements. To visualize how the dataset we just created looks, we
will use the Pandas library to display it. This library will take the list of lists and show it in a
nice, user-friendly table with the properties as the column headers.

all_values = [] # Values for Attributes


all_labels = [] # Values for Young's Modulus (Property to be estimated)

for item in elements:


element_values = []

# This section queries Mendeleev


element_object = mendel.element(item)
for i in querable_mendeleev:
element_values.append(getattr(element_object,i))

# This section queries Pymatgen


element_object = pymat.Element(item)
for i in querable_pymatgen:
element_values.append(getattr(element_object,i))

all_values.append(element_values) # All lists are appended to another


list, creating a list of lists

# Pandas Dataframe
df = pd.DataFrame(all_values, columns=querable_values)

# We will patch some of the values that are not available in the datasets.

# Value for the CTE of Cesium


index_Cs = df.index[df['atomic_number'] == 55]
df.iloc[index_Cs,
df.columns.get_loc("coefficient_of_linear_thermal_expansion")] = 0.000097
# Value from: David R. Lide (ed), CRC Handbook of Chemistry and Physics,
84th Edition. CRC Press. Boca Raton, Florida, 2003

# Value for the CTE of Rubidium


index_Rb = df.index[df['atomic_number'] == 37]
df.iloc[index_Rb,
df.columns.get_loc("coefficient_of_linear_thermal_expansion")] = 0.000090
# Value from: https://www.azom.com/article.aspx?ArticleID=1834

# Value for the Evaporation Heat of Ruthenium


index_Ru = df.index[df['atomic_number'] == 44]
df.iloc[index_Ru, df.columns.get_loc("evaporation_heat")] = 595 # kJ/mol
# Value from: https://www.webelements.com/ruthenium/thermochemistry.html

# Value for the Bulk Modulus of Zirconium


index_Zr = df.index[df['atomic_number'] == 40]
df.iloc[index_Zr, df.columns.get_loc("bulk_modulus")] = 94 # GPa
# Value from: https://materialsproject.org/materials/mp-131/

# Value for the Bulk Modulus of Germanium


index_Ge = df.index[df['atomic_number'] == 32]
df.iloc[index_Ge, df.columns.get_loc("bulk_modulus")] = 77.2 # GPa
# Value from: https://www.crystran.co.uk/optical-materials/germanium-ge

# Value for the Young's Modulus of Germanium


index_Ge = df.index[df['atomic_number'] == 32]
df.iloc[index_Ge, df.columns.get_loc("youngs_modulus")] = 102.7 # GPa
# Value from: https://www.crystran.co.uk/optical-materials/germanium-ge

# The labels (values for Young's modulus) are stored separately for clarity
(We drop the column later)

df.to_csv(os.path.expanduser('~/mseml_data.csv'), index=False,
compression=None) # this line saves the data we collected into a .csv file
into your home directory

all_labels = df['youngs_modulus'].tolist()
df = df.drop(['youngs_modulus'], axis=1)

df.head(n=10) # With this line you can see the first ten entries of our
database

• Exercise 1. Use the Pandas dataframe created above to plot Young's modulus vs melting
temperature. Note that this is recreating the plot from previous tutorials, but using the
Pandas framework to slice and access data

2. Processing and Organizing Data

Most machine learning models are trained on a subset of all the available data, called the
"training set", and the models are tested on the remainder of the available data, called the
"testing set". Model performance has often been found to be enhanced when the inputs are
normalized.

SETS

With the dataset we just created, we have 49 entries for our model. We will train with 44
cases and test on the remaining 5 elements to estimate Young's Modulus.

NORMALIZATION

Each one of these input data features has different units and is represented in scales with
distinct orders of magnitude. Datasets that contain inputs like this need to be normalized, so
that quantities with large values do not overwhelm the neural network, forcing it tune its
weights to account for the different scales of our input data. In this tutorial, we will use the
Standard Score Normalization, which subtracts the mean of the feature and divide by its
standard deviation.

X−µσ
While our model might converge without feature normalization, the resultant model would be
difficult to train and would be dependent on the choice of units used in the input.

#We will rewrite the arrays with the patches we made on the dataset by
turning the dataframe back into a list of lists

all_values = [list(df.iloc[x]) for x in range(len(all_values))]

# SETS

# List of lists are turned into Numpy arrays to facilitate calculations in


steps to follow (Normalization).
all_values = np.array(all_values, dtype = float)
print("Shape of Values:", all_values.shape)
all_labels = np.array(all_labels, dtype = float)
print("Shape of Labels:", all_labels.shape)

# Uncomment the line below to shuffle the dataset (we do not do this here
to ensure consistent results for every run)
#order = np.argsort(np.random.random(all_labels.shape)) # This numpy
argsort returns the indexes that would be used to shuffle a list
order = np.arange(49)
all_values = all_values[order]
all_labels = all_labels[order]

# Training Set
train_labels = all_labels[:44]
train_values = all_values[:44]

# Testing Set
test_labels = all_labels[-5:]
test_values = all_values[-5:]

# This line is used for labels in the plots at the end of the tutorial -
Testing Set
labeled_elements = [elements[x] for x in order[-5:]]
elements = [elements[x] for x in order]

# NORMALIZATION

mean = np.mean(train_values, axis = 0) # mean


std = np.std(train_values, axis = 0) # standard deviation

train_values = (train_values - mean) / std # input scaling


test_values = (test_values - mean) / std # input scaling

print(train_values[0]) # print a sample entry from the training set


print(test_values[0]) # print a sample entry from the training set
print(order)
3. Creating the Model

For this regression, we will use a simple sequential neural network with one densely
connected hidden layer. The optimizer used will be RMSPropOptimizer (Root Mean Square
Propagation).

To learn more about Root Mean Squared Propagation, click here.

A cool tool developed by Tensorflow to visualize how a neural network learns, and play
around with its parameters, can be found here NN Tools.

# DEFINITION OF THE MODEL

# The weights of our neural network will be initialized in a random manner,


using a seed allows for reproducibility
kernel_init = initializers.RandomNormal(seed=0)
bias_init = initializers.Zeros()
# In a sequential model, the first layer must specify the input shape the
model will expect;
# in this case the value is train_values.shape[1] which is the number
# of attributes (properties) and equals 17.

model = Sequential()
model.add(Dense(32, activation='relu', input_shape=(train_values.shape[1],
), kernel_initializer=kernel_init, bias_initializer=bias_init))
model.add(Dense(64, activation='relu', kernel_initializer=kernel_init,
bias_initializer=bias_init))
#model.add(Dense(128, activation='relu', kernel_initializer=kernel_init,
bias_initializer=bias_init))
model.add(Dense(1, kernel_initializer=kernel_init,
bias_initializer=bias_init))

# DEFINITION OF THE OPTIMIZER

optimizer = optimizers.RMSprop(0.002) # Root Mean Squared Propagation

# This line matches the optimizer to the model and states which metrics
will evaluate the model's accuracy
model.compile(loss='mae', optimizer=optimizer, metrics=['mae'])
model.summary()
TRAINING

This model is trained for 2000 epochs, and we record the training accuracy in the history
object.

One Epoch occurs when you pass the entire dataset through the model. One Batch contains a
subset of the dataset that can be fed to the model at the same time. A more detailed
explanation of these concepts can be found in this blog. As we have a really small dataset
compared to the ones that are usually considered to be modeled by these neural networks, we
are feeding all entries at the same time, so our batch is the entire dataset, and an epoch occurs
when the batch is processed.

This way, by plotting "history" we can see the evolution of the "learning" of the model, that is
the decrease of the Mean Absolute Error. Models in Keras are fitted to the training set using
the fit method.

The blue curve that will come up from the History object represents how the model is
learning on the training data, and the orange curve represents the validation loss, which can
be thought of as the way our model evaluates data that it was not trained in. This validation
loss would start going up again when we start to overfit our data.

# EPOCH REAL TIME COUNTER CLASS


class PrintEpNum(keras.callbacks.Callback): # This is a function for the
Epoch Counter
def on_epoch_end(self, epoch, logs):
sys.stdout.flush()
sys.stdout.write("Current Epoch: " + str(epoch+1) + " Training
Loss: " + "%4f" %logs.get('loss') + '
\r') # Updates current Epoch Number

EPOCHS = 2000 # Number of EPOCHS

# HISTORY Object which contains how the model learned

# Training Values (Properties), Training Labels (Known Young's Moduli)


history = model.fit(train_values, train_labels,
batch_size=train_values.shape[0],
epochs=EPOCHS, verbose = False, shuffle=False,
validation_split=0.1, callbacks=[PrintEpNum()])

# PLOTTING HISTORY USING MATPLOTLIB

plt.figure()
plt.xlabel('Epoch')
plt.ylabel('Mean Abs Error')
plt.plot(history.epoch,
np.array(history.history['mean_absolute_error']),label='Loss on training
set')
plt.plot(history.epoch,
np.array(history.history['val_mean_absolute_error']),label = 'Validation
loss')
plt.legend()
plt.show()
SAVING A MODEL

Compiled and trained models in Keras can be saved and distributed in .h5 files using the
model.save() method. Running the cell below will save the current model we trained, both
weights and architecture to your home directory.
model.save(os.path.expanduser('~/model.h5'))
TESTING

Models in Keras are tested using the method evaluate. This method returns the testing loss of
the model and the metrics we specified when creating it, which in our case it's the Mean
Absolute Error. For the original model in this tutorial you should get a value of 29.59 GPa
for the Mean Absolute Error. This value would decrease with more training data, more
attributes/features, or a different optimizer. In the case of a model that overfits, you can
expect values to start increasing.

[loss, mae] = model.evaluate(test_values, test_labels, verbose=0)

print("Testing Set Mean Absolute Error: {:2.2f} GPa".format(mae))


MAKING PREDICTIONS

The last step in a regression model is to make predictions for values not in the training set,
which are determined by the method predict. In the following cell we print the elements in
the testing set, the real values for their Young's moduli and the predictions generated by our
machine learned model.

test_predictions = model.predict(test_values).flatten()

print("Elements in Test Set: ", labeled_elements)


print("Real Values", list(test_labels))
print("Predictions", list(test_predictions))

values = np.concatenate((train_values, test_values), axis=0) # This line


joins the values together to evaluate all of them
predictions = model.predict(values).flatten()
4. Plotting

The easiest way to see if the model did a good job estimating the Young's Modulus for the
Elements is through a plot comparing Real Values with their Predictions. We will use Plotly
to create a plot like that. We covered how to plot in Plotly in the first tutorial of this tool. For
values in this plot, the line (x = y) indicates a perfect match and would be the desirable result
for the points. As you analyze the plot, you can hover on the points to see the data we
obtained in the cell above.
import plotly
import plotly.graph_objs as go
from plotly.offline import iplot

plotly.offline.init_notebook_mode(connected=True)

layout0= go.Layout(title=go.layout.Title(text="Neural Network Model -


Young's Modulus", font=dict(size=28)), hovermode= 'closest', width = 1000,
height=600, showlegend=True, # Hovermode establishes the way the labels
that appear when you hover are arranged # Establishing a square plot
width=height
xaxis= dict(title=go.layout.xaxis.Title(text='Real Values (GPa)',
font=dict(size=24)), zeroline= False, gridwidth= 1,
tickfont=dict(size=18)), # Axis Titles. Removing the X-axis Mark. Adding a
Grid
yaxis= dict(title=go.layout.yaxis.Title(text='Prediction (GPa)',
font=dict(size=24)), zeroline= False, gridwidth= 1,
tickfont=dict(size=18)), # Axis Titles. Removing the Y-axis Mark. Adding a
Grid
legend=dict(font=dict(size=24))) # Adding a legend

trace0 = go.Scatter(x = all_labels, y = predictions, mode = 'markers',


marker= dict(size= 12, color= 'blue'), text= elements, name = 'Young\'s
Modulus (Training)')
trace1 = go.Scatter(x = test_labels, y = test_predictions, mode =
'markers', marker= dict(size= 12, color= 'red'), text = labeled_elements,
name = 'Young\'s Modulus (Testing)')
trace2 = go.Scatter(x = [0,600], y = [0,600], mode = 'lines', name =
"Match") # This trace is the line X = Y which would indicate that the
Prediction equals the real value

data = [trace0, trace1, trace2]


fig= go.Figure(data, layout=layout0)
iplot(fig)

• Exercise 2. Compare these results for the Young's Modulus with the ones you got
from the Linear Regression Tutorial. How are they different? How can you explain
this difference?

• Exercise 3 (Advanced). Uncomment a line in the cell [4] to use three hidden layers in
the neural network and monitor the training. Is this model better than the one with two
hidden layers?

You might also like