Machine Learning For Materials Science
Machine Learning For Materials Science
N. M. Anoop Krishnan
Hariprasad Kodamana
Ravinder Bhattoo
Machine
Learning
for Materials
Discovery
Numerical Recipes
and Practical Applications
Machine Intelligence for Materials Science
Series Editor
N. M. Anoop Krishnan, Department of Civil Engineering, Yardi School of Artificial
Intelligence (Joint Appt.), Indian Institute of Technology Delhi, New Delhi, India
This book series is dedicated to showcasing the latest research and developments at
the intersection of materials science and engineering, computational intelligence, and
data sciences. The series covers a wide range of topics that explore the application
of artificial intelligence (AI), machine learning (ML), deep learning (DL), reinforce-
ment learning (RL), and data science approaches to solve complex problems across
the materials research domain.
Topical areas covered in the series include but are not limited to:
• AI and ML for accelerated materials discovery, design, and optimization
• Materials informatics
• Materials genomics
• Data-driven multi-scale materials modeling and simulation
• Physics-informed machine learning for materials
• High-throughput materials synthesis and characterization
• Cognitive computing for materials research
The series also welcomes manuscript submissions exploring the application of AI,
ML, and data science techniques to following areas:
• Materials processing optimization
• Materials degradation and failure
• Additive manufacturing and 3D printing
• Image analysis and signal processing
Each book in the series is written by experts in the field and provides a valuable
resource for understanding the current state of the field and the direction in which
it is headed. Books in this series are aimed at researchers, engineers, and academics
in the field of materials science and engineering, as well as anyone interested in the
impact of AI on the field.
N. M. Anoop Krishnan · Hariprasad Kodamana ·
Ravinder Bhattoo
Machine Learning
for Materials Discovery
Numerical Recipes and Practical Applications
N. M. Anoop Krishnan Hariprasad Kodamana
Department of Civil Engineering, Yardi Department of Chemical Engineering,
School of Artificial Intelligence (Joint Yardi School of Artificial Intelligence (Joint
Appt.) Appt.)
Indian Institute of Technology Delhi Indian Institute of Technology Delhi
New Delhi, India New Delhi, India
Ravinder Bhattoo
Indian Institute of Technology Delhi
New Delhi, India
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
vii
viii Foreword
strategies among a broad set of tools. I believe this book can give an impetus for the
adoption of machine learning in materials science curricula across many universities.
I hope you will enjoy reading this excellent book as much as I did.
Markus J. Buehler
Massachusetts Institute of Technology (MIT)
Cambridge, USA
The original version of the book has been revised. The ESM information for chapters 1, 2, 4, 5,
6 has been updated. A correction to this book can be found at https://doi.org/10.1007/978-3-031-
44622-1_16
Preface
The last decade in materials science has seen a major wave of change due to the advent
of machine learning and artificial intelligence. While there have been significant
advances in machine learning for the materials domain, a vast majority of students,
researchers, and professionals working on materials still do not have access to the
theoretical backgrounds of machine learning. This can be attributed partially to the
intricate mathematical treatments commonly followed in many machine learning
textbooks and the use of general examples that lack relevance to materials science-
related applications.
This textbook aims to bridge this gap by providing an overview of machine
learning in materials modeling and discovery. The textbook is well-suited for a diverse
audience, including undergraduates, graduates, and industry professionals. The book
is also structured as foundational and can be used as a textbook covering the basics
and advanced techniques while giving hands-on examples using Python codes.
The book is structured into three parts. Part I gives an introduction to the evolu-
tion of machine learning in the materials domain. Part II focuses on building the
foundations of machine learning, with various tailor-made examples accompanied
by corresponding code implementations. In the part III, emphasis is given to several
practical applications related to machine learning in the materials domain.
Although several use cases from the literature are covered, the book also integrates
examples from the authors’ research whenever possible. This deliberate choice is
motivated by accessible data and first-hand details of available codes that might not
readily exist in the literature. We believe such a treatment facilitates comprehen-
sive information about practical implementation while striking a balance with the
theoretical exposition.
The field of machine learning is growing at an exponential pace, and it is impos-
sible to cover all the state-of-the-art methods. This book by no means is exhaustive.
Rather, this book is an attempt to capture the essence of the basics of machine learning
and make the readers aware of the foundations so that they can either delve into the
deeper aspects of machine learning or focus on the applications to the materials
domain using existing approaches to solve an impactful problem in the domain.
ix
x Preface
We hope you enjoy the book and find it useful for your journey in materials
discovery.
There are a lot of people who have contributed both actively and passively to the
development of this book. First, we would like to thank our editor, Dr. Zachary
Evenson, for initiating the idea of the book and encouraging us to complete it. It is
indeed his motivation and support that resulted in the book. Thanks to Mohd Zaki
for helping with the images and suggestions on graphics. Thanks are also due to
Indrajeet Mandal, who painstakingly collected the copyrights for all the images used
in the work. Special thanks to the research scholars of the M3RG at IIT Delhi for
their comments, feedback, and proofreading that helped significantly improve the
book. The authors also thank the support from IIT Delhi and specifically the Yardi
School of Artificial Intelligence, Department of Civil Engineering, and Department
of Chemical Engineering. The role played by the authors’ family in the form of
continuous support to complete the book cannot be emphasized enough. Boundless
thanks to them for supporting us through this endeavor through thick and thin, COVID
and many other uncertainties and challenges, and making this happen.
xi
Contents
Part I Introduction
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 Materials Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Physics- and Data-Driven Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Introduction to Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Machine Learning for Materials Discovery . . . . . . . . . . . . . . . . . . . 10
1.4.1 Property Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.2 Materials Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.3 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.4 Understanding the Physics . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.5 Automated Knowledge Extraction . . . . . . . . . . . . . . . . . . . 14
1.4.6 Accelerating Materials Modeling . . . . . . . . . . . . . . . . . . . . 15
1.5 Outline of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
xiii
xiv Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
Acronyms
xix
xx Acronyms
RL Reinforcement learning
RNN Recurrent neural network
SARSA State-action-reward-state-action
SEM Scanning electron microscope
SMOTE Synthetic minority oversampling technique
SVM Support vector machine
SVR Support vector regression
t-SNE t-distributed stochastic neighbor embedding
XGBoost Extreme gradient boosting
Part I
Introduction
Chapter 1
Introduction
Abstract Materials form the basis of human civilization. With the advance of com-
putational algorithms, computational power, and cloud-based services, materials
innovation is accelerating at a pace never witnessed by humankind. In this chapter,
we briefly introduce the materials discovery approaches using AI and ML that has
enabled some breakthrough in our understanding of materials. We list some publicly
available databases on materials and some of the applications where AI and ML has
been used to design and discover novel materials. The chapter concludes with a brief
outline of the book.
The progress of human civilization has been closely related to the discovery and usage
of new materials. Materials have shaped how we interact with the world, from the
stone to the silicon age. This is exemplified by the fact that the different ages of human
history have been named after the prominent materials used in those eras–the stone
age, the ber, the importance of materials discovery was formally accepted only in the
1950s with the proposition of materials as a separate engineering domain. During
world war II and the ensuing cold war, countries realized that materials were the
bottleneck in advancing military, space, and medical technologies. Thus, materials
science emerged as the first discipline formed out of the fusion and collaborations of
multiple disciplines from basic sciences and engineering, focusing on understanding
material response leading to materials discovery. While the early focus of materials
science remained in metallurgy, it was soon expanded to other domains such as
ceramics, polymers, and later to composites, nano-materials, and bio-materials.
However, the importance of materials discovery was formally accepted only in the
1950s with the proposition of materials as a separate engineering domain. During
world war II and the ensuing cold war, countries realized that materials were the
bottleneck in advancing military, space, and medical technologies. Thus, materials
Fig. 1.1 Flow chart of traditional materials discovery based on what-if scenarios. Intuition and
expert knowledge is used to cleverly pose the what-if questions that can potentially lead to the
discovery of novel materials
science emerged as the first discipline formed out of the fusion and collaborations of
multiple disciplines from basic sciences and engineering, focusing on understanding
material response leading to materials discovery. While the early focus of materials
science remained in metallurgy, it was soon expanded to other domains such as
ceramics, polymers, and later to composites, nano-materials, and bio-materials.
The earlier approaches for materials discovery relied on trial-and-error approaches
driven either by physics or strong intuition developed through years of experience.
In such cases, the idea of what-if scenarios was used for discovering materials with
tailored properties, as shown in Fig. 1.1. This approach would start from a “what-if”
question on one or more aspects of the tetrahedron materials: processing, struc-
ture, property, and performance. A set of candidate solutions would be proposed
based on the available knowledge and intuitions. These solutions would be tested
using experimental synthesis and characterizations. If a candidate solution meets the
expected performance, the new material is manufactured, verified, validated, cer-
1.1 Materials Discovery 5
tified, and deployed in the industry. If any of the candidate solutions do not meet
the expected performance, the iteration is continued until a desired candidate is
discovered. As a trivial example, consider the following. Carbon can improve the
hardness and strength of steel–what if we increase the carbon content of steel? Experi-
mental studies reveal that the increase in carbon content improves steel’s hardness and
strength. However, higher carbon content makes steel brittle and less weldable! Thus,
although the new candidate meets the expected performance in terms of strength, it
induces some undesirable side-effects on other properties. Hence, the candidate may
not be accepted. Thus, the what-if scenarios required detailed and time-consuming
experimental characterization and analysis of new materials, significantly increas-
ing the cost and time required for materials discovery. In these cases, the typical
timescale associated with the discovery of new material was 20–30 years from the
initial research to its first use.
The invention of computers and in-silico approaches came as a breakthrough in
materials discovery in the second half of the twentieth century. Monte Carlo (MC)
algorithms and molecular dynamics (MD) simulations, both proposed in the 1950s,
became valuable tools for understanding materials response under different scenar-
ios. These approaches reduced the number of actual experiments to be carried out,
accelerating materials discovery. At the same time, slowly but steadily, researchers
also started realizing the importance of compiling and documenting the materials
data generated by the experiments and simulations. The first attempts to this extent
were the Cambridge Structural Database (CSD) and Calculation of Phase Diagrams
(CALPHAD) around the 1970s. These databases enabled the development of a quan-
titative structure-property relationships (QSPR) approach in materials. The QSPR
approaches primarily relied on correlations and simple linear or polynomial regres-
sions that allowed the discovery of patterns from the available data, which ultimately
provided insights into materials response.
List of Publically Available Materials Databases
1. CSD: Cambridge Structural Database
2. CALPHAD
3. Granta Design
4. Pauling File
5. ICSD: Inorganic Crystal Structure Database
6. ESP: Electronic Structure Project
7. AFLOW: Automatic-Flow for Materials Discovery
8. MatNavi
9. AIST: National Institute of Advanced Industrial Science and Technology
Databases
10. COD: Crystallography Open Database
11. MatDL: Materials Digital Library
12. The Materials Project
13. CMR: Computational Materials Repository
14. Springer Material
15. OpenKIM
6 1 Introduction
Fig. 1.2 Material database timeline and geographical region of origin. Reprinted with permission
from [1]
Models are simplified replicas of real-world scenarios with attention to the features
or phenomena of interest. For example, a ball-and-stick model of atoms aims to
show the relative atomic positions for a given lattice, while completely ignoring
the dynamics, electronic structure, and other details of an atomic system. Figure 1.3
shows the ball and stick model for benzene with the chemical formula C.6 H.6 . Note that
the black balls represent the carbon atom while the white ones represent hydrogen
atoms. Further, the alternating single and double bonds are represented beautifully
by single and double sticks connecting the carbon atoms. Such models can be very
useful for giving a quick understanding of complex molecular structures and are
hence used commonly for teaching purposes.
While a ball-and-stick model is a physical model, phenomenons are typically
expressed through mathematical models. Traditional models in materials and engi-
neering disciplines have relied on mathematical equations derived based on physical
theories or laws. This approach has been widely accepted for centuries and has stood
the test of time. Some of the widely used mathematical models in materials science
include laws of thermodynamics, Fick’s laws, Avrami equation, Arrhenius equa-
tion, Gibbs–Thomson equation, Bragg’s law, and Hooke’s law. Thus, the physical
8 1 Introduction
models are derived based on existing theories and can be explained using reason-
ing to understand the phenomenon. However, the physical models have traditionally
been limited to simple systems. The extremely complex and non-linear nature of
advanced materials have remained elusive to physical models as well as in-silico
models. Understanding the response of these materials require high-fidelity high-
throughput experiments and simulations, which are highly prohibitive in terms of
cost and manpower.
An alternate approach that has emerged recently is the data-driven approach. Here,
the data is used to first identify the model and then fit the parameters of the model.
Data-driven models are not based on physical theories and hence are occasionally
termed as “black-box” models. It is interesting to note that although, data-driven
models such as machine learning was first proposed at the same time as MC and
MD simulations in 1950s, it has started finding wide-spread applications in materi-
als engineering only for the past two decades. The inertia to not accept data-driven
models, despite their fast, accurate, and efficient ability to learn patterns from data,
could be attributed to their black-box nature. In other words, the data-driven mod-
els cannot be explained using known physics, they can only be tested for unknown
scenarios. However, the advances in machine learning coupled with the availability
of large-scale data on materials have shown the potential of data-driven approaches
for materials discovery. In addition, the development of explainable machine learn-
1.3 Introduction to Machine Learning 9
ing algorithms, which allows the interpretation of black-box models, has allowed
domain experts to interpret the black-box models. This allows the interpretation of
the features “learned” by the model, thereby, giving insights into the inner workings
of the models. Overall, data-driven approaches have shown significant potential to
accelerate materials discovery and reduce the discovery-to-deployment period from
20 years to 10 years or even lesser.
Machine learning (ML) refers to the branch of study which focuses on developing
algorithms that “learns” the hidden patterns in the data. In contrast to physics-based
models, ML uses the data for both model development and model training. Further,
it improves the model in a recursive fashion using a predictor-corrector approach
without being explicitly programmed to do the specific task. As such, large amounts
of data is required for the ML models to learn the patterns reasonably–the more the
data, the better the ML model is. ML has already been widely used in our day-to-day
life for several applications such as face recognition, email spam detection, personal
assistants, automated chat-bots, and fraud detection. To achieve these tasks, ML uses
different classes of algorithms as detailed below.
Algorithms in ML can be broadly classified into supervised, unsupervised, and
reinforcement learning. Supervised learning refers to those which learns the function
that maps a set of input-output data. The examples of this approach include predicting
the Young’s modulus or density of an alloy based on the composition and processing
or classifying a set of materials into conductor or insulator. It may be noticed in the
first task the output Young’s modulus can take continuous values as a function of the
composition and processing and hence, is known as regression. Whereas in the second
task, the output can either be conductor or insulator, and hence is a classification task.
Note that the classification problems can be multi-class as well having more than two
classes, for example, conductor, insulator, superconductor, and semi-conductor. The
crucial aspect in supervised learning is the availability of a labeled dataset on which
the model can be trained. The accuracy of the model depends highly on the accuracy
of the dataset among other factors. Some commonly used supervised models are
linear and polynomial regressions, logistic regression, decision trees, random forest
(RF), XGBoost, support vector (SVR), neural network (NN), and Gaussian process
regression (GPR).
In unsupervised learning, the algorithm tries to find out patterns from the features
of the data. In this case, there is no labeled training set that is used. Some of the
main approaches in unsupervised learning include clustering and anomaly detection.
Clustering refers the automated grouping of materials based on their similarity to
each other based on the features provided. Clustering may be used to remove an
outlier in the data, or to identify subgroups in the data. Some of the unsupervised
models include k-means, DBSCAN, OPTICS (inspired from DBSCAN), t-SNE, and
principal component analysis (PCA).
10 1 Introduction
Fig. 1.4 Predicted values of a density, b Young’s modulus, c Vicker’s hardness, and d shear modulus
of oxide glasses with respect to the experimental values. The R.2 values of training, validation,
and test are shown. The inset shows the histogram of error in the prediction along with the 95%
confidence interval
respect to the experimental values for all the properties. In addition, the 95% confi-
dence interval of the error histogram shown in the inset confirms that the predictions
indeed exhibit a very low error in comparison to the range of values considered.
Similar approaches have been widely used for the prediction of properties of several
materials including ceramics, metal alloys, metallic glasses, 2D materials, polymers,
and even proteins.
While property prediction allows one to explore the properties of hitherto unknown
composition, it necessarily does not directly provide a recipe of new materials. Mate-
rials discovery a more challenging problem having constraints on multiple properties
and components. For instance, a desired alloy for automotive applications should be
12 1 Introduction
light-weight, hard, strong, tough, ductile, and easily weldable. Many of these prop-
erties are conflicting. Effectively, this problem translates to solving the inverse of
property prediction. Here, we need to predict the candidate composition and process-
ing parameters corresponding to a target property. To this extent, surrogate model
based optimization approaches can be used, wherein the surrogate model is devel-
oped using supervised ML. Once the surrogate model is developed for composition–
property relationships, metaheuristic algorithms such as ant colony optimization,
particle swarm optimization, or genetic algorithm or Bayesian optimization can be
used to identify the family of compositions that satisfies the compositional and prop-
erty constraints. The list of predicted compositions can be tested experimentally for
validation. This approach significantly reduces the total number of experiments to
be carried out, thereby accelerating materials discovery significantly.
Fig. 1.5 Optimization flow chart for the discovery glass compositions with Young’s modulus
greater than 30 GPa with liquidus less 1500 K and having a mol% SiO.2 between 70 and 90%. This
approach has been implemented for glass discovery in PyGGi Zen
1.4 Machine Learning for Materials Discovery 13
Figure 1.5 shows the flow chart for the discovery glass compositions with Young’s
modulus greater than 30 GPa while having a liquidus less 1500 K and comprising
of SiO.2 with a mol% between 70 and 90%. Here the objective is to discover glasses
with Young’s modulus greater than 30 GPa. The constraints are applied on both
property (liquidus .< 1500 K) and composition (70% .< SiO.2 < 90%). Additional
constraints on other properties or components can also be applied. Now, the model
for composition–property relationships is obtained using ML. The optimization algo-
rithms such as gradient descent, particle swarm, ant colony and genetic algorithm
are applied on this model to discover new glass compositions satisfying both the
objectives and the constraints. Finally, the product is experimentally validated. Sim-
ilar approaches have been employed in several materials discovery packages such as
The Materials Project or PyGGi Zen.
Images hold key information about materials form a crucial part of the materials
literature [13, 17–19]. For instance, the scanning electron microscope (SEM) image
of a microstructure of a material provides detailed information about the grain struc-
ture, orientation, phase, and texture of the material, which in turn contributes to
its mechanical properties [20–22]. Quantifying this information requires a domain
expert and is an extremely time intensive task. ML has been successfully used to
address this challenge. ML has proved to be able to automatically capture grain-level
information form the SEM images. In addition, these SEM images can be used to
predict the properties of materials. These models can then be used to obtain materials
with tailored microstructures having desired properties as well.
Figure 1.6 shows the prediction of crystal structures, that is, the Bravais lattice
and space group, based on the electron backscatter diffraction (EBSD) patterns [23].
EBSD patterns are directly given as an input to a CNN, which consists of alternating
convolution and pooling layers. The goal of the convolution layer is to extract the
features from the images to form feature maps, which are then downsampled by the
pooling layers. Finally, a feedforward NN is placed at the last layer to perform a
classification task, which takes the learned and downsampled features as inputs and
predicts the crystal structure. Thus, the ML model allows the prediction of crystal
structure directly from the EBSD images.
Fig. 1.6 Applying CNN to predict the Bravais lattice or space group of a crystal structure from the
electron backscatter diffraction patterns. Reprinted from [23] with permission
century [30]. Recently, ML has shown potential provide insights into the dynamics
governing glass transition. Specifically, ML methods have been able to predict the
structural control of glass dynamics, which allowed extrapolation of glass relaxation
behavior to large timescales. Similarly, ML has also been used to understand the
preferred direction of crack propagation (see Fig. 1.7). These studies show that ML
has the potential to provide deep insights into the physics of material behavior, which
may hold the key solving some the open materials problems.
While there have been several databases on material properties, most of the informa-
tion on materials lie buried as unstructured data in the form of text in literature. ML
allows automated extraction of the knowledge from literature which can then be used
to predict new materials. This approach has been demonstrated to discover novel ther-
moelectric materials. This approach also allows to gain correlations between mate-
rials that was otherwise not obvious even to domain experts. Altogether, ML can
aid the extraction of knowledge from text, which can further be used for knowledge
dissemination in an accelerated fashion.
1.4 Machine Learning for Materials Discovery 15
Fig. 1.7 Prediction of fracture path based on the atomistic simulation trajectory trained using a
LSTM and CNN. Reprinted from [31] with permission
Fig. 1.8 Workflow for named entity recognition. The key steps are as follows: 1 documents are
collected and added to our corpus, 2 the text is preprocessed (tokenized and cleaned), 3 for training
data, a small subset of documents are labeled (SPL .= symmetry/phase label, MAT .= material,
APL .= application), 4 the labeled documents are combined with word embeddings (Word2vec)
generated from unlabeled text to train a neural network for named entity recognition, and finally 5
entities are extracted from our text corpus. Reprinted with permission from [32]
Materials modeling is another area where ML has been disruptive. Specifically, mate-
rials modeling can be at multiple scales ranging from electronic and atomic length-
scales to continuum scales. Atomistic simulations model the interactions between
atoms using empirical forcefields, the accuracy which governs the accuracy of simula-
tions. While first principle simulations can provide accurate results, these simulations
are constrained in terms of the number of atoms due to the prohibitive computational
cost. To address this challenge, ML-based interatomic forcefields have been devel-
oped which provide the accuracy of first principle simulations at a significantly lower
computational cost. These generic potentials may thus enable the simulation almost
any element in the periodic table with reasonable computational cost. Figure 1.9
shows a GNN framework which enables accurate predictions of interatomic force
from the particle positions and velocities [39]. In this approach, each atomic configu-
ration is represented as a directed graph wherein the influnce of neighboring atoms . j
on an atom .i, or node .vi , is represented by a directed edge .ei j along the direction .ui j .
Nodes and edges are embedded with latent vectors in the embedding stage. Initially,
1.5 Outline of the Book 17
Fig. 1.9 A graph neural network framework that directly predicts the force on each atom directly
from the atomic structure. The trained model can be used as a surrogate for direct-to-force archi-
tecture to be used in molecular simulation. Reprinted with permission from [39]
the node and edge embeddings respectively contain the atom type and interatomic
distance information. The embeddings are then iteratively updated during the mes-
sage passing stage. The final updated edge embeddings are used for predicting the
interatomic force magnitudes. The force on the center atom . j is calculated by sum-
ming the force contributions of neighboring atoms that are calculated by multiplying
the force magnitude and the respective unit vector. The predicted forces are finally
used for updating the atomic positions in MD.
Another approach used for accelerating simulations at higher length scales is the
physics-based ML approaches for materials simulation. Here, the ML is allowed to
learn the equation governing material response while being constrained by physical
laws such as energy, mass, and momentum conservation. Thus, these physics-based
ML models allow generic material simulation while also obeying the laws of physics.
Some examples of these simulations include the Hamiltonian neural network [40],
physics-informed neural network [41], and Lagrangian neural network [42].
rithms are also given. This section aims to allow the readers to understand the
inner workings of ML algorithms and empower them to use these algorithms to
solve their own problems of interest.
3. Part III: The third part focuses on the applications of ML to solve several chal-
lenges in the materials domain. This section aims to give insights into the problems
that has been tackled by AI and ML. Further, through these examples, we also aim
to inspire the readers to identify problems in their domain, which can be solved
using ML.
The area of ML for materials, being a very active and dynamic one, is evolving at
a very high pace. The ideas, methods, and problems discussed in this book are by no
means exhaustive. Further, these discussions should not be considered as a detailed
review on the applications of ML in materials domain. Rather, the discussions in this
book are simply illustrative in nature to exemplify the applications of several ML
methods in the context of materials domain. Moreover, we hope that these discussions
inspire the readers to improve upon the state-of-the-art of ML for materials.
References
1. L. Himanen, A. Geurts, A.S. Foster, P. Rinke, Data-driven materials science: status, challenges,
and perspectives. Adv. Sci. 6(21), 1 900 808 (2019). https://doi.org/10.1002/advs.201900808.
eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/advs.201900808. https://onlinelibrary.
wiley.com/doi/abs/10.1002/advs.201900808
2. J. Li, K. Lim, H. Yang, Z. Ren, S. Raghavan, P.-Y. Chen, T. Buonassisi, X. Wang,
Ai applications through the whole life cycle of material discovery. Matter 3(2), 393–
432 (2020). ISSN: 2590-2385. https://doi.org/10.1016/j.matt.2020.06.011. https://www.
sciencedirect.com/science/article/pii/S2590238520303015
3. J.E. Gubernatis, T. Lookman, Machine learning in materials design and discovery: exam-
ples from the present and suggestions for the future. Phys. Rev. Mater. 2(12), 120 301
(2018). https://doi.org/10.1103/PhysRevMaterials.2.120301. https://link.aps.org/doi/10.1103/
PhysRevMaterials.2.120301. Accessed 19 Feb 2019
4. Y. Liu, T. Zhao, W. Ju, S. Shi, Materials discovery and design using machine learning. J. Mate-
riomics 3(3), 159–177 (2017). High-throughput Experimental and Modeling Research toward
Advanced Batteries, ISSN: 2352-8478. https://doi.org/10.1016/j.jmat.2017.08.002. https://
www.sciencedirect.com/science/article/pii/S2352847817300515
5. P. Raccuglia, K.C. Elbert, P.D. Adler, C. Falk, M.B. Wenny, A. Mollo, M. Zeller, S.A. Friedler, J.
Schrier, A.J. Norquist, Machine-learningassisted materials discovery using failed experiments.
Nature 533(7601), 73–76 (2016)
6. Q. Zhou, P. Tang, S. Liu, J. Pan, Q. Yan, S.-C. Zhang, Learning atoms for materials discovery.
Proc. Natl. Acad. Sci. 115(28), E6411–E6417 (2018)
7. A. Fluegel, Statistical regression modelling of glass properties -a tutorial. Glass Technol. - Eur.
J. Glass Sci. Technol. Part A 50(1), 25–46 (2009)
8. Q. Ling, H. Zijun, L. Dan, Multifunctional cellular materials based on 2D nanoma-
terials: prospects and challenges. Adv. Mater. 30(4), 1 704 850 (2018). ISSN: 1521-
4095. https://doi.org/10.1002/adma.201704850. https://onlinelibrary.wiley.com/doi/abs/10.
1002/adma.201704850
9. D.R. Cassar, A.C.P.L.F. de Carvalho, E.D. Zanotto, Predicting glass transition temper-
atures using neural networks. Acta Materialia 159, 249–256 (2018). ISSN: 1359-6454.
References 19
https://doi.org/10.1016/j.actamat.2018.08.022. http://www.sciencedirect.com/science/article/
pii/S1359645418306542. Accessed 02 Oct 2019
10. T. Oey, S. Jones, J.W. Bullard, G. Sant, Machine learning can predict setting behavior and
strength evolution of hydrating cement systems. J. Amer. Ceramic Soc. 103(1), 480–490
(2020). eprint: https://ceramics.onlinelibrary.wiley.com/doi/pdf/10.1111/jace.16706. ISSN:
1551- 2916. https://doi.org/10.1111/jace.16706. https://ceramics.onlinelibrary.wiley.com/doi/
abs/10.1111/jace.16706. Accessed 27 Feb 2021
11. A. Yamanaka, R. Kamijyo, K. Koenuma, I. Watanabe, T. Kuwabara, Deep neural network
approach to estimate biaxial stress-strain curves of sheet metals. Mater. Design 195, 108 970
(2020). ISSN: 0264-1275. https://doi.org/10.1016/j.matdes.2020.108970
12. R. Kondo, S. Yamakawa, Y. Masuoka, S. Tajima, R. Asahi, Microstructure recognition using
convolutional neural networks for prediction of ionic conductivity in ceramics. Acta Materialia
141, 29–38 (2017)
13. J. Ling, M. Hutchinson, E. Antono, B. DeCost, E.A. Holm, B. Meredig, Building data-
driven models with microstructural images: generalization and interpretability. Mater. Discov.
10, 19–28 (2017). ISSN: 2352-9245. https://doi.org/10.1016/j.md.2018.03.002. https://www.
sciencedirect.com/science/article/pii/S235292451730042X. Accessed 27 Feb 2021
14. F. Ren, L. Ward, T. Williams, K.J. Laws, C. Wolverton, J. Hattrick-Simpers, A. Mehta,
Accelerated discovery of metallic glasses through iteration of machine learning and high-
throughput experiments. Sci. Adv. 4(4), eaaq1566 (2018). ISSN: 2375-2548. https://doi.org/10.
1126/sciadv.aaq1566. https://advances.sciencemag.org/content/4/4/eaaq1566. Accessed 30
July 2019
15. M. Zaki, Jayadeva, and N.A. Krishnan, Extracting processing and testing parameters from mate-
rials science literature for improved property prediction of glasses. Chem. Eng. Proc. - Process
Intensif. 108 607 (2021). ISSN: 0255-2701. https://doi.org/10.1016/j.cep.2021.108607. https://
www.sciencedirect.com/science/article/pii/S0255270121003020
16. R. Ravinder, K.H. Sridhara, S. Bishnoi, H. Singh Grover, M. Bauchy, Jayadeva, H. Kodamana,
N.M.A. Krishnan, Deep learning aided rational design of oxide glasses. Mater. Horizons (2020).
Publisher: Royal Society of Chemistry. https://doi.org/10.1039/D0MH00162G. https://pubs.
rsc.org/en/content/articlelanding/2020/mh/d0mh00162g. Accessed 10 May 2020
17. V. Venugopal, S.R. Broderick, K. Rajan, A picture is worth a thousand words: apply-
ing natural language processing tools for creating a quantum materials database map.
MRS Commun. 9(4), 1134–1141 (2019). Publisher: Cambridge University Press. ISSN:
2159-6859, 2159-6867. https://doi.org/10.1557/mrc.2019.136. https://www.cambridge.org/
core/journals/mrs-communications/article/picture-is-worth-a-thousand-words-applying-
natural-language-processing-tools-for-creating-a-quantum-materials-database-map/
8956AFA3C1D282BAF0A85DA36AB0F6B2. Accessed 19 Oct 2020
18. X. Li, Z. Liu, S. Cui, C. Luo, C. Li, Z. Zhuang, Predicting the effective mechanical property of
heterogeneous materials by image based modeling and deep learning. Comput. Methods Appl.
Mech. Eng. 347, 735–753 (2019)
19. J. Bernal, K. Kushibar, D.S. Asfaw, S. Valverde, A. Oliver, R. Marti, X. Llado, Deep convo-
lutional neural networks for brain image analysis on magnetic resonance imaging: a review.
Artif. Intell. Med. 95, 64–81 (2019)
20. K. Kim, Z. Lee, W. Regan, C. Kisielowski, M.F. Crommie, A. Zettl, Grain boundary mapping
in polycrystalline graphene. ACS Nano 5(3), 2142–2146 (2011). ISSN: 1936-0851. https://doi.
org/10.1021/nn1033423. https://doi.org/10.1021/nn1033423. Accessed 07 April 2019
21. A. Shekhawat, R.O. Ritchie, Toughness and strength of nanocrystalline graphene. Nat. Com-
mun. 7, 10 546 (2016). ISSN: 2041-1723. https://doi.org/10.1038/ncomms10546. https://www.
nature.com/articles/ncomms10546. Accessed 07 April 2019
22. H.I. Rasool, C. Ophus, W.S. Klug, A. Zettl, J.K. Gimzewski, Measurement of the intrinsic
strength of crystalline and polycrystalline graphene. Nat. Commun. 4, 2811 (2013). ISSN: 2041-
1723. https://doi.org/10.1038/ncomms3811. https://www.nature.com/articles/ncomms3811.
Accessed 07 April 2019
20 1 Introduction
39. C.W. Park, M. Kornbluth, J. Vandermause, C. Wolverton, B. Kozinsky, J.P. Mailoa, Accu-
rate and scalable graph neural network force field and molecular dynamics with direct force
architecture. npj Comput. Mater. 7(1), 1–9 (2021)
40. S. Greydanus, M. Dzamba, J. Yosinski, Hamiltonian neural networks. Adv. Neural Inf. Proc.
Syst. 32, 15 379–15 389 (2019)
41. G.E. Karniadakis, I.G. Kevrekidis, L. Lu, P. Perdikaris, S. Wang, L. Yang, Physics-informed
machine learning. Nat. Rev. Phys. 3(6), 422–440 (2021)
42. M. Cranmer, S. Greydanus, S. Hoyer, P. Battaglia, D. Spergel, S. Ho, Lagrangian neural net-
works (2020). arXiv:2003.04630
Part II
Basics of Machine Learning
This part covers the basics of machine learning methods used for materials discovery.
In Chap. 2, we focus on the dataset visualization and preprocessing. Followed by
this, a brief introduction on various ML approaches such as supervised, unsuper-
vised, and reinforcement learning is provided in Chap. 3. Chapters 4 and 5 discuss
in detail about the supervised algorithms for regressions with the former focusing
on parametric methods and the latter on non-parametric methods. Chapter 6 deals
with classification and clustering algorithms. Chapter 7 focuses on model refinement
using hyperparametric optimization. Chapter 8 provides an overview to advanced ML
algorithms and deep learning such as variational auto-encoders, generative adver-
sarial networks, graph neural networks, and reinforcement learning. Finally, Chap. 9
focuses on the interpretability of the black-box ML algorithms.
Chapter 2
Data Visualization and Preprocessing
Abstract ML methods, being purely data driven, relies on the availability of high
quality dataset. However, in reality, the datasets may have inconsistencies, errors,
and may even be incomplete. Further, the choice of an appropriate ML algorithm for
a given dataset will depend highly on the nature, size, distribution of the dataset. In
this chapter, we discuss the different approaches to visualize data such histograms,
scatter plots, heat maps, and tree maps. Further, several measures that quantify the
data including central and higher-order measures are discussed. Next, we discuss
several commonly used outlier detection algorithms that enable “data cleaning”.
Finally, we discuss data-imputation algorithms such as SMOTE and ADASYN for
imputing data in imbalanced datasets.
2.1 Introduction
A bar graph is a means to visualize categorical data with rectangular bars with heights
of each bar proportional to the values they represent. The bars can be plotted vertically
or horizontally. Figure 2.1 shows the bar graph of a dataset of glasses containing
sodium silicate, that is, (Na.2 O).x .· (SiO.2 ).(1−x) , and calcium aluminosilicate glasses,
that is, (CaO). y ·(Al.2 O.3 ).z .· (SiO.2 ).(1−y−z) , where .x, y, and .z represent the mole % of
the respective oxides in the glasses and can take any value in the range of [0, 1]. The
barplot shows the number of glasses with each oxide having non-zero values. We
observe that the number of glasses having Na.2 O is the least, while maximum glasses
have SiO.2 present in them. Code snippet 2.1 shows the Python code to reproduce the
results.
A heat map is another useful data visualization tool. In this, the input features are
represented in two dimensions (x and y), and the variation of the output property
2.2 Data Visualization 27
"""
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ' )
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# i m p o r t n u m p y and p a n d a s
i m p o r t numpy as np
i m p o r t p a n d a s as pd
# load s a m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / e l a s t i c _ m o d u l u s . csv " )
# print column names
print ( data . c o l u m n s )
n u m _ N a 2 O = sum ( data [ ' Na2O ' ] > 0 )
n u m _ C a O = sum ( data [ ' CaO ' ] > 0 )
n u m _ A l 2 O 3 = sum ( data [ ' Al2O3 ' ] > 0 )
n u m _ S i O 2 = sum ( data [ ' SiO2 ' ] > 0 )
y = [ num_Na2O , num_CaO , num_Al2O3 , n u m _ S i O 2 ]
x = [ " $ N a _ 2 O $ " , " $CaO$ " , " $ A l _ 2 O _ 3 $ " , " $ S i O _ 2 $ " ]
# bar plot using m a t p l o t l i b
plt . bar (x , y , fc = " none " , ec = " k " , hatch = " // " )
plt . y l a b e l ( " N u m b e r " )
plt . l e g e n d ()
plt . gca () . s e t _ a s p e c t ( ' auto ' )
s a v e f i g ( " s a m p l e b a r p l o t . png " )
print ( " End of o u t p u t " )
Output:
Index(['Al2O3', 'CaO', 'SiO2', 'Na2O', 'Young's modulus (GPa)'], dtype='object')
End of output
is represented using a coloring scheme. The intensity of the color in the x-y space
represents the variation in the property values. Thus, a heat map facilitates the visu-
alization of three-dimensional data in two dimensions. An advantage of a heat map
is that it can provide a quick visual summary of the data sets. Heat maps are also used
to visualize the correlations between two variables (namely, correlation heat maps).
Such heat maps provide a quick way to understand correlations among variables in
a visual manner.
Figure 2.2 shows the ternary diagram for Young’s modulus of calcium aluminosili-
cate glasses [1]. The squares represent the experimental values, while the background
represents the predictions based on an ML model. The coloring scheme on the right
shows the range of values for Young’s modulus. The heat map thus clearly shows the
trends in Young’s modulus values with respect to composition. In addition, it allows
a direct comparison of the model predictions represented by the heat map with the
experimental values represented by the squares.
Fig. 2.3 Treemap of elements in the periodic table. Note that the elements are grouped into canonical
categories such as alkali, alkaline earth, noble gases, non-metals, lanthanides, metalloids, actinides,
and transition metals
A scatter plot uses simple markers (and not continuous lines) to represent values for
two or three different variables. Accordingly, the resulting plot will be two- or three-
dimensional, respectively. Scatter plots are particularly useful and are one of the
most widely used plots in analyzing relationships between variables. A scatter plot
can also be used to unearth hidden patterns in data when the plot closely resembles
data points clustered together. Scatter plots can enable one to identify if there are any
unexpected gaps or outliers present in the data.
Figure 2.4 shows the scatter plot of Young’s modulus of sodium silicate glasses,
(Na.2 O.x ).·(SiO.2 ).(1−x) , with respect to the silica percentage in the glass. See Code
Snippet 2.2 to reproduce the results and the plot. From the scatter plot, we observe
that Young’s modulus values of the glass compositions lie scattered. Further, for
similar compositions having SiO.2 percent of 65%, 70%, or 75%, the values of
Young’s modulus exhibit huge variations. This suggests the presence of outliers in
the data. For instance, if the compositions with Young’s modulus values less than
55 GPa are discarded (five data points), Young’s modulus exhibits an increasing trend
in an average sense with respect to the silica content. Thus, scatter plots provide a
30 2 Data Visualization and Preprocessing
clear visualization of the trend in the data while also providing insights into the
outliers.
"""
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ' )
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# i m p o r t n u m p y and p a n d a s
i m p o r t numpy as np
i m p o r t p a n d a s as pd
# load s a m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / NS_ym . csv " )
# print column names
print ( data . c o l u m n s )
x = data [ ' SiO2 ' ]
y = data [ " Y o u n g ' s m o d u l u s ( GPa ) " ]
# s c a t t e r plot using m a t p l o t l i b
plt . s c a t t e r ( x , y , m a r k e r = " o " , fc = " none " , ec = " k " )
plt . y l a b e l ( " Y o u n g ' s m o d u l u s ( GPa ) " )
plt . x l a b e l ( r " $ S i O _ 2 $ ( mol % ) " )
plt . l e g e n d ()
s a v e f i g ( " s a m p l e s c a t t e r p l o t . png " )
print ( " End of o u t p u t " )
Output:
Index(['Na2O', 'SiO2', 'Young's modulus (GPa)'], dtype='object')
End of output
2.2.5 Histogram
"""
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ' )
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# i m p o r t n u m p y and p a n d a s
i m p o r t numpy as np
i m p o r t p a n d a s as pd
# load s a m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / e l a s t i c _ m o d u l u s . csv " )
# print column names
print ( data . c o l u m n s )
# hist plot using m a t p l o t l i b
plt . hist ( data [ " Y o u n g ' s m o d u l u s ( GPa ) " ] , bins = 16 , range = ( 42 . 5 , 122 . 5 ) ,
fc = " none " , ec = " k " , hatch = " // " )
plt . y l a b e l ( " F r e q u e n c y " )
plt . x l a b e l ( " Y o u n g ' s m o d u l u s ( GPa ) " )
plt . l e g e n d ()
plt . gca () . s e t _ a s p e c t ( ' auto ' )
s a v e f i g ( " s a m p l e h i s t p l o t . png " )
print ( " End of o u t p u t " )
Output:
Index(['Al2O3', 'CaO', 'SiO2', 'Na2O', 'Young's modulus (GPa)'], dtype='object')
End of output
"""
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ' )
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# i m p o r t n u m p y and p a n d a s
i m p o r t numpy as np
i m p o r t p a n d a s as pd
# load s a m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / e l a s t i c _ m o d u l u s . csv " )
# print column names
print ( data . c o l u m n s )
# create sample dataset
y = [ np . r a n d o m . n o r m a l ( loc = 2 . 0 , scale = 1 . 0 ) for i in range ( 1000 ) ]
y + = [ np . r a n d o m . n o r m a l ( loc = 5 .0 , scale = 1 . 0 ) for i in range ( 1000 ) ]
y = np . array ( y )
# d e n s i t y plot using m a t p l o t l i b
r a n g e _ = ( y . min () , y . max () )
binsize = 0.4
bins = int (( r a n g e _ [ 1 ] - r a n g e _ [ 0 ] ) / b i n s i z e )
plt . hist (y , bins = bins , range = range_ , fc = " none " , ec = " k " , hatch = " // " ,
d e n s i t y = True , alpha = 0 . 1 )
binsize = 0.4
bins = int (( r a n g e _ [ 1 ] - r a n g e _ [ 0 ] ) / b i n s i z e )
values , b i n _ e d g e s = np . h i s t o g r a m (y , bins = bins , range = range_ , d e n s i t y =
True )
x = ( bin_edges [:-1]+ bin_edges [1:])/2
plt . plot ( x , v a l u e s )
binsize = 0.2
bins = int (( r a n g e _ [ 1 ] - r a n g e _ [ 0 ] ) / b i n s i z e )
values , b i n _ e d g e s = np . h i s t o g r a m (y , bins = bins , range = range_ , d e n s i t y =
True )
x = ( bin_edges [:-1]+ bin_edges [1:])/2
plt . plot ( x , v a l u e s )
binsize = 1.2
bins = int (( r a n g e _ [ 1 ] - r a n g e _ [ 0 ] ) / b i n s i z e )
values , b i n _ e d g e s = np . h i s t o g r a m (y , bins = bins , range = range_ , d e n s i t y =
True )
x = ( bin_edges [:-1]+ bin_edges [1:])/2
plt . plot ( x , v a l u e s )
plt . y l a b e l ( " P r o b a b i l i t y d e n s i t y f u n c t i o n " )
plt . x l a b e l ( " Y o u n g ' s m o d u l u s ( GPa ) " )
plt . l e g e n d ()
s a v e f i g ( " s a m p l e d e n s i t y p l o t . png " )
print ( " End of o u t p u t " )
Output:
Index(['Al2O3', 'CaO', 'SiO2', 'Na2O', 'Young's modulus (GPa)'], dtype='object')
End of output
The nature of data in a qualitative manner can be obtained from data visualization.
However, a quantitative representation requires the extraction of statistics from the
data. These methods of extracting key information from the data are outlined next.
A population is the set of all possible data of the characteristics under investigation.
This population may be finite or infinite in size. However, due to various limitations,
a population may not be fully accessible to anyone. For example, it is impossible
to directly measure Young’s modulus of each grain and phase of a steel sample
with a highly heterogeneous microstructure. To address this issue, all the statistical
measures are defined on a sample, that is, the part or the subset of the population
which is fully accessible. Thus, to quantify Young’s modulus of steel, 100 or 200
measurements may be made at randomly selected or uniformly distributed grid points.
These points are considered to represent the entire microstructure of steel, and the
statistical properties are then extracted from this dataset. For further discussion on
statistical methods, readers are directed References [2–6].
2.3.1.1 Mean
Mean is the most common central measure used to represent a dataset that is dis-
tributed in a continuous fashion. The sample mean is defined as:
1∑
. x= xi (2.1)
n n
where .n is the size of the sample and .xi is the .ith sample point. Note that .x may
not represent the region where data is most densely distributed, especially if the
distribution is asymmetric. In fact, the mean can occur in regions where there is no
sparse data. Further, as the mean is the weighted sum of all the data points, outliers
present in the data can significantly affect the location of the mean.
2.3.1.2 Median
2.3.1.3 Mode
Mode is the value that occurs with the most significant frequency. If each.xi is unique,
it is impossible to represent the Mode. Typically, Mode is the best central measure
while dealing with categorical or discrete data.
Consider the dataset on Young’s modulus of the sodium silicate and calcium
aluminosilicate glasses presented in the earlier section. Figure 2.7 shows the mean,
median, and mode for the dataset along with the underlying histogram of the data.
Code Snippet 2.5 can be used to reproduce the results and the plot. Here, we observe
that the mean corresponds to 80 GPa, a value around which there are very few data
points in the raw dataset. This could be attributed to the bimodal nature of the data,
wherein the mean is not a very meaningful central measure. Mode corresponds to
60 GPa, which is the mean Young’s modulus of the dataset consisting of sodium sili-
cate glasses only. Thus, no information about the calcium aluminosilicate glasses in
the dataset is included in the mode. The median, having a value of 85 GPa, represents
the histogram bin having the maximum number of calcium aluminosilicate glasses.
Thus, the median represents a reasonable central measure of the dataset in this case.
It is interesting to note that, for the present dataset, each of the three central measures
36 2 Data Visualization and Preprocessing
corresponds to different regions in the dataset—the mean in the sparse region, the
mode in the sodium silicate region, and the median in the calcium aluminosilicate
region.
Measures of spread can summarise how scattered the data is and how are each of
the points in the dataset distributed with respect to the central measure considered.
Some of the commonly used measures of spread are range, percentile, and variance.
It is also useful to estimate the uncertainty associated with some experimental mea-
surements. For example, the spread in the values of Young’s modulus from multiple
measurements in a homogeneous material provides insights into the accuracy of the
measurements. This is typically represented using error bars which incorporate the
measures of variability along with the central measure of the data.
2.3.2.1 Range
The range is the simplest measure of the spread in the dataset. The range is defined as
the difference between the largest and smallest observations in a sample set. Although
it is easy to compute, the range is highly sensitive to outliers. For example, for a dataset
given by (70, 71, 73, 69, 91) GPa—representing five measurements of Young’s
modulus of silica glass in GPa—the range is given by .91 − 69 = 22 GPa. From
the dataset, it can be observed that all the values, other than 91 GPa, are distributed
closely, and hence 91 GPa is clearly an outlier. Excluding the outlier 91 GPa, the range
2.3 Extracting Statistics from Data 37
"""
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ' )
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# i m p o r t n u m p y and p a n d a s
i m p o r t numpy as np
i m p o r t p a n d a s as pd
# load s a m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / e l a s t i c _ m o d u l u s . csv " )
# print column names
print ( data . c o l u m n s )
y = data [ " Y o u n g ' s m o d u l u s ( GPa ) " ]
mean_ = y . mean ()
m e d i a n _ = y . m e d i a n ()
mode_ = y . mode () [ 0 ]
# hist plot using m a t p l o t l i b
plt . hist (y , bins = 16 , range = ( 42 .5 , 122 . 5 ) , fc = " none " , ec = " k " , hatch = " //
" , alpha = 0 . 5 )
plt . v l i n e s ( mean_ , 0 . 0 , 25 . 0 , lw =3 , color = " k " , label = " Mean " )
plt . v l i n e s ( median_ , 0 .0 , 25 . 0 , lw =3 , ls = " : " , color = " k " , label = " M e d i a n "
)
plt . v l i n e s ( mode_ , 0 . 0 , 25 . 0 , lw =3 , ls = " -- " , color = " k " , label = " Mode " )
plt . y l a b e l ( " F r e q u e n c y " )
plt . x l a b e l ( " Y o u n g ' s m o d u l u s ( GPa ) " )
plt . l e g e n d ()
s a v e f i g ( " s a m p l e h i s t m m m p l o t . png " )
print ( " End of o u t p u t " )
Output:
Index(['Al2O3', 'CaO', 'SiO2', 'Na2O', 'Young's modulus (GPa)'], dtype='object')
End of output
is .73 − 69 = 4 GPa. As such, the range is not a good measure for the variability of
data where there may be outliers with large variations in their values.
The . pth percentile is that threshold such that . p% of observations are at or below this
value. It is .(k + 1)th largest sample point if .np/100 /= integer, where .k is largest
integer less than .np/100. The first quartile .(Q 1 ), the second quartile .(Q 2 ), and
the third quartile .(Q 3 ) are defined as .25 p = n+1 4
, .50 p = n+1
2
, and .75 p = 3(n+1)
4
,
respectively. The second quartile is also known as the median (. M). A box plot is a
convenient graphic representing range, median, and quartiles. In box plot, a box is
drawn from . Q 1 to . Q 3 , . Q 2 (median) is drawn as a vertical line in the box, and outer
lines are drawn either up to the outermost points, or at .1.5 × (Q 3 − Q 1 ), and the
length of the line represents the range.
38 2 Data Visualization and Preprocessing
2.3.2.3 Variance
The variance (. S 2 ) and its square root value, the standard deviation (. S), are measures of
the variability of the data sets around the mean. They give a picture of the distribution
of the data around their mean value. If a dataset is highly dispersed, they tend to
spread farther away from the mean, leading to a high value of variance and standard
deviation and vice versa. The standard deviation of a normal distribution enables us
to calculate confidence intervals. In a normal distribution, about .68% of the values
lie within one standard deviation on either side of the mean and about .95% of the
scores are within two standard deviations of the mean, and about .99.5% values are
within three standard deviations from the mean. The sample variance is computed
by
∑n
∑n
(xi − x)2 ∑n
(xi − i=1
xi 2
)
.S = =
2 n
(2.2)
i=1
n − 1 i=1
n − 1
It is to be noted for the population data, the notation used for mean, variance, and
standard deviation are .μ, .σ 2 , and .σ , respectively.
Figure 2.8 shows the range and percentiles of a normalized dataset of Young’s
modulus values. The normalization is performed by subtracting the mean of the
distribution from each data point and dividing the value by the standard deviation
of the distribution. As such, the values of Young’s modulus are distributed between
.−2.5 and 2.5. Code Snippet 2.6 can be used to reproduce the results and the plot.
"""
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ' )
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# i m p o r t n u m p y and p a n d a s
i m p o r t numpy as np
i m p o r t p a n d a s as pd
# load s a m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / e l a s t i c _ m o d u l u s . csv " )
# print column names
print ( data . c o l u m n s )
y = np . random . randn ( 1000 ) # data [" Young 's m o d u l u s ( GPa ) "]
q25 , q50 , q75 = np . q u a n t i l e ( y , [ 0 . 25 , 0 .5 , 0 . 75 ] )
# hist plot using m a t p l o t l i b
fig , ax1 = plt . s u b p l o t s ( 1 , 1 , s h a r e x = True )
r a n g e _ = ( -3 , 3 )
bins = 100
ax1 . hist (y , bins = bins , range = range_ , ec = " k " , fc = " none " , hatch = " // " ,
alpha = 0 .5 , d e n s i t y = True )
ax1 . b o x p l o t (y , p o s i t i o n s = [ 1 . 1 ] , vert = False )
ax1 . v l i n e s ( q25 , 0 . 25 , 1 . 02 , ls = " : " )
ax1 . h l i n e s ( 0 . 25 , - 3 , q25 , ls = " : " )
ax1 . v l i n e s ( q50 , 0 . 5 , 1 . 02 , ls = " : " )
ax1 . h l i n e s ( 0 . 50 , - 3 , q50 , ls = " : " )
ax1 . v l i n e s ( q75 , 0 . 75 , 1 . 02 , ls = " : " )
ax1 . h l i n e s ( 0 . 75 , - 3 , q75 , ls = " : " )
ax1 . hist (y , bins = 10 * bins , range = range_ , fc = " none " , hatch = " " , alpha = 0 .
5 , d e n s i t y = True , h i s t t y p e = " step " ,
c u m u l a t i v e = True , lw = 1 . 5 )
plt . sca ( ax1 )
plt . y t i c k s ( [ 0 . 25 , 0 . 50 , 0 . 75 , 1 . 00 ] , [ 0 . 25 , " 0 . 50 " , 0 . 75 , " 1 . 00 " ] )
plt . ylim (0 , 1 . 3 )
s a v e f i g ( " s a m p l e q u a r t i l e s p l o t . png " )
print ( " End of o u t p u t " )
Output:
Index(['Al2O3', 'CaO', 'SiO2', 'Na2O', 'Young's modulus (GPa)'], dtype='object')
Two higher-order measures of data representation are skewness and kurtosis. While
skewness is a measure of the distortion, kurtosis is a measure heavy-tailed nature of
the data relative to a Normal distribution.
40 2 Data Visualization and Preprocessing
2.3.3.1 Skewness
Skewness, .G 1 measure the degree of distortion of the data from the normal distribu-
tion. A symmetrical distribution will have a skewness of 0. Skewness is calculated
by
√
n (n − 1) ∑ (xi − x)3
n
. G1 = (2.3)
n−2 i=1
nS 3
If the .−0.5 ≤ G 1 ≤ 0.5 then the data are fairly symmetrical. If the .G 1 < −0.5 it is
called negatively skewed, while if the .G 1 > 0.5, it is called positively skewed.
2.3.3.2 Kurtosis
Kurtosis, .κ is used to describe the extreme values in one versus the other tail, and
therefore, it is a measure of outliers present in the distribution. Gaussian distribution
has a kurtosis value of three. Hence, typically excess kurtosis is defined to compare
the kurtosis of the data set under consideration with respect to Gaussian distribution
as follows:
∑ n
(xi − x)3
.κ = −3 (2.4)
i=1
nS 4
A high kurtosis value in a data set indicates that data has heavy tails or outliers and
vice versa.
Figure 2.9 shows the comparison of the performance of twelve different outlier detec-
tion algorithms on a benchmark data available in PyOD. The errors associated with
each method are given in the figure within parentheses. The results clearly show that
42 2 Data Visualization and Preprocessing
Fig. 2.9 The comparison of the performance of twelve different outlier detection algorithms based
on benchmark dataset available in PyOD
a trial and error-based approach using different methods is required for the robust
identification of outliers. The code to reproduce the results can be obtained from
https://github.com/yzhao062/pyod/blob/master/notebooks/benchmark.py. Here, we
will discuss some of the commonly used outlier detection methodologies that are
model agnostic, that is, which don’t depend on the ML model.
In this method, we first calculate the mean and standard deviation of the data. A data
point is identified as an outlier if it is away from the mean by a pre-specified threshold
in terms of the standard deviations. That is, if a data point .xi satisfies calculated the
. Z score,
2.4 Outlier Detection and Data Imputing 43
|xi − x|
. > k, where, k = 1, 2, or 3 (2.5)
S
then it is detected as an outlier. In other words, a datapoint is an outlier if it is beyond
x + k S. For a normally distributed dataset, .1S, .2S, and .3S represents, 68.27%,
.
95.45%, and 99.73%, respectively, of the dataset. However, this method can fail
to detect outliers if the . S is large.
The median is a central measure of data that is less susceptible to outliers. Median
Absolute Deviation (MAD) is calculated as the median absolute difference between
each point and the median as:
0.6745(xi − M)
. ZM = (2.7)
M AD
As a rule of thumb, if . Z M is greater than 3, an outlier is detected.
The interquartile range (IQR) is calculated the same way as the range. IQR is com-
puted by subtracting the first quartile from the third quartile:
. I Q R = Q3 − Q1 (2.8)
IQR can be used to detect outliers as follows: if any data point .xi
range requires the use of more sophisticated outlier detection algorithms mentioned
earlier or an outlier ensemble combining multiple of these algorithms.
If the data set size is small or imbalanced, there would be difficulty in training a
desired model. This is because very little information can be extracted from small
data sets, and data-driven modeling generally depends on sufficiently large data
sets for information extraction. To address this issue, data augmentation may be
performed, which generates and includes artificial data points to the dataset. The
main idea of data augmentation techniques is to learn the statistical features and
underlying distribution of the data. Some typical approaches for imputing missing
data are based on the mean and median of the data. For instance, the missing value
for each feature is replaced with the mean or the median of non-missing values of the
respective features. Note that this method does not consider the correlation between
features and does not account for the uncertainty in the imputation. Another approach
is to use kNN for data imputation. Here, first, kNN is used to identify the clusters
based on the original data points. Then a new point is assigned based on how closely
it resembles the points in the selected cluster. For instance, the mean of the input
features of a given cluster can be used to generate a new data point. In addition, the
mean of selected points in the cluster can also be used to generate new points. The
advantage of this approach is that the new point generated will also lie within the
cluster and hence will not disturb the distribution of the data. This idea has been
improved to develop more sophisticated algorithms allowing the data imputation in
targeted regions such as SMOTE, as outlined below.
There are several ML algorithms used for data imputation which directly take into
account the distribution of the data. These approaches may especially be necessary if
there is a significant data imbalance. For instance, from a dataset of images on con-
crete, we aim to identify the images with fracture and without fracture. The number
of images with fracture may be significantly smaller than those without fracture, say
1:99, respectively. In such cases, one easy approach is to use undersampling, wherein
we remove the images in the larger class to make the data balance. However, this
approach leads to a loss of information and suboptimal use of the dataset. An alter-
nate approach is to use oversampling algorithms. A popular approach used to address
oversampling is the synthetic minority oversampling technique (SMOTE). SMOTE
oversample or artificially synthesize data from smaller sample data sets until the data
is sufficiently balanced. Figure 2.10 shows the SMOTE oversampling approach used
for imbalanced data. In SMOTE, the n-nearest neighbors in the minority class for
each sample in the class are identified. Then, a line is drawn connecting the two
points in the minority classes. New data is generated by randomly identifying points
on this line. Note that the new point can be the midpoint or any other point along the
line connecting two points. Due to the possibility of identifying an infinite number
of points along a line, this approach allows the generation of new points until the
2.6 Summary 45
Fig. 2.10 Oversampling for minority data using SMOTE. Reproduced from [7]
2.6 Summary
solution, demanding careful evaluation and thorough analysis tailored to each dataset.
Notably, comprehensive outlier detection involving multiple methods should be con-
ducted subsequent to meticulous data visualization. Extensive evidence supports the
notion that enhanced outlier detection significantly enhances the performance of
machine learning algorithms. Consequently, the focus is shifting towards attaining
high-quality data rather than solely developing advanced machine learning algo-
rithms for improved outcomes.
References
Abstract Machine learning algorithms can be broadly divided into three categories
depending on the nature of the “learning” process, namely, supervised, unsupervised,
and reinforcement learning. In this chapter, we introduce these different categories
with the focus on the nature of the tasks for which these algorithms are useful. Specif-
ically, we focus on supervised and unsupervised learning algorithms. The supervised
algorithms may further be classified as parametric and non-parametric algorithms
depending on the mathematical model used to fit the data. These algorithms may
be used for several downstream tasks such as classification, regression, or cluster-
ing. Finally, we discuss the idea of overfitting and underfitting in machine learning
algorithms.
There is really nothing you must be. And there is nothing you must do. There is really
nothing you must have. And there is nothing you must know. There is really nothing you
must become. However, it helps to understand that fire burns, and when it rains, the earth
gets wet.
The concept of learning from data is deeply rooted in human history, predating
the term “machine learning,” coined in the mid-twentieth century. In fact, learning
from data is a fundamental process deeply ingrained in human cognition. It involves
extracting knowledge and understanding from information gathered through obser-
vation and experience. This ability allows us to recognize patterns, make predic-
tions, and adapt our behavior accordingly. The concept of learning from data has
been leveraged in various domains to develop computational methods that emulate
this human learning process. Consequently, machine learning refers to a broad class
of algorithms that focus on extracting patterns from data, enabling the inference of
meaningful insights. These algorithms continuously self-correct and improve through
experience, leveraging the provided data. Consequently, they are commonly referred
to as data-driven methods. By adopting an approach akin to human learning, machine
learning algorithms emulate the ability to acquire knowledge from data. For example,
a domain expert can effortlessly determine the number of grains in the microstructure
Machine learning encompasses a diverse set of algorithms and approaches that enable
computers to learn from data, recognize patterns, and make predictions or decisions.
Figure 3.1 shows the machine learning framework and some of the popular algorithms
in each of the categories. By categorizing machine learning algorithms into unsuper-
vised learning, supervised learning, and reinforcement learning, we can effectively
leverage these approaches for materials discovery and research. Note that the large
number of available methods in the ML can make it overwhelming for a researcher to
choose an appropriate method. While there are no strong mathematical guidelines,
there are several experience-based thumbrules one can follow while selecting the
model. To this extent, the guideline provided by sci-kit learn is shown in Fig. 3.2.
Note that this is simply a guideline and should not be taken too seriously or strictly.
Unsupervised learning algorithms are particularly valuable when dealing with unla-
beled data, where the objective is to uncover hidden patterns or structures within
the dataset. In the context of materials science, unsupervised learning algorithms
3.1 Machine Learning Paradigm 49
Fig. 3.1 The machine learning paradigm includes supervised, unsupervised, and reinforcement
learning algorithms. Some of the algorithms belonging to each class are included
Fig. 3.2 Workflow for the choice of ML algorithm based on the dataset size and model complexity
based on sci-kit learn tutorial
can assist in data exploration, clustering, and anomaly detection. Some notable algo-
rithms include:
• Clustering: Clustering algorithms, such as k-means clustering, hierarchical cluster-
ing, and density-based clustering, group similar data points together based on their
feature similarities. This allows for the identification of distinct clusters within the
data, enabling insights into materials classifications or compositions.
50 3 Introduction to Machine Learning
Machine learning models can be broadly categorized into two types: parametric mod-
els and non-parametric models. These models are designed to learn from data and
make predictions or classifications based on the patterns and information present in
the dataset. Understanding the differences between parametric and non-parametric
models is essential in grasping the fundamentals of machine learning and their respec-
tive applications, particularly in the field of materials science.
Parametric models make strong assumptions about the underlying distribution of the
data. These models have a fixed number of parameters that determine their behavior
52 3 Introduction to Machine Learning
and are independent of the size of the dataset. In materials science, parametric models
can be applied to understand the relationship between material properties and specific
features.
For example, consider the case of predicting the mechanical strength of a material
based on its composition. A parametric model like linear regression can assume a lin-
ear relationship between the elemental composition and the mechanical strength. The
model’s parameters, such as the coefficients associated with each element, represent
the influence of the composition on the mechanical strength. Once the parameters are
estimated from a training dataset containing material compositions and correspond-
ing mechanical strengths, the model can make predictions on new compositions by
calculating the weighted sum of the elemental contributions.
Parametric models offer computational efficiency and interpretability, which can
be valuable in materials science. However, they may struggle to capture complex,
nonlinear relationships between the features and the target variable if the underlying
assumptions do not hold.
Non-parametric models, on the other hand, make fewer assumptions about the under-
lying distribution of the data. These models have a flexible structure that can adapt
to the complexity of the dataset. In materials science, non-parametric models can
be used to capture intricate relationships between material properties and multiple
features.
For instance, let’s consider the task of predicting the bandgap energy of a material
based on its crystal structure and elemental composition. A non-parametric model
like the k-nearest neighbors (KNN) algorithm does not impose any specific form of
the relationship. Instead, it identifies the k closest neighbors in the training dataset
with similar crystal structures and compositions and predicts the bandgap energy
based on their average values.
Non-parametric models excel in capturing complex patterns and relationships in
materials data, even when the underlying distribution is unknown. However, they can
be computationally more intensive and require larger training datasets to generalize
well.
like linear regression can provide efficient and interpretable results. For instance, if
previous research indicates a linear correlation between the concentration of a dopant
element and the electrical conductivity of a material, a parametric model can capture
this relationship effectively.
On the other hand, when dealing with complex material systems or when the
underlying relationship is not well understood, non-parametric models like KNN or
decision trees may be more suitable. These models can adapt to the intricacies of the
dataset and handle nonlinear relationships between material properties and features.
It is important to note that the choice of model in materials science is a data-
driven and iterative process. Researchers often experiment with different models
and evaluate their performance using metrics such as mean squared error or accu-
racy. Techniques like cross-validation can also be employed to assess how well a
model generalizes to unseen data. Further details of model training, validation, and
hyperparametric optimizations are discussed in detail later.
In summary, parametric models make strong assumptions about the data distribu-
tion and have a fixed number of parameters, while non-parametric models are more
flexible and can adapt to complex relationships. The selection of a model in mate-
rials science depends on the characteristics of the dataset and the specific problem
at hand. Understanding the differences between these two types of models enables
researchers to make informed decisions and develop effective machine-learning solu-
tions for materials discovery and property prediction.
Classification and regression models are two major classes of supervised ML algo-
rithms. These models are designed to analyze data and make predictions or classifi-
cations based on the patterns and information present in the dataset.
Classification models are used when the target variable or outcome is categorical
or discrete. These models aim to assign input data points into predefined classes
or categories based on the patterns and characteristics present in the dataset. For
example, classification models can be employed to classify materials based on certain
properties or behaviors. Some major algorithms used for classification are as follows.
• Logistic regression takes a set of independent variables or features related to the
material, such as composition, structural characteristics, or spectroscopic data. It
produces a probability value between 0 and 1, representing the likelihood of the
input belonging to a particular class. For example, logistic regression can be used to
predict whether a material is a conductor or an insulator based on its composition
54 3 Introduction to Machine Learning
and electronic structure. The model uses a loss function called logistic loss or
cross-entropy loss to measure the discrepancy between the predicted probabilities
and the true class labels.
• Decision trees take a set of input features related to the material, such as compo-
sition, crystal structure, or elemental properties. The output of a decision tree is a
predicted class label or category for a given set of input features. Decision trees
employ various algorithms to optimize the structure of the tree, such as the Gini
impurity or information gain, which determines the splits in the tree that maximize
the separation between different classes. Decision trees can be utilized to classify
materials into different crystal structures based on their elemental composition
and lattice parameters.
• Random forest is an ensemble learning algorithm that combines multiple decision
trees. It takes the same input features as decision trees, consisting of material-
related properties. The output of random forest is the majority vote or average pre-
diction of a set of decision trees within the ensemble, resulting in a final predicted
class label. Random forest combines the individual decision trees by minimizing
the overall classification error or using the entropy-based criterion.
• Support Vector Machines are powerful classification algorithm that takes a set of
input features related to the material, such as structural parameters, composition,
or material descriptors. SVM aims to find the optimal hyperplane that maximally
separates the classes by minimizing the hinge loss or maximizing the margin
between the classes. SVM can be applied to classify materials based on their
mechanical properties, such as distinguishing between ductile and brittle materials.
3.3.2 Regression
Regression models, in contrast to classification models, are used when the target
variable is continuous or numerical. These models aim to predict a value or estimate
a relationship between the input features and the output variable. In materials sci-
ence, regression models can be utilized to predict material properties or performance
metrics. On the other hand, regression models are used when the target variable is
continuous or numerical. These models aim to predict a value or estimate a relation-
ship between the input features and the output variable. Regression models can be
utilized to predict material properties or performance metrics.
• Linear regression is a fundamental regression algorithm that takes a set of input
features, such as material composition, processing conditions, or structural proper-
ties. The output of linear regression is a continuous numerical value representing
the predicted property or performance metric. Linear regression minimizes the
sum of squared errors or mean squared errors between the predicted values and
the actual target values.
• Support Vector Regression (SVR) is an extension of SVM for regression tasks. It
takes a set of input features related to the material, such as composition, crystal
3.4 Clustering 55
3.4 Clustering
material properties, such as strength and ductility, by exploring the design space
and discovering Pareto-optimal solutions.
Reinforcement learning in materials science offers opportunities to optimize var-
ious aspects, such as material synthesis, characterization, design, and control. By
using reinforcement learning algorithms, researchers can discover optimal strate-
gies, policies, or conditions for material development, leading to improved materials
with desired properties, enhanced performance, and reduced costs. Some of the appli-
cations of RL for materials discovery are outlined below.
• Materials Discovery: RL can aid in the discovery of new materials with desired
properties by optimizing the selection of chemical compositions, crystal structures,
or material configurations. For example, RL algorithms can be used to guide the
exploration of the vast compositional and structural space to discover new high-
performance materials. By learning from the feedback on material properties,
RL agents can intelligently navigate through the search space to find promising
candidates. For example, RL can optimize the discovery of novel catalysts by
suggesting compositions and configurations that enhance catalytic activity and
selectivity. By interacting with the environment (chemical reactions) and receiving
rewards based on reaction efficiency or product quality, RL agents can learn to
propose optimal catalyst designs.
• Atomic Structure Optimization: RL can optimize the arrangement of atoms within
a material to minimize energy or maximize desired properties. It can explore the
vast configuration space efficiently and converge to stable or optimized atomic
structures. RL agents can learn from the feedback on energy calculations or prop-
erty evaluations to guide the search toward more favorable atomic arrangements.
For example, RL can optimize the structure of a molecule or a material to achieve
specific properties, such as band gaps or binding energies. By iteratively adjusting
the atomic positions and receiving rewards or penalties based on the calculated
properties, RL agents can learn to converge towards optimized structures.
• Process Optimization: RL can optimize various processes involved in materials
synthesis, such as controlling reaction conditions, optimizing deposition parame-
ters, or fine-tuning manufacturing processes. By learning from the rewards asso-
ciated with process efficiency, product quality, or cost reduction, RL agents can
identify optimal process settings and parameter combinations. For example, RL
can optimize the growth of thin films by controlling deposition parameters such
as temperature, pressure, and precursor flow rates. By exploring the parameter
space and receiving feedback on film quality or desired properties, RL agents can
discover optimal deposition conditions.
• Planning for Automated Materials Synthesis: RL can facilitate the planning and
decision-making in automated materials synthesis systems. RL agents can learn to
optimize the selection of precursor materials, reaction pathways, synthesis condi-
tions, or fabrication steps to achieve desired material properties, performance, or
functionality. For example, RL can optimize the synthesis of complex materials,
such as perovskite compounds, by guiding the selection of precursor materials and
reaction conditions. By receiving rewards based on desired properties or crystal
3.6 Summary 59
quality, RL agents can learn to plan the synthesis process and make informed
decisions for efficient and effective material production.
In summary, RL offers promising avenues for materials science applications, includ-
ing materials discovery, atomic structure optimization, process optimization, and
planning for automated materials synthesis. However, RL remains one of the under-
explored areas in machine learning for materials. The adoption of RL algorithms
in materials research necessitates addressing domain-specific considerations, such
as the design of appropriate reward functions tailored to materials properties and
performance. In addition, further research is required to develop specialized RL
algorithms that can handle the intricacies of atomic and molecular systems, account
for the long-range interactions and quantum effects that dictate material behavior,
and effectively optimize the vast combinatorial space of material configurations and
processing parameters.
Similarly, integrating RL with experimental workflows is crucial to bridge the gap
between simulation and real-world materials synthesis and characterization. Devel-
oping RL-driven autonomous systems for intelligent decision-making in materials
synthesis, process optimization, and adaptive experimentation holds great promise
for accelerating materials discovery and optimization. By actively exploring and
advancing RL methodologies specifically tailored for materials science, we can
unlock new opportunities for rational materials design, enable the discovery of novel
materials with tailored properties, optimize complex manufacturing processes, and
ultimately revolutionize the field of materials science and engineering.
3.6 Summary
This chapter introduced the machine learning paradigm and its applications in materi-
als science. We explored unsupervised learning algorithms for clustering and dimen-
sionality reduction, such as k-means, hierarchical clustering, DBSCAN, PCA, and
t-SNE. Supervised learning algorithms, including decision trees, random forests,
support vector machines, and neural networks, were discussed for classification and
regression tasks in materials science. Further, we discussed supervised algorithms
including classification and regression models such as linear and logistic regressions,
support vectors, random forests, and neural networks. Additionally, we highlighted
the potential of reinforcement learning (RL) in materials research, though it remains
relatively under-explored. RL algorithms like Q-Learning, DQN, and PPO offer
opportunities for materials discovery, atomic structure optimization, process opti-
mization, and planning in automated materials synthesis. We also discussed para-
metric and non-parametric models and the primary differences among them. The
remaining chapters of this part of the book will discuss these algorithms in detail.
60 3 Introduction to Machine Learning
References
4.1 Introduction
Regression analysis is one of the most widely used approaches to predict or fore-
cast the values of a dependent variable as a function of selected independent
variables. Regression analysis finds several applications, such as to predict the
composition–property relationships in materials, or to predict the composition–
processing–structure relationships. For instance, regression analysis can be used to
compute the slope of the strain-strain curve of a material, which represents Young’s
modulus of the material. As discussed in Chap. 3, in ML, regression analysis can
be performed using parametric and non-parametric methods. Parametric methods
include traditional approaches such as OLS and its regularized versions, weighted
regression, and LAR, to name a few.
The parametric approaches a priori assumes the functional form for the data to
be fitted. For instance, in the framework of linearized elasticity, we assume that the
stress-strain curve of a material follows a straight line passing through the origin, the
slope of which represents Young’s modulus. Thus, prior to regression, we assume
that the mathematical form of the curve to be . y = mx, or .σ = Eε, where .σ and .ε
represent the stress and strain for a uniaxial loading condition. Young’s modulus . E
of the material is the free parameter to be fitted. Thus, . E can be obtained based on
regression analysis. Once . E is known for a material, the stress associated with any
strain can be computed from the equation .σ = Eε, provided the strain is within the
elastic limit.
Thus, the basic steps involved in the parametric methods for regression can be
outlined as follows.
1. Identify the dependent and independent variables from the available data. All
possible relevant independent variables should be considered while developing the
model. The selection of these independent parameters can be based on intuition,
the physics of the problem, or expert knowledge.
2. Identify the functional form that fits the data best. These forms can be simple lin-
ear or polynomial forms such as . y = a0 + a1 x or . y = a0 + a1 x + a2 x 2 + · · · +
an x n . In some cases, it may take more complex forms, such as a power law or
exponential, based on the physics of the governing problem. In case the func-
tional form is not known a priori as in the case of many composition-property
relationships, multiple functional forms can be evaluated on the data, and the best
performing functional form can be selected later on.
3. Identify the free parameters that are to be obtained by regression analysis. In the
earlier example of linear regression, the free parameters are .a0 and .a1 , while in
the case of polynomial, the free parameters range from .a0 to .an .
4. Define a cost function such as the root mean squared errors between the predicted
value and the actual (or measured value), where the predicted value is obtained
from the model considered.
5. Apply analytical or numerical methods to identify the values of free parameters
that minimize the cost function. These values can then be used in the functional
form considered to predict values for unknown cases.
The main advantage of the parametric methods is their interpretability. Owing to
their straightforward functional form, the free parameters provide direct insight into
the weight of each of the independent parameters in governing the output value.
Also, since the functional form is chosen a priori, this approach allows a rational
choice for the mathematical model based on the known physics of the problem, that
is, linear, non-linear, static, or dynamic. Due to these reasons, parametric approaches
have been widely used for more than a century in aiding materials discovery.
In this chapter, we will first focus on the closed-form solution for the gener-
alized form of linear regression. Then we will discuss some iterative approaches,
such as the gradient descent optimizer for solving the regression. Following this,
other approaches inspired by linear regression, such as locally weighted regression,
stepwise regression, and LAR, are discussed. Finally, the chapter concludes with
a discussion on the application of logistic regression to classify data into multiple
labels.
4.2 Closed Form Solution of Regression 63
First, we focus on the closed-form solution of the generalized form of a linear regres-
sion problem . y = a0 + a1 x. In the generalized form, the output . y can be a function
j
of multiple input variables .xi or their powers .xi or any combination thereof. We aim
to derive the analytical solution to this generalized problem, which can then be used
to obtain any particular solutions, for example, simple linear regression.
Consider .m samples of training points that containing .n input variables .x1 ,
x2 , . . . , xn where each .xi ∈ R and the corresponding labelled output variable . y ∈ R.
We are interested in developing a linear regression model of the following form
The model presented in Eq. (4.1) encompasses the classes of both linear regression
models as well as regression models that are linear in parameters. ..̂ notation is added
to. y to identify it as the predicted output. ŷ, which may be different from the true output
value . y obtained from the experimental measurement or physics-based simulations.
For instance, the function .h(x) can be evaluated as .x itself for linear regression
problems or the corresponding nonlinear polynomial functions (such as .x 2 , x 3 , or
n
. x ), resulting in polynomial regression. Our aim is to develop a model that maximally
represents the available data. With this in view, we consider all the training examples,
and the resulting data set is represented in tabular form as shown in Table 4.1.
In the Table 4.1, .x (i, j) represents .ith training example for . jth variable and
(i)
. y represents the corresponding output. For calculating .θ := [θ1 , . . . , θn ] , a
T
1 ∑ (i)
m
θ = min J (θ ) :=
. ( ŷ − y (i) )2 (4.2)
θ 2 i=1
To solve Eq. (4.2), first, we transform the dataset in Table 4.1 to maintain a compact
notation scheme. Let the .ith row of inputs in Eq. (4.2) be represented as
⎛ ⎞T
h(x (i,1) )
⎜ h(x (i,2) ) ⎟
⎜ ⎟
⎜ ... ⎟
⎜ ⎟
.h[x(i)] := ⎜ h(x
(i, j) ⎟
T
⎜ ) ⎟ (4.3)
⎜ ... ⎟
⎜ (i,n−1) ⎟
⎝h(x )⎠
h(x (i,n) )
Then, the input datasets from all the.m training samples can be compactly represented
in the following fashion as
⎛ ⎞
h[x(1)]T
⎜ h[x(2)]T ⎟
⎜ ⎟
⎜ ... ⎟
⎜ ⎟
.h[X ] := ⎜ T ⎟
⎜ h[x(i)] ⎟ (4.4)
⎜ ... ⎟
⎜ ⎟
⎝h[x(m − 1)] ⎠
T
h[x(m)]
Combining these compact notations in Eqs. (4.3), (4.4), and (4.5), the OLS objective
function in Eq. (4.2) can be rewritten as
1
.θ = min J (θ ) := (h[X ] − Y )T (h[X ] − Y ) (4.6)
θ 2
1
∇θ J (θ ) =
. ∇θ (θ T h[X ]T h[X ]θ − θ T h[X ]T Y − Y T h[X ]θ + Y T Y ) = 0 (4.7)
2
1
. = (h[X ]T h[X ]θ + h[X ]T h[X ]θ − 2h[X ]T Y ) = 0 (4.8)
2
. =⇒ h[X ] h[X ]θ = h[X ] Y
T T
(4.9)
−1
. =⇒ θ = (h[X ] h[X ]) h[X ] Y
T T
(4.10)
The final value of .θ obtained in Eq. (4.10) represents the optimal values of the free
parameters obtained by minimizing the error between the predicted values using the
mathematical model with respect to the observed (or experimental) values for the
output variable.
Even though the closed-form expression to calculate .θ provided in Eq. (4.10) is quite
convenient, there are several issues associated with it as outlined below.
1. Existence of an inverse: If matrix .(h[X ]T h[X ]) is rank deficient, the inverse
−1
.(h[X ] h[X ])
T
will not exist. This may occur if any two rows or columns of
.h[X ] are linearly dependent on each other, which is quite likely in a large dataset.
2. Numerical instabilities: If there large number of entries in .h[X ], this may result
in numerical instabilities while calculating the inverse.(h[X ]T h[X ])−1 . Numerical
instabilities while computing the inverse can also occur if there are huge (order
of magnitude) variations in the dataset, which is quite likely in a realistic dataset.
In short, while dealing with a large amount of data where one expects variables
to have a correlation among each other, the closed-form expression given by Eq.
(4.10), though seemingly simpler for calculations, may result in undesirable results.
To address these issues, iterative approaches are commonly used to obtain the solution
closest to the minimum as discussed in the next section.
While the closed-form solutions are simpler and desirable for computing the optimal
solution, in practical cases, it often fails as mentioned earlier. In such cases, one
needs to opt for iterative approaches for solving the least squares objective function
given by Eq. (4.2). The canonical algorithm used for iteratively solving the Eq. (4.2)
is the gradient descent algorithm also known as the steepest descent algorithm.
Figure 4.1 shows the gradient descent algorithm applied in the case of a simple
parabolic function to achieve the minimum. In a gradient descent algorithm, one
starts with an initial guess for the free parameters .θ and repeatedly updates the guess
66 4 Parametric Methods for Regression
value of .θ such that the . J (θ ) in Eq. (4.2) is minimized. The direction for the iterative
∂ J (θ )
search will be along the rate of change of . J (θ ) represented by . ∂θ j j to reach the
minimum in the shortest possible steps. Parameter update stalls when the gradient
of . J (θ ) is zero or close to zero, indicating that a minimum is achieved. Gradient
descent update for the . jth parameter, .θ j , is given by
∂
θ := θ j − α
. j J (θ j ), j = 1, . . . , n (4.11)
∂θ j
The negative update rule in Eq. (4.11) indicates that the search is towards minimiz-
ing the gradient. Parameter .α ∈ (0, 1) in Eq. (4.11) is called the learning rate or the
forgetting factor. A higher value of .α may result in long search steps, and a lower
value of .α short search steps. The alpha value has to be tuned properly for achieving
an optimal trade-off between achieving minima and faster convergence of the algo-
rithm. Code Snippet 4.1 can be used to reproduce the results and the plots in Fig. 4.1.
Note: Authors recommend the readers to run the code with different values of
. α to understand the role of learning rate in converging towards the minimum.
In this subsection, we derive the gradient descent approach for computing the param-
eter for the regression model given by Eq. (4.1). This method is also known as the
least mean square (LMS) update or Widrow-Hoff learning rule. Assume we have
4.3 Iterative Approaches for Regression 67
""" G r a d i e n t D e s c e n t
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# import numpy
i m p o r t n u m p y as np
# P a r a b o l a with its m i n i m u m at x = 2 . 0
def f ( x ) :
r e t u r n 0 . 5 * ( x - 2 ) ** 2
# D e f i n e f u n c t i o n to c a l c u l a t e g r a d i e n t
def g r a d i e n t _ f ( x ) :
return x-2
# I n i t i a l p o s i t i o n at x = 6 . 0
x0 = 6 . 0
# define step size
stepsize = 0.1
# update position
def x_new ( x0 , stepsize , g r a d i e n t ) :
r e t u r n x0 - s t e p s i z e * g r a d i e n t
# I t e r a t e this for 100 steps
x _ v a l u e s = [ x0 ]
f _ v a l u e s = [ f ( x0 ) ]
x_ = x0
for i in r a n g e ( 100 ) :
g r a d i e n t = g r a d i e n t _ f ( x_ )
x_ = x_new ( x_ , stepsize , g r a d i e n t )
x _ v a l u e s . a p p e n d ( x_ )
f _ v a l u e s . a p p e n d ( f ( x_ ) )
# Print values
p r i n t ( ' I n i t i a l x : { :. 3f } , f ( x ) = { :. 3f } '. f o r m a t ( x0 , f ( x0 ) ) )
p r i n t ( ' F i n a l x : { :. 3f } , f ( x ) = { :. 3f } '. f o r m a t ( x_ , f ( x_ ) ) )
# S c a t t e r plot using m a t p l o t l i b
xs = np . a r a n g e ( - 3 , 7 , 0 . 01 )
plt . plot ( xs , [ f ( i ) for i in xs ] , c = 'k ' , l a b e l = 'f ( x ) ')
plt . s c a t t e r ( x_values , f_values , c = 'r ' , l a b e l = ' ' , s = 80 , lw = 0 .5 , ec = 'k ')
plt . xlim ( [ -3 , 7 ] )
plt . x l a b e l ( " x " )
plt . y l a b e l ( " f ( x ) " )
plt . text ( x0 , f ( x0 ) , " x0 " , ha = " r i g h t " )
plt . l e g e n d ()
s a v e f i g ( " g r a d i e n t _ d e s c e n t . png " )
p r i n t ( " End of o u t p u t " )
Output:
only one training example .(x, y). In this case, the gradient term . ∂θ∂ j J (θ j ) for the
regression model Eq. (4.1) is computed as
∂ ∂ 1
. J (θ ) = (h θ (x) − y)2 , j = 1, . . . , n (4.12)
∂θ j ∂θ j 2
1 ∂
. = 2× (h θ (x) − y) (h θ (x) − y) (4.13)
2 ∂θ j
∂ ∑
n
. = (h θ (x) − y) ( θi xi − y) (4.14)
∂θ j i=1
. = (h θ (x) − y)x j (4.15)
Using Eq. (4.15), the gradient descent update for the .ith training example is given as
However, the problem with Eq. (4.16) is that the parameter update is performed
considering only one training example. If there is more than one training example,
which is usually the case, we need to generalize this method. To this extent, one
approach is to update the parameter .θ considering the contributions from all the
training examples. This approach is called the batch LMS update as indicated in the
Algorithm 1.
In batch LMS, the magnitude of the update is proportional to the error given by
(y (i) − h θ (x (i) )). This means that a more significant change to the parameters will be
.
made when .h θ (x (i) ) deviates more from . y (i) . Similarly, when a training example on
which the prediction nearly matches the actual value of . y (i) , the parameter change
is minimal. Note that, LMS can be susceptible to local minima and may converge
4.3 Iterative Approaches for Regression 69
to the nearest local minima instead of finding the global optimal solution. However,
this is not a concern in the case of linear regression as it has only one global solution.
As such, the gradient descent approach always converges to the global solution in
linear regression.
Figure 4.3 shows the example of a linear regression on a set of data points rep-
resented by open circles. The equation of a straight line passing through the origin
. ŷ = h θ (x) = mx, where . x, and . y correspond to the independent and dependent vari-
ables, respectively, and .m is the slope, which is the free parameter to be fitted. First,
we start with an initial value of .m, m 0 = 0. This corresponds to a line parallel to the
x-axis. The error or loss is then calculated as
70 4 Parametric Methods for Regression
∑
n
loss =
. (y (i) − h θ (x (i) )
i=1
∑
n
= (y (i) − 0 × x (i) ) (4.17)
i=1
∑n
= (y (i) )
i=1
In this case, the loss comes to be a large number of 350 k (see Fig. 4.3). The value
of the slope .m is then updated to a positive value .m 1 according to the direction of
decreasing loss as provided by the gradient. Consequently, we observe that the error
is decreased. The procedure is continued until the loss converges to the minimum.
Note that the stepsize, as marked by the change in the value of .m, also decreases
with decreasing loss. The final fitted line represents the optimal model with mini-
mum error regressed through the points considered. Code Snippet 4.2 can be used to
reproduce the results and plots.
Note: The authors recommend the readers run the code for different initial values
of .m and learning rate .lr to understand the effects of initial values and learning
rate in converging towards the global minimum.
In batch LMS , the algorithm scans through every datapoint present in the training
set on every single step. Thus, the entire training set is considered before an update
is performed on the parameters, leading to a computationally expensive operation if
the training dataset is large. To address this issue, a stochastic version of the LMS
is used. In stochastic LMS , each time when a training datapoint is encountered, the
parameters are updated according to that single training datapoint only. Therefore,
stochastic LMS can start the update right away from the first training data point itself.
Thus, stochastic LMS enables faster convergence to optimal .θ values in comparison
to the batch LMS. However, it should be noted that the stochastic LMS suffers from
the disadvantage that it may never converge to the actual minimum. This is due
to the fact that as the update in each step in stochastic LMS is based on a single
training datapoint instead of considering the entire training sample, the values of
the parameters will be such that the . J (θ ) will oscillate around the minimum without
necessarily converging to the exact value. Nevertheless, when the training set is large,
stochastic LMS is often preferred over batch LMS considering the trade-off between
the accuracy of the solution and the computational cost.
4.3 Iterative Approaches for Regression 71
""" L i n e a r R e g r e s s i o n
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
i m p o r t m a t p l o t l i b . p a t c h e s as p a t c h e s
# i m p o r t n u m p y and s k l e a r n
i m p o r t n u m p y as np
i m p o r t p a n d a s as pd
# Create sample dataset
X = np . a r a n g e ( - 10 , 11 , 1 )
X_ = np . a r a n g e ( - 20 , 21 , 1 )
y = 30 * X + 10 * np . r a n d o m . r a n d n ( len ( X ) )
def l i n e a r _ m o d e l ( x , m ) :
return m*x
def loss (x , y , m ) :
r e t u r n np . sum ( 0 . 5 * ( l i n e a r _ m o d e l ( x , m ) - y ) ** 2 )
def l o s s _ g r a d i e n t ( x , y , m ) :
temp = l i n e a r _ m o d e l ( x , m ) - y
g_m = np . sum ( x * temp )
r e t u r n g_m
m = 0
lr = 0 . 0002
l o s s _ = [ loss (X , y , m ) ]
m_ = [ m ]
fig , axs = plt . s u b p l o t s ( 1 , 2 )
plt . sca ( axs [ 0 ] )
plt . plot ( X_ , l i n e a r _ m o d e l ( X_ , m ) , c = " k " )
for i in r a n g e ( 50 ) :
g_m = l o s s _ g r a d i e n t ( X , y , m )
m = m - lr * g_m
plt . plot ( X_ , l i n e a r _ m o d e l ( X_ , m ) , c = " k " , a l p h a = 0 . 2 * ( 1 - ( i + 1 ) / 50 ) )
l o s s _ + = [ loss (X , y , m ) ]
m_ + = [ m ]
plt . s c a t t e r ( X , y , s = 60 , ec = " k " , fc = " none " )
plt . x l a b e l ( " $x$ " )
plt . y l a b e l ( " $y = mx$ " )
plt . grid ( ' on ' )
plt . xlim ( - 20 , 20 )
plt . ylim ( - 300 , 300 )
plt . sca ( axs [ 1 ] )
plt . s c a t t e r ( m_ , loss_ , s = 60 , ec = " k " , fc = " none " )
plt . y t i c k s ( [ 100000 , 200000 , 3 0 0 0 0 0 ] , [ " 100k " ," 200k " ," 300k " ] )
plt . x l a b e l ( " $m$ " )
plt . y l a b e l ( " $ l o s s ( m ) $ " )
s a v e f i g ( " l i n e a r _ r e g r e s s i o n . png " )
p r i n t ( " End of o u t p u t " )
Output:
End of output
Code snippet 4.2: Linear regression using sklearn module. (a) A linear equation
.y = mx is fitted through the training data points represented by open circles. The
light lines represent the models corresponding to different values of .m, namely,
.m 0 , m 1 , m 2 , . . . , m n obtained during each update step
72 4 Parametric Methods for Regression
∈ (i) an error term that captures either unmodeled effects, or random noise and are
.
independently and identically distributed (IID) according to a Gaussian distribution
with mean 0 and standard deviation.σ , that is,.∈ (i) ∼ N (0, σ 2 ). Hence, the probability
distribution of .∈ can be given as
1 (∈ (i) )2
. p(∈ (i) ) = √ exp(− ) (4.19)
2πσ σ2
From Eq. 4.18, .∈ (i) can be written as . y (i) − θ T x (i) . Incorporating this in Eq. 4.19, the
probability of a . y (i) given .x (i) can be written in terms of the parameter .θ as
1 (y (i) − θ T x (i) )2
. p(y (i) |x (i) ; θ ) = √ exp(− ) (4.20)
2π σ σ2
Extending this expression to . X , which contains all the .x (i) ’s) and .θ , the distribution
of the .Y comprising all the . y (i) is given by . p(Y |X ; θ ). Explicit representation of
these distributions as a function of .θ is expressed in terms of likelihood function,
. L(θ ) given by
∏
m
. L(θ ) = L(θ ; X, Y ) = p(y (i) |x (i) ; θ ) (4.21)
i=1
Considering the IID assumption on the .∈ (i) , the . L(θ ) can be written as
∏
m
1 (y (i) − θ T x (i) )2
. = √ exp(− ) (4.22)
i=1
2π σ σ2
Similar to minimizing the loss function in OLS regression, to obtain the optimal
values of .θ , here we apply the maximum likelihood estimation. In the maximum
likelihood estimation of .θ , we obtain the values of .θ that maximize the . L(θ ). In
4.4 Locally Weighted Linear Regression (LWR) 73
practice, this approach is challenging due to the presence of the exponential in the
L(θ ) term. To address this issue, the logarithm of . L(θ ) is considered which reduces
.
the expression to a more straightforward form. In addition, the logarithm being
a monotonically increasing function, will not alter the maximal points. Thus, we
calculate the loglikelihood .l(θ ) = ln (L(θ )) for simplicity and tractability as below.
(∏
m
1 (y (i) − θ T x (i) )2 )
l(θ ) =ln
. √ exp(− ) (4.23)
i=1
2πσ σ2
1 1 ∑ (i)
m
1
. = m ln √ − 2 (y − θ T x (i) )2 (4.24)
2π σ σ 2
~ ~~ ~ ~~~~ i=1
constant scaling
∑m
Hence, maximizing.l(θ ) is equivalent to minimizing. 21 i=1 (y (i) − θ T x (i) )2 , the least
squares objective. In other words, under the probabilistic assumptions on the data,
OLS regression corresponds to finding the maximum likelihood estimate of .θ.
Although linear regression fits a straight line, with minor modifications, OLS can
be used for nonlinear data as well. This approach is based on the assumption that
a nonlinear function can be approximated as a combination of several piece-wise
linear functions. This approach is known as the linearisation of nonlinear functions.
LWR is an approach inspired by linearisation of nonlinear functions and having
multiple instance-based models ∑for a particular dataset. In linear regression, we fit .θ
to minimize the loss function . i (y (i) − θ T x (i) )2 in training and calculate the output
as .θ T x. In contrast, LWR ∑ is an online algorithm. Here, during the training process,
we fit .θ to minimize . i w (i) (y (i) − θ T x (i) )2 and prediction the output is calculated
as .θ T x, where .w (i) is a set of non-negative valued (.≥ 0) weights. Thus, when .w (i)
takes a large value, penalization of the loss.(y (i) − θ T x (i) )2 for that particular training
datapoint is high, and when .w (i) is small, penalization of the loss .(y (i) − θ T x (i) )2 for
(i)
that particular datapoint is small. A standard choice of .w (i) is .w (i) = exp(− (x 2τ−x)
2
2 ),
at a particular query point .x. This results in .|(x ( j) − x (i) )| is small .w (i) ~ 1 and
( j)
.|(x − x (i) )| is large .w (i) ~ 0. Here, .τ is the bandwidth parameter that decides how
the weight of a training datapoint .x ( j) reduces with its distance from the query point
(i)
.x .
Figure 4.4 shows the LWR approach on a nonlinearly distributed dataset .(x, y).
In this particular example, the datasets are generated as a combination of a linear
function and a sinusoidal function (.x + sin(x/5)) with some noise added to it. We
observe that the LWR is able to fit the nonlinear function very well albeit considering
a linearised form. Code Snippet 4.3 can be used to reproduce the results and the plots.
74 4 Parametric Methods for Regression
The default values of slope .m, learning rate .lr , and the bandwidth parameter .tau (.t
in the code) are 1, 0.001, and 1.2, respectively.
Note: Authors recommend the readers to run the code with different values of
. m, .lr , and .t to understand the effects of initial values, learning rate, and varying
bandwidth on the convergence, and the final fit obtained.
Although OLS estimates provide reasonable predictions in many cases, there are two
reasons why we are often not satisfied with the OLS estimates.
1. Prediction accuracy: The OLS estimates often have low bias but large variance.
Prediction accuracy can sometimes be improved by shrinking or setting some
coefficients to zero. By doing so we sacrifice the bias to reduce the variance of
the predicted values. This approach may improve the overall prediction accuracy.
2. Interpretability: With a large number of predictors, we often would like to
determine a smaller subset of independent variables that exhibit the strongest
effects. Reducing the model to a lower number of input features may thus provide
improved interpretability. In other words, to get the “big picture”, we are often
willing to sacrifice some of the small details.
To achieve this, we need to judiciously select a subset of features or independent
variables from all the available ones without compromising significantly on the
4.5 Best Subset Selection for Regression 75
""" L o c a l l y W e i g h t e d L i n e a r R e g r e s s i o n ( LWR )
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
i m p o r t m a t p l o t l i b . p a t c h e s as p a t c h e s
# i m p o r t n u m p y and s k l e a r n
i m p o r t n u m p y as np
i m p o r t p a n d a s as pd
np . r a n d o m . seed ( 2020 )
# Create sample dataset
X = np . a r a n g e ( - 20 , 21 , 1 )
y_ = X + X * np . sin ( X / 5 )
y = y_ + 2 * np . r a n d o m . r a n d n ( len ( X ) )
def w e i g h t ( t ) :
r e t u r n l a m b d a x , x_i : np . exp ( - ( x_i - x ) ** 2 / 2 . 0 / t ** 2 )
def w e i g h t e d _ l i n e a r _ m o d e l _ (x , X , y , w_func , m = 1 . 0 ) :
w = w_func (x , X)
l o s s _ g = l a m b d a m : np . sum ( - 2 . 0 * w * ( y - m * X ) * X )
m_ = 1 . 0 * m
lr = 0 . 001
for i in r a n g e ( 1000 ) :
m_ = m_ - lr * l o s s _ g ( m_ )
r e t u r n m_ * x
def w e i g h t e d _ l i n e a r _ m o d e l ( x , * args ) :
r e t u r n [ w e i g h t e d _ l i n e a r _ m o d e l _ ( i , * args ) for i in x ]
y _ p r e d = w e i g h t e d _ l i n e a r _ m o d e l (X , X , y , w e i g h t ( 1 . 2 ) )
fig , axs = plt . s u b p l o t s ( 1 , 1 )
plt . s c a t t e r ( X , y , s = 60 , ec = " k " , fc = " none " , l a b e l = " Data p o i n t s " )
plt . s c a t t e r ( X , y_pred , s = 60 , ec = " r " , l a b e l = " LWR " )
plt . x l a b e l ( " $x$ " )
plt . y l a b e l ( " $y$ " )
plt . l e g e n d ()
s a v e f i g ( " l w _ r e g r e s s i o n . png " )
p r i n t ( " End of o u t p u t " )
Output:
End of output
Stepwise regression is the simplest and most intuitive approach for best subset
selection. In this approach, we first consider the linear regression model given by
76 4 Parametric Methods for Regression
∑n
. ŷ = θ0 + i=1 (θi xi + ∈). Then, the best subset regression based on the stepwise
∑k
approach takes the form . ŷ = β0 + i=1 (βi xi + ∈) where .k < n. The best subset of
features .k are selected such that for each .i ∈ {1, 2, . . . , n}, the selected subset of .xi s
with size .k gives the smallest residual sum of squares in comparison to any other
subset of .xi s having size .k. To achieve this, two approaches, namely, forward and
backward stepwise regressions, are employed.
In forward stepwise regression approach, the algorithm starts with fitting the
intercept .β0 first. Then, it evaluates the model performance with a single predictor
and the intercept by adding only one among all the available features one-by-one.
Then it identifies the descriptor that provides the maximum training score and adds it
to the model. This approach is continued until the model becomes one with.k features.
Thus, forward stepwise regression sequentially adds into the model the predictors
. x i s (independent features) one by one that most improve the fit. Thus, the equation
∑k
for . ŷ evolves as .β0 , β0 + β1 x1 , β0 + β1 x1 + β2 x2 , . . . , β0 + i=1 (βi xi ). Note that
at each step, the feature that provides the best training score among all the remaining
features is added in this process. In contrast to forward approach, ∑the backward
n
stepwise regression approach starts with the full model . ŷ = θ0 + i=1 (θi xi + ∈).
Then, it identifies the feature that has the least impact on the model by removing
each of the input features one by one and evaluating the performance of the updated
model. Among all the features, the one having the least impact is deleted, and the
procedure is continued. Thus, backward stepwise regression sequentially deletes the
predictors that have the least impact on the model until it reduces to a model with .k
number of features.
model predictions. Then, the predictor or feature most correlated with the residual is
computed by taking the inner product of the input features with the residual. Once
the feature is identified, it performs linear regression considering only the selected
feature to identify the coefficient corresponding to this feature. The updated model
is then used to compute the residual. This process is continued until none of the
variables have a correlation with the residuals. Note that unlike forward stepwise
regression, only the coefficient associated with a single feature most correlated with
the residual is adjusted in one update step. Coefficients corresponding to the other
variables are not adjusted when this term is added to the model. As a result, forward
stagewise regression can take significantly more steps in comparison to stepwise
4.5 Best Subset Selection for Regression 77
regression and hence is deemed to be slow in fitting. However, it has been found that
stagewise regression can be very effective in solving high-dimensional problems.
Figure 4.2 shows the LAR performed on a dataset with 11 features. Specifically,
Fig. 4.2a and b show the values taken by the weights corresponding to each of the
input variables in each iteration and the corresponding training R.2 . We observe that
at the initial point (0.th iteration), all the weights are having a value of 0. With every
iteration, the weight associated with a different variable gains a non-zero value. Cor-
respondingly, we also observe an increase in the training score. In the 11.th iteration,
we observe that all the features have non-zero weights and the best training score.
However, it is interesting to note that the values of the weights associated with the
last three features are significantly low. Further, the increase in the training score
with the inclusion of these features is also marginal. These results suggest that the
78 4 Parametric Methods for Regression
last three features do not contribute significantly to the model predictions. On the
other hand, the first five features contribute significantly to the output of the model.
Thus, LAR is a highly interpretable approach to developing a regression model. Code
Snippet 4.4 can be used to reproduce the results and the plot. Note that the data used
is the density of multicomponent oxide glasses with the glass composition as the
input and density as the output.
Note: Authors recommend the readers to run the code with different values of
iterations, number of features, and different random states to understand the
effects on the final fit obtained.
All the regression models discussed thus far considers continuous variables for both
the input and output of the model. In many cases, while the input variables may
be continuous, the output variable may be discrete or even binary. This is typically
referred to as a classification problem. In this final section, we discuss the logistic
regression, which is a parametric algorithm for classification.
Logistic regression is a probabilistic model that uses a logistic function to model a
binary dependent variable. Mathematically, a binary logistic model has a dependent
variable with two possible values, which is represented by an indicator variable,
where the two values are labeled “0” and “1”. Outputs with more than two values
are modeled by multinomial logistic regression.
Consider a binary classification problem in which . y can take only two values,
0 and 1. Since . y is discrete-valued, if linear regression is used to predict . y as a
function of .x, it will give poor predictions. To address this challenge, we invoked the
logistic function .g(z) = 1+e1 −z , also known as a sigmoid function. Using the sigmoid
function, the logistic regression model is given as
1
h (x) = g(θ T x) =
. θ (4.25)
1 + e−θ T x
Note that .g(z) → 1 when .z → ∞, and .g(z) → 0 when .z → −∞. Thus the value
of .g(z) is bounded between 0 and 1. Further, .g(z) is a smooth, continuous and
differentiable function. With respect to the differentiability, .g(z) exhibits an interest-
ing property where .g ' (z) = g(z)(1 − g(z)). This property can be verified as shown
below.
4.6 Logistic Regression for Classification 79
""" L e a s t A n g l e R e g r e s s i o n
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
i m p o r t m a t p l o t l i b . p a t c h e s as p a t c h e s
# i m p o r t n u m p y and s k l e a r n
i m p o r t n u m p y as np
i m p o r t p a n d a s as pd
from s k l e a r n . m e t r i c s i m p o r t r 2 _ s c o r e
from s k l e a r n i m p o r t l i n e a r _ m o d e l
# Load sa m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / f u l l _ d e n . csv " ) . s o r t _ v a l u e s ( by = " Na2O " )
p r i n t ( data . c o l u m n s )
X = data [ data . c o l u m n s [ : - 1 ] ] . v a l u e s
y = data . v a l u e s [ : ,- 1 : ]
p r i n t ( " N u m b e r of f e a t u r e s : " , X . s h a p e [ 1 ] )
# Regression model
regr = l i n e a r _ m o d e l . Lars ( n _ n o n z e r o _ c o e f s = X . s h a p e [ 1 ] , r a n d o m _ s t a t e = 20 )
regr . fit (X , y )
scores = []
for w in regr . c o e f _ p a t h _ [ : , 1 : ] . T :
mask = w ! = 0 . 0
l i n _ r e g r = l i n e a r _ m o d e l . L i n e a r R e g r e s s i o n ()
l i n _ r e g r . fit ( X [ : , mask ] , y )
s c o r e s + = [ l i n _ r e g r . s c o r e ( X [ : , mask ] , y ) ]
# plor W e i g h t s vs I t e r a t i o n
fig , axs = plt . s u b p l o t s ( 1 , 2 )
plt . sca ( axs [ 0 ] )
m a r k e r s = [ " o " , " <" , " s " , " ^ " , " D " , " + " , " >" , " * " , " x " , " P " , " X " ]
for i , w in e n u m e r a t e ( regr . c o e f _ p a t h _ ) :
w_ = w
w_ = [ 0 . 0 ] + list ( w [ w ! = 0 . 0 ] )
x_ = r a n g e ( 12 - len ( w_ ) , 12 )
plt . plot ( x_ , w_ , " -- { } k " . f o r m a t ( m a r k e r s [ i ] ) , ms = 8 , mfc = " none " ,
l a b e l = " W$_ { { { } } } $ " . f o r m a t ( i + 1 ) )
plt . x t i c k s ( r a n g e ( 0 , 12 ) )
plt . xlim ( [ -1 , 16 ] )
plt . plot ( r a n g e (0 , 12 ) , [ 0 ] * 12 , " k " )
plt . x l a b e l ( " I t e r a t i o n " )
plt . y l a b e l ( " C o e f f i c i e n t v a l u e " )
plt . l e g e n d ()
plt . sca ( axs [ 1 ] )
plt . plot ( r a n g e (1 , len ( s c o r e s ) + 1 ) , scores , " -- ok " , ms = 8 , mfc = " none " )
plt . x t i c k s ( r a n g e ( 1 , 12 ) )
plt . y l a b e l ( " R$ ^ 2$ " )
plt . x l a b e l ( " N u m b e r of f e a t u r e " )
s a v e f i g ( " lars . png " )
p r i n t ( " End of o u t p u t " )
Output:
Index([’Al2O3’, ’B2O3’, ’CaO’, ’Fe2O3’, ’FeO’, ’MgO’, ’Na2O’, ’P2O5’, ’TeO2’,
’TiO2’, ’ZrO2’, ’Density’],
dtype=’object’)
Number of features: 11
End of output
d 1 1
. g ' (z) = −z
=− (−e−z )
dz 1 + e (1 + e−z )2
( )
1 e−z 1 1
= = 1− (4.26)
(1 + e−z ) (1 + e−z ) (1 + e−z ) (1 + e−z )
= g(z)(1 − g(z))
For a given dataset, the model parameters of logistic regression, namely .θ , are
computed as follows. Assume that
∏
m
. L(θ ) = P(Y |X ; θ ) = p(y (i) |x (i) ; θ )
i=1
∏
m
(i) (i)
= (h θ (x))(y ) (1 − h θ (x))(1−y )
(4.29)
i=1
l(θ ) = log(L(θ ))
.
∑
m
= y (i)ln h(x (i) ) + (1 − y (i) )ln (1 − ln h(x (i) )) (4.30)
i=1
Finally, by employing gradient ascent algorithm (as we are maximizing the like-
lihood)
( )
∂ 1 1 ∂
. l(θ ) = y − (1 − y) g(θ T x)
∂θ j g(θ T x) (1 − g(θ T x)) ∂θ j
( )
1 1 ∂
= y − (1 − y) g(θ T x)(1 − g(θ T x)) (θ T x)
T
g(θ x) (1 − g(θ x))
T ∂θ j
= (y(1 − g(θ T x)) − ((1 − y)g(θ T x))
= (y − h θ (x))x j (4.31)
4.7 Summary 81
This would result in the following parameter update expression, which can be solved
in the batch or stochastic LMS framework.
Note: Authors recommend the readers to run the code with different values of
hyperparameters such as .C to analyze the effect on the final fit obtained.
4.7 Summary
""" L o g i s t i c c l a s s i f i c a t i o n
Code s o u r c e : Gael V a r o q u a u x
M o d i f i e d for d o c u m e n t a t i o n by J a q u e s G r o b l e r
L i c e n s e : BSD 3 c l a u s e
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
i m p o r t m a t p l o t l i b . p a t c h e s as p a t c h e s
i m p o r t n u m p y as np
i m p o r t m a t p l o t l i b . p y p l o t as plt
from s k l e a r n . l i n e a r _ m o d e l i m p o r t L o g i s t i c R e g r e s s i o n
from s k l e a r n i m p o r t d a t a s e t s
# i m p o r t s o m e data to play with
iris = d a t a s e t s . l o a d _ i r i s ()
X = iris . data [ : , : 2 ] # we only take the first two f e a t u r e s .
Y = iris . t a r g e t
# C r e a t e an i n s t a n c e of L o g i s t i c R e g r e s s i o n C l a s s i f i e r and fit the data
.
l o g r e g = L o g i s t i c R e g r e s s i o n ( C = 1e5 )
l o g r e g . fit ( X , Y )
# Plot the d e c i s i o n b o u n d a r y . For that , we will assign a color to each
# point in the mesh [ x_min , x _ m a x ] x [ y_min , y _ m a x ].
x_min , x _ m a x = X [ : , 0 ] . min () - . 5 , X [ : , 0 ] . max () + . 5
y_min , y _ m a x = X [ : , 1 ] . min () - . 5 , X [ : , 1 ] . max () + . 5
h = . 02 # step size in the mesh
xx , yy = np . m e s h g r i d ( np . a r a n g e ( x_min , x_max , h ) , np . a r a n g e ( y_min , y_max
, h))
Z = l o g r e g . p r e d i c t ( np . c_ [ xx . r a v e l () , yy . r a v e l () ] )
# Put the result into a color plot
Z = Z . r e s h a p e ( xx . s h a p e )
plt . f i g u r e (1 , f i g s i z e = ( 4 , 3 ) )
plt . p c o l o r m e s h ( xx , yy , Z , cmap = plt . cm . Paired , s h a d i n g = ' auto ')
# Plot also the t r a i n i n g p o i n t s
plt . s c a t t e r ( X [ : , 0 ] , X [ : , 1 ] , c = Y , e d g e c o l o r s = 'k ' , cmap = plt . cm . P a i r e d )
plt . x l a b e l ( ' S e p a l l e n g t h ')
plt . y l a b e l ( ' S e p a l w i d t h ' )
plt . xlim ( xx . min () , xx . max () )
plt . ylim ( yy . min () , yy . max () )
plt . x t i c k s (() )
plt . y t i c k s (() )
s a v e f i g ( " l o g i s t i c _ c l a s s i f i c a t i o n . png " )
p r i n t ( " End of o u t p u t " )
Output:
End of output
References
1. M. Kuhn, K. Johnson, et al., Applied Predictive Modeling, vol. 26 (Springer, Berlin, 2013)
2. G. James, D. Witten, T. Hastie, R. Tibshirani, An Introduction to Statistical Learning: With
Applications in R (Springer Publishing Company, Incorporated, 2014). ISBN: 1461471370
3. C.M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics)
(Springer, Berlin, Heidelberg, 2006). ISBN: 0387310738
4. Y.S. Abu-Mostafa, M. Magdon-Ismail, H.-T. Lin, Learning From Data (AMLBook, 2012).
ISBN: 1600490069
5. A. Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and Tensor-Flow: Concepts,
Tools, and Techniques to Build Intelligent Systems (O’Reilly Media, 2019)
6. J. Friedman, T. Hastie, R. Tibshirani, et al., The Elements of Statistical Learning. Springer Series
in Statistics New York, vol. 1 (2001)
7. C.M. Judd, G.H. McClelland, C.S. Ryan, Data Analysis: A Model Comparison Approach to
Regression, ANOVA, and Beyond (Routledge, 2017)
Chapter 5
Non-parametric Methods for Regression
5.1 Introduction
Predicting the value of an output variable requires accurate knowledge of the func-
tion relating the independent variables with that of the output variable. Parametric
approaches such as linear, polynomial, or logistic regression makes a priori assump-
tions about the nature of the data before the process of learning the function. For
instance, for a linear elastic material, it is assumed that the stress is proportional to
strain, which leaves only one free parameter to be fitted. The knowledge of this func-
tional form may be based on the underlying physics, expert knowledge, or intuition.
Thus, the number of parameters in a parametric model is fixed. During the learning
process, the values of these parameters are learned.
On the other hand, non-parametric algorithms does not make any assumptions
about the underlying nature of the data. Further, they do not have any restrictions on
the number of parameters to be used for the fitting. Rather, the functional form is
Tree-based methods partition the feature space into a set of symmetric regions and
then fit a simple model, mostly a constant, in each of these regimes. For instance,
consider a regression problem with continuous output variable . y and inputs features
. x 1 , . . . , x n . Here, we split the feature space generated by the input features into
regions . R1 , . . . , Rm . These regions are split in such a fashion that the response taken
by the variable is a constant .ci in each region. The accuracy of the model depends on
how these regions are identified and how the final prediction is made. Accordingly,
there are several tree-based approaches with minor modifications, namely, regression
trees, random forest, gradient boosted trees, XGBoost, and AdaBoost. We discuss
some of these algorithms in detail below.
∑
M
. f (x ( j) ) = ci I (x ( j) ∈ Ri ) (5.1)
i=1
5.2 Tree-Based Approaches 87
where in. I stands for an indicator function, which evaluates to identity if the condition
.(x ( j) ∈ Ri ) is met (that is, . I (x ( j) ) = 1 if .x ( j) ∈ Ri , otherwise . I (x ( j) ) = 0), .x ( j)
stands for a given value of predictors for which the predictions are made, . f (x ( j) )
stands for the prediction, and .i corresponds to the number of regions split in the
regression tree. Thus, in a regression tree model, we have to evaluate two parameters,
namely, .ci and . Ri . For evaluating .ci , one can use least squares principles such that
∑ ( j)
.
N
j=1 (y − f (x ( j) ))2 is minimized, over. N samples. It can be shown that this would
result in
ĉ = average(y ( j) |x ( j) ∈ Ri )
. i (5.2)
Now, the following approach can be used to generate the nodes based on a greedy
approach. Consider a case where a node is created for splitting a variable .xi at a
splitting point . j, such that:
so that the best binary partition is achieved. Thus, any value of .xi < j would belong
to the region . R1 and any value of .xi ≥ j would belong to region . R2 . Hence, we
would minimize the following objective function given by
⎡ ⎤
∑ ∑
. min ⎣min (yi − c1 )2 + min (yi − c2 )2 ⎦ (5.4)
i, j c1 c2
x∈R1 (i, j) x∈R2 (i, j)
It is clear that for any suitable choice .i and . j, the inner minimization problem yields
the following optimal solutions
In other words, .ĉ1 and .ĉ2 correspond to the average values of the output variable . y
values obtained by considering all the input features .x belonging to the region . R1
and . R2 , respectively. Note that for each splitting variable .xi , the determination of
the split point . j is determined by scanning through all of the input and finding the
optimal pair .(i, j). Having found the best split, we partition the data into the two
resulting regions and repeat the splitting process on each of the two regions. Then
this process is repeated until the full regime is explored. It should be noted that the
tree size is a tuning parameter governing the model’s complexity, and the optimal
tree size should be suitably chosen based on the data. To achieve that, one can choose
to employ a condition to split tree nodes only if the decrease in sum-of-squares due
to the split exceeds a manually set threshold.
Figure 5.1 shows the example of regression tree to predict the density of binary
sodium silicate glasses with the chemical composition (Na.2 O).x ·(SiO.2 ).(1−x) . In
Fig. 5.1a, shows the results for tree regression with a tree depth of two, which results
in four regions (two per node). The node splitting was performed based on the feature
88 5 Non-parametric Methods for Regression
(Na.2 O) for regions with mol % of (Na.2 O) .< 15% (. R1 ), 15–25% (. R2 ), 25–35% (. R3 ),
and > 35% (. R4 ). We observe that in these four regions . R1 , R2 , R3 , R4 , the density
values given by the values of .ĉ1 , ĉ2 , ĉ3 , ĉ4 are 2.31, 2.38, 2.46, 2.53 g/cm.3 , respec-
tively. It may be noted that a tree depth of two underfits the data. Figure 5.1b, c, and
d shows the tree regression with tree depths three (eight regions), four (16 regions),
and five (32 regions). A tree depth of four seems to provide an optimal fit, while tree
depth seems to exhibit trends of overfitting represented sudden spikes with notable
difference in magnitude. Note that the training:test ration for the data set is 70:30.
That is, only 70% of the data is provided to the algorithm to develop the model. The
remaining 30% test data can be used to evaluate the performance of the model. Code
Snippet 5.1 can be used to reproduce the results and the plot.
5.2 Tree-Based Approaches 89
Note: Authors recommend the readers to run the code with different values
of of the random state for test train split (.random_state) and test set size
(.test_size to understand the effect on the final fit obtained.
"""
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# i m p o r t n u m p y and s k l e a r n
i m p o r t n u m p y as np
i m p o r t p a n d a s as pd
from s k l e a r n . m e t r i c s i m p o r t r 2 _ s c o r e
from s k l e a r n . m o d e l _ s e l e c t i o n i m p o r t t r a i n _ t e s t _ s p l i t
from s k l e a r n . tree i m p o r t D e c i s i o n T r e e R e g r e s s o r
# Load sa m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / N S _ d e n . csv " ) . s o r t _ v a l u e s ( by = " Na2O " )
p r i n t ( data . c o l u m n s )
X = data [ " Na2O " ] . v a l u e s . r e s h a p e ( - 1 , 1 )
y = data [ " D e n s i t y ( g / cm3 ) " ] . v a l u e s . r e s h a p e ( -1 , 1 )
# Split dataset
X_train , X_test , y_train , y _ t e s t = t r a i n _ t e s t _ s p l i t ( X , y , t e s t _ s i z e = 0 . 3
, r a n d o m _ s t a t e = 2020 )
o r d e r = np . a r g s o r t ( X _ t e s t [ : , 0 ] )
fig , axs_ = plt . s u b p l o t s ( 2 , 2 )
axs = axs_
for ind , td in e n u m e r a t e ( [ 2 ,3 ,4 , 5 ] ) :
# Fit r e g r e s s i o n m o d e l
regr = D e c i s i o n T r e e R e g r e s s o r ( m a x _ d e p t h = td )
regr . fit ( X_train , y _ t r a i n )
# Predict
y _ p r e d _ t e s t = regr . p r e d i c t ( X _ t e s t )
# Score
p r i n t ( " Test R2 ( Tree depth = { } ) : " . f o r m a t ( td ) , r 2 _ s c o r e ( y_test ,
y_pred_test ))
plt . sca ( axs [ ind ] )
plt . plot ( X _ t e s t [ order , 0 ] , y _ p r e d _ t e s t [ o r d e r ] , c = 'r ' , l a b e l = ' Tree
d e p t h { } '. f o r m a t ( td ) )
plt . s c a t t e r ( X_test , y_test , l a b e l = ' Test data p o i n t s ' , ec = " k " , fc = "
none " )
plt . l e g e n d ()
plt . x l a b e l ( " $ N a _ 2 O $ ( mol % ) " )
plt . y l a b e l ( " D e n s i t y ( g cm$ ^ { -3 } $ ) " )
s a v e f i g ( " t r e e _ r e g r e s s i o n . png " )
p r i n t ( " End of o u t p u t " )
Output:
Code snippet 5.1: Tree-based regression. Model trained for varying tree-depths of
2, 3, 4, and 5 are shown
90 5 Non-parametric Methods for Regression
One major problem of regression tree is that they tend to overfit their training sets,
if the trees are grown very deep. To address this issue, one of the approaches is
develop multiple regression trees over which an average can be computed to obtain
the function value for a given predictor. This can be achieved using the random
forest algorithm. Random forests are a way of averaging multiple deep decision
trees, trained on different parts of the same training set, with the goal of reducing
the variance. Suppose, we fit a tree to the training data . Z = {(x1 , y1 ), . . . , (x N , y N )},
obtaining the prediction. f (x) at input.x. Similarly, consider several subsets of. Z given
by. Z b , b = 1, 2, . . . , B. Now, fit a tree for each these subsets separately obtaining the
predictions. f b (x) at input.x. Then, the mean value of these predictions. f b (x) is given
as the estimate of the output value in the bagging approach. In other words, bagging
estimate of the prediction over a collection of .b bootstrap samples gives improved
prediction over a single model, thereby reducing its variance. In this case, for each
bootstrap sample . Z b , b = 1, 2, . . . , B, subset of . Z , a regression tree is developed,
giving prediction . f b (x). Thus, the bagging estimate . fˆb is defined by
1 ∑ b
B
ˆ
. fb = f (x) (5.6)
B b=1
1 ∑
B
ˆ
. f r f (x) =
B
Tb (x) (5.7)
B i=1
It should be noted that RF with only one tree is equivalent to regression tree. In other
words, regression tree is a special case of RF with only one tree.
Figure 5.2 shows the example of a RF regression with varying number of esti-
mators for predicting the density of calcium aluminosilicate glasses. The number of
estimators refer to the number of trees considered in the RF model. For each tree,
the tree depth is another variable that can be optimized. Here, the tree depth is kept
constant as two for all the trees considered. Thus, for each tree, there are four regions
5.2 Tree-Based Approaches 91
(. R1 , R2 , R3 , and . R4 ). Figure 5.2a shows the prediction for the test data using one tree
having a tree depth of two. In this case, the four regions can be very clearly observed
as the four step values in the figure. Figure 5.2b, c, and d shows RF predictions with
the number of trees 2, 5, and 10, with each tree having a tree depth of 2. Despite
having only four regions, we observe that the prediction improves with increasing
number of estimators. This could be attributed to the bootstrapping and bagging over
a increasing number of trees which gives a better estimate of the output variable.
Code Snippet 5.2 can be used to reproduce the results and plots.
Note: Authors recommend the readers to run the code with different values of test
set size (.test_size), random state for train test partition (.random_state),
number of trees in the RF (.n_estimator), and tree depth (.max_depth).
92 5 Non-parametric Methods for Regression
"""
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# i m p o r t n u m p y and s k l e a r n
i m p o r t n u m p y as np
i m p o r t p a n d a s as pd
from s k l e a r n . m e t r i c s i m p o r t r 2 _ s c o r e
from s k l e a r n . m o d e l _ s e l e c t i o n i m p o r t t r a i n _ t e s t _ s p l i t
from s k l e a r n . e n s e m b l e i m p o r t R a n d o m F o r e s t R e g r e s s o r
# Load sa m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / C A S _ d e n . csv " ) . s o r t _ v a l u e s ( by = " CaO " )
p r i n t ( data . c o l u m n s )
X = data [ [ " CaO " , " A l 2 O 3 " , " SiO2 " ] ] . v a l u e s
y = data [ " D e n s i t y ( g / cm3 ) " ] . v a l u e s . r e s h a p e ( -1 , 1 )
# Split dataset
X_train , X_test , y_train , y _ t e s t = t r a i n _ t e s t _ s p l i t ( X , y , t e s t _ s i z e = 0 . 3
, r a n d o m _ s t a t e = 2020 )
fig , axs_ = plt . s u b p l o t s ( 2 , 2 )
axs = axs_ . f l a t t e n ()
for ind , n in e n u m e r a t e ( [ 1 , 2 ,5 , 10 ] ) :
# Fit r e g r e s s i o n m o d e l
regr = R a n d o m F o r e s t R e g r e s s o r ( n _ e s t i m a t o r s =n , m a x _ d e p t h =2 ,
r a n d o m _ s t a t e = 20 )
regr . fit ( X_train , y _ t r a i n )
# Predict
y _ p r e d _ t e s t = regr . p r e d i c t ( X _ t e s t )
# Score
p r i n t ( " Test R2 ( n u m b e r of e s t i m a t o r s = { } ) : " . f o r m a t ( n ) , r 2 _ s c o r e (
y_test , y _ p r e d _ t e s t ) )
plt . sca ( axs [ ind ] )
i = 0
o r d e r = np . a r g s o r t ( X _ t e s t [ : , i ] )
plt . plot ( X _ t e s t [ order , i ] , y _ p r e d _ t e s t [ o r d e r ] , c = 'r ' , l a b e l = ' N u m b e r
of e s t i m a t o r s { } '. f o r m a t ( n ) )
plt . s c a t t e r ( X _ t e s t [ : ,i ] , y_test , l a b e l = ' Test data p o i n t s ' , ec = " k " ,
fc = " none " )
plt . l e g e n d ()
plt . x l a b e l ( " $ C a O $ ( mol % ) " )
plt . y l a b e l ( " D e n s i t y ( g cm$ ^ { -3 } $ ) " )
s a v e f i g ( " r a n d o m _ f o r e s t _ r e g r e s s i o n . png " )
p r i n t ( " End of o u t p u t " )
Output:
Code snippet 5.2: Random forest regression with the number of estimators varying
from 1, 2, 5, 10
5.3 Multi-layer Perceptron 93
Note: Authors recommend the readers to run the code with different val-
ues of test set size (.test_size), random state for train test partition
(.random_state), number of trees in the XGBoost (.n_estimator), and
tree depth (.max_depth) .
Perceptron is two class classifier using generalized linear model structure and Multi-
layer perceptron (MLP) is network of perceptrons. In percetrpon, the input vector .x
is first transformed using a fixed nonlinear transformation to give a feature vector
.φ(x), and this is then used to construct a generalized linear model of the form
. y = f (w T φ(x)) (5.8)
94 5 Non-parametric Methods for Regression
""" X G B o o s t r e g r e s s i o n
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# import
i m p o r t n u m p y as np
i m p o r t p a n d a s as pd
from x g b o o s t i m p o r t X G B R e g r e s s o r
from s k l e a r n . m o d e l _ s e l e c t i o n i m p o r t t r a i n _ t e s t _ s p l i t
from s k l e a r n . m e t r i c s i m p o r t m e a n _ s q u a r e d _ e r r o r , r 2 _ s c o r e
# Load sa m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / N S _ d e n . csv " ) . s o r t _ v a l u e s ( by = " Na2O " )
p r i n t ( data . c o l u m n s )
X = data [ data . c o l u m n s [ : - 1 ] ] . v a l u e s
y = data [ " D e n s i t y ( g / cm3 ) " ] . v a l u e s . r e s h a p e ( -1 , 1 )
# Split dataset
X_train , X_test , y_train , y _ t e s t = t r a i n _ t e s t _ s p l i t ( X , y , t e s t _ s i z e = 0 . 3
, r a n d o m _ s t a t e = 2020 )
o r d e r = np . a r g s o r t ( X _ t e s t [ : , 0 ] )
fig , axs_ = plt . s u b p l o t s ( 2 , 2 )
axs = axs_
for ind , td in e n u m e r a t e ( [ ( 5 , 1 ) ,( 5 , 5 ) ,( 5 , 10 ) ,(5 , 50 ) ] ) :
# Fit r e g r e s s i o n m o d e l
regr = X G B R e g r e s s o r ( m a x _ d e p t h = td [ 0 ] , n _ e s t i m a t o r s = td [ 1 ] )
regr . fit ( X_train , y _ t r a i n )
# Predict
y _ p r e d _ t e s t = regr . p r e d i c t ( X _ t e s t )
# Score
p r i n t ( " Test R2 ( Tree depth = { } ) : " . f o r m a t ( td ) , r 2 _ s c o r e ( y_test ,
y_pred_test ))
plt . sca ( axs [ ind ] )
plt . plot ( X _ t e s t [ order , 0 ] , y _ p r e d _ t e s t [ o r d e r ] , c = 'r ' , l a b e l = ' Tree
d e p t h { } , E s t i m a t o r s { } '. f o r m a t
( * td ) )
plt . s c a t t e r ( X _ t e s t [ : , 0 ] , y_test , l a b e l = ' Test data p o i n t s ' , ec = " k " ,
fc = " none " )
plt . l e g e n d ()
plt . x l a b e l ( " $ N a _ 2 O $ ( mol % ) " )
plt . y l a b e l ( " D e n s i t y ( g cm$ ^ { -3 } $ ) " )
s a v e f i g ( " x g b _ r e g r e s s i o n . png " )
p r i n t ( " End of o u t p u t " )
Output:
If no features are used, then .φ(x) become .φ(x) = x. The nonlinear activation func-
tion . f (.) is given by a step function of the form:
P
+1, i f a ≥ 0
. f (a) = (5.9)
−1, i f a < 0
classification, .w φ(xn )tn > 0, for incorrect classification .w T φ(xn )tn < 0, always.
T
where .M, is the misclassified points. The weight are updated by stochastic gradient
descent as
accuracy is reached. The perceptron can only be expected to handle problems that are
linearly separable. To tackle more complicated (nonlinear) situations, we can increase
the set of perceptrons in multiple layers. The additional layers are called‘hidden’
layers between the output and input layers. Also popularly called feed-forward neu-
ral nets or feed-forward networks. The model comprises multiple layers of logistic
regression models (with continuous nonlinearities) rather than multiple perceptrons
(with discontinuous nonlinearities). Structure of basis functions are
( M )
∑
. y(x, w) = f w j φ j (x) (5.12)
i=1
where . f (.) is a nonlinear activation function in the case of classification and is the
identity in the case of regression, .φ(x) = x if variables are considered directly.
In an MLP framework output of first hidden layer is calculated as:
( D )
∑
z = h(a j ) = h
. j w (1)
ji x i (5.13)
i=1
∑
M
a =
. k wk(2)
j zj (5.14)
j=1
where .σ (.) is the output function. Final network output . yk is evaluated to be:
5.3 Multi-layer Perceptron 97
⎛ ⎞
∑M ∑
D
. yk (x, w) = σ ⎝ wk(2)
j h( w (1)
ji x i )
⎠ (5.16)
j=1 i=1
Deep networks are generally termed if there are multiple hidden layers in an MLP.
Credit Assignment Path (CAP) is the chain of transformations from input to output
in a network. E.g. For a feedforward neural network, the depth of the CAP is the
number of hidden layers plus one For a deep network CAP depth.> 2 and CAP of
depth 2 has been shown to be a universal approximators. Output activation function
in an MLP is mostly logistic function
1
σ (a) =
. (5.17)
1 + ex p(−a)
⎧
⎪
⎪ Logistic([0, 1]) − 1+ex1p(−a)
⎪
⎪
⎨tanh − ([−1, 1]) − ea −e−a
.h(a) = P ea +e−a
(5.18)
⎪
⎪ 0 i f a ≤ 0
⎪
⎩ ReLU − a i f a > 0
⎪
∑
N
. E(w) = E n (w) (5.19)
n=1
1∑
. E n (w) = (ynk − tnk )2 (5.20)
2 k
98 5 Non-parametric Methods for Regression
. nky —.kth output of the .nth data vector, .tnk -corresponding target of . ynk (Omitting
subscript .n for convenience for parameters) n a feed-forward network (forward prop-
agation), Output of a general hidden layer: (.z i = xi , if there is only one hidden layer)
∑
M
z = h(a j ) = h(
. j w ji z i ) (5.21)
i=1
∂ En ∂ E n ∂a j
. = = δ j zi (5.22)
∂w ji ∂a j ∂w ji
~~~~ ~ ~~ ~
δj zi
∂ En ∂ E n ∂ak
. = = δk z j (5.23)
∂wk j ∂ak ∂wk j
~~~~ ~ ~~ ~
δk zj
Thus, we need only to calculate the value of .δ j and .δk . For weights in the output
layer, we compute
∂ En
δ =
. k (5.24)
∂ak
∑K
Asyk = ak , E n (w) =
.
1
2 k=1 (yk − t k )2
For the hidden layer we need to sum over all output nodes, as:
∂ En ∑ ∂ E n ∂ak
K
. j δ = = (5.26)
∂a j k=1
∂ak ∂a j
∑
As .ak = k wk j h(a j )
∑
K
. = h , (a j ) δk w k j (5.27)
k=1
Until desired training criterion is met, repeat the following steps. Apply an input
vector .xn to the network and forward propagate through the network using to find
5.3 Multi-layer Perceptron 99
the activations of all the hidden and output units. Final weight updates are performed
as:
• Output layer weight update using stochastic LMS (Use Eqs. (12), (14)):
• Hidden layer weight update using stochastic LMS (Use Eqs. (11), (16)):
∑
K
,
.wi j := wi j − αh (a j )( (yk − tk )wk j )z i , i = 1, . . . , D (5.29)
k=1
Code Snippet 5.4 shows the code for predicting the density of calcium silicate
glasses using MLPs. Figure 5.4 shows the performance of MLP with varying number
of neurons. With increasing number of neurons the model gradually starts to fit the
data in a better fashion.
100 5 Non-parametric Methods for Regression
"""
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# i m p o r t n u m p y and s k l e a r n
i m p o r t n u m p y as np
i m p o r t p a n d a s as pd
from s k l e a r n . m e t r i c s i m p o r t r 2 _ s c o r e
from s k l e a r n . m o d e l _ s e l e c t i o n i m p o r t t r a i n _ t e s t _ s p l i t
from s k l e a r n . n e u r a l _ n e t w o r k i m p o r t M L P R e g r e s s o r
# Load sa m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / C A S _ d e n . csv " ) . s o r t _ v a l u e s ( by = " CaO " )
p r i n t ( data . c o l u m n s )
X = data [ [ " CaO " , " A l 2 O 3 " , " SiO2 " ] ] . v a l u e s
y = data [ " D e n s i t y ( g / cm3 ) " ] . v a l u e s . r e s h a p e ( -1 , 1 )
# Split dataset
X_train , X_test , y_train , y _ t e s t = t r a i n _ t e s t _ s p l i t ( X , y , t e s t _ s i z e = 0 . 3
, r a n d o m _ s t a t e = 2020 )
fig , axs_ = plt . s u b p l o t s ( 2 , 2 )
axs = axs_ . f l a t t e n ()
for ind , n in e n u m e r a t e ( [ 1 , 2 ,5 , 10 ] ) :
# Fit r e g r e s s i o n m o d e l
regr = M L P R e g r e s s o r ( h i d d e n _ l a y e r _ s i z e s = [n ,n , n ] , r a n d o m _ s t a t e = 20 )
regr . fit ( X_train , y _ t r a i n )
# Predict
y _ p r e d _ t e s t = regr . p r e d i c t ( X _ t e s t )
# Score
p r i n t ( " Test R2 ( Nu m b e r of n e u r o n s = { } ) : " . f o r m a t ( n ) , r 2 _ s c o r e (
y_test , y _ p r e d _ t e s t ) )
plt . sca ( axs [ ind ] )
i = 0
o r d e r = np . a r g s o r t ( X _ t e s t [ : , i ] )
plt . plot ( X _ t e s t [ order , i ] , y _ p r e d _ t e s t [ o r d e r ] , c = 'r ' , l a b e l = ' N u m b e r
of n e u r o n s = { } '. f o r m a t ( n ) )
plt . s c a t t e r ( X _ t e s t [ : ,i ] , y_test , l a b e l = ' Test data p o i n t s ' , ec = " k " ,
fc = " none " )
plt . l e g e n d ()
plt . x l a b e l ( " $ C a O $ ( mol % ) " )
plt . y l a b e l ( " D e n s i t y ( g cm$ ^ { -3 } $ ) " )
s a v e f i g ( " n n _ r e g r e s s i o n . png " )
p r i n t ( " End of o u t p u t " )
Output:
Code snippet 5.4: MLP-based regression to predict the density with varying number
of hidden layer units, namely, 1, 2, 5, and 10
5.4 Support Vector Regression 101
Note: Authors recommend the readers to run the code with different values of test
set size (.test_size), random state for train test partition (.random_state),
number of neurons or hidden layer units, and learning rates.)
.φ(x) denotes a fixed feature-space transformation (used only if needed), and we have
made the bias parameter .b explicit. The training data set comprises . N input vectors
.{x 1 , . . . , x N }, with corresponding target values .{t1 , . . . , t N } where .tn ∈ {−1, 1}, and
For all data points .n = 1, . . . , N , we need to maximize the separation of only the
correctly classified classes so that .tn y(xn ) > 0:
tn y(xn ) tn (w T xn + b)
. = (5.34)
||w||2 ||w||
We also need to minimize the number of points within the margin. In short, we need
to find the parameters .w, .b, by
{ }
1
. arg max min tn (w T xn + b) (5.35)
w,b ||w|| n
102 5 Non-parametric Methods for Regression
1
min ||w||2
. (5.36)
w,b 2
.subject to tn (w x n + b) ≥ 1, n = 1, . . . , N
T
(5.37)
Duality theory shows how we can construct an alternative problem from the func-
tions and data that define the original optimization problem. This alternative “dual”
problem is related to the original problem (which is sometimes referred to in this
context as the “primal” for purposes of contrast). In most cases, the dual problem is
easier to solve computationally than the original problem or can be used to obtain
easily a lower bound on the optimal value of the objective for the primal problem.
If we obtain a lower bound, we can maximize the lower bound to obtain the optimal
value. The dual optimization problem is:
Rewriting (11):
tribution to . y, the rest of the data points are called support vectors as .tn y(xn ) = 1
(and also they contribute in . y. Incidentally, .tn y(xn ) = 1 implies they are active con-
straints and lie on the maximum margin hyperplane. For calculating .b, we use the
following relation. Any support vector .xn satisfies:
. nt y(xn ) = 1
tn {w T xn + b} = 1
P N }
∑
tn λm tm xm xn + b = 1, m ∈ S
T
m=1
P }
∑
N
multiplying both sides bytn and usingtn2 =1 λm tm xmT xn + b = tn
m=1
P }
∑
=⇒ b = tn − λm tm xmT xn
m∈S
where . S is the index set support vectors. Averaging these equations over all support
vectors gives
P P }}
1 ∑ ∑
.b = tn − λm tm xmT xn (5.43)
NS N m∈S
S
In the case of linearly non-separable classes, some points fall inside the class sepa-
ration band. For situations where the samples are linearly non-separable, the above
optimization procedure is not valid. It is not possible to draw a hyperplane that will
have a class separation band around it with no data points inside this band. We now
modify this approach so that data points are allowed to be on the “wrong side” of
the margin boundary but with a penalty that increases with the distance from that
boundary. As a result, we need to introduce slack variables .ξn = 0 for data points that
are on or inside the correct margin boundary and .ξn = |tn − y(xn )| for other points.
That would result in three potential possibilities:
1. Vectors that fall outside the margin and are correctly classified: .tn (w T xn + b) ≥
1 =⇒ ξn = 0
2. Vectors that fall inside the margin but are correctly classified:.0 ≤ tn (w T xn + b) ≤
1 =⇒ 0 ≤ ξn ≤ 1
3. Vectors that are misclassified:.tn (w T xn + b) ≤ 0 =⇒ ξn > 1
This would result in the following modified optimization problem:
1 ∑N
. min ||w||2 + C ξn (5.44)
w,b,ξ 2 n=1
.sub. to tn (w T xn + b) ≥ 1 − ξn , ξn ≥ 0, n = 1, . . . , N (5.45)
C is a positive constant that controls the trade-off, and it relaxes the hard constraint
.
for classification to some extent. If data points have .ξn = 0 =⇒ correctly classified
within margin, .0 ≤ ξn ≤ 1 =⇒ lie inside the margin, but on the correct side of the
decision boundary, and if .ξn > 0 =⇒ misclassified (undesirable). It will also try to
minimize .ξn , so it will only take the smallest value, so by default, it will only take
the suitable value. Here, the Lagrangian is computed as:
. L(w, b, λ, μ)
1 ∑N ∑N ∑N
= ||w||2 + C ξn − λn {tn (w T xn + b) − 1 + ξn } − μn ξn (5.46)
2 n=1 n=1 n=1
∂L ∑N
.μn , λn , n = 1, . . . , N are Lagrange multipliers, also. ∂w =0 =⇒ w = n=1 λ n tn x n ,
∂L ∑N ∂L
.
∂b
= 0 = n=1 λn tn , . ∂ξn = 0 = λn = C − μn =⇒ C = λn + μn Dual objective
.q(λ) is evaluated as:
5.4 Support Vector Regression 105
P N N }
1 ∑∑ ∑ ∑
N N
. = λ n λ m tn tm x n x m + (λn + μn )ξn − λ n tn b
T
2 n=1 m=1 n=1 n=1
~ ~~ ~
=0
P }
∑
N ∑
N ∑
N ∑
N ∑
N
− λn λm tn tm xnT xm + λn − λn ξn − μn ξn
n=1 m=1 n=1 n=1 n=1
P N N }
∑
N
1 ∑∑
= λn − λn λm tn tm xnT xm (5.47)
n=1
2 n=1 m=1
∑
N
.0 ≤ λn ≤ C, λ n tn = 0 (5.48)
n=1
In order to classify new data points, we evaluate the sign of . y(x) = w T x + b. For
calculating .w, we solve dual problem (20, 21) and get .λ, use
∑
N
w=
. λ n tn x n (5.49)
n=1
and . y calculated is
∑
N
. y= λn tn xnT x + b. (5.50)
n=1
features. By defining a kernel function, using an inner product between the data
facilitates a nonlinear transformation from the input space to a higher dimension
space, where the problem may be linearly √ separable.
.φ : (x 1 , x 2 ) | → (z 1 , z 2 , z 3 ) := (x 1 , 2x1 x2 , x22 ) For second-order polynomials,
2
√ , √ ,
.φ(x1 , x2 )T φ(x1 , x2 ) = (x12 , 2x1 x2 , x22 )T (x12 , 2x1, x2, , x22 ) (5.52)
, ,
. = x12 x12+ 2x1 x1, x2 x2, + x22 x22 (5.53)
. = (x1 x1,+ x2 x2, )2 (5.54)
. = {(x1 , x2 )T (x1, , x2, )}2 = k(x, x , ) (5.55)
The term .k(x, x , ) is called the inner product kernel. Similarly for a .dth order poly-
nomial:
. The net result is that we can compute the value of the inner product without explicitly
carrying out a mapping involving monomials. Let .φ : X |→ H indicate a nonlinear
transformation from the input space . X to the feature space . H , them the separating
hyperplane is
∑
N
. y= λn tn k(xm , xn ) + b (5.58)
n=1
with weights
∑
N
.wφ = λn tn φ(xn ) (5.59)
n=1
P P }}
1 ∑ ∑
.b = tn − λm tm k(xm , xn ) (5.60)
NS N m∈S
S
Here, the term .k(xm , xn ) = φ(xm )T φ(xn ) is called inner product kernel. . K (xi , x j )
may be very inexpensive to calculate due to inner product evaluation, even though
.φ(x) itself may be expensive to calculate. The inner product of orthogonal functions
is zero, and the collinear function is one. If .φ(x) and .φ(x , ) are far apart-say nearly
orthogonal to each other-then . K (x, x , ) = φ(x)φ(x , ) will be small otherwise it will
be large. We can think of . K (x, x , ) as some measurement of how similar are .φ(x)
5.4 Support Vector Regression 107
and .φ(x , ) , or of how similar are .x and .x , Kernels are also known as covariance
functions, and some popular kernels are:
1. Polynomial kernel: .k(x, x , ) = (x T x , + c)d , of .d degree
, 2
2. Gaussian kernel: .k(x, x , ) = exp(− ||x−xσ2
||
)
"""
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# i m p o r t n u m p y and s k l e a r n
i m p o r t n u m p y as np
i m p o r t p a n d a s as pd
from s k l e a r n . m e t r i c s i m p o r t r 2 _ s c o r e
from s k l e a r n . m o d e l _ s e l e c t i o n i m p o r t t r a i n _ t e s t _ s p l i t
from s k l e a r n . svm i m p o r t SVR
# Load sa m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / N S _ d e n . csv " ) . s o r t _ v a l u e s ( by = " Na2O " )
p r i n t ( data . c o l u m n s )
X = data [ " Na2O " ] . v a l u e s . r e s h a p e ( - 1 , 1 )
y = data [ " D e n s i t y ( g / cm3 ) " ] . v a l u e s . r e s h a p e ( -1 , 1 )
# Split dataset
X_train , X_test , y_train , y _ t e s t = t r a i n _ t e s t _ s p l i t ( X , y , t e s t _ s i z e = 0 . 3
, r a n d o m _ s t a t e = 2020 )
o r d e r = np . a r g s o r t ( X _ t e s t [ : , 0 ] )
fig , axs_ = plt . s u b p l o t s ( 1 , 2 )
axs = axs_ . f l a t t e n ()
for ind , k in e n u m e r a t e ( [ " l i n e a r " , " rbf " ] ) :
# Fit r e g r e s s i o n m o d e l
regr = SVR ( C = 0 . 1 , e p s i l o n = 0 . 001 , k e r n e l =k , d e g r e e = 4 )
regr . fit ( X_train , y _ t r a i n )
# Predict
y _ p r e d _ t e s t = regr . p r e d i c t ( X _ t e s t )
# Score
p r i n t ( " Test R2 : " , r 2 _ s c o r e ( y_test , y _ p r e d _ t e s t ) )
plt . sca ( axs [ ind ] )
plt . plot ( X _ t e s t [ order , 0 ] , y _ p r e d _ t e s t [ o r d e r ] , c = 'r ' , l a b e l = '{ }
k e r n e l '. f o r m a t ( k ) )
plt . s c a t t e r ( X_test , y_test , l a b e l = ' Test data p o i n t s ' , ec = " k " , fc = "
none " )
plt . l e g e n d ()
plt . x l a b e l ( " $ N a _ 2 O $ ( mol % ) " )
plt . y l a b e l ( " D e n s i t y ( g cm$ ^ { -3 } $ ) " )
s a v e f i g ( " s v _ r e g r e s s i o n . png " )
p r i n t ( " End of o u t p u t " )
Output:
Code snippet 5.5: Support Vector regression for predicting density using linear
kernel and RBF kernel
108 5 Non-parametric Methods for Regression
Code Snippet 4.5 can be used to perform SVR for predicting the density of sodium
silicate glasses. Figure 5.5 shows the performance of the SVR models with linear and
RBF kernels for predicting the density.
Note: Authors recommend the readers to run the code with different values of test
set size (.test_size), random state for train test partition (.random_state),
linear and non-linear kernels, and also with varying hyperparameters according
to the non-linear kernels.
Gaussian processes (GPs) model nonparametric models that are capable of modeling
datasets in a fully probabilistic fashion. The main advantages of GP models are: (i) its
unique ability to model any complex data sets; (ii) estimate the uncertainty associated
predictions through posterior variances computations. A GP is a joint distribution
of any finite set of random variables that follow Gaussian distributions. As a result,
GPR modeling framework tries to ascribe a distribution over a given set of input (.x)
and output datasets (. y). A mean function .m(x) and a covariance function .k(x, x , ),
the two degrees of freedoms that are needed to characterize a GPR fully, are as shown
below.
while the mean function .m(x) computes the expected values of output for a given
input, the covariance function captures the extent of correlation between function
outputs for the given set of inputs as
5.5 Gaussian Process Regression 109
In the GP literature .k(x, x , ) is also termed as the kernel function of the GP. A widely
used rationale for the selection of kernel function is that the correlation between any
points decreases with an increase in the distance between them. Some popular kernels
in the GP literature are 1. Exponential kernel: .k(x, x , ) = ex p(|x − x , |)/l 2. Squared
exponential kernel:.k(x, x , ) = σ 2f ex p[−1/2((x − x , )/l)2 ] where .l is termed as the
length-scale parameter and .σ 2f is termed as the signal variance parameter. In a GPR
model, these hyper-parameters can be tuned to model datasets that have an arbitrary
correlation. Also, the function . f ∼ G P(m(x), k(x, x , )) is often mean-centered for
relaxing the computational complexity. Suppose, we have a set of test inputs . X ∗ for
which we are interested in computing the output predictions. This would warrant sam-
pling as a set of . f ∗ := [ f (x( 1∗)), . . . , f (x( n∗))], such that . f ∗ = N (0, K (X ∗ , X ∗ ))
with the mean and covariance as
.m(x) = 0
⎛ ⎞
(k(x1∗ , x1∗ ) . . . k(x1∗ , xn∗ )
⎜ .. .. .. ⎟
. K (X ∗ , X ∗ ) = ⎝ . ⎠
. .
k(xn∗ , x1∗ ) . . . k(xn∗ , xn∗ )
The above equations are employed to make new predictions using the GPR.
Figure 5.6 shows the performance of GPR for predicting the density of sodium
silicate glasses with RBF kernel and Matern kernel. Note that GPR can also be
used to obtain the standard deviation which is not demonstrated in the figure. Code
Snippet 5.6 can be used to reproduce the results.
110 5 Non-parametric Methods for Regression
Note: Authors recommend the readers to run the code with different values of test
set size (.test_size), random state for train test partition (.random_state),
different kernels, and also with varying hyperparameters according to the non-
linear kernels.
5.6 Summary
""" GP r e g r e s s i o n
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# i m p o r t n u m p y and s k l e a r n
i m p o r t n u m p y as np
i m p o r t p a n d a s as pd
from s k l e a r n . m e t r i c s i m p o r t r 2 _ s c o r e
from s k l e a r n . m o d e l _ s e l e c t i o n i m p o r t t r a i n _ t e s t _ s p l i t
from s k l e a r n . g a u s s i a n _ p r o c e s s i m p o r t G a u s s i a n P r o c e s s R e g r e s s o r
from s k l e a r n . g a u s s i a n _ p r o c e s s . k e r n e l s i m p o r t RBF , Matern ,
RationalQuadratic , ExpSineSquared ,
DotProduct , ConstantKernel
# Load sa m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / N S _ d e n . csv " ) . s o r t _ v a l u e s ( by = " Na2O " )
p r i n t ( data . c o l u m n s )
X = data [ " Na2O " ] . v a l u e s . r e s h a p e ( - 1 , 1 )
y = data [ " D e n s i t y ( g / cm3 ) " ] . v a l u e s . r e s h a p e ( -1 , 1 )
# Split dataset
X_train , X_test , y_train , y _ t e s t = t r a i n _ t e s t _ s p l i t ( X , y , t e s t _ s i z e = 0 . 3
, r a n d o m _ s t a t e = 2020 )
o r d e r = np . a r g s o r t ( X _ t e s t [ : , 0 ] )
fig , axs_ = plt . s u b p l o t s ( 1 , 2 )
axs = axs_ . f l a t t e n ()
n a m e s = [ " RBF " , " M a t e r n " ]
for ind , k in e n u m e r a t e ( [ RBF () , M a t e r n () ] ) :
regr = G a u s s i a n P r o c e s s R e g r e s s o r ( k e r n e l =k , r a n d o m _ s t a t e = 0 )
regr . fit ( X_train , y _ t r a i n )
# Predict
y _ p r e d _ t e s t = regr . p r e d i c t ( X _ t e s t )
# Score
p r i n t ( " Test R2 ( k e r n e l = { } ) : " . f o r m a t ( n a m e s [ ind ] ) , r 2 _ s c o r e ( y_test
, y_pred_test ))
plt . sca ( axs [ ind ] )
plt . plot ( X _ t e s t [ order , 0 ] , y _ p r e d _ t e s t [ o r d e r ] , c = 'r ' , l a b e l = '{ }
k e r n e l '. f o r m a t ( n a m e s [ ind ] ) )
plt . s c a t t e r ( X_test , y_test , l a b e l = ' Test data p o i n t s ' , ec = " k " , fc = "
none " )
plt . l e g e n d ()
plt . x l a b e l ( " $ N a _ 2 O $ ( mol % ) " )
plt . y l a b e l ( " D e n s i t y ( g cm$ ^ { -3 } $ ) " )
s a v e f i g ( " g p _ r e g r e s s i o n . png " )
p r i n t ( " End of o u t p u t " )
Output:
Code snippet 5.6: Gaussian process regression for predicting the density of sodium
silicate glasses with RBF kernel and Matern kernel
112 5 Non-parametric Methods for Regression
References
1. C.M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics)
(Springer, Berlin, Heidelberg, 2006). ISBN: 0387310738
2. K.P. Murphy, Machine Learning: A Probabilistic Perspective (The MIT Press, 2012). ISBN:
0262018020
3. T.M. Mitchell, Machine Learning, 1st edn. (McGraw-Hill, Inc., USA, 1997). ISBN:
0070428077
4. R.O. Duda, P.E. Hart, et al., Pattern Classification (Wiley, 2006)
5. J. Friedman, T. Hastie, R. Tibshirani, et al., The Elements of Statistical learning. Springer Series
in Statistics New York, vol. 1 (2001)
6. C.K. Williams, C.E. Rasmussen, Gaussian Processes for Machine Learning, vol. 2 (MIT Press
Cambridge, MA, 2006)
7. M. Bramer, Principles of Data Mining, 2nd edn. (Springer Publishing Company, Incorporated,
2013). ISBN: 1447148835
8. A. Blum, J. Hopcroft, R. Kannan, Foundations of Data Science (Cambridge University Press,
2020)
9. C.C. Aggarwal, Neural Networks and Deep Learning: A Textbook, 1st edn. (Springer Publishing
Company, Incorporated, 2018). ISBN: 3319944622
10. S. Russell, P. Norvig, Artificial Intelligence: A Modern Approach, 3rd edn. (Prentice Hall Press,
USA, 2009). ISBN: 0136042597
11. B. Schölkopf, A.J. Smola, F. Bach, et al., Learning with Kernels: Support Vector Machines,
Regularization, Optimization, and Beyond (MIT Press, 2002)
12. L. Rokach, O.Z. Maimon, Data Mining with Decision Trees: Theory and Applications, vol. 69
(World Scientific, 2007)
Chapter 6
Dimensionality Reduction and Clustering
this space using a . D-dimensional unit vector .u 1 so that .u 1T u 1 = 1. Note that we are
only interested in the direction defined by .u 1 , not in the magnitude of .u 1 itself. The
mean of the projected data is .u 1T x where .x is the sample set mean given by
1 ∑
N
. x= xn (6.1)
N i=1
1 ∑ T
N
. {u xn − u 1T x} = u 1T Su 1 (6.2)
N i=1 1
1 ∑
N
. S= (xn − x)(xn − x)T (6.3)
N i=1
. L = u 1T Su 1 + λ1 (1 − u 1T u 1 ) (6.4)
∂L
. = 0 =⇒ Su 1 = λ1 u 1 (6.5)
∂u 1
u T Su 1 = λ1
. 1 (6.6)
That is the variance will be a maximum when we set.u 1 equal to the eigenvector having
the largest eigenvalue .λ1 . This eigenvector is known as the first principal component.
We can define additional principal components in an incremental fashion by choosing
each new direction to be that which maximizes the projected variance until desired
variance is achieved. The optimal linear projection for which the variance of the
projected data is maximized is now defined by the . M eigenvectors .u 1 , . . . , u M of the
data covariance matrix . S corresponding to the . M largest eigenvalues .λ1 , . . . , λ M .
Figure 6.1 shows the variance with increasing number of features. It may be noted
with 10 features a variance of 100% is achieved. Corresponding R.2 values which
uses the principal component as input features is shown. Code Snippet 6.1 shows the
code to reproduce the results of the principal component analysis.
116 6 Dimensionality Reduction and Clustering
""" PCA
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# i m p o r t n u m p y and s k l e a r n
i m p o r t n u m p y as np
i m p o r t p a n d a s as pd
from s k l e a r n . d e c o m p o s i t i o n i m p o r t PCA
from s k l e a r n . m e t r i c s i m p o r t r 2 _ s c o r e
from s k l e a r n . l i n e a r _ m o d e l i m p o r t L i n e a r R e g r e s s i o n
from s k l e a r n . p r e p r o c e s s i n g i m p o r t S t a n d a r d S c a l e r
s c a l e r = S t a n d a r d S c a l e r ()
# Load sa m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / f u l l _ d e n . csv " ) . s o r t _ v a l u e s ( by = " Na2O " )
p r i n t ( data . c o l u m n s )
X = data [ data . c o l u m n s [ : - 1 ] ] . v a l u e s
y = data . v a l u e s [ : ,- 1 : ]
p r i n t ( " N u m b e r of f e a t u r e s : " , X . s h a p e [ 1 ] )
# Scale data using s t a n d a r d s c a l e r
s c a l e r . fit ( X )
X_ = s c a l e r . t r a n s f o r m ( X )
ns = [ 1 , 2 , 3 ,4 ,5 , 6 , 7 , 8 , 9 , 10 , 11 ]
var = [ ]
Xs_pca = []
R2 = [ ]
for n _ c o m p o n e n t s in ns :
pca = PCA ( n _ c o m p o n e n t s = n _ c o m p o n e n t s )
X s _ p c a + = [ pca . f i t _ t r a n s f o r m ( X_ ) ]
var + = [ sum ( pca . e x p l a i n e d _ v a r i a n c e _ r a t i o _ ) * 100 ]
regr = L i n e a r R e g r e s s i o n () . fit ( X s _ p c a [ - 1 ] , y )
R2 + = [ r 2 _ s c o r e ( y , regr . p r e d i c t ( X s _ p c a [ - 1 ] ) ) ]
fig , axs = plt . s u b p l o t s ( 1 , 2 )
plt . sca ( axs [ 0 ] )
plt . plot ( ns , var , " -- k " )
plt . s c a t t e r ( ns , var , s = 60 , c = " k " , fc = " none " , ec = " k " )
plt . x t i c k s ( [2 ,4 , 6 , 8 , 10 ] )
plt . x l a b e l ( " N u m b e r of f e a t u r e s " )
plt . y l a b e l ( " V a r i a n c e ( % ) " )
plt . sca ( axs [ 1 ] )
plt . plot ( ns , R2 , " -- k " )
plt . s c a t t e r ( ns , R2 , s = 60 , c = " k " , fc = " none " , ec = " k " )
plt . x t i c k s ( [2 ,4 , 6 , 8 , 10 ] )
plt . x l a b e l ( " N u m b e r of f e a t u r e s " )
plt . y l a b e l ( " R$ ^ 2$ " )
s a v e f i g ( " p c a e x a . png " )
p r i n t ( " End of o u t p u t " )
Output:
Index(['Al2O3', 'B2O3', 'CaO', 'Fe2O3', 'FeO', 'MgO', 'Na2O', 'P2O5', 'TeO2',
'TiO2', 'ZrO2', 'Density'],
dtype='object')
Number of features: 11
End of output
Note: Authors recommend the readers to run the code with different values of test
set size (.test_size), random state for train test partition (.random_state)
and also with varying hyperparameters associated with PCA.
variable .x. Our goal is to partition the data set into some number . K of clusters,
where we shall suppose for the moment that the value of . K is given. Intuitively,
we might think of a cluster as comprising a group of data points whose inter-point
distances are small compared with the distances to points outside of the cluster. Let
.μk , where .k = 1, . . . , K , be a prototype (representing the centres of the clusters)
associated with the .kth cluster. For each data point .xn , we introduce a corresponding
set of binary indicator variables .rnk ∈ {0, 1}, where .k = 1, . . . , K describing which
of the . K clusters the data point .xn is assigned. If data point .xn is assigned to cluster
.k then .r nk = 1, and .r n j = 0 for . j / = k. This is known as the 1-of-K coding scheme.
Minimize distortion measure, given by:
∑
N ∑
K
. J= rnk ||xn − μ j ||2 (6.7)
n=1 k=1
118 6 Dimensionality Reduction and Clustering
which represents the sum of the squares of the distances of each data point to its
assigned vector .μk . Our goal is to find values for the .rnk and the .μk so as to minimize
. J . We can perform the optimization through an iterative procedure in which each
iteration involves two successive steps corresponding to successive optimizations
with respect to the .rnk and the .μk . First we choose some initial values for the .μk .
Then we minimize. J with respect to the.rnk , keeping the.μk fixed. In the second phase
we minimize . J with respect to the .μk keeping .rnk fixed. This two-stage optimization
is then repeated until convergence. Consider first the determination of the .rnk . As
. J in (6.1) s a linear function of Because . J in (9.1) is a linear function of .r nk , this
optimization can be performed easily to give a closed form solution. The terms
involving different .n are independent and so we can optimize for each .n separately
by choosing.rnk to be 1 for whichever value of.k gives the minimum value.||xn − μk ||2 .
That is,
⎧
⎨1 i f k = arg min ||xn − μk ||2
.r nk = j (6.8)
⎩0 other wise
Now consider the optimization of the .μk with the .rnk held fixed. The objective
function . J is a quadratic function of .μk , and it can be minimized by setting its
derivative with respect to .μk to zero giving
∑
N
2
. rnk (xn − μk ) = 0 (6.9)
n=1
∑N
rnk xn
. =⇒ μk = ∑n=1
N
(6.10)
n=1 r nk
Note: Authors recommend the readers to run the code with different values of
varying hyperparameters associated with k-means and t-SNE.
6.3 k Means Clustering 119
Output:
Index(['Al2O3', 'B2O3', 'CaO', 'Fe2O3', 'FeO', 'MgO', 'Na2O', 'P2O5', 'TeO2',
'TiO2', 'ZrO2', 'Density'],
dtype='object')
Number of features: 11
End of output
Code snippet 6.2: K-means clustering. Number of clusters and visualisation using
t-SNE
120 6 Dimensionality Reduction and Clustering
∑
K
. p(xi |θ ) = πk pk (xi |θ ) (6.11)
k=1
Equation (6.1) is a convex combination of the . pk (xi |θ )’s, since we are ∑ Ktaking a
weighted sum, where the mixing weights .πk satisfy .0 ≤ πk ≤ 1 and . k=1 πk =
1. But the structure of the model (6.1) does not give an idea about the belonging
of each data point to a particular cluster. To achieve that, we need to additionally
introduce latent or hidden variables to capture the cluster attribution. Let .xi be visible
or observable variables and .z i be a latent variable to capture its cluster attribution.
GMM model has the following form:
∑
K
. p(xi |θ ) = πk N (xi |μk , ∑k ) (6.12)
k=1
. p(z k = 1) = πk (6.13)
∑
K
.(i) 0 ≤ πk ≤ 1, (ii) πk = 1 (6.14)
k=1
∏
K
. p(z 1 , . . . , z k ) = p(z) = π1 π2 . . . π K = π1z1 π2z2 . . . π Kz K = πkzk (6.15)
k=1
∏
K
. p(x|z) = N (x|μk , ∑k )zk (6.17)
k=1
∑
where . p(x) is computed by marginalization (. P(A) = n p(A|Bn )Bn )
∑ ∑ ∏
K ∏
K
. p(x) = p(z) p(x|z) = πkzk N (x|μk , ∑k )zk (6.18)
z 1 ,...,z k z 1 ,...,z k k=1 k=1
∑
K
. p(x) = πk N (x|μk , ∑k ) (6.19)
k=1
p(x|z k = 1) p(z k = 1)
.γ (z k ) = p(z k = 1|x) = (6.20)
p(x)
p(x|z k = 1) p(z k = 1)
. = ∑K (6.21)
j=1 p(z j = 1) p(x|z j = 1)
122 6 Dimensionality Reduction and Clustering
∑
(By Bayes theorem: . p(A|B) = p(B|A) p(A)
p(B)
, P(B) = n p(B|An )An ))
πk N (x|μk , ∑k )
γ (z k ) = ∑ K
. (6.22)
k=1 pi k N (x|μk , ∑k )
Maximizing the log likelihood function (6.15) for a GMM turns out to be a more
complex problem than for the case of a single Gaussian. The difficulty arises from the
presence of the summation over .k that appears inside the logarithm in (6.15), so that
the logarithm function no longer acts directly on the Gaussian. If we set the derivatives
of the log likelihood to zero, we will no longer obtain a closed form solution. A
solution to this the iterative MLE approach such as Expectation Maximization (EM).
EM is an elegant and powerful method for finding maximum likelihood solutions
for models with latent variables is called the expectation-maximization algorithm,
or EM algorithm (Dempster et al., 1977; McLachlan and Krishnan, 1997). The EM
algorithm is used to find (local) maximum likelihood parameters of a statistical
model in cases where the equations cannot be solved directly. The EM algorithm
proceeds from the observation that there is a way to solve these two sets of equations
numerically:
1. one can simply pick arbitrary values for one of the two sets of unknowns, use
them to estimate the second set
2. then use these new values to find a better estimate of the first set
3. then keep alternating between the two until the resulting values both converge to
fixed points.
We denote the set of all observed data by . X , in which the .nth row represents .xnT , and
similarly we denote the set of all latent variables by . Z , with a corresponding row .z nT .
The structure of loglikelihood becomes:
6.4 Gaussian Mixture Model 123
{ }
∑
ln p(X |θ ) = ln
. p(X, Z |θ ) (6.25)
Z
Equation (6.16) can be modified by replacing the sum over . Z with an integral for
continuous latent variables. A key observation is that {the summation over the}latent
∑N ∑K
variables appears inside the logarithm as . i=1 ln k=1 πk N (x|μk , ∑k ) . The
presence of the sum prevents the logarithm makes complicated expressions for the
maximum likelihood solution (Eg..log(x1 + · · · + xn ) /= log(x1 )) + · · · + log(xn )).
Now suppose that, for each observation in . X , we were told the corresponding
value of the latent variable . Z . Then we can call .{X, Z } the complete data set, and
we refer to the actual observed data . X as incomplete. The likelihood function for the
complete data set simply takes the form .ln p(X, Z |θ ). In practice, we are not given
the complete data set .{X, Z }, but only the incomplete data . X , State of knowledge
of the values of the latent variables in . Z is given only by the posterior distribution
. p(Z |X, θ ). As we cannot use the complete-data log likelihood, we consider instead its
expected value under the posterior distribution of the latent variable, using.θ old which
corresponds E step of the EM algorithm. In the subsequent M step, we maximize
this expectation to obtain .θ new . This is iteratively performed until convergence Given
. p(X, Z |θ ) over observed variables . X and latent variables . Z , governed by parameters
.θ , the goal is: .arg max p(X |θ ). The EM algorithm steps are illustrated below:
θ
4. Check for convergence of the log likelihood/the parameter values. If the conver-
gence criterion is not satisfied, then:
θ old ← θ new
. (6.27)
∏
N ∏
K
. p(X, Z |π, μ, ∑) = πkznk N (xn |μk , ∑k )znk (6.28)
n=1 k=1
Then the log likelihood for the complete data set .{X, Z } (one of the terms of the . Q
function) becomes considering all the data points:
∑
N ∑
K
.ln p(X, Z |π, μ, ∑) = z nk {ln πk + ln N (xn |μk , ∑k )} (6.29)
n=1 k=1
∑N {∑ }
K
Now let us compare (6.21) with (6.15) (. i=1 ln π
k=1 k N (x|μ k , ∑ )
k ), no log-
arithm inside summation. So explicit introduction of the hidden variables (.z nk )
removed the logarithm inside summation, but the difficulty is that we need to estimate
then for all data points.
∑
. Q(θ, θ old ) = p(Z |X, θ old )ln p(X, Z |π, μ, ∑) (6.30)
Z
∑
N ∑
K
. = p(z nk |xn , θ old )z nk {ln πk + ln N (xn |μk , ∑k )} (6.31)
n=1 k=1
∑
N ∑
K
. = γ (z nk )z nk {ln πk + ln N (xn |μk , ∑k )} (6.32)
n=1 k=1
∑
N ∑
K
. = γ (z nk ) {ln πk + ln N (xn |μk , ∑k )} (6.33)
n=1 k=1
πk N (xn |μk , ∑k )
γ (z nk ) = ∑ K
. (6.34)
k=1 πk N (x n |μk , ∑k )
∑N ∑K
Having got a fixed .γ (z nk ) maximize . n=1 k=1 γ (z nk ) {ln πk + ln N (x n |μk , ∑k )}
(6.24). .πk do not depend on .μk and .∑k , we can eliminate that term for calculating
∑N ∑K
.μk and .∑k , i.e, maximize . n=1 k=1 γ (z nk )ln N (x n |μk , ∑k )
6.4 Gaussian Mixture Model 125
∑
N ∑
K
. γ (z nk )ln N (xn |μk , ∑k ) (6.35)
n=1 k=1
∑
N { ( )}
1 (xn − μk )2
. γ (z nk )ln √ ex p − (6.36)
n=1
2π ∑k 2∑k
∑ ∑K { ∑ }
N
√ N
(xn − μ)2
. γ (z nk ) − ln( 2π ∑k ) − (6.37)
i=1 k=1 i=1
2∑k
1 1 ∑ ∑
N N
(xn − μk )2
. − γ (z nk ) − γ (z nk ) (−1) = 0 (6.43)
2 ∑k i=1 i=1
2∑k2
∑N
γ (z nk )(xn − μk )2
. =⇒ ∑k = i=1∑ N (6.44)
i=1 γ (z nk )
∑K
π calculation should include the constraint . i=1
. k πk = 1. Also, we can eliminate the
terms corresponding to .μk and .∑k , as they are independent of .πk . The Lagrangian is
computed as:
126 6 Dimensionality Reduction and Clustering
( )
∑
N ∑
K ∑
K
. L= γ (z nk ) ln πk + λ 1 − πk (6.45)
n=1 k=1 k=1
∑N
∂L n=1 γ (z nk )
. = −λ=0 (6.46)
∂πk πk
∑N ∑N
γ (z nk ) γ (z nk )
. =⇒ λ = n=1
=⇒ πk = n=1
(6.47)
πk λ
∑K
As . k=1 πk = 1, the following will hold
∑N
∑
K
1 ∑∑
K N
1 γ (z nk )
. πk = γ (z nk ) = N = 1 =⇒ πk = n=1
. (6.48)
k=1
λ k=1 n=1 λ N
where .σi is the variance of the Gaussian distribution that is centered on the datapoint
x . The resulting distributions give a representation of local similarities in these
. i
data points. These distributions are influenced by the perplexity parameter, which
manipulates the distribution variance.
In the above step, we have modeled a Gaussian Distribution on each of the data
points; we can do this for the lower dimensional space as well, which will give
the normal Stochastic Neighbouring Embedding technique, but in this algorithm, a
student t-distribution with one degree of freedom which is also known as Cauchy
distribution. This distribution distinguished this method from other local techniques
for multi-dimensional mapping wherein the area covered by nearby data points does
not leave space in the two-dimensional map space to accommodate distant data points,
thus preserving local distances. The area of the two-dimensional map that is available
to accommodate moderately distant data points will not be nearly large enough
compared with the area available to accommodate nearby data points. Like in the
first step, this distribution is again centered on data point.xi , and the density of all other
data points .x j are calculated, and the distribution is normalized. Mathematically, this
distribution .q j|i can be calculated with the help of following equation
This student t-distribution with a single degree of freedom is helpful as the function
(1 + ||yi − y j ||2 )−1 closely resembles inverse square law for large pairwise distances
.
in the lower-dimensional space. Hence, for large distances, the mapping distances do
not change mainly as compared to the Gaussian distribution. Thus, far-apart clusters
have an interrelation of the same order as close individual points. Computationally,
this is far less costly than the exponential function.
We have calculated the probability distributions in the higher dimensional space
using Gaussian distribution and the lower dimensional space using Student t-
distribution with one degree of freedom. The next objective is to minimize the differ-
ence in these distributions such that the data points in the two map structures are as
similar as possible. This will make the distribution in lower dimensional data space
mirror the higher-dimensional data space. Kullback–Leibler divergence measures
the degree of fairness with .q j|i models . p j|i . Thus this quantity is a good choice for
128 6 Dimensionality Reduction and Clustering
the cost function, which can be written mathematically as in the following equation
pi j
C = ∑i K L(Pi ||Q i ) = ∑i ∑ j pi j log
. (6.51)
qi j
pi| j + p j|i
. pi j = (6.52)
2
This cost function is a measure of information gained by substituting probability
distribution .qi j with probability distribution . pi j or the information lost when the
distribution .qi j is used to model . pi j . This gradient of the cost function causes dis-
similar points that are represented by small pairwise distances in low-dimensional
space to move away from each other. This makes dissimilar points appear at a far
away distance in lower-dimensional space.
Figure 6.3 shows the t-SNE embedding in two dimensions for sodiumsilicate and
calcium aluminosilicate glasses. Code Snippet 6.3 can be used to perform the t-SNE
embedding.
Note: Authors recommend the readers to run the code with different values of
varying hyperparameters associated with t-SNE.
""" TSNE
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
from s h a d o w . font i m p o r t s e t _ f o n t _ f a m i l y , s e t _ f o n t _ s i z e
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# i m p o r t n u m p y and s k l e a r n
i m p o r t n u m p y as np
i m p o r t p a n d a s as pd
from s k l e a r n . m a n i f o l d i m p o r t TSNE
# Load sa m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / e l a s t i c _ m o d u l u s . csv " ) . s o r t _ v a l u e s ( by = " Na2O " )
p r i n t ( data . c o l u m n s )
X = 1 * data
# e n c o d e d a t a in d i f f e r e n t c l a s s e s
for ind , i in e n u m e r a t e ( [ 1 , 2 ,4 , 8 ] ) :
X [ f " c { ind } " ] = i * ( X . v a l u e s [ : , ind ] > 0 . 0 )
X [ " code " ] = X [ " c0 " ] + X [ " c1 " ] + X [ " c2 " ] + X [ " c3 " ]
# this will yield 7 ( CaO - Al2O3 - SiO2 ) and 12 ( Na2O - SiO2 )
u n i q u e s = np . sort ( np . u n i q u e ( X [ " code " ] ) )
tsne = TSNE ( n _ c o m p o n e n t s =2 , v e r b o s e = 0 , p e r p l e x i t y = 25 , n _ i t e r = 300 )
t s n e _ r e s u l t s = tsne . f i t _ t r a n s f o r m ( data . v a l u e s )
m a s k 1 = X [ " code " ] . v a l u e s = = u n i q u e s [ 0 ]
plt . s c a t t e r ( t s n e _ r e s u l t s [ mask1 , 0 ] , t s n e _ r e s u l t s [ mask1 , 1 ] , s = 60 , ec = " k " ,
c o l o r = " r " , l a b e l = " CaO - Al2O3 - SiO2 " )
m a s k 2 = X [ " code " ] . v a l u e s = = u n i q u e s [ 1 ]
plt . s c a t t e r ( t s n e _ r e s u l t s [ mask2 , 0 ] , t s n e _ r e s u l t s [ mask2 , 1 ] , s = 60 , ec = " k " ,
c o l o r = " b " , l a b e l = " Na2O - SiO2 " )
plt . l e g e n d ()
plt . x l a b e l ( " tsne - dimention - 1 " )
plt . y l a b e l ( " tsne - dimention - 2 " )
s a v e f i g ( " tsne . png " )
p r i n t ( " End of o u t p u t " )
Code snippet 6.3: t-SNE embedding in two dimensions for a dataset comprising
sodium silicate and calcium aluminosilicate glasses
6.6 Summary
deeper understanding of the underlying structures and relationships within the data.
It is important to note that unsupervised machine learning methods are not stand-
alone solutions but rather tools that complement domain knowledge and human
expertise. The interpretation of results requires careful consideration and valida-
tion through experimentation. In conclusion, unsupervised machine learning tech-
niques offer a powerful framework for data analysis and exploration. They enable
researchers to uncover hidden patterns, simplify complex datasets, and gain valuable
insights. Whether in materials science or other domains, the integration of unsuper-
vised machine learning methods enhances our understanding and drives advance-
ments in various fields of research and knowledge discovery.
References
1. C.M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics)
(Springer, Berlin, Heidelberg, 2006). ISBN: 0387310738
2. K.P. Murphy, Machine Learning: A Probabilistic Perspective (The MIT Press, 2012). ISBN:
0262018020
3. T.M. Mitchell, Machine Learning, 1st edn. (McGraw-Hill, Inc., USA, 1997). ISBN: 0070428077
4. R.O. Duda, P.E. Hart, et al., Pattern Classification (Wiley, 2006)
5. J. Friedman, T. Hastie, R. Tibshirani, et al., The Elements of Statistical Learning, 10, vol. 1.
Springer Series in Statistics New York (2001)
6. A.C. Faul, A Concise Introduction to Machine Learning (CRC Press, 2019)
7. A.R. Webb, Statistical Pattern Recognition (Wiley, 2003)
8. R.A. Johnson, D.W. Wichern, et al., Applied Multivariate Statistical Analysis, vol. 6. (Pearson,
London, UK, 2014)
Chapter 7
Model Refinement
7.1 Introduction
Model refinement is a critical step in machine learning that aims to improve the per-
formance and generalization ability of predictive models. While initial model training
may provide a starting point, further fine-tuning and optimization are often neces-
sary to enhance the model’s accuracy, robustness, and interpretability. This chapter
focuses on two key aspects of model refinement: the use of regularizers, including
Lasso, Ridge, and Elastic Net, and hyperparameter optimization techniques.
Regularization techniques play a vital role in mitigating overfitting and improving
the generalization of machine learning models. Overfitting occurs when a model
becomes overly complex, capturing noise or idiosyncrasies in the training data that
do not generalize well to unseen data. Regularizers address this issue by introducing
© Springer Nature Switzerland AG 2024 131
N. M. A. Krishnan et al., Machine Learning for Materials Discovery,
Machine Intelligence for Materials Science,
https://doi.org/10.1007/978-3-031-44622-1_7
132 7 Model Refinement
One key point that to be ensured while training a ML model is circumventing over-
fitting. Overfitting happens mainly due to fitting model parameters on the noisy data.
As a result, instead of learning the right signal, the model will be following the noise,
leading to incorrect prediction results. This will also increase the number of model
parameters in the ML model. Regularization is approach that regularizes or shrinks
the coefficient estimates towards zero by imposing constraints on the range of model
parameters. This essentially avoids learning unwanted complex models that fit the
data to prevent the risk of overfitting.
A model with more number of variables is difficult to interpret physically. For
instance, let us say we are modelling property of product (. y) as a function of compo-
sition 1 (.x1 ), composition 2 (.x2 ), composition 3 (.x3 ), composition 4(.x4 ), composition
5 (.x5 ), i.e, . y = β0 + β1 x1 + β2 x2 + β3 x3 + β4 x4 + β5 x5 . If all the coefficients are
active .β0 , . . . , β5 , we do not have any clue which is the dominant variable that con-
trols property. Let us only one or two active variables after modeling, say, .x1 and
7.2 Regularization for Regression 133
∑
n ∑
p
∑
p
. hatβ ridge = arg min {(yi − β0 − β j x j )2 + λ β 2j } (7.1)
β i=1 j=1 j=1
Here .λ ≥ 0 is a complexity parameter that controls the amount of shrinkage and the
larger the value of .λ, the greater the amount of shrinkage. The coefficients are shrunk
toward zero (and each other). When there are many correlated variables in a linear
regression model, their coefficients can become poorly determined and exhibit high
variance. A wildly large positive coefficient on one variable can be cancelled by a
similarly large negative coefficient on its correlated variable. By imposing a size
constraint on the coefficients, this problem is alleviated. An equivalent formation
which makes explicit the size constraint on the parameters:
⎧ ⎫
n ⎨
∑ ∑
p ⎬
. β̂ ridge = arg min (yi − β0 − β j x j )2 (7.2)
β ⎩ ⎭
i=1 j=1
∑
p
. β 2j ≤ t (7.3)
j=1
Notice that the intercept.β0 has been left out of the penalty term to avoid the procedure
depending on the origin chosen for . y. The solution adds a positive constant to the
diagonal of . X T X before inversion. This makes the problem non-singular, even if
T
. X X is not of full rank, and was the main motivation for ridge regression.
Code Snippet 7.1 can be used to perform ridge regression.
134 7 Model Refinement
Note: Authors recommend the readers to run the code with different values of
varying hyperparameters associated with ridge regression.
""" R i d g e R e g r e s s i o n
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# i m p o r t n u m p y and s k l e a r n
i m p o r t p a n d a s as pd
# Load sa m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / f u l l _ d e n . csv " ) . s o r t _ v a l u e s ( by = " Na2O " )
p r i n t ( data . c o l u m n s )
X = data [ data . c o l u m n s [ : - 1 ] ] . v a l u e s
y = data . v a l u e s [ : , - 1 : ]
from s k l e a r n i m p o r t l i n e a r _ m o d e l
clf = l i n e a r _ m o d e l . R i d g e ( a l p h a = 0 . 1 )
clf . fit ( X , y )
p r i n t ( " C o e f f i c i e n t : " , clf . c o e f _ )
p r i n t ( " I n t e r c e p t : " , clf . i n t e r c e p t _ )
p r i n t ( " End of o u t p u t " )
Output:
The LASSO regression is a shrinkage method like the ridge, with subtle but important
differences. LASSO regression uses the following penalization for estimating the
model parameters:
7.2 Regularization for Regression 135
∑
n ∑
p
. β̂ L ASS O = arg min {(yi − β0 − β j x j )2 } (7.5)
β i=1 j=1
∑
p
. |β j | ≤ t (7.6)
j=1
∑
n ∑
p
∑
p
. β̂ ridge
= arg min {(yi − β0 − βjxj) + λ
2
|β j |} (7.7)
β i=1 j=1 j=1
Code Snippet 7.2 can be used to perform the LASSO regression with a regular-
ization.
Note: Authors recommend the readers to run the code with different values of
varying hyperparameters associated with LASSO regression.
""" L a s s o R e g r e s s i o n
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# i m p o r t n u m p y and s k l e a r n
i m p o r t p a n d a s as pd
# Load sa m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / f u l l _ d e n . csv " ) . s o r t _ v a l u e s ( by = " Na2O " )
p r i n t ( data . c o l u m n s )
X = data [ data . c o l u m n s [ : - 1 ] ] . v a l u e s
y = data . v a l u e s [ : , - 1 : ]
from s k l e a r n i m p o r t l i n e a r _ m o d e l
clf = l i n e a r _ m o d e l . L a s s o ( a l p h a = 0 . 1 )
clf . fit ( X , y )
p r i n t ( " C o e f f i c i e n t : " , clf . c o e f _ )
p r i n t ( " I n t e r c e p t : " , clf . i n t e r c e p t _ )
p r i n t ( " End of o u t p u t " )
Output:
Here, the penalty of the form. L 1 and the shrinkage is much more compared to ridge
regression. This latter constraint makes the solutions non-linear in the . yi , and there
is no closed-form expression as in ridge regression. Computing the lasso solution is
a quadratic programming problem. The generalized form of LASSO regression is
∑
n ∑
p
∑
p
. β̃ = arg min {(yi − β0 − β j x j )2 + λ ||β j ||q } (7.8)
β i=1 j=1 j=1
Here, the value .q = 0 corresponds to variable subset selection, as the penalty simply
counts the number of nonzero parameters. As a result, .q = 1 corresponds to the
Lasso and .q = 2 corresponds to ridge regression. We can also optimize .q using
hyperparameter optimization for obtaining the best shrinkage.
In Ridge and LASSO regression the residual sum of squares has elliptical contours,
centered at the full least squares estimate. Constraints for ridge regression result
in disk shape contours .x12 + x22 ≤ t. However, constraints for LASSO regression
.|x 1 | + |x 2 | ≤ t results in diamond shape contours. In constraint optimization, solution
would lie at the active constraints (on the boundary of the constraint surface). Unlike
the disk, the diamond has corners; if the solution occurs at a corner, then it will make
one parameter .β j equal to zero. Thus, shrinkage in Lasso is more than in Ridge
The elastic net is a regularized regression method that linearly combines the . L 1 and
L 2 penalties of the Lasso and Ridge methods. Elastic-net regression has the following
.
form
∑
n ∑
p
∑
p
∑
p
. β̂ elastic = arg min {(yi − β0 − β j x j )2 + λ 1 |β j | + λ2 ||β j ||2 } (7.9)
β i=1 j=1 j=1 j=1
Often times .λ1 and .λ2 are related to be .λ2 = 1 − λ1 . Hence, the elastic net regres-
sion form includes bothe the LASSO and ridge regression. When .λ1 = λ and .λ2 = 0
elastic-net regression behaves as Ridge regression, while, .λ2 = λ and .λ1 = 0 elastic-
net regression become LASSO regression.
Code Snippet 7.3 can be used to perform the elastic net regression with a regu-
larization.
Note: Authors recommend the readers to run the code with different values of
varying hyperparameters associated with LASSO regression.
7.3 Cross-Validation for Model Generalizability 137
""" E l a s t i c Net R e g r e s s i o n
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# i m p o r t n u m p y and s k l e a r n
i m p o r t p a n d a s as pd
# Load sa m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / f u l l _ d e n . csv " ) . s o r t _ v a l u e s ( by = " Na2O " )
p r i n t ( data . c o l u m n s )
X = data [ data . c o l u m n s [ : - 1 ] ] . v a l u e s
y = data . v a l u e s [ : , - 1 : ]
from s k l e a r n i m p o r t l i n e a r _ m o d e l
clf = l i n e a r _ m o d e l . E l a s t i c N e t ( a l p h a = 0 . 1 , l 1 _ r a t i o = 0 . 5 )
clf . fit ( X , y )
p r i n t ( " C o e f f i c i e n t : " , clf . c o e f _ )
p r i n t ( " I n t e r c e p t : " , clf . i n t e r c e p t _ )
p r i n t ( " End of o u t p u t " )
Output:
Output:
Code snippet 7.4: XGBoost with grid search and cross validation
7.4 Hyperparametric Optimization 139
the model’s performance. It helps mitigate the impact of dataset variability and
randomness.
• Efficient Use of Data: In traditional train-test splits, a portion of the data is set aside
for testing, which reduces the amount of data available for training. K-fold cross-
validation ensures that all data points are used for both training and validation,
maximizing the utilization of the available data.
• Model Selection and Hyperparameter Tuning: K-fold cross-validation is com-
monly used for model selection and hyperparameter tuning. It allows for comparing
different models or hyperparameter settings based on their average performance
across multiple folds, enabling more informed decisions. These aspects are dis-
cussed in detail in the next section.
• Identifying Overfitting: By evaluating the model on multiple validation sets,
k-fold cross-validation helps detect overfitting. If a model performs significantly
better on the training set compared to the validation sets, it indicates that the model
may be overfitting the training data.
During model development, different models are tested, and hyperparameters are
tuned to get more reliable predictions. Hyperparameters are specified parameters
that can control a machine learning algorithm’s behavior by adjusting. They are dif-
ferent from parameters in that hyperparameters are parameters set before training
and supplied to the model, while parameters are learned during training phase. Thus,
hyperparameters are configuration settings that are not learned from the data but set
by the user before model training. Examples of hyperparameters include the learning
rate, regularization strength, number of layers in a neural network, or the choice of
a kernel in a support vector machine. Hyperparameter optimization techniques aim
to automatically search for the best combination of hyperparameters that maximizes
the model’s performance. The performance of ML models are highly dependent on
the choice of hyperparameters. For example, a typical soft-margin SVM classifier
equipped with an RBF kernel has at least two hyperparameters that need to be tuned
for good performance on unseen data: the regularization constant and a kernel hyper-
parameter. There are various approaches to achieve hyperparameter optimization,
namely, (i) random Search, (ii) grid Search, and (iii) Bayesian optimization.
metric is defined to perform the optimization. Afterward, this criterion can be applied
over cross-validation on the training set or validation set. A poorly chosen spaces for
the hyperparameters may result in non-convergence grid search. Grid search approach
heavily suffers from the curse of dimensionality, because the hyperparameter settings
it evaluates are typically independent of each other.
Code snippet 7.5 can be used to perform the grid search for hyperparametric
optimization of XGBoost models. Similarly, Code snippet 7.6 can be used to perform
random search for hyperparametric optimization of XGBoost models. Note that the
same code can be applied for any other models as well.
Note: Authors recommend the readers to run the code with different regression
models for hyperparametric optimization using grid search and random search.
""" X G B o o s t r e g r e s s i o n with h y p e r p a r a m e t e r g r i d s e a r c h
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# import
i m p o r t n u m p y as np
i m p o r t p a n d a s as pd
from x g b o o s t i m p o r t X G B R e g r e s s o r
from s k l e a r n . m o d e l _ s e l e c t i o n i m p o r t t r a i n _ t e s t _ s p l i t
from s k l e a r n . m e t r i c s i m p o r t m e a n _ s q u a r e d _ e r r o r , r 2 _ s c o r e
import optuna
# Load sa m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / N S _ d e n . csv " ) . s o r t _ v a l u e s ( by = " Na2O " )
p r i n t ( data . c o l u m n s )
X = data [ data . c o l u m n s [ : - 1 ] ] . v a l u e s
y = data [ " D e n s i t y ( g / cm3 ) " ] . v a l u e s . r e s h a p e ( -1 , 1 )
# Split dataset
X_train , X_test , y_train , y _ t e s t = t r a i n _ t e s t _ s p l i t ( X , y , t e s t _ s i z e = 0 . 3
, r a n d o m _ s t a t e = 2020 )
o r d e r = np . a r g s o r t ( X _ t e s t [ : , 0 ] )
def o b j e c t i v e ( t r i a l ) :
d = trial . suggest_int ( ' depth ', 2 , 8)
n = t r i a l . s u g g e s t _ i n t ( ' e s t i m a t o r s ' , 2 , 512 )
# Fit r e g r e s s i o n m o d e l
regr = X G B R e g r e s s o r ( m a x _ d e p t h = d , n _ e s t i m a t o r s = n )
regr . fit ( X_train , y _ t r a i n )
y _ p r e d _ t e s t = regr . p r e d i c t ( X _ t e s t )
r e t u r n m e a n _ s q u a r e d _ e r r o r ( y_test , y _ p r e d _ t e s t )
search_space = {
' d e p t h ': [2 , 4 , 6 , 8 ] ,
' e s t i m a t o r s ' : [ 2 , 16 , 128 , 512 ]
}
study = optuna . create_study ( sampler = optuna . samplers . GridSampler (
search_space ))
s t u d y . o p t i m i z e ( objective , n _ t r i a l s = 4 * 4 )
p r i n t ( " best p a r a m s : " , s t u d y . b e s t _ p a r a m s ) # Get best p a r a m e t e r s for
the o b j e c t i v e f u n c t i o n .
p r i n t ( " best value : " , s t u d y . b e s t _ v a l u e ) # Get best o b j e c t i v e v a l u e .
p r i n t ( " End of o u t p u t " )
Output:
""" X G B o o s t r e g r e s s i o n with h y p e r p a r a m e t e r r a n d o m s e a r c h
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# import
i m p o r t n u m p y as np
i m p o r t p a n d a s as pd
from x g b o o s t i m p o r t X G B R e g r e s s o r
from s k l e a r n . m o d e l _ s e l e c t i o n i m p o r t t r a i n _ t e s t _ s p l i t
from s k l e a r n . m e t r i c s i m p o r t m e a n _ s q u a r e d _ e r r o r , r 2 _ s c o r e
import optuna
# Load sa m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / N S _ d e n . csv " ) . s o r t _ v a l u e s ( by = " Na2O " )
p r i n t ( data . c o l u m n s )
X = data [ data . c o l u m n s [ : - 1 ] ] . v a l u e s
y = data [ " D e n s i t y ( g / cm3 ) " ] . v a l u e s . r e s h a p e ( -1 , 1 )
# Split dataset
X_train , X_test , y_train , y _ t e s t = t r a i n _ t e s t _ s p l i t ( X , y , t e s t _ s i z e = 0 . 3
, r a n d o m _ s t a t e = 2020 )
o r d e r = np . a r g s o r t ( X _ t e s t [ : , 0 ] )
def o b j e c t i v e ( t r i a l ) :
d = trial . suggest_int ( ' depth ', 2 , 8)
n = t r i a l . s u g g e s t _ i n t ( ' e s t i m a t o r s ' , 2 , 512 )
# Fit r e g r e s s i o n m o d e l
regr = X G B R e g r e s s o r ( m a x _ d e p t h = d , n _ e s t i m a t o r s = n )
regr . fit ( X_train , y _ t r a i n )
y _ p r e d _ t e s t = regr . p r e d i c t ( X _ t e s t )
r e t u r n m e a n _ s q u a r e d _ e r r o r ( y_test , y _ p r e d _ t e s t )
s t u d y = o p t u n a . c r e a t e _ s t u d y ( s a m p l e r = o p t u n a . s a m p l e r s . R a n d o m S a m p l e r ( seed =
2020 ) )
s t u d y . o p t i m i z e ( objective , n _ t r i a l s = 3 * 3 )
p r i n t ( " best p a r a m s : " , s t u d y . b e s t _ p a r a m s ) # Get best p a r a m e t e r s for
the o b j e c t i v e f u n c t i o n .
p r i n t ( " best value : " , s t u d y . b e s t _ v a l u e ) # Get best o b j e c t i v e v a l u e .
p r i n t ( " End of o u t p u t " )
Output:
7.5 Summary
References
1. C.M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics)
(Springer, Berlin, Heidelberg, 2006). isbn: 0387310738
2. K.P. Murphy, Machine Learning: A Probabilistic Perspective (The MIT Press, 2012). isbn:
0262018020
3. T.M. Mitchell, Machine Learning, 1st ed. (McGraw-Hill, Inc., USA, 1997). isbn: 0070428077
4. R.O. Duda, P.E. Hart et al., Pattern Classification (Wiley, 2006)
5. J. Friedman, T. Hastie, R. Tibshirani et al., The Elements of Statistical Learning, 10. Springer
Series in Statistics New York, vol. 1 (2001)
6. B. Schölkopf, A.J. Smola, F. Bach et al., Learning with Kernels: Support Vector Machines,
Regularization, Optimization, and Beyond (MIT Press, 2002)
Chapter 8
Deep Learning
8.1 Introduction
terns and relationships directly from the data, leading to improved performance and
adaptability across various domains.
Compared to classical machine learning algorithms, deep learning models offer
several distinct advantages. One key advantage lies in their ability to handle large
and complex datasets. Deep learning models excel at scaling with data due to their
capacity to leverage parallel computing architectures and exploit vast amounts of
training examples. This scalability empowers deep learning models to process mas-
sive datasets efficiently, enabling them to capture intricate patterns that may be chal-
lenging for classical algorithms.
Furthermore, deep learning models can perform end-to-end learning, eliminating
the need for manual feature engineering. While classical machine learning requires
domain expertise to carefully design and select relevant features, deep learning mod-
els can learn hierarchical representations directly from raw data. This end-to-end
learning approach simplifies the development process and makes deep learning mod-
els more adaptable to diverse problem domains.
Throughout this chapter, we will delve into advanced deep learning models,
including Convolutional Neural Networks (CNNs) for computer vision, Long Short-
Term Memory networks (LSTMs) for sequential data analysis, Generative Adversar-
ial Networks (GANs) for generative modelling, Graph Neural Networks (GNNs) for
graph-structured data, Variational Autoencoders (VAEs) for generative modelling
and representation learning, and Reinforcement Learning (RL) for sequential deci-
sion making. By exploring these models in detail, we aim to provide readers with
a comprehensive understanding of the underlying principles, architectures, training
methodologies, and practical applications of deep learning, enabling them to leverage
the power of these advanced techniques in their own machine learning endeavours.
It should be noted that the chapter is not exhaustive and is only prescriptive in
nature. We only aim to provide a brief overview of different widely used architectural
frameworks and their potential applications. Readers are encouraged to read the
original research papers and textbooks on each of these subject matters for the in-
depth understanding of the mathematical operations, and theoretical guarantees, if
any, for each of the approaches discussed [1–3].
The Convolutional Neural Network (CNN) is an deep learning framework that can
process input images, highlight different aspects in the image, and distinguish one
from the other. A CNN can successfully capture an image’s spatial and temporal
dependencies by applying relevant filters or kernels by applying convolution opera-
tion (Fig. 8.1).
Like any neural network architecture, a CNN consists also of an input layer, a
hidden layer, and an output layer. In a CNN, the hidden layers include layers that
perform convolutions. Convolution operation is a component-wise inner product of
two matrices as though they are vectors. In CNN convolution filters or kernels used to
8.2 Convolutional Neural Networks 147
Fig. 8.1 CNN schematic (adapted from [4] with permission from)
perform convolution operation with the input data frame. This output is then passed
through an activation function, which is commonly used as ReLU for CNNs. A
feature map is generated as the convolution kernel convolves along with the input
page matrix for the layer. This is followed by other layers such as pooling layers,
fully connected layers, and normalization layers.
The convolution filter moves from left to right in an image with a fixed stride value
until the full width is parsed. These operation helps to extract key high level features
from the image. Traditionally, the first convolution layer is used for extracting low-
level features such as edges, color, and gradient orientation. With added layers, the
architecture can learn high-level features as well.
Like the convolution layer, the pooling layer is responsible for reducing the spatial
size of the convoluted feature. This reduces the computational power required to
process data through dimensional reduction. In addition, it helps extract powerful
features that do not change rotational and spatial, thus maintaining the effective
training of the model. There are two types of pooling: max pooling and average
pooling. Max pooling gives the maximum value from the part of the kernel-covered
image. On the other hand, the kernel provides an average pooling of all values from
the covered image portion.
This procedure is repeated in each layer till all the relevant features are extracted.
The flattened output is fed into a feed-forward neural network, and backpropagation
is applied to each training repetition to achieve the desired classification.
The convolutional layers are the core building blocks of CNNs and perform the
most important operations. The main steps of a CNN are as follows.
1. Convolution Operation: The convolutional layer applies filters or kernels to the
input image to extract local patterns and features. Each filter is a small matrix of
weights that is convolved with the input image. The convolution operation can be
defined as follows:
∑∑
.C(i, j) = I (i + m, j + n) · K (m, n) (8.1)
m n
where .C(i, j) is the output feature map at position .(i, j), . I is the input image,
and . K is the filter.
2. Non-linear Activation: After the convolution operation, a non-linear activation
function is applied element-wise to introduce non-linearity into the network. The
148 8 Deep Learning
most commonly used activation function in CNNs is the Rectified Linear Unit
(ReLU), defined as . f (x) = max(0, x), where .x is the input.
3. Pooling: The pooling layer reduces the spatial dimensions of the feature maps,
effectively downsampling the information. It helps to reduce the computational
complexity and make the network more robust to small variations in the input.
The most commonly used pooling operation is max pooling, which extracts the
maximum value from each local region.
4. Fully Connected Layers: The fully connected layers are typically placed after the
convolutional and pooling layers. These layers connect every neuron from the
previous layer to the next layer, allowing the network to learn complex patterns
and make predictions. The output of the fully connected layers is usually fed into a
softmax layer for classification tasks, which produces the probability distribution
over different classes.
The overall architecture of a CNN typically consists of multiple convolutional
layers, interleaved with pooling layers, followed by one or more fully connected
layers. The layers are trained end-to-end using backpropagation and optimization
algorithms, such as stochastic gradient descent (SGD), to update the weights and
minimize the loss function. CNNs are commonly used for image classifications,
image segmentations, and predicting the global properties such as ionic conductivity
based on the microstructure of a material. Accordingly, CNNs are extremely use-
ful in materials domain to analyze images and identify the crystal structural, grain
boundaries or global properties.
Recurrent neural networks (RNN) are a class of neural networks that allow previous
outputs to be used as inputs while having hidden states. In RNNs the output(s) are
influenced not just by weights applied on inputs like a regular NN, but also by a
“hidden” state vector representing the context based on prior input(s)/output(s). So,
the same input could produce a different output depending on previous inputs in the
series. RNNs are recurrent due to the repeated applications of the transformations
to a series of inputs and produce a series of output vectors. In RNNs, in addition to
generating the output, the hidden state itself updated based on the input. Long short-
term memory (LSTM) is an RNN architecture that a cell, an input gate, an output
gate, and a forget gate [5]. The cell remembers values over arbitrary time intervals,
and the three gates regulate the flow of information into and out of the cell. LSTM
networks are well-suited to modeling sequential and time-series data since they can
accommodate delayed effect data points in the data-sets. Role of each unit of LSTM
is illustrated below (Fig. 8.2).The primary components of LSTM are (1) cell state
and (2) gates. The cell state a continuous information flow to all components in the
cell. The cell state information, when it run through an LSTM cell, either added or
8.3 Long-short Term Memory Networks 149
removed via gates. The gates are different neural networks components that decide
which information is allowed on the cell state.
Input gate is used to update the status information of the cell. First, we pass the
previous hidden state and the current input to the sigmoid function. Further, the hidden
state and current input is also fed into the tanh function. Afterward, both of these
inputs are multiplied together. These operations decide the extent of information to
be stored in a cell state. Next, the updated cell state is calculated. Initially, the cell
state is multiplied by the forget vector. This results in dropping values in the cell state
if it gets multiplied by values near zero. Then the output from the input gate is added
to update the cell state, resulting in new cell state. The output gate decides what will
be the next hidden state from the current cell. First, the previous hidden state and the
current input are passed to the sigmoid function. Then the updated cell position is
fed into the tanh function. These two outputs are then multiplied together to decide
which information should be in the hidden state. The new cell position and newly
hidden are then passed to the new cell. To summarize, the forget gate decides what
is relevant to keep from previous steps, the input gate decides what information is
pertinent to add from the current step, and the output gate determines what the next
hidden state should be.
Mathematically, the operations associated with each of these cells in LSTM can
be written as follows. These equations illustrate how an LSTM cell processes input
sequences and updates its memory cell state and hidden state.
1. Input Gate: The input gate determines how much new information should be
stored in the memory cell. It takes as input the current input vector, denoted as .xt ,
and the previous hidden state, denoted as .h t−1 . The input gate activation, denoted
as .i t , is computed as follows:
where .Wxi , .Whi , and .bi are the weight matrix and bias term associated with the
input gate.
2. Forget Gate: The forget gate determines how much information from the previous
memory cell state, denoted as .ct−1 , should be forgotten. It takes the same inputs as
the input gate and computes the forget gate activation, denoted as . f t , as follows:
150 8 Deep Learning
f = σ (Wx f xt + Wh f h t−1 + b f )
. t (8.3)
3. Update Memory Cell: The update to the memory cell state, denoted as .c̃t , is
computed by applying the hyperbolic tangent activation function to the current
input and previous hidden state:
4. Memory Cell: The memory cell state, denoted as .ct , is updated by combining the
information from the input gate and the forget gate:
c = ft
. t ct−1 + i t c̃t (8.5)
6. Hidden State: The hidden state is computed by applying the hyperbolic tangent
activation function to the updated memory cell state and multiplying it by the
output gate activation:
.h t = ot tanh(ct ) (8.7)
Machine learning models can be broadly classified into generative models and dis-
criminative models. The fundamental difference between these models is that dis-
criminative models learn the (hard or soft) boundary between classes while generative
models learn the distribution of individual classes. A generative model is the one that
8.4 Generative Adversarial Networks 151
can generate data as it models both the features and the class and have the poten-
tial to automatically learn the natural features of a data set, whether categories or
dimensions or something else entirely.
Let us consider a data set with two variables .x and . y. If we model the joint distri-
bution of both, . p(x, y), it can use this probability distribution to generate data points
and will a generative model. In contrary to that if we model . p(y|x),the conditional
probability of the observable .x, given a target . y, symbolically it will only model the
hard class where it belong to.
Generative adversarial networks, or GANs are an approach to generative model-
ing using deep learning methods that employs a discriminator to aid in the generative
modeling step. GANs consist of two main components: a generator and a discrim-
inator (Fig. 8.3). The generator learns to produce synthetic data that resembles real
data, while the discriminator learns to distinguish between real and generated data.
This adversarial setup allows the generator to improve its ability to generate realistic
data by competing against the discriminator. Thus, GANs employ a new format of
training a generative model by posing the problem as a supervised learning problem
with two sub-models that include the generator model that we train to generate new
examples and the discriminator model that tries to classify examples as either real
from the domain or fake generated. The two models are trained together in a zero-
sum game, format contesting each other, until the discriminator model is deceived
about half the time, meaning the generator model generates plausible examples.
The generator takes as input a random noise vector, denoted as .z, and maps it to a
generated sample, denoted as .G(z). The goal of the generator is to generate samples
that are indistinguishable from real samples. On the other hand, the discriminator
takes as input a sample, either real (.x) or generated (.G(z)), and outputs a probability
(. D(x) or . D(G(z))) representing its belief on whether the sample is real or generated.
The training objective of GANs can be defined using the minimax game between
the generator and the discriminator:
152 8 Deep Learning
Graph Neural Networks (GNNs) are a class of neural networks specifically designed
to process and analyze graph-structured data. They are particularly effective for
tasks involving relational data, such as social networks, molecular structures, citation
networks, and knowledge graphs. GNNs operate by propagating information through
the nodes and edges of a graph, allowing each node to aggregate and update its
representation based on the information from its neighbors (Fig. 8.4).
The key idea behind GNNs is to leverage the connectivity and relational structure
of graph data to enable effective information propagation and learning. By iteratively
updating node representations based on the information from their neighbors, GNNs
can capture complex dependencies and patterns in graph-structured data.
A graph is a data structure consisting of two elements: nodes (or vertices) and
edges connecting them. The nodes of a graph can be homogeneous, with all nodes
having a similar structure or heterogeneous nodes having different types of structure.
The edges of the graph can also be weighed to represent the importance of the edge.
Graph embedding is mapping a graph into a set of vectors that capture graph topology,
node-to-node relationships, and other relevant information about graphs. Each node
will have a unique set of embeddings for itself which determines its identity in the
graph.
In a typical GNN, the input consists of a graph with a set of nodes and edges.
Each node in the graph is associated with a feature vector representing its attributes,
and each edge contains information about the relationship between two connected
nodes. The goal of the GNN is to learn a function that maps these input features to
a desired output, such as node classification, node level predictions, or graph-level
prediction. Molecular structures can naturally be modeled as a graph with the atoms
representing the nodes and the bonds representing the edges. Then, GNN can be
used for several tasks such as: (i) predicting the dynamics of individual atoms by
predicting the node-level displacement, (ii) predicting the overall graph property
such as toxicity of a drug molecule, or (iii) as an interatomic potential by predicting
the potential energy of a node/edge as a function of the local structure.
The propagation step in GNNs is typically performed iteratively, allowing each
node to update its representation based on the features of its neighbors. This process
can be summarized as follows:
1. Initialization: Each node in the graph is assigned an initial feature vector, denoted
as .h (0)
v , where .v represents the node index.
2. Message Passing: During message passing, information is exchanged between
connected nodes. Each node aggregates the feature vectors of its neighbors and
combines them with its own feature vector to generate a new representation. This
aggregation step can be defined using a function, typically a neural network,
that takes the neighboring node features and produces a message for each edge.
For example, the message passed from node .u to node .v can be computed as
(t) (t)
.m uv = M(h u , h v , euv ), where . M is a function that combines the node features
(t) (t)
.h u and .h v , along with the edge feature .euv .
3. Aggregation: After computing the messages, each node aggregates the received
messages to update its own representation. This aggregation step can be defined
as .h (t+1)
v = U (h (t)
v , {m uv }), where .U is a function that combines the node features
(t)
.h v and the received messages .{m uv }.
4. Iteration: Steps 2 and 3 are repeated for a fixed number of iterations, allowing
the information to propagate through the graph. After each iteration, the repre-
sentations of the nodes are refined based on the updated information from their
neighbors.
154 8 Deep Learning
5. Readout: Once the iterations are completed, the final node representations can
be used for various downstream tasks, such as node classification, graph classifi-
cation, or link prediction.
Two prominent GNN architectures are (1) graph convolutional networks (GCN)
and graph encoder networks (GEN). In GCNs, like a CNN, a spatially moving filter
over the nodes extracts key features from the embeddings, which are further used
in a CNN framework. In GENs, an encoding layer downsamples the embedding
inputs by passing them through convolutional filters to provide the compact feature
representation, and a decoder which upsamples the representation provided by the
encoder and re-constructs the input according to the same. Both of these architectures
and their variants can be used to learn embeddings and are used to predict embeddings
for un-seen nodes.
Variational auto encoders (VAE) are a class generative of neural network architecture
belong to the class of Bayesian graphical models and variational Bayesian methods.
Unlike GANs, VAEs are explicit generative models wherein the specifics of the
probabilistic distributions are incorporated in the network architecture. This further
aids in to sample from the output distribution of the network (Fig. 8.5).
There are two integral components for the VAE: (i) an encoder network, and
(ii) a decoder network. Encode neural network is employed to convert input data .x
is to convert to a latent representation representation .z conditioned on a a suitable
attribute .a. Hence, by performing high level inference, the encoder compressors the
data learning the lower dimensional latent place distribution. While, The decoder
neural networks take the latent space input and attributes and regenerates the output
probability distribution. However, during this exercises, some amount of informa-
tion is irrecoverably lost. During the training of the VAEs, this error back propagated
through the entire network to improve the reconstruction of the original inputs. Note
that the bottle neck layer of a VAE represents the information bottle neck by reducing
the dimnesionality of the features. Thus, the bottle neck layer of the encoder rep-
resents the input data in an extremely low-dimensional space. The performance of
the decoder can be used to evaluate the variance of the information that is captured
by the bottle neck layer. In this sense, VAEs can also be used as a dimensionality
reduction technique.
The variational autoencoder can be represented as the graphical model, where
the joint probability can be expressed as . p(x, z) = p(x|z) p(z). This enables one to
sample latent variables and data points to be sampled from transformed distributions
. p(z) and . p(x|z) respectively. For the inference calculation, we need to complete
the is known as the posterior distribution . p(z|x) = p(x|z) p(z)/ p(x). To make this
computation tractable, VAEs assume that samples of .z can drawn from a simple
posterior distribution .q(z|x) which is similar to . p(Z |z). VAEs use Kullback-Leibler
8.6 Variational Auto Encoders (VAE) 155
divergence to measure how far .q(.) is far away from . P p(z). This is achieved by a
loss function called Evidence Lower BOund function (ELBO) defined for the VAE.
The ELBO is a lower bound on the evidence and its maximization will results in
increasing the likelihood of observing the data following the assumed distribution.
Mathematically, the operations in VAEs can be defined as follows. Consider the
input data as .x and the latent variable as .z. The encoder network parameterizes the
conditional distribution .q(z|x), which approximates the true posterior distribution
. p(z|x). The encoder produces two vectors, the mean .μ and the log-variance .log(σ ),
2
μ, log(σ 2 ) = Encoder(x)
. (8.9)
ϵ ∼ N(0, I )
. (8.10)
. z =μ+ϵ·σ (8.11)
. x̃ = Decoder(z) (8.12)
1∑
.LKL = − 1 + log(σ 2 ) − μ2 − σ 2 (8.14)
2
The overall objective function for training a VAE is the sum of the reconstruction
loss and the KL divergence:
L = Lrecon + LKL
. (8.15)
During training, VAEs optimize this objective function using stochastic gradient
descent or a similar optimization algorithm. Once trained, the decoder network can
generate new samples by sampling latent variables from the prior distribution and
decoding them into output data samples. VAEs are widely used in materials for
synthetic data generation and also dimensionality reduction. Once the VAEs are
trained, the encoder is used to reduce the dimensionality of the input features, which
can be then be used for downstream tasks such as classification or regression.
taken, the agent receives a reward from the environment and transitions to a new
state. This interaction between the agent and the environment takes place at discrete
time steps, .t = 0, 1, 2, 3..... Specifically, at each time step t, the agent takes action .at
on receiving state .st from the environment. In the next time step, the agent receives
a scalar reward .rt (for the action .at taken from state .st ) and finds itself in a new state
.st+1 (Fig. 8.6).
The agent uses the policy.π to interact with the environment to generate a sequence
of states, actions, and rewards.
The goal of the agent is to find a sequence of control actions to maximize the expected
discounted return of rewards as given below:
∞
∑
. arg max E [Rt := γ k rt+k ] (8.17)
A
k=0
8.8 Summary
In this chapter, we have explored advanced deep learning techniques and their broad
applications in machine learning. Our focus was on providing a prescriptive overview
of these techniques rather than delving into detailed explanations of the algorithms.
We highlighted the transformative impact of deep learning compared to classical
approaches, showcasing the shift from manual feature engineering to end-to-end
learning. Deep learning models, such as CNNs, LSTMs, GANs, GNNs, VAEs, and
RL, have revolutionized various fields beyond materials science. While we did not
provide extensive technical explanations, we emphasized the wide-ranging applica-
tions of these advanced deep learning techniques. From computer vision and nat-
ural language processing to generative modeling and sequential decision making,
deep learning has enabled breakthroughs in areas such as image analysis, language
understanding, creative generation, network analysis, representation learning, and
autonomous agents. Altogether, the chapter aimed to provide readers with a broader
outlook on the capabilities and potential applications of advanced deep learning
techniques. By understanding the core concepts and recognizing the diverse domains
where these techniques have been successfully applied, readers can explore and adapt
these approaches to their specific problem domains.
References
1. I. Goodfellow, Y. Bengio, A. Courville, Deep Learning (The MIT Press, 2016). isbn:
0262035618
2. R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction (MIT Press, 2018)
3. K.P. Murphy, Machine Learning: A Probabilistic Perspective (The MIT Press, 2012). isbn:
0262018020
4. J. Bernal, K. Kushibar, D.S. Asfaw, S. Valverde, A. Oliver, R. Marti, X. Llado, Deep convo-
lutional neural networks for brain image analysis on magnetic resonance imaging: a review.
Artif. Intell. Med. 95, 64–81 (2019)
5. F.A. Gers, J. Schmidhuber, F. Cummins, Learning to forget: continual prediction with LSTM.
Neural Comput. 12(10), 2451–2471 (2000)
6. T. Fischer, C. Krauss, Deep learning with long short-term memory networks for financial market
predictions. Eur. J. Oper. Res. 270(2), 654–669 (2018)
7. A. Aggarwal, M. Mittal, G. Battineni, Generative adversarial network: an overview of theory
and applications. Int. J. Inf. Manag. Data Insights 100 004 (2021)
8. J.-Y. Kim, S.-B. Cho, A systematic analysis and guidelines of graph neural networks for prac-
tical applications. Expert Syst. Appl. 184, 115 466 (2021)
9. D.P. Kingma, M. Welling, An Introduction to Variational Autoencoders (2019).
arXiv:1906.02691
10. R. Nian, J. Liu, B. Huang, A review on reinforcement learning: introduction and applications
in industrial process control. Comput. Chem. Eng. 139, 106 886 (2020)
Chapter 9
Interpretable Machine Learning
Abstract ML algorithms, and deep learning models more so, are notorious for their
black-box nature providing little or no insights into the nature of the learned func-
tion. To address this challenge, increasing emphasis is being placed on developing
interpretable ML models or enabling interpretability for the existing models. Inter-
pretable machine learning addresses the challenge of understanding complex black-
box models, enabling transparency and insight into their decision-making processes.
In this chapter, we explore interpretable machine learning techniques, focusing on
two prominent methods: SHapley Additive exPlanations (SHAP) and integrated gra-
dients. The SHAP framework, based on cooperative game theory, is examined as
a method to attribute feature contributions to model outputs. SHAP values provide
a mathematically grounded approach to understanding the significance and impact
of individual features. The chapter also explores integrated gradients, which quan-
tify feature importance by integrating gradients along a reference-to-input path. This
technique offers insights into how changes in feature values affect model predictions.
We also discuss symbolic regression as a tool to abstract out symbolic laws from
the data. Finally, we discuss a few additional interpretability algorithms to unpack
the black-box ML models. Altogether, we discuss how interpretable algorithms can
provide insights into the feature-to-label map learned by the DL models.
9.1 Introduction
ML approaches are notorious for the black-box nature of the models. With increas-
ing complexity of the models, interpretability and explainability decreases. Simpler
models such as linear and polynomial regression are explainable due to the paramet-
ric nature of the equations. However, more complex models such as random forest,
and neural networks are less interpretable. Support vectors can be considered as
an intermediate model wherein the simpler linear version provides some insights
into the model where are the non-linear kernel versions are more complex with less
interpretability. Overall, the complex ML models provide little or no insights into the
nature of the problem or the input-output relationships. This questions the applicabil-
ity of these models to extrapolate beyond the training domain and apply for practical
© Springer Nature Switzerland AG 2024 159
N. M. A. Krishnan et al., Machine Learning for Materials Discovery,
Machine Intelligence for Materials Science,
https://doi.org/10.1007/978-3-031-44622-1_9
160 9 Interpretable Machine Learning
situations, particularly in domains where model decisions impact human lives and
crucial decision-making processes. Further, domain experts might find it challenging
to appreciate the model as the features learnt by the model cannot be clearly under-
stood. By making machine learning models more transparent and explainable, we
can build trust, understand biases, detect errors, and ensure ethical considerations
are met.
To address these challenges, several algorithms have been developed recently that
can be used to explain the features learnt by the ML models. While some of these
algorithms are model-specific, others are model-agnostic. For example, the feature
explainer in tree algorithms identifies the frequency of usage of the input features to
create a branch in a tree. This frequency is then used as a surrogate to explain the
feature importance learnt by the model. On the contrary, algorithms such as Shapley
additive explanations (SHAP) are model agnostic and can be applied on any ML
model.
In this chapter, we will focus on the interpretability of ML models. First, we will
discuss the SHAP framework, a versatile tool for model interpretation based on coop-
erative game theory. SHAP values provide a unified and mathematically grounded
approach to attribute the contribution of each feature to the model’s output, offer-
ing insights into the importance and impact of individual features. Next, the chapter
explores integrated gradients, a technique inspired by the field of interpretability
in deep learning. Integrated gradients provide an attribution method that quantifies
the importance of each feature by integrating gradients along a path from a refer-
ence point to the input point. This approach enables the understanding of feature
importance and how changes in feature values influence model predictions.
We will also discuss the theoretical foundations, practical implementation, and
application examples of both SHAP and integrated gradients. We highlight their
strengths, limitations, and considerations for different types of machine learning
models and datasets. By employing these interpretable machine learning techniques,
practitioners and researchers can gain a deeper understanding of how complex models
arrive at their predictions. This chapter aims to equip readers with the knowledge
and tools necessary to unravel the inner workings of black-box models, fostering
transparency, trust, and responsible deployment of machine learning in real-world
applications. For further discussion on interpretable machine learning, readers are
directed to References [1, 2].
[5], Shapley regression values [6], Shapley sampling values and quantitative input
influence [7], among others.
The Shapley Additive exPlanation (SHAP) method uses the “Shapley value” to
understand the influence of a feature value on prediction. Shapley additive expla-
162 9 Interpretable Machine Learning
Fig. 9.4 Force plot showing contribution of each component for a prediction
nation (SHAP) values is a unified game theoretic approach to calculate the feature
importance of an ML model. SHAP measures a feature’s importance by quantifying
the prediction error while perturbing a given feature value. If the prediction error
is large, the feature is important, otherwise the feature is less important. It is an
additive feature importance method which produces unique solution while adhering
to desirable properties namely local accuracy, missingness, and consistency. It is
defined as a feature’s average marginal contribution value across all potential feature
combinations (Figs. 9.1, 9.2, 9.3 and 9.4).
'
Suppose a regression model and their prediction function be. f (x1 , . . . , xn ) where
. x 1 , . . . , x n are the features. By means of SHAP, the contribution of . jth feature is
computed as follows:
'
.φ j ( f ) = f ' (x j ) − E( f ' (X j )) (9.1)
9.2 Shapley Additive Explanations 163
where the upper case letter stands for the feature random variable. . E( f ' (X j )) is
the mean effect estimate for . jth feature and their contribution is computed as the
difference between the feature prediction and the average estimation. If we add up
every feature contribution at once, we get the following:
'
.∑ Nj=1 φ j ( f ) = ∑ Nj=1 ( f ' (x j ) − f ' (E(X j ))) (9.2)
Let . f x (S) denote the estimate feature values in the set . S that are minimised over
features not in the set . S:
' '
f (S) =
. x f (x1 ....x N )d Px ∈S
/ − E X ( f (X ))) (9.3)
A feature value’s contribution to the prediction is weighted and added over all pos-
sible feature value combinations to determine its Shapley value. It is determined
by averaging out each feature’s marginal contributions to all viable coalitions of
features. This leads to an estimation of the Shapley value . phi j ( f x ) as follows:
where . S denotes the subset of features and . N denotes the total number of input
features. A negative number of Shapley indicates that the feature instance has a
negative impact on the target value, whereas a positive value indicates the opposite.
This explanation method uses additive feature contribution, which is dependent on
the linear combination of the following features.
'
∑
N
'
. g(z ) = φo + φjz j (9.5)
j=1
'
where .z ∈ {0, 1} N is the coalition vector, . N is number of features, and .φ j is attribute
by. jth feature [8]. In this study, we have interpreted the best-performing GPR models
by utilizing the “kernelSHAP” framework. It is appropriate for nonlinear models such
as GPR and interprets the feature importance by evaluating the Shapley values.
The “KernelSHAP” is an extension of SHAP wherein the contributions of each
feature value to the estimate for a data instance are calculated for nonlinear models
such as GPR. “KernelSHAP” algorithm involves five major steps as indicated below:
'
1. Sample coalitions .z k ∈ {0, 1} M , where .k = {0, 1, ....K }, 1 means coalition has
feature and 0 means feature absent.
' '
2. Get prediction for each .z k : . f ' (h x (z k )), where .h x : {0, 1} M |→ R P .
' '
3. Estimate the weight for each .z k using the SHAP kernel .||x (z ) [8]:
164 9 Interpretable Machine Learning
' (M − 1)
.||x (z ) = ( M ) ' (9.6)
|z ' |
|z |(M − |z ' |)
'
where . M indicates the maximum size of coalition size and .|z | stands for the
number of features present for a given instance. ∑
4. Fit weighted linear model by minimizing the loss . L = z ' [ f ' (h x (z ' )) − g(z ' )]2
'
||x (z )
5. Return Shapley values for .φk the linear model’s coefficients.
After normalization, the final Shapley values are presented as the average ∑ of the
absolute Shapley values for each feature across the feature data, . Ik = .1/n in ||φki ||.
The high value of . Ik , means it is a crucial feature while predicting the target variables
and vice versa.
The SHAP values are visualized using a violin plot and a river flow plot. The violin
plot represents the contribution of a given feature towards the different output values
as a function of the feature value. Thus, the violin plot is colored according to the fea-
ture value. The river plot shows some specific paths for the final prediction, where the
intermediate points represent the contribution of different input components. These
paths are created by nudging the prediction from expected value towards a particular
direction representing that specific glass component’s particular contribution. Thus,
the river flow plot is colored according to the final output value.
Further, the correlation between several input components in a model can also be
studied using the SHAP interaction values. To this extent, we analyze the error in
the output prediction while perturbing two input components simultaneously. If the
magnitude of error in the output while perturbing a single input component is the
same for different values of the second output components, this suggests that the two
input components are not correlated; otherwise, they are correlated. The degree of
this correlation can also be quantified from the SHAP interaction values as follows.
The Shapely interaction values from classic game theory are calculated as
∑ (|S| − 1)!( p − |S| − 2)!
SHAPi j ( f ) =
. × [ f (S ∪ {i, j})
S⊆{1,2,..., p}\{i, j}
( p − 1)!
− f (S ∪ {i}) − f (S ∪ { j}) + f (S)] (9.7)
where
This equation represents the SHAP interaction values between pairs of features,
denoted by .i and . j, in a given function or model . f . The equation involves sum-
ming over all possible subsets . S of features excluding .i and . j. The terms inside
the summation compute the difference between model predictions when including
and excluding the features .i and . j in different combinations. The resulting SHAP
interaction values provide insights into the joint effect of features on the model pre-
dictions. For a model with M features, a M .× M matrix per instance is obtained. To
visualize interaction values for a specific property, a heatmap of M .× M square grid
is plotted, where color of each grid represents the normalized interaction value. Note
that interaction values are averaged (mean of absolute values) over the whole dataset
to produce a single M .× M matrix for each property. The stronger the interaction
value, the stronger is the coupling between two variables for a given property.
Formally, integrated gradients defines the importance value for the .ith feature value
as follows:
1
δ f (x ' + α(x − x ' ))
φ I G ( f, x, x ' ) = (xi − xi' ) ×
. i dα (9.8)
α=0 δxi
where .x is the current input, . f is the model function and x’ is some baseline input
that is meant to represent the absence of feature input. The subscript .i is used to
denote the indexing into the .ith feature (Fig. 9.5).
As the formula above states, integrated gradients gets importance scores by accu-
mulating gradients on images interpolated between the baseline value and the current
input. But why would doing this make sense? Recall that the gradient of a function
represents the direction of maximum increase. The gradient is telling us which pixels
have the steepest local slope with respect to the output. For this reason, the gradient
of a network at the input was one of the earliest saliency methods.
Unfortunately, there are many problems with using gradients to interpret deep
neural networks. One specific issue is that neural networks are prone to a problem
known as saturation: the gradients of input features may have small magnitudes
around a sample even if the network depends heavily on those features. This can
happen if the network function flattens after those features reach a certain magnitude.
Intuitively, shifting the pixels in an image by a small amount typically does not
change what the network sees in the image. We can illustrate saturation by plotting
the network output at all images between the baseline .x ' and the current image. The
figure below displays that the network output for the correct class increases initially,
but then quickly flattens.
166 9 Interpretable Machine Learning
Fig. 9.5 Comparing integrated gradients with gradients at the image. Left-to-right: original input
image, label and softmax score for the highest scoring class, visualization of integrated gradients,
visualization of gradients*image. Notice that the visualizations obtained from integrated gradients
are better at reflecting distinctive features of the image
The historical process of discovering the functional forms that relate abstract quanti-
ties, such as energy and force, to observable quantities, such as positions and veloci-
ties, was indeed a laborious and time-consuming endeavor. Scientists and researchers
dedicated significant efforts over decades and even centuries to conduct experiments
and make observations to uncover these relationships. One notable example is the
equation governing kinetic energy, which Emilie Du Chatelet identified through
meticulous study and analysis.
The discovery of the equation for kinetic energy, expressed as. 21 m ẋ 2 , was a ground-
breaking achievement. Prior to Du Chatelet’s work, there was a misconception that
kinetic energy was linearly proportional to velocity (.ẋ). Du Chatelet’s findings cor-
rected this misunderstanding and provided a more accurate understanding of the
relationship between kinetic energy, mass (m), and the square of velocity (.ẋ 2 ). Her
contribution not only corrected an existing misconception but also paved the way for
further advancements in the field of physics.
It is important to note that these discoveries were primarily driven by intuition,
empirical understanding, and occasional derivation from first principles. Formal
9.4 Symbolic Regression 167
methods, as we know them today, were not employed during that time. Scientists
relied on their deep knowledge of the subject matter, careful observations, and insight-
ful interpretations to uncover the underlying mathematical relationships between
abstract and observable quantities.
The process of discovering equations directly from observations has been a fun-
damental method through which humans have developed their understanding of the
universe. While the historical approach lacked formalism, it highlights the impor-
tance of empirical exploration and the ability to derive insights from experimental
data. Today, with the advent of modern computational techniques and machine learn-
ing algorithms, we have the opportunity to employ formal methods, such as symbolic
regression, to automatically discover mathematical equations directly from data. This
formalization of the process allows for a more systematic and efficient exploration of
mathematical relationships, complementing the intuitive and empirical approaches
of the past.
With the advent of computational and data-driven modeling, these problems can be
solved in a more formal fashion employing combinatorial optimization along with the
symbolic functions and operations through an approach called symbolic regression.
Symbolic regression is a computational technique that aims to discover mathematical
equations directly from data without any prior knowledge of the underlying functional
form. It leverages the principles of evolutionary algorithms and genetic programming
to search through a vast space of mathematical expressions and identify the most
suitable equation that accurately represents the relationship between variables.
In symbolic regression, the goal is to find an equation of the form:
where . y represents the dependent variable, .x1 , x2 , ..., xn are the independent vari-
ables, and . f is the unknown mathematical function to be discovered. The objective
is to find the functional form of . f that best fits the given data.
The search for the equation begins by generating an initial population of candidate
equations. These candidate equations consist of a combination of mathematical oper-
ators (e.g., addition, subtraction, multiplication, division) and mathematical functions
(e.g., logarithm, exponential, trigonometric functions). The initial population is typ-
ically generated randomly or based on prior knowledge.
To evaluate the fitness of each candidate equation, a fitness function is defined. The
fitness function quantifies how well the equation fits the observed data. This fitness
evaluation is often based on a measure of the difference between the predicted values
from the equation and the actual observed values.
Genetic programming techniques are then employed to evolve and refine the pop-
ulation of candidate equations over successive generations. This involves applying
genetic operators such as mutation and crossover to create new candidate equations
by modifying or combining existing ones. Mutation introduces random changes to the
equations, while crossover combines elements from two parent equations to generate
offspring equations.
168 9 Interpretable Machine Learning
The evolution process follows the principles of natural selection, where candidate
equations with higher fitness are more likely to be selected for reproduction, and thus
have a higher chance of propagating their genetic material to subsequent generations.
Over multiple generations, the population evolves towards equations that provide a
better fit to the data, allowing the discovery of the underlying functional form.
The evolution process continues until a stopping criterion is met, such as reaching
a maximum number of generations or achieving a desired level of fitness. At the end
of the process, the best-performing equation, as determined by the fitness function,
is selected as the final result.
Symbolic regression has been successfully applied in various domains, including
physics, engineering, finance, and biology. In materials domain, it is sparingly used.
However, this could be an extremely useful approach in discovering empirical rules
based on the data and also to discover symbolic that represents complex data in an
interpretable fashion. There are several open-source packages that can be used for
symbolic regression; some of these are listed below.
1. PySR [9]: https://github.com/MilesCranmer/PySR.
2. Eureqa [10]: Open-source code not available.
3. GPLearn: https://github.com/trevorstephens/gplearn.
4. AI Feynman: https://github.com/heal-research/operon.
5. Operon: https://github.com/heal-research/operon.
6. DSO: https://github.com/brendenpetersen/deep-symbolic-optimization.
7. PySINDy: https://github.com/dynamicslab/pysindy.
8. EQL: https://github.com/martius-lab/EQL.
9. SR-Transformer: https://github.com/martius-lab/EQL.
10. GP-GOMEA: https://github.com/marcovirgolin/GP-GOMEA.
11. Symbolic Physics Learner [11]: Open-source code not available.
While SHAP and integrated gradients are widely used model agnostic algorithms,
interpretable ML is a fast growing field and there are several model-specific algo-
rithms that are available and much more are being continuously developed. Here, we
briefly outline some of such interpretability algorithms.
• Decision Trees: Decision trees are widely used for their inherent interpretability.
These models consist of a hierarchical structure of decision nodes and leaf nodes,
allowing easy visualization and understanding of the decision-making process.
• Rule-based Models: Rule-based models, such as decision rules and association
rules, provide explicit rules that dictate model predictions. These models are highly
interpretable as they directly map input features to decision rules.
• Partial Dependence Plots: Partial dependence plots reveal the relationship
between a selected feature and the model’s output by systematically varying that
9.5 Other Interpretability Algorithms 169
feature while keeping others constant. These plots provide insights into how indi-
vidual features impact predictions. The partial dependence function is defined as
1 ∑
N
.PD(x j ) = f (x− j , x j ) (9.10)
N i=1
where .(x j ) represents the selected feature of interest, . N is the number of instances,
x represents all features except .x j , and . f represents the model’s prediction
. −j
function. The partial dependence plot, denoted as .PD(x j ), quantifies the average
predicted outcome. f across all instances when varying the feature.x j while keeping
other features constant.
• Local Interpretable Model-Agnostic Explanations (LIME): LIME explains
complex models locally by generating interpretable explanations for specific
instances. It approximates the behavior of the black-box model around a par-
ticular data point using a locally interpretable model. The explanation is obtained
by solving the following optimization problem:
where . f is the black-box model, .g is the locally interpretable model, .πx is the
proximity measure between the instance of interest and perturbed instances, and
.Ω is the complexity penalty.
• Generalized Additive Models (GAM): GAMs extend traditional linear models
by incorporating nonlinearities using smooth functions. In GAM, each feature .xi
is associated with a smooth function . f i that captures its non-linear relationship
with the response variable as follows.
∑
N
. y = β0 + f i (xi ) (9.12)
i=1
where . y represents the predicted outcome or response variable, .β0 is the intercept
term, . N is the number of features, and . f i (xi ) represents the smooth non-linear
functions applied to each feature .xi . These smooth functions can take various
forms, such as splines or kernel functions, and are often estimated using tech-
niques like penalized regression. The final prediction is obtained by summing the
contributions from each feature’s smooth function, along with the intercept term.
Each smooth function . f i can have its own set of parameters, allowing flexibility in
modeling the relationship between each feature and the response. The objective of
GAM is to estimate the smooth functions. f i that minimize the discrepancy between
the observed responses and the predictions. This is typically achieved through opti-
mization methods, such as maximum likelihood estimation or penalized regression
techniques. They allow for a flexible representation of the relationship between
features and the target variable, enabling interpretability. The smooth functions
170 9 Interpretable Machine Learning
allow for capturing non-linear patterns in the data, making GAM a powerful tool
for understanding the dependencies between features and the response variable.
• Contrastive Explanations Method (CEM): CEM generates contrastive explana-
tions by identifying the minimal changes required in the input features to change
the model’s prediction. In other words, CEM provides explanations for individ-
ual predictions by contrasting them with alternative outcomes. The mathematical
equation for CEM is as follows:
where .x represents the instance to be explained, . y is the true label or target value
associated with .x, . f is the black-box model being explained, .g is the interpretable
model used for generating contrastive explanations, and .δ is the contrastive per-
turbation applied to .x to create an alternative instance. The goal of CEM is to
find the optimal perturbation .δ that minimizes the loss function . L and produces
a contrastive explanation for the prediction made by the black-box model . f . By
exploring perturbed instances and comparing their predictions with the original
instance, CEM provides insights into the key features and factors influencing
the decision made by the model. CEM is a powerful technique for generating
contrastive explanations in various domains, such as image classification, natural
language processing, and recommender systems. This approach helps understand
model behavior by highlighting critical features.
• Feature Importance Techniques: Various feature importance techniques, such as
permutation importance, mean decrease impurity, and coefficient weights, quantify
the importance of each feature in the model’s decision process.
Note that the detailed mathematical explanations of each of these algorithms are
beyond the scope of the book. The aim of the discussion is to give abroad overview
of several available techniques for model interpretability.
9.6 Conclusion
References
10.1 Introduction
In the field of materials science, the ability to accurately predict and understand
the properties of new materials is of paramount importance. Traditional approaches
to property prediction often rely on time-consuming and costly experimental tech-
niques. Further, the properties of a material is a complex function of composition,
structure, processing conditions, and testing conditions. However, with the advent of
machine learning, there has been a paradigm shift towards data-driven methods that
leverage the power of computational models and algorithms to expedite the discovery
and characterization of novel materials.
This chapter explores the application of machine learning techniques for property
prediction of materials. By leveraging large datasets, advanced algorithms, and com-
putational models, machine learning offers the potential to revolutionize the way we
© Springer Nature Switzerland AG 2024 175
N. M. A. Krishnan et al., Machine Learning for Materials Discovery,
Machine Intelligence for Materials Science,
https://doi.org/10.1007/978-3-031-44622-1_10
176 10 Property Prediction
Periodic-table-based descriptors
Data collection
Model interpretation
Feature
engineering
Machine learning
Data pre-
model development
processing
Fig. 10.1 Basic steps involved in the development of an ML model for predicting the property of
a material
understand and design materials with desired properties. The objective is to develop
predictive models that can accurately estimate various material properties, such as
mechanical strength, electrical conductivity, thermal conductivity, optical proper-
ties, and more. Thus, property prediction is one of the most common application for
which ML is being widely used. The major steps involved in the property prediction
problem are (see Fig. 10.1):
1. Dataset preparation including data collection, and processing
2. Feature engineering,
3. Model development,
4. Hyperparametric optimization,
5. Model testing and deployment.
Note that depending on the nature of the problem, there could be additional steps
involved. However, these steps explain the bare minimum that need to be carried
out for a reasonable model development for property prediction. Further, the model
can also be interpreted using interpretable algorithms such as SHAP. This allows
the domain experts to critically analyze the features learned by the model and then
evaluate whether they are sensible from a physical perspective. The interpretation of
the ML models are explained in detail in the later chapters. These steps mentioned
above are explained in detail below.
10.2 Dataset Preparation 177
Fig. 10.2 Features for materials modeling. Reprinted with permission from [1]
2. Physics-driven features: Here, the features are obtained either based on the
physics of the system or directly from the periodic table. These features can range
from periodic table based descriptors such as atomic number, atomic mass, ionic
radius, and electronic orbitals to microstructural features such as the grain size,
and grain orientation for alloys, or chain length, or correlation length for polymers.
If the property is at higher length-scales, such as mesoscale or microscale, then
the relevant parameters should be selected. Further, these features are highly
dependent on the materials of interest as well. The selection of these features is a
strong function of the length and timescales associated with property/phenomenon
of interest. For example, in the case of composite materials, the weight or volume
percent of the matrix and inclusions, thickness and geometry-dependent features,
orientation of the inclusions if the inclusions are fibres or sheets. Note that the
final selection of the features will be contingent upon the model performance.
3. Topology-driven features: In some cases, the specific structure of an atomic
or mesoscale system may play a crucial role in controlling the properties. This
could be the case in case of crystalline systems, organic molecules, polymers,
perovskites, or metal-organic frameworks. In such cases, it might be interesting
to consider the topology dependent features of the atoms. Note that the topological
features could be represented using simple coordination numbers, bond angles,
10.3 Feature Engineering 181
Once the features are identified, they should be checked for correlation. Note that
correlation doesn’t imply causation. However, in a representative dataset, correla-
tion analysis can be used to identify how one input variables changes with respect
to another. If two input variables are highly correlated, this means that when one
increases, the other variable also increases and vice-versa. Thus, these input vari-
ables may essentially contain similar information, and having both of them might be
redundant. Thus, the correlation analysis can be used to trim down the feature space
by removing highly correlated input features. It should be noted that in ML models,
unlike the classical regression models such as linear regression, the input variables
need not be independent. The ML models can handle even highly dependent input
variable space. Nevertheless, reduction of the feature space using dimensionality
reduction may allow the development of a simpler ML model, that will have a higher
degree of interpretability.
In addition, dimensionality reduction techniques, such as principal component
analysis, may be employed to project the input features on to a lower dimensional
domain. In this approach, since we project the input features to a lower dimensional
domain, the new features obtained after the dimensionality reduction may be easily
interpretable. The new features may be a weighted combination of multiple input
features and hence, may not be amenable for a physical explanation. However, it has
been observed that dimensionality reduction can occasionally improve the perfor-
mance of the ML models.
Most of the approaches mentioned above involve feature engineering based on
physical intuition or the understanding of the characteristics of the system. Thus,
feature engineering is highly subjective and relies highly on the domain knowledge
of the user. The rise of deep learning comes with the promise that feature engineer-
ing may no longer be required. In deep learning, the raw dataset can be provided to
the neural network, which in turn identifies the relevant features during the learning
process. For instance, using a pixel image of crystals, a convolutional neural network
can automatically identify the features that enable the identification the crystal struc-
ture. Similarly, a graph neural network can use the topology of the graph structure to
identify the node and edge embeddings that maximize the performance of the model.
In such cases, the burden of feature engineering is significantly reduced. However, it
might still be worth trying to see if feature engineering can enhance the performance
182 10 Property Prediction
Once the processed dataset with the input features and output labels or values are
ready, we use this dataset to develop the ML models. Before the model training,
the dataset should be split into training, validation, and test sets. While exact split
can be based on the dataset size, common practises involve ratios such as 60:20:20,
70:15:15, or 80:10:10 for the training, validation, and test sets respectively. The
validation set is used to develop the optimal model, that is, without underfitting or
overfitting, and hence is actually part of the training data. The test set, also known as
the hold out set, is kept unseen to the model and is only used at the end to evaluate the
model performance. Typically, the train:validation:test split is performed randomly.
However, this approach assumes that the dataset used is a balanced one with the
training and test data following a similar distribution for both input features and
output labels or values. In cases where this is not the case, additional care should
be taken to ensure that the training set used is representative of the entire domain
of the input features. For instance, if one of the input features is present only in
100/1000 data points, care should be taken to ensure that the data points with the
given feature is present in the training data. A randomized 80:20 split or training and
test data does not automatically ensure this. Further, if the feature is not present in
the training set, then we cannot expect the model to learn weights associated with
the feature and hence, the feature and hence the data entries become redundant. At
the same time, while ensuring that the training data is indeed representative of the
dataset, care should be taken to avoid data leakage. Data leakage refers to the use
of information outside the training set (that is, from the test set) for developing the
final model. As such, it is advisable to create the training and test dataset initially
and keep it constant throughout.
The model is trained on the training set and the performance is evaluated on the
validation set for overfitting or underfitting. Thus, the optimal model is identified
with the help of training and validation set–the performance on both the sets should
be comparable. Further details are discussed in the next section on hyperparametric
optimization. The choice of the ML model is completely up to the user, although
some thumb rules can be used to identify the appropriate one. The first among those
is the Occam’s razor principle, that is, to use the simplest model among the possible
ones. For instance, if linear regression can do the job, then there is no need to use
neural networks. Figure 10.3 shows the flowchart that can be used as a guide to select
the models based on the dataset size, and requirements.
While developing an ML model, it is always prudent to start with simple linear
regression or logistic regression for a regression or classification problem, respec-
tively. If the model is not accurate enough higher order polynomials can be evalu-
10.4 Model Development 183
Fig. 10.3 ML model selection guideline based on the size of the dataset
ated. Note that using higher order polynomials always run the risk of overfitting the
model, which can be handled using proper hyperparametric optimization. In some
cases, the model may improve using regularized linear regressions such as lasso,
ridge, or elastic net. If these classical algorithms are not accurate enough (a quality
that has to judged by the user), other models may be used. At this point, the size
of the dataset plays a major role. If the size of the dataset is small (another quality
judged by the user), for instance, less than thousand, then algorithms such as sup-
port vector machine, classification and regression trees (CART), random forest, or
gradient boosted decision trees may be preferable. Note that the CART approaches
with modifications in the regularizer, loss function, or the way new branch is created
have resulted in several algorithms with minor differences. Some other approaches
have tried to combine multiple CART approaches (and sometimes, other algorithms
as well) as an ensemble model as well. Out of these, one algorithm that requires a
special mention is the extreme gradient boosted decision trees (XGBoost) that seems
to work particularly well for both classification and regression tasks. XGBoost has
several features incorporated including multiple in-built regularizers that prevent
overfitting. An interesting aspect of XGBoost is its ability to interpret the relative
importance of the input features based on the number of times they have been used
to create a branch or a leaf. This aspect makes XGBoost an interpretable machine
learning model in comparison to other algorithms that are more black-box in nature.
If the dataset is large (for instance, greater than 1000), then neural network
approaches may work best. Depending on the size of the dataset and the complexity
of the input-output relationship, it could be a simple multilayer perceptron (MLP)
with a single hidden layer or a more complex deep network with multiple hidden lay-
ers, each having several hidden layer units. All the approaches mentioned thus far are
deterministic in nature–for a given set of input features, there is only one output that
can be obtained. In case probabilistic modeling is of interest, then Gaussian process
184 10 Property Prediction
regression (GPR) should be used. GPR allows one to obtain the best estimate for a
given set of input features along with the standard deviation in the prediction. How-
ever, due to the matrix inversion involved in GPR, the training procedure becomes
extremely expensive, even prohibitive, in cases with large number of datapoints. In
such cases, scalable Gaussian processes may be used, for example, kernel interpola-
tion for scalable structured GP (KISS-GP). Altogether, model development should
rely on choosing the simplest model possible after trying all the models.
Hyperparametric optimization is one of the most crucial steps involved in the training
of a machine learning model. The fundamental difference between the parameters of
a model and the hyperparameters are that the former get updated during the training
process while the latter is set a priori to the training and is kept constant while the
training is in progress. As such, the performance of a ML model is highly dependent
on the hyperparameters chosen.
There are several aspects involved in the hyperparametric optimization. First step
is the identification of the hyperparameters associated with a selected model. The
usual hyperparameters associated with ML models are the training epochs, loss func-
tion, learning rate, regularizer and the associated weights, if any, and batch size.
Depending on the ML model used there could be additional hyperparameters such as
the number of hidden layers and hidden layer units for an MLP, number of branches
and tree depth for CART-based approaches, or the kernel functions associated with
support vector and Gaussian process regressions. Similarly, in the case of MLPs,
dropout can be used ensure that the weights of the hidden layer units are optimal. In
dropout, the percentage dropout per training epoch is another hyperparameter, which
can vary from 0.1 to 0.3 (as a thumb rule). Due to the large number of options avail-
able, several algorithms as mentioned earlier (see Chap. 7), including grid search,
random search, or Bayesian approach, can be employed for identifying the optimized
hyperparameters.
Another approach typically employed during the hyperparametric optimization
is the k-fold cross validation (discussed in Chap. 7). In this approach, the training
dataset (including the validation set) is divided into k folds, where k can be any
number ranging from 5 to 50 (as a thumb rule). Among the k folds, n sets may be
chosen as training and n-k as validation sets. The process is repeated to cover all
possible combinations of taking n and n-k from the k sets. For example, in 10-fold
cross validation, one fold can be taken as validation set and nine folds as training
set. This leaves one with ten possible options for the validation set. The training
should be conducted on all the ten possible options and the validation scores should
be comparable for all the ten sets. This ensures that the developed model is optimal
and have reasonable generalizing capability.
Finally, the model is evaluated by comparing the performance on the test set. The
predicted values by the model are compared with the actual values of the test set
10.6 Physics-Informed ML for Property Prediction 185
that is unseen by the model. There are several measures that are commonly used to
evaluate the test set such as the the root mean squared error (RMSE) or the MSE, mean
absolute error, mean absolute percentage error, or the coefficient of determination,
also known as the R.2 value (Fig. 10.4).
The approaches mentioned thus far simply uses a labeled dataset (that is, with input
features and output properties) and trains the ML model on it. It does not take into
account any physical constraints, either derived from domain knowledge or from first
principle, associated with the property of interest. Indeed, the feature engineering
may be performed to include as many relevant features as possible based on the
physics or chemistry of the problem. However, there is no bias or constraint provided
in the training process itself to respect some known physical constraints. Including
such constraints may sometimes significantly improve the predictive capability of
the model. Here, we focus on how such constraints can be incorporated in the ML
models for property prediction. This approach is termed as physics-informed or
symbolic-reasoning informed ML.
In traditional ML models, the information flow is updated as follows. First, a
model is initiated with random weights. Then the output values corresponding to input
features in the training dataset are computed with this model having random weights.
The error between the predicted and actual output is computed using error metric
such as MSE or other similar loss function. Using this loss function as the objective,
optimization is carried to update the weights so as to minimize this error. Thus, the
weights learned in this process are identified as the one that the minimizes the data
loss (that is, error between predicted and actual property). If additional information in
186 10 Property Prediction
Tg ( )
.log10 (η, T, η∞ , Tg , m) = log10 (η∞ ) + 12 − log10 (η∞ )
[( )( T )]
Tg m
× exp −1 −1 (10.1)
T 12 − log10 (η∞ )
and periodic table based descriptors as input. Further, these predicted parameters
are substituted along with the temperature in the MYEGA equation to predict the
viscosity of a glass composition at a given temperature. Then the loss function is
defined as the error between the predicted and actual viscosity in terms of the MSE.
There are several advantages for this this gray-box approach—in contrast with the
traditional data-driven approach as listed below.
1. Interpretability: The parameters governing the viscosity are predicted by the
equation rather the viscosity itself. Thus, the model can also infer meaningful
quantities such as .Tg and .m without directly training on it. These parameters
predicted by the MLP provides insights into the behavior of the glass composition
and can be verified independent of the viscosity.
2. Generalizability: The model strictly follows the MYEGA equation and hence
can be meaningfully extrapolated to unseen temperatures for a given glass com-
position. Further, the model can be used to predict the viscosity of unseen glass
compositions beyond those in the training dataset in a meaningful fashion. The
performance can further be evaluated as additional parameters such as .Tg and .m
are predicted.
It was also demonstrated that the proposed approach can provide improved pre-
dictions on viscosity than purely data-driven approaches, thanks to the additional
inductive bias in terms of the MYEGA equation.
SRIMP: Symbolic-Reasoning Informed Prediction of Hardness
The hardness of glass, a crucial property, is measured using instrumented indentation
experiments. However, it is important to note that the obtained hardness values are
not solely determined by the intrinsic properties of the glass. They are also influ-
enced by various factors, including the loading procedure, indenter geometry, and
environmental conditions. One significant phenomenon that affects glass hardness is
the indentation size effect (ISE). The ISE refers to the observed behavior where the
hardness of glass monotonically decreases and saturates as the applied load increases.
This behavior poses challenges in comparing hardness values obtained under differ-
188 10 Property Prediction
ent loading conditions. The underlying cause of the ISE is the stress concentration
generated by sharp contact loading, leading to localized structural changes in the
glass network and resulting in permanent deformation.
To enable meaningful comparisons of hardness values obtained at varying loads,
it is essential to predict load-independent hardness values. To this extent, a recent
work [5] proposed to combine the Bernhardt’s law with machine learning to develop
a symbolic reasoning informed machine learning procedure (SRIMP) for predicting
glass hardness. The hardness, . H from indentation experiments is given by
2Psin(θ/2)
. H= (10.2)
L 2D
where . L D is the diagonal length of the indent after unloading, .θ is the tip angle of
the indenter, and . P is the applied load. According to the Bernhardt’s model, the ISE
is given by
P
. = a1 + a2 L D (10.3)
LD
Combining these two equations by eliminating . L D and solving the quadratic equa-
tion, we obtain /
C1 C12 + 4C1 C2 P
. H= + C2 + (10.4)
2P 2P
where .C1 = 2a12 sin(θ/2) and .C2 = 2a2 sin(θ/2). Similarly, combining the earlier
two equations and eliminating . P, we get
2a1 sin(θ/2) αI S E
. H = 2a2 sin(θ/2) + = H∞ + (10.5)
LD LD
Fig. 10.6 Predicting Hardness with symbolic reasoning informed machine learning. Reprinted
with permission from [5]
10.7 Summary
This chapter discusses the major approaches employed to predict the properties of
materials based on structured data. Further it provides a comprehensive overview of
the key steps involved in developing machine learning models for structured data,
along with an introduction to the concept of physics-informed machine learning
for predicting material properties. Some of the key challenges inherent in predicting
material properties and the need for advanced computational methods to address these
challenges were briefly discussed. Further, it delves into the essential stages of con-
structing machine learning models for structured data, including data preprocessing,
feature engineering and selection, and model training and evaluation. Note that high-
quality data and effective feature representation are two important aspects to ensure
accurate and robust predictions. Furthermore, the chapter introduces the concept
of physics-informed machine learning, which integrates domain-specific knowledge
and fundamental physical principles into machine learning models. This integra-
tion not only improves prediction accuracy but also ensures that the models adhere
to the governing laws and principles underlying material behavior. Throughout the
chapter, illustrative examples and case studies demonstrate the practical application
of machine learning in predicting material properties across a wide range of materials.
Some interesting additional reading include Refs. [6–13]. We hope these examples
showcase the superiority of machine learning models over traditional methods and
highlight their potential to expedite material discovery and development.
190 10 Property Prediction
References
11.1 Introduction
Fig. 11.1 Materials discovery flowchart. Reprinted with permission from [1]
The ML models for property predictions, once trained and validated, are useful in
exploring a larger domain of the input features. The model can be used to explore
new compositions, structures, and processing conditions depending on the input
features. However, most of the input features for property predictions are continuous
variables and can take a wide range of values. Thus, exploring all the possibilities
for the discovery of a new material is challenging.
11.2 ML Surrogate Model Based Optimization 193
The materials selection chart was first proposed by Mike Ashby as a unique way to
visualize the variations in multiple and seemingly independent properties of materi-
als. The materials selection chart enables the identification and selection of materials
with target properties or even cost. Figure 11.5 shows a traditional materials selection
chart. Although the traditional materials selection chart is two-dimensional in nature,
the chart can be made multi-dimensional with more features added as the additional
axes. Although these charts are developed for a large class materials, these selection
charts can be developed for specific materials as well, for instance, silicate glasses or
aluminum alloys. In such cases, populating these charts require detailed experimen-
tal characterization for each of the properties of interest for a given composition or
structure. This can be extremely challenging, both economically and time-wise, due
to the large number of sample preparations and experiments that need to be carried
out.
An alternate approach is to train ML models for properties that can be further
used to develop the materials selection charts. Indeed, the the training of ML models
also require large amounts of data. However, it is not mandatory that for a given
composition, all the properties are available. For instance, independent ML models
Fig. 11.5 Strength versus Density. Chart created using CES EduPack 2019, ANSYS Granta ©2020
Granta Design
196 11 Material Discovery
Fig. 11.6 Glass selection chart for elastic modulus and thermal expansion coefficient
can be developed for each of the properties, while ensuring that the input features for
all these models are same and consistent. The datasets may individually be different.
Once the ML model is developed, then multiple properties can be predicted for any
given composition or structure within the input feature space. This can then be used
to develop the selection chart for a given material, for instance, glass selection chart.
Figure 11.6 shows a glass selection chart developed using ML models for Young’s
modulus and thermal expansion coefficient (TEC). Although the experimental data
is not available for both the Young’s modulus and TEC for a given glass system,
the ML models trained on the dataset can be used to predict the TEC of glasses for
which Young’s modulus is available and vice-versa. The ML models thus allow the
imputation of missing data, and prediction of properties for new compositions based
on which the glass selection charts can be developed. Additional analysis on these
charts can also be used to develop composite materials. The contour lines in the
Fig. 11.6 correspond to compositions with a constant value of . Eα, where . E is the
Young’s modulus and .α is the TEC. Note that . Eα corresponds to the thermal stress
developed in a material subjected to a unit (1.◦ C) change in the temperature. Thus,
materials having constant values of . Eα can be used to develop composite materials
that exhibit zero thermal stress, while having significant different in their modulus
values (Fig. 11.7).
11.4 Generative Models 197
One of the first works to use GANs to generate a large number of materials was by
Dan et al. [5] by training a GAN on a large number of datasets from databases such
as OQMD, ICSD, and Materials project to develop MatGAN. Figure 11.8 shows the
architecture of MatGAN. In this work, the authors trained a GAN model to generate
new materials. The model achieved a high level of novelty, generating materials that
have not been seen before, with a novelty percentage of 92.53% when producing 2
million samples. Furthermore, the generated samples exhibited a high percentage of
chemical validity, with 84.5% of the generated materials being chemically valid in
terms of charge neutrality and electronegativity balance. Notably, the GAN model
does not explicitly enforce chemical rules but demonstrates its ability to implicitly
learn and adhere to the underlying composition rules for forming compounds.
Interestingly, some of the family of materials were not generated successfully by
the GANs. This was attributed to the limited number of data for these materials to
learn the required composition rules to generate those samples. To evaluate these
materials, authors employed an autoencoder architecture. An autoencoder architec-
ture consists of an encoder-decoder structure with a bottleneck layer in the middle.
The aim of the such a bottleneck layer is to learn a representation with significantly
reduced dimension that can represent the maximum information of the original data.
Thus, the autoencoder is an unsupervised approach towards dimensionality reduc-
tion of an input representation. The hypothesis was that any structure that cannot be
generated by autoencoder can also be not generated by a GAN. Figure 11.9 shows
the VAE architecture employed to evaluate the GAN. Note that one of the major
challenge associated with ML is to identify or develop a metric to identify the diffi-
culty associated with a learning task. Using such surrogate model based approaches
is a useful strategy to identify why a model fails for certain domains of materials.
Figure 11.10 shows the two-dimensional representation of the crystal structures
based on training and test set, along with the new structures generated. The two axes
correspond to the two dimensions after t-SNE-based dimension reduction. The ICSD
materials only occupies a tiny portion of the chemical space of inorganic materials.
Figure 11.10a, b, and c shows training samples (green dots) and leave-out validation
samples (red dots) from ICSD, 50,000 and 200,000 generated samples (blue dots),
11.4 Generative Models 199
Fig. 11.9 Variational autoencoder for materials discovery. Reprinted with permission from [5]
Fig. 11.10 Inorganic materials space composed of existing ICSD materials and hypothetical mate-
rials generated by GAN-ICSD. Reprinted with permission from [5]
respectively. The GAN approach is thus able to explore a significantly larger domain
of crytals while training on a smaller domain (Fig. 11.11).
Note that the approach of GANs is not limited to crystal structures. With the advent
of additive manufacturing and 3D printing, architectured materials with optimized
topology for different loading conditions are ideal are desirable. Several efforts in
this direction also attempt to employ GANs [6].
In addition to generation tasks, inverse design requires control over the genera-
tive process to prioritize desired qualities. Variational Autoencoders (VAEs) enable
explicit optimization of properties by operating on a continuous representation. On
the other hand, Generative Adversarial Networks (GANs) and Recurrent Neural
Networks (RNNs) achieve property optimization by biasing the generation process
using techniques like Reinforcement Learning (RL), where generative behaviors are
rewarded or penalized.
VAEs provide control over data generation through latent variables. An Autoen-
coder (AE) model consists of an encoding and a decoding network. The encoder
200 11 Material Discovery
Fig. 11.11 Examples of GAN-generated architectured materials with . Ẽ mean (.Ω ≤ 5%) achieving
more than 94% of E. H S . Reprinted with permission from [6]
processes. Thus, the utilization of VAEs and their latent space provides a framework
for property optimization and exploration in generative modeling, facilitating the
design of molecules with desired characteristics.
Fig. 11.13 Reinforcement learning for a specific task. Reprinted with permission from [8]
forcement learning algorithms, directly optimize the policy function that maps states
to actions. By iteratively adjusting the policy parameters based on gradient informa-
tion, these methods have been employed to search for optimal material configurations
or optimize material synthesis pathways. Actor-critic models combine elements of
both value-based and policy-based methods. They consist of an actor that selects
actions based on a policy and a critic that estimates the value function. By leverag-
ing this combination, actor-critic models have demonstrated improved performance
in materials discovery tasks. Some of the example applications of RL for material
discovery are outlined here [4, 7, 8] (Figs. 11.14 and 11.15).
One of the most challenging problems in the field of materials is optimize the struc-
ture of a bulk system based on the atomic configuration. This problem is interesting
in identifying a stable structure based on a random initial configuration, optimizing
structures with defects, modeling grain boundaries, optimizing liquid or solid bulk
structures, obtaining stable glassy or disordered structures, and even in drug discov-
ery. To address this problem, one approach is to allow the system learn policies that
discover better minimum energy structures through reinforcement learning (RL).
Recently, it has been demonstrated that RL combined with graph neural networks
can be used to discover stable structures, through a framework, namely, Strider-
Net [10]. Here, we briefly discuss this framework as an example how RL can be
used to address a challenging problem in material discovery.
In this work, authors formulate the problem of discovering stable as an RL problem
as follows. Let .Ωc (x1 , x2 , ...xN ) be a configuration of an . N -atom system with energy
11.5 Reinforcement Learning for Optimizing Atomic Structures 203
Fig. 11.14 Visualization of the molecular design process. The RL agent (depicted by a robot arm)
sequentially places atoms onto a canvas. By working directly in Cartesian coordinates, the agent
learns to build structures from a very general class of molecules. Learning is guided by a reward that
encodes fundamental physical properties. Bonds are only for illustration. Reprinted with permission
from [9]
Fig. 11.15 Neural-network policies trained by evolutionary reinforcement learning can enact effi-
cient time- and configuration-dependent protocols for molecular self-assembly. A neural network
periodically controls certain parameters of a system, and evolutionary learning applied to the weights
of a neural network (indicated as colored nodes) results in networks able to promote the self- assem-
bly of desired structures. The protocols that give rise to these structures are then encoded in the
weights of a self-assembly kinetic yield net. Reprinted with permission from [7]
204 11 Material Discovery
Fig. 11.16 Optimization of a bulk binary LJ liquid system using StriderNet [10]
Fig. 11.17 StriderNet architecture for optimizing atomic structures. The atomic structure is
transformed into a graph, which is passed to a policy network that predicts node displacement, and
reward is computed. Finally, the policy parameters are updated based on the discounted reward [10]
U Ωc sampled from the energy landscape .U N d of the system. Starting from .Ωc , our
.
goal is to obtain the configuration .Ωmin exhibiting the minimum energy .U Ωmin by
displacing the atoms. To this end, we aim to learn a policy .π that displaces the atom
so that the system moves toward lower energy configurations while allowing it to
overcome local energy barriers [10].
Figure 11.16 shows the atomic structure of a binary Kob-Anderson Lennard Jones
liquid with two particles in the ratio 80:20. The optimization of the structure starting
from a random configuration results in a more stable structures with overall lower
energies. Consequently, this also results in a reduction in the energy of each of the
atoms in the system and in their local environment. StriderNet approach takes
inspiration from this observation. It employs a RL approach to learn what displace-
ments should be made to each of the atoms so that the overall energy reduces. While
doing this, care should be taken to balance between the exploration and exploitation
dilemma. Specifically, going always along the direction that decreases energy will
result in a local minima and thus not allowing the system to relax further. However,
11.6 Summary 205
exploring too long could lead the system toward higher and higher energy states
thereby making more challenging to discover the minima. An optimal policy would
allow enough exploration so that the system can overcome local energy barriers.
However, as soon as it reaches a well with extremely low energy values, it should
exploit this fact and reach the minima.
To address this problem, StriderNet employs a graph reinforcement learning
approach. Figure 11.17 describes the architecture of StriderNet. To achieve the
permutation invariance and inductivity, the atomic configuration is represented as a
graph with nodes representing the atoms and edges representing the bonds between
them. Note that a realistic cutoff is selected to create the graph based on the atomic
structure. Further, the graph is dynamic in nature as the atoms move. Subsequently,
a message-passing GNN is used to embed graphs into a feature space. The message-
passing architecture of the GNN ensures both permutation invariance and inductivity.
This graph, in turn, predicts the displacements of each of the atoms based on which the
rewards are computed. Finally, the policy .π is learned by maximizing the discounted
rewards. Note that the parameters of .π is learned using a set of training graphs
exhibiting diverse energies that are sampled from the energy landscape .E N d of an
atomic system with . N -atoms in .d dimensions. Thus, the initial structure, although
arbitrary and possibly unstable, is realistic and physically feasible. Then given a
new structure, the parameters of the learned policy network .π is adapted to the new
structure while optimizing the new graph structure.
The approach was tested on three systems namely, binary LJ system, Si atoms
modeled using a Stillinger-Weber potential, and also a coarse-grained model of
cement hydrate (C–S–H). It was demonstrated that StriderNet outperforms all
other optimization algorithms including gradient descent and fast inertial relaxation
engine (FIRE) and reaches much more stable configurations. Thus, the RL-based
approach can be a strong candidate to bridge the gap in terms of the timescales
between simulations and experiments allowing one to discover stable structures cor-
responding to even millions of years of aging.
11.6 Summary
the discovery process by guiding the selection of candidate materials and optimizing
atomic structures, potentially having applications in the area of drug discovery.
Looking ahead, the future of materials discovery using ML holds immense poten-
tial. Continued advancements in ML algorithms, computational power, and data
availability will further enhance the capabilities of ML-based approaches. Integra-
tion with experimental techniques, such as high-throughput synthesis and character-
ization, will enable rapid validation and feedback loops, accelerating the discovery
process. Additionally, the development of physics-informed ML models will enable
the incorporation of fundamental scientific principles into the design process, improv-
ing the reliability and interpretability of ML-based predictions. The combination of
ML with other emerging technologies, such as quantum computing and advanced
imaging techniques, will open up new frontiers for materials discovery. Overall, the
continued exploration and integration of ML techniques in materials discovery will
drive innovation, enabling the development of advanced materials with enhanced
performance and functionality.
References
11. M.-P.V. Christiansen, H.L. Mortensen, S.A. Meldgaard, B. Hammer, Gaussian representation
for image recognition and reinforcement learning of atomistic structure. J. Chem. Phys. 153(4),
044107 (2020)
12. D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning representations by back-propagating
errors. Nature 323(6088), 533–536 (1986)
13. S.A. Meldgaard, H.L. Mortensen, M.S. Jørgensen, B. Hammer, Structure prediction of surface
reconstructions by deep reinforcement learning. J. Phys. Condens. Matter 32(40), 404005
(2020)
14. R. Bhattoo, S. Bishnoi, M. Zaki, N.A. Krishnan, Understanding the compositional control on
electrical, mechanical, optical, and physical properties of inorganic glasses with interpretable
machine learning. Acta Mater. 242, 118439 (2023)
15. R. Ravinder, K.H. Sridhara, S. Bishnoi, H. Singh Grover, M. Bauchy, Jayadeva, H. Kodamana,
N.M.A. Krishnan, Deep learning aided rational design of oxide glasses. Mater. Horiz. (2020).
Royal Society of Chemistry. https://doi.org/10.1039/D0MH00162G. [Online]. https://pubs.
rsc.org/en/content/articlelanding/2020/mh/d0mh00162g. Accessed 10 May 2020
16. S. Singla, S. Mannan, M. Zaki, N.A. Krishnan, Accelerated design of chalcogenide glasses
through interpretable machine learning for composition-property relationships. J. Phys. Mater.
6(2), 024003 (2023)
17. S. Bishnoi, R. Ravinder, H. Singh Grover, H. Kodamana, N.M. Anoop Krishnan, Scal-
able Gaussian processes for predicting the optical, physical, thermal, and mechanical prop-
erties of inorganic glasses with large datasets. Mater. Adv. (2021). Royal Society of
Chemistry. https://doi.org/10.1039/D0MA00764A. [Online]. https://pubs.rsc.org/en/content/
articlelanding/2021/ma/d0ma00764a. Accessed 10 Jan 2021
18. T.C. Le, D.A. Winkler, Discovery and optimization of materials using evolutionary approaches.
Chem. Rev. 116(10), 6107–6132 (2016)
Chapter 12
Interpretable ML for Materials
12.1 Introduction
The black-box nature of ML models does not allow a domain expert to gain insights
into the features learned by the model. However, there are several approaches, both
model-specific and model-agnostic, that can be used to interpret ML models. Some
of these models have been explained in Chap. 9. Here, we discuss the application
of these approaches to interpret ML models. Specifically, we will discuss how to
interpret
• composition–property models
• interdependence of input features
• physics associated with atomic motion.
Note that these are a few selected problems, which are aimed at giving insights
into how the ML models can be explained. There are several interesting approaches
applied to interpreting images, such as microstructure, failure patterns, and crystal
structure. These are dealt with separately in Chap. 14.
The SHAP values are visualized using a violin plot and a river flow plot. Both
these plots allow one to analyze the SHAP value from different angles: (i) one from
that of a given component, that is, whether a given component affects the output
positively or negatively, and (ii) the other from that of a given composition, that is, in
a given composition which components increase and which decrease the output from
the mean value. Note that, for a given composition, the sum of the SHAP values of all
the components present and the mean output value should give the actual property
value of that composition. Thus, the SHAP values of the components in a given
composition are additive in nature toward the final output value.
The violin plot, also known as the beeswarm plot, represents the contribution of
a given feature toward the output value. Note that the color of the points represent
the value of the feature and the x-axis represent the corresponding SHAP value.
Thus, the violin plot is colored according to the feature value. The violin allows
one to identify, which features influence the outcome positively, and which affects
negatively. It also shows the features having maximum and minimum influence in a
quantitative fashion. Further, the violin plot also allows one to identify any features
that exhibit mixed effects. Mixed effects are those where a non-monotonic variation
in SHAP value is observed with a monotonic variation in the feature value. It means
that the behavior of the given feature is not independent and is dependent on other
features (elements or compounds) present in the material. This could further be
understood in terms of SHAP interaction values discussed in the next section.
Another way to visualize the SHAP values is through the river flow plot. For a given
material or compositions, the river flow plot shows the contribution of each of the
component in the composition or material toward final property value. Thus, a given
line represents one material and the intermediate points represent the contribution of
different input components towards the property value. In this case, the y-axis values
corresponding to each of the components represent the sum of SHAP values of all
the components up to that components with the mean. Thus, the value corresponding
to the last component represent the total property value for a given composition.
Thus, these paths are created by nudging the prediction from the expected value
towards a particular direction representing that specific glass component’s particular
contribution. Note that the river flow plot is colored according to the final output
value. Altogether, the SHAP beeswarm along with the riverflow plots provide an
approach to analyze the contribution of the individual components toward a given
property value. This feature can hence be used by experimentalists to design materials
with targeted property value by increasing or decreasing the components in a material
based on their SHAP values for the given property.
of the input features is all the more important in materials modeling. For instance,
in oxide glasses, the coordination of boron can be three or four depending on the
presence of a charge-compensating sodium atom in the near vicinity. If sodium is
absent, boron takes a trigonal planar structure (BO.3 ), while in the presence of sodium
atom, boron takes a tetrahedral structure (BO.− 4 ). Such interdependence of the input
features may not be captured by simple correlation analysis. It should be noted that
the interdependence of features may be specific to each property as well. For instance,
the impact of the dependence of input components for different properties might be
different. This depends on the factors governing the property, whether it is electronic,
structural, thermodynamic, or physical. SHAP dependence and interaction analysis
can be used to elucidate such interdependence of input features (Figs. 12.2 and 12.3).
Thus, the correlation between several input components in a model can also be
studied using the SHAP interaction values. To obtain the SHAP interaction values,
the variation in the output (predicted) value while perturbing two input components
simultaneously is analyzed. If the magnitude of variation in the output while perturb-
ing a single input component is the same for different values of the second output
components, this suggests that the two input components are not correlated; other-
wise, they are correlated. The degree of this correlation can also be quantified from
the SHAP interaction values.
It is worth noting that the SHAP interaction values for features is obtained for
each property separately. Thus, SHAP interaction values should not be confused
with correlation functions. Rather, SHAP interaction values for each property even
for the same set of input features and dataset could be different. Thus, the interaction
values could provide insights into which features are related for a given property
providing a direction for investigation. The exact nature of the coupling could be
further investigated employing simulations or experiments.
The previous sections focused on the use of interpretable ML for decoding the input-
output relationships. In addition to property predictions, ML can be used to gain
insights into materials’ response itself. This can be achieved by learning the patterns
associated with the atomic motion obtained through in silico experiments. Some of
the open questions in materials research include the understanding of the structure-
dynamics relationship in disordered systems. Disordered systems include inorganic
glasses, colloidal gels, metallic glasses, and even granular systems. These systems
exhibit interesting behavior such as ductile deformation, glass transition, and jam-
ming, the physics of which remain elusive till date. Indeed, the plasticity mechanisms
in crystalline systems are better understood by analysing the dislocation and their
propagation. However, the structural origin ductile behavior in disordered systems
is an active area of research.
One of the first attempts to understand the physics of structure-dynamics rela-
tionship in disordered systems employing supervised ML approaches was through a
quantity named, softness, which attempts to characterize the local structure and the
relation with dynamics thereof [1]. In this work, instead of trying to intuit the rela-
tionship between structure and dynamics, an ML approach using the large amounts
of data from either molecular dynamics or experimental data was employed [1–6].
This approach was then used to address a variety of interesting problems includ-
ing the relationship between structure and relaxation in out-of-equilibrium systems,
the role of defects in governing the fracture behavior, and universal signatures of
structure-property relationships in disordered solids [1–6].
The idea of softness revolves around the use of support vector machine (SVM) to
classify atoms based on their structural features and displacements. Using an SVM
with a linear kernel provides interpretability in terms of the support vector, that is,
214 12 Interpretable ML for Materials
Fig. 12.4 Parametrizing the local structure in a disordered solid through supervised ML. Reprinted
with permission from [1]
the plane that classifies the data points based on their features and labels. The basic
steps involved in the computation of softness is as follows.
1. Identify two populations of atoms or cluster of atoms that are: (i) about to expe-
rience rearrangements, and (ii) stable.
2. Obtain the degrees of freedom of each of these atoms in a quantitative fashion
using structural features such as bond and angle distribution functions or orien-
tational orders.
3. Learn the function (support vector represented by the plane, in this case) that opti-
mally separates (classifies) the rearranging population from the stable population.
4. Compute the distance of each of the atoms with respect to the support vector,
which is defined as the softness, . Si . Farther the distance from the plane is, more
stable (or unstable) the atom is (see Fig. 12.4).
There are several ways to perform each of these tasks and the authors have demon-
strated that the results remain qualitatively unaffected by the choice of the method
[1]. Here, we briefly describe some of the common choices for the tasks mentioned
in the list above.
The two common metrics used to identify the population susceptible (and not
2
susceptible) to motions are . Dmin [7] and . phop [8, 9]. Note that . phop is defined, with
reference to two time intervals . A = [t − 4000δt, t] and . B = [t, t + 4000δt] which
is large-enough to ensure that the system has undergone notable rearrangement, as
√
. phop (i, t) = <(x i − <x i > B )2 > A <(x i − <x i > A )2 > B (12.1)
where .<> represents the average over a given time interval, and .δt represents the
timestep of the simulation. Note that the intuitive meaning of the . phop is as follows.
If there are no rearrangements in the intervals A and B, then . phop reduces to the
variance of the particle position over time. In case there are rearrangements in the
intervals, then . phop is proportional to the square of the distance the particle moves in
the two intervals as the system moves from one minima to the other (see Fig. 12.5).
Using the distribution of . phop as shown in Fig. 12.5c, two populations of rearranging
and non-rearranging particle can be constructed using a threshold value.
12.4 Decoding the Physics of Atomic Motion 215
Fig. 12.5 Quantifying rearrangements. a The distance in the inherernt structure positions of a
particle as a function of time. Hopping events can be noticed here. b The . phop indicator function of
this trajectory. c The distribution of . phop . A clear cross-over to exponential distribution is observed
at at a well-defined value of . phop . Reprinted with permission from [1]
Once the populations that are mobile are identified, the next step is to obtain
features that quantify the local structure. To this extent, a function that counts the
number of particles around a central atom at a distance .r ± σ , similar to the pair
distribution function, .G iX (r, σ )is defined (see Fig. 12.4a) as
1 ∑ − 12 (Ri j −r)2
. G iX (r, σ ) = √ e 2σ (12.2)
2π j ∈X
In disordered solids, it is hypothesized that flow defects, which are effective in scat-
tering sound waves, are associated with localized particle rearrangements. However,
it is extremely challenging to directly identify and correlate flow defects based on the
local atomic structure only. To this extent, softness was used, due to its unique ability
to identify the structure–dynamics relation [5]. In this work [5], authors studied two
systems, a two-dimensional experimental granular pillar under compression and a
Lennard-Jones (LJ) glass in two and three dimensions (.d = 2, 3) above and below
its glass transition. Figure 12.6 shows the two systems studied.
These systems were trained using the SVM based on the features .G iX (r, σ ) and
.𝚿i (r, ξ, λ, ζ ), where the former represents the radial features, while the latter rep-
X
resents the orientational features. The label used to classify the SVM was based on
2
the probability of atomic motion computed using . Dmin . Figure 12.7 shows the prob-
ability . P(Dmin ) that a particle with given structural features was identified as soft
2
2
with respect to the observed value of . Dmin . Overall, these results suggest that the
SVM is able to identify the soft particles which are more susceptible to plastic flow
represented by increased mobility.
Once the stable and unstable particles were identified, the associated structural
features, which were used as the input for the SVM, were investigated to analyse
the correlation between the structure and mobility of the particles. Figure 12.8 shows
the distribution of the radial and orientational features associated with soft and hard
particles. Interestingly, it can be observed that the radial features, represented by .G BA
is unable to distinguish clearly between the hard particles and soft particles. However,
the orientational feature represented by .𝚿 ABB
is able to clearly differentiate the hard
and soft particles based on their local environment. (Note that the notation for .G and
Fig. 12.6 Snapshot configurations of the two systems studied. Particles are colored gray to red
2
according to their . Dmin value. Particles identified as soft by the SVM are outlined in black. a A
snapshot of the pillar system. Compression occurs in the direction indicated. b A snapshot of the
.d = 2 sheared, thermal Lennard-Jones system. Reprinted with permission from [5]
12.5 Summary 217
ϕ used in [5] is slightly different from that followed in the present book. For the
.
sake of consistency with the figure, the notation as per [5] is used in this paragraph).
In summary, the flow defects in disordered systems can be identified clearly using
SVM-based ML techniques. Moreover, the structural features associated with the
local environment or mobile and immobile particles can also be understood, thanks
to the interpretable nature of SVMs. Similar approach employing softness have been
widely used for studying several other problems including the structural relaxation of
glasses, structure–dynamics correlation in self-organising systems, and supercooled
liquids to name a few.
12.5 Summary
In this chapter, the use of interpretable ML approaches to gain insights into the
response of materials were discussed. Specifically, the use of SHAP to analyze the
black-box ML models can provide insights into the role of individual components
in a material. This information can be used for tailoring the design of materials with
targeted properties. Further, SHAP interaction values can be used to understand the
interdependence of the input features for a given property. Finally, a novel approach
to interpret the structure–dynamics relationship in materials through a metric namely
218 12 Interpretable ML for Materials
Fig. 12.8 Distribution of structural features of stable and unstable particles. Distribution of
.G B (i; r p
A AB eak) for soft (red dark) peak and hard (blue medium dark and green light) particles; .r AB
peak
corresponds to the first peak of the partial pair distribution functions .g AB or .g B A . b Distribution
of .𝚿 AB (i; 2.07σ AA , 1, 2), proportional to the density of neighbors with small bond angles near a
B
particle .i, for soft (red dark) and hard (blue medium dark and green light) particles. The inset shows
examples of configurations with corresponding radial and bond orientation properties, where dark
(light) gray neighbors are of species . A(B). Reprinted with permission from [5]
“softness” was discussed. These interpretable approaches provide insights into the
nature of the functions learned by the ML model and thus allows one gain insights
into the underlying physics associated with the material behavior.
References
1. S.S. Schoenholz, Combining machine learning and physics to understand glassy systems.
J. Phys. Conf. Ser. 1036, 012021 (2018). https://doi.org/10.1088/1742-6596/1036/1/012021.
[Online]
2. S.S. Schoenholz, E.D. Cubuk, E. Kaxiras, A.J. Liu, Relationship between local structure and
relaxation in out-of-equilibrium glassy systems. Proc. Natl. Acad. Sci. 114(2), 263–267 (2017)
3. E.D. Cubuk, S.S. Schoenholz, E. Kaxiras, A.J. Liu, Structural properties of defects in glassy
liquids. J. Phys. Chem. B 120(26), 6139–6146 (2016)
4. S.S. Schoenholz, E.D. Cubuk, D.M. Sussman, E. Kaxiras, A.J. Liu, A structural approach to
relaxation in glassy liquids. Nat. Phys. 12(5), 469–471 (2016)
5. E.D. Cubuk, S.S. Schoenholz, J.M. Rieser, B.D. Malone, J. Rottler, D.J. Durian, E. Kaxiras, A.J.
Liu, Identifying structural flow defects in disordered solids using machine-learning methods.
Phys. Rev. Lett. 114(10), 108001 (2015)
6. E.D. Cubuk, R. Ivancic, S.S. Schoenholz, D. Strickland, A. Basu, Z. Davidson, J. Fontaine, J.L.
Hor, Y.-R. Huang, Y. Jiang et al., Structure-property relationships from universal signatures of
plasticity in disordered solids. Science 358(6366), 1033–1037 (2017)
7. M.L. Falk, J.S. Langer, Dynamics of viscoplastic deformation in amorphous solids. Phys. Rev.
E 57(6), 7192–7205 (1998). https://doi.org/10.1103/PhysRevE.57.7192. [Online]. https://link.
aps.org/doi/10.1103/PhysRevE.57.7192
8. A. Smessaert, J. Rottler, Distribution of local relaxation events in an aging three-dimensional
glass: Spatiotemporal correlation and dynamical het- erogeneity. Phys. Rev. E 88(2), 022314
(2013). https://doi.org/10.1103/PhysRevE.88.022314. [Online]. https://link.aps.org/doi/10.
1103/PhysRevE.88.022314
References 219
13.1 Introduction
Materials are made of atoms, which, in turn, is made of electrons protons and sub-
atomic particles. The atomic motion is at the origin all the microscopic and macro-
scopic responses of materials. Materials’ response is controlled by phenomena occur-
ring at different length and time scales. For instance, the atomic motion occurs in
the scale of femto seconds, the dislocation motion occurs in a few pico seconds,
fracture propagation occurs in a nano seconds, and creep or fatigue occurs over a
period of days or years. Associated with each of these time and length scales, there
are several simulation techniques which are designed to capture the physics at that
particular length scale while ignoring the details that are not relevant. Some of these
techniques along with the associated length scales are shown in Fig. 13.1. The small-
est length scale, quantum, is traditionally modeled using first principle simulations
or density functional theory. While these approaches can take into account the elec-
tronic motions and accurately predict the atomic structure, they are limited due to a
few tens of atoms and picoseconds due to the prohibitive computational cost.
Atomistic simulations can model larger system sizes of a few millions of atoms
and up to a few nanoseconds of simulation time by ignoring the electronic structure
details. The atomic interactions in these simulations are modeled by the empiri-
cal interatomic potentials, which are fitted against the first principle simulations
or experimental data. To understand phenomena occurring at larger length scales,
coarse-grained simulations can be used. In these simulations, a cluster of atoms are
combined to form a “bead” and the effective interactions between the beads are fitted
based on the all atom simulations.
Beyond coarse-grained simulations, continuum-based approaches such as phase
field simulations, finite element simulations, or particle based simulations including
smoothed particle hydrodynamics or peridynamics are commonly used. For these
simulations, the constitutive model of the material is given as the input and the
material response under different loading conditions are studied.
In all these simulations, the force-displacement or the strain-strain response, also
known as the constitutive relationship, form the basic input that governs the response
of materials. These relationships are either assumed based on domain knowledge,
or fitted against experiments or first principle simulations. ML presents as an ideal
candidate to learn these interactions which can then be used for materials simulations.
Several physics-informed approaches have been developed for modeling materials
at different length-scales. In this chapter, we will discuss about these approaches
13.2 Machine Learning Interatomic Potentials for Atomistic Modeling 223
Molecular dynamics (MD) simulations model atoms as point particles that interact
with each based on a force field, also known as interatomic potentials. The interatomic
potentials are fitted against first principle simulations or experiments for certain prop-
erties of interest, which are then validated against independent measurements. These
interatomic potentials, like other potentials such as gravitational or electric, depends
on the position, type, and charge of the atoms. The iterative dynamics followed in
MD simulations is given in Fig. 13.2. Based on the atomic configuration, the poten-
tial energy of the systems can be obtained from the interatomic potential. Once the
potential energy is obtained, the force on each atom can be computed as the derivative
of energy with respect to positions .(Fi = −∇U (ri )). Once the forces are obtained,
the acceleration is computed using Newton’s laws. Finally, the updated velocities and
positions of atoms can be obtained using numerical integration algorithms such as
Verlet or velocity Verlet. Thus, the accuracy of MD simulations are dependent on the
reliability of the interatomic potentials used for the simulation. Similarly, the unavail-
ability of interatomic potentials for a large number of elements and compounds limits
the ability to simulate these systems using MD (Fig. 13.3).
Fig. 13.3 ML potentials as a potential solution to the trade-off between cost and accuracy of
conventional atomistic simulations. Potential future developments include hybrid machine learn-
ing/molecular mechanics (ML/MM) methods, more efficient representations to decrease simulation
times and more accurate training data (proposed by an active learning algorithm) to improve the
model accuracy beyond density functional theory. Potential future applications are shown in the
blue box (only approximately positioned according to their system size and accuracy requirements).
They include the simulation of enzymes and biomolecules such as ribosomes, the quantitative sim-
ulation of chemical reactions and reaction networks as well as the atomistic simulation of complex
reactive materials as found in, for example, Li-ion batteries. The inset on the top left shows the
energy landscape of a protein folding simulation, which is a prototypical example of a classical
force field calculation. No covalent bonds are formed or broken during the simulation. The inset
on the right shows an excited state dynamics simulation of a S1 to S0 transition, which requires
ab initio methods to compute the excited state properties. The ‘Coarse graining’, ‘Reactive force
fields’, and ‘QM/MM’ boxes are faded out as these methods are not discussed in depth. Figure
adapted with permission from: Marat Yusupov, Roland Beckmann and Anthony Schuller (biology
image); [1], American Chemical Society (chemistry image); [2], Elsevier (materials image); [3],
PNAS (left inset); [4], AIP (right inset). Reprinted with permission from [5]
Fig. 13.5 Fitting procedure for machine learning interatomic potentials. Reprinted with permission
from [8]
mass, and angle of the molecular axis relative to the surface normal. However, these
approaches were restricted to low-dimensional potential energy surfaces due to the
lack of a descriptor that can unambiguously, succinctly, and universally represent
the local environment of an atom. The first ML-based potential that overcame these
limitations was proposed in 2007 [7]. Following this, a variety of ML potentials, each
using either a different descriptor for the atomic environment or a different regressor
to learn the potential energy function (detailed below), have been developed. The
timeline of the some of the major ML potentials are shown in Fig. 13.4.
The broad approach employed in the development of an ML-based potential is
shown in Fig. 13.5. These steps can be broadly outlined as follows.
1. Development of a training set. First step involved in the training of an ML
potential is the development of training set. The training set should have a mini-
mum of the following information: (i) position of all the atoms in a configuration,
which forms the input, and (ii) potential energy of the total system, which forms
the output against which the potential is trained. In addition to these, sometimes
226 13 Machine Learned Material Simulation
velocity of the atoms may also be used as an input parameter. Similarly, the force
per atom and potential energy per atom are some additional output against which
the potential may be trained. This depends on the approach used to train the
ML potential. Note all of these parameters can be obtained directly from density
functional theory or other first principle simulation approaches.
2. Representation of the local structure. Once the training set is developed, the
next important aspect is the development of a representation for the local structure
around an atom. It should be noted that the contributions to the potential energy
of a system is mainly from the first neighbors, followed by the second and third
neighbors. Thus, it is reasonable to use a cutoff for computing the representation
of the local structure around an atom. This cutoff can be similar to, lower than
or even higher than the cutoff of an empirical potential. However, the number of
atoms within the selected region interacting with the central atom can be a variable
for a given configuration even with a cutoff. Thus, a general representation that
can be given as an input to a fixed size ML model is a requirement when using
classical ML models. This is because the number of input features in a classi-
cal ML model such as neural networks, support vector or tree-based approaches
remain constant. An alternative to this approach is to use a graph neural network
which can tackle different number of neighbors. This approach is discussed in
detail in later section. The representations that are commonly used are developed
based on the radial and angular functions or order parameters. A detailed study
on the sensitivity (which represents the accuracy with which the descriptor is able
to capture a local environment uniquely and distinctly from a slightly different
one) and dimensionality (which represents the complexity and consequently the
computational cost of the resulting ML model) of these descriptors used to rep-
resent the local atomic environment can be found in [8]. The list of descriptors
presented in this work are shown in Fig. 13.6. Based on this work [8], some of
the essential properties of descriptors and representations for encoding materials
and molecules suggested are as follows.
Fig. 13.6 Classification of local atomic representations based on their method of construction
(horizontal axis) and when they were first proposed (vertical axis). QSAR and SISSO do not
indicate representations but instead indicates the representations that are constructed or selected
using these methods (see the text). The superscript .a,b,c correspond to the representations that are
classified with multiple methods: direct and connectivity, histogram and mapping functions, and
connectivity and mapping functions. Reprinted with permission from [9]
A library which generates several atomic descriptors directly from the atomic
structure namely DScribe [9], has been developed and made publicly available
(see https://github.com /singroup/dscribe). This package, a continuously growing
one, has incorporated several of these descriptors including Coulomb matrix [10],
Sine matrix [11], Ewald sum matrix [11], Atom-centered Symmetry Functions
(ACSF) [12], Smooth Overlap of Atomic Positions (SOAP) [13], Many-body
Tensor Representation (MBTR) [14], Local Many-body Tensor Representation
(LMBTR) [14].
3. Regressor for learning the potential energy function. The third aspect of devel-
oping an ML potential involves the use of different types of regressor for training
the potential. These include kernel ridge regression, neural networks, random for-
est or decision trees, and Gaussian process regression. Each of these approaches
have their own pros and cons. Specifically, Gaussian process regression enables
the computation of standard deviation associated with every prediction. This
allows to compute the reliability of the prediction of the potential energy asso-
ciated with each new configuration. Further, identifying the regions having high
standard deviation can enable an active learning approach for training the inter-
atomic potentials, thereby reducing the number of data points required to train
the ML potentials. The most common regressor that is used in training the ML
potential is the feed forward neural network with two or more hidden layers.
Among the potentials mentioned in Fig. 13.4, SNAP, AGNI and MTP potentials
employs linear regression or a regularized version of it. GAP potential employs its
namesake Gaussian regression. All other potentials use neural networks, shallow
or deep.
Being a very active of area of research, several software packages are available for
the development of ML potentials. Many of these packages are linked to available
open-source packages such as LAMMPS and VASP. Some of these packages are
listed below.
• Aenet (Atomic Energy Network, see: http://ann.atomistic.net) and N2P2 (Neural
Network Potential Package, see: https://compphysvienna.github.io/n2p2/), are two
software packages for training NN potentials proposed by Behler-Parrinello [7].
N2P2 also has an interface which allows running MD simulations with LAMMPS
package. MAISE (Module for ab initio structure evolution [15], see:
http://maise.binghamton.edu/wiki/home.html) is another package which enables
automated generation of NN (Behler-Parrinello type) potentials for global struc-
ture optimization, wherein the DFT database is automatically generated employing
an evolutionary algorithm-based sampling procedure.
• ASE (Atomistic Simulation Environment [16] see: https://wiki.fysik.dtu.dk/ase/)
offers a set of Python tools for pre-processing, running, post-processing atomistic
simulations. It enables DFT database generation and ML potential testing through
an environment that is linked to MD simulation packages such as LAMMPS, and
to DFT packages such as VASP and Quantum Espresso.
13.2 Machine Learning Interatomic Potentials for Atomistic Modeling 229
Fig. 13.7 Fitting procedure for machine learning interatomic potentials along with some of the
commonly used applications. Reprinted with permission from [18]
230 13 Machine Learned Material Simulation
then used to compute the energy of an individual, which when summed over all the
atoms provide the total energy of the system. The mean squared errors between the
predicted total energy and the energy obtained from the DFT simulations is used to
train the neural network. Note that standard techniques of regularization can also be
included in the loss function for the improved training of the potential, while avoiding
overfitting. Note that PINN potentials try to combine the best of both ML potential
and traditional potential. On the one hand, unlike the traditional potential, the param-
eters of PINN potentials are not constant. Rather, they are dynamically changing
depending on the local environment of the atomic structure. Further, this function is
learned from the first principle simulations data. On the other hand, unlike the ML
potential, the functional form of PINN potential is physically-informed having a high
inductive-bias leading to improved performance than purely data-driven ML poten-
tials. Thus, PINN potentials have been shown to provide superior performance over
pure ML-based potential especially in conditions far from equilibrium. Figure 13.7
shows some of the example properties calculated with the PINN potentials for Al (a–
d) [19] and Ta (e–h). These include the phonon dispersion curves (Fig. 13.7), linear
thermal expansion with respect to temperature (Fig. 13.7), the solid-liquid interface
tension computed by the capillary fluctuation method (Fig. 13.7) and crack nucle-
ation and growth on a grain boundary. Predictions are compared with experimental
data wherever available. Further, MD simulations of surfaces in body-centered cubic
Ta on (110) and (112) planes, respectively (Fig. 13.7), Nye tensor plot of the core
structure and Peierls barrier of the screw dislocation in Ta predicted by the PINN
potential (lines) are shown.
In addition, PINNs can be extremely useful in simulations involving interactions
at a very short range much below the equilibrium bond lengths, for example, in the
case of radiation damage or shock simulations. This is because the high repulsive
interactions in the short-range may not be trained properly in purely data-driven
ML potentials due to the sparsity of data in this region. These interactions being
extremely rare may not be represented in a dataset of reasonable size and especially
that of first-principle simulations. However, despite the lack of data in this region,
PINN potentials can successfully capture the exponential repulsion at the short-
range, thanks to the physics-informed functional form of the potential (see Fig. 13.8).
Overall, PINN potential holds to be the promise to tackle a large variety of problems
by incorporating DFT-level accuracy with MD-level computational efficiency, all the
while respecting the underlying the physics of the problem. Thus, PINN potentials
can be reliably extrapolated used in regions where the system has not been trained
thus far.
13.3 Physics-Informed Neural Networks for Continuum Simulations 231
When simulating systems at the macroscopic scale, the details at the atomic and
microscale may not be relevant. Instead a homogenized model representing the aver-
aged information from the lower length scales may be capable of simulating the
system with expected accuracy. For instance, while simulating the deformation of a
232 13 Machine Learned Material Simulation
steel truss structure (like Eiffel tower!), the microstructural details or atomic level
dislocation motions may not be relevant as long as the Young’s modulus and the yield
strain is maintained at the macroscopic level. Thus, the continuum models ignore the
atomic and microscopic details and relies on a fundamental laws and mathematical
models that are capable of capturing the essential details at the macroscopic level. It
should be noted as we start going toward length scales the continuum rules breakdown
and atomistic movement start becoming prominent and governing. Although, there
are some thumb-rules on the lengthscales associated each of these theories, there are
several reformulations, such as the non-local theories and multiscale models, that
allow one to adapt the theories over a wider range of lengthscales and systems.
Modeling of a continuum system relies on a few fundamental relationships. These
include the:
.ρx = ∇ · σ + b (13.1)
where.ρ represents the density,.x represents the displacement,.σ represents the second
order stress tensor, and .b the body force. Further, it is assumed that the boundary
condition is fixed on both ends of the bar. There are several assumptions in this
model, many of which may not hold valid under different circumstances leading to
accumulation of error in the system.
When large amount of data is available, such approaches can be replaced by purely
data-driven approaches where the constitutive relationship itself is learned directly
using ML. Note that this approach is similar to the development of interatomic
13.3 Physics-Informed Neural Networks for Continuum Simulations 233
Fig. 13.9 Different types of bias in the data of physics. Reprinted from [23] with permission
potentials using ML, albeit for a continuum system. Further, instead of first-principle
simulations, the data needed for training the ML model has to be generated from a
large number of experimental observations. While such approaches have been used
to study atomistic models, these approaches are not commonly used for traditional
continuum modeling. Instead, the governing laws are directly learned from the data
by applying symbolic regression [20, 21] or even combining symbolic regression
with deep learning [22]. However, for most practical applications, the data available
will neither be too large nor too small. In such cases, a physics-informed neural
network (PINN) can significantly enhance the performance of the simulations with
limited assumptions about the model (Fig. 13.9).
PINNs involve biasing an ML model so as to learn the underlying physics of the
problem leading to solutions that are consistent. These biasing modes can be broadly
classified into: (i) observational, (ii) inductive, and (iii) learning bias [23] as detailed
below.
1. Observational bias can be introduced by including a large amount of observa-
tional data. Sometimes, getting a large amount of data can be challenging and
expensive. In such cases, available data can be augmented while respecting the
physical structure of the data. For instance, a microstructural image of a crystal
can be augmented by applying operations such as rotation, reflection, or even
zooming in to selected areas and trimming the remaining regions. All of these
234 13 Machine Learned Material Simulation
Fig. 13.10 A PINN with modified loss function for solving the viscous Burgers’ equation.
Reprinted from [23] with permission
augmented images will still correspond to the original image. Observational bias
is one of the most commonly applied modes used in training ML models.
2. Inductive bias refers to the introduction of a specific architecture of an ML model,
for example NN, which implicitly embed any prior knowledge on the structure
of the data or the predictive task. Examples of such specialized architectures
include convolutional neural networks (which are designed for learning patterns
from images), graph neural networks (which incorporates the specific topology
of the data such as molecules and structures in the form of a graph), or recurrent
neural networks which can learn from the data in the form of a series. Note
that these architectures can be further modified to respect more physics-based
features such as symmetry and translation. In such cases, the data augmentation
by symmetry operations become redundant thereby reducing the observation bias
and increasing the inductive bias. Introducing inductive bias by converting a
molecular structure to a graph structure has been a very effective to predict material
properties and discover new entities.
3. Learning bias is the third approach, in which instead of modifying the architecture
of the ML model, the constraints in terms of the physical laws are imposed in
the loss functions. Figure 13.10 shows a PINN algorithms for solving the viscous
Burgers’ equation for fluid flow given by
∂u ∂u ∂ 2u
. +u =ν 2 (13.2)
∂t ∂x ∂x
13.4 Graph Neural Networks 235
where .u(x, t) represents a the speed of the fluid at the spatial and temporal coor-
dinates of .x and .t, respectively and .ν represents the kinematic viscosity or the
diffusion coefficient. The 13.3 can be modified as
∂u ∂u ∂ 2u
. +u −ν 2 =0 (13.3)
∂t ∂x ∂x
Thus, in PINN, this additional term can be included as a physics-loss in addition
to the data-loss. Following this, the models will be trained on the loss function
containing both the physics and data loss. Thus, the learned weights will be trained
to respect both the physics loss and the data loss. However, the model once trained
will not have any restrictions during inference as the physics-based inductive bias
was employed only in the loss function.
PINN algorithms have now been widely used in different domains to solve prob-
lems in the area of solid and fluid mechanics. However, it should be noted that the
PINN approaches satisfy the governing laws or additional physics-based inductive
biases only in a weak fashion. This is due to the fact that the training of an ML with a
physics-loss will have a finite loss irrespective of how well-trained the model is. The
propagation of this finite loss during the inference phase will be a governing factor in
deciding the quality of the simulation carried out using PINN algorithms. Alternate
approach to address the issues of PINN algorithms is to enfore the physics-laws such
as conservation laws in a strong fashion into the architecture itself. This approach is
discussed next.
A major limitations of MLPs used for learning the dynamics of a system is that the
MLPs are transductive in nature, that is, they work only for the systems they are
trained for. For instance, an MLP-based PINN trained for a 5-spring system (that
is, 5 balls connected by 5 springs) can be used to infer the dynamics of the same
system only and not any .n-spring system. This significantly limits the application
of such approaches to simple systems since for each system the training trajectory
needs to be generated and the model needs to be trained. Further, such approaches are
not beneficial to train interatomic potentials from DFT trajectories or to be used in
continuum simulations as the trained model cannot be used for any system other than
the one on which it is trained. It has been shown the transductivity of MLPs could be
addressed by incorporating an additional inductive bias in the structure by accounting
for the topology of the system in the graph framework using a graph neural network
GNNs. GNNs, once trained, has the capability to generalize to arbitrary system sizes.
Most earlier studies on GNNs for dynamical systems use a purely data-driven
approach, where the GNNs are used to learn the updated position and velocity from
the data on trajectories. To address these challenges, several physics-enformed GNNs
236 13 Machine Learned Material Simulation
have been proposed such as the Hamiltonian (HGNN) and Lagrangian (LGNN) graph
neural networks and Graph Neural ODEs. These physics-enforced GNN architectures
are discussed in detail later in this section. In addition, since the GNNs are trained
at the node and edge level, they can potentially learn more efficiently from the same
number of data points in comparison to their fully connected counterparts. Further,
since the learning of the function happens at the node and edge level in a GNN, there
are no limitations on the system size on which the trained GNN can be used. Note that
the graph-based architectures makes GNN directly amenable for molecular or atomic
systems with atoms and bonds, where the nodes represent the atoms and the edges
represent the bonds. Thus, GNNs are widely used for modeling atomic systems.
However, it should be noted that the GNNs are no limited to such discrete systems;
GNNs have also been used to model continuum systems, rigid body systems, and
articulated bodies.
This equation is essentially equivalent to the Newton’s second law of motion and is
applied for solving several problems including classical molecular dynamics simu-
lations.
13.4 Graph Neural Networks 237
Lagrangian Dynamics
The standard form of Lagrange’s equation for a system with .holonomic constraints
is given by
d
. (∇ẋ L) − (∇x L) = 0 (13.6)
dt
where the Lagrangian is. L(x, ẋ, t) = T (x, ẋ, t) − V (x, t) with.T (x, ẋ, t) and.V (x, t)
representing the total kinetic energy of the system and the potential function from
which generalized forces can be derived, respectively. Accordingly, the dynamics of
the system can be represented using Euler-Lagrange (EL) equations as
Note that the second order differential equation obtained based on the Lagrangian
mechanics is equivalent to the first order differential equation obtained based on the
D’Alembert’s principle.
Hamiltonian Dynamics
where, .px = ∇ẋ L = Mẋ represents the momentum of the system in Cartesian coor-
dinates and . H (x, px ) = ẋT px − L(x, ẋ) = T (ẋ) + V (x) represents the Hamilto-
nian of the system. The equation can be simplified by assuming . Z = [x; px ] and
. J = [0, I ; −I, 0] then the Hamiltonian equation can be written as
. ∇ Z H + J Ż = 0 or Ż = J ∇ Z H (13.10)
since . J −1 = −J . Note that this coupled first order differential equations are equiv-
alent to the Lagrangian 13.7.
238 13 Machine Learned Material Simulation
architecture from previous works is the presence of global and local features—local
features participate in message passing and contribute to quantities that depend on
topology, while global features do not take part in message passing. In HGNN, the
position .x, velocity .ẋ are employed as global features for a node while .d and .t p are
used as local features.
13.4 Graph Neural Networks 239
An .l-layer message passing GNN, which takes an embedding of the node and
edge features created by MLPs as input is used as the graph architecture. The local
features participate in message passing to create an updated embedding for both
the nodes and edges. The final representations of the nodes and edges, .z i and .z i j ,
respectively, are passed through MLPs to obtain the Hamiltonian of the system.
The Hamiltonian of the system is predicted as the sum of .T and .V in the HGNN.
Typically, the potential energy of a system exhibits significant dependence on the
topology of its underlying structure. In order to effectively capture this information,
multiple layers of message-passing among interacting particles (nodes) is employed
in HGNN. During the .l th layer of message passing, the node embeddings are itera-
tively updated according to the following expression:
⎛ ⎛ ⎞⎞
∑ ( )
hl+1 = squareplus ⎝MLP ⎝hli +
. i WlV · hlj ||hli j ⎠⎠ (13.11)
j∈Ni
Similar to .WlV , .WlE is a layer-specific learnable weight matrix specific to the edge
set. The message passing is performed over . L layers, where . L is a hyper-parameter.
The final node and edge representations in the . L th layer are denoted as .zi = hiL and
.zi j = hi j respectively.
L
240 13 Machine Learned Material Simulation
∑
∑ The total potential energy of an .n-body system is represented as .V = i vi +
i j vi j . Here, .vi denotes the energy associated with the position of particle .i, while
.vi j represents the energy arising from the interaction between particles .i and . j.
For instance, .vi corresponds to the potential energy of a bob in a double pendu-
lum, considering its position within a gravitational field. On the other hand, .vi j
signifies the energy associated with the expansion and contraction of a spring con-
necting two particles. In the proposed framework, the prediction for .vi is given
by .vi = squareplus(MLPvi (hi0 || xi )). Similarly, the prediction for the pair-wise
interaction energy .vi j is determined by .vi j = squareplus(MLPvi j (zi j )).
Finally, the . H of the system is obtained from HGNN is substituted in the (Eq.
13.10) to obtain the acceleration and velocity of the particles. These values are
Fig. 13.12 Evaluation of HGNN on the pendulum, spring, binary LJ, and gravitational systems.
a Predicted and b actual phase space (that is, .x1 -position vs. .x2 -velocity), predicted with respect
to actual c kinetic energy, d potential energy, and e forces in 1 (blue square), and 2 (red triangle)
directions of the .5-pendulum system. f Predicted and g actual phase space (that is, .1-position, .x1
vs .2-velocity, .ẋ2 ), predicted with respect to actual h kinetic energy, i potential energy, and j forces
in 1 (blue square) and 2 (red triangle) directions of the .5-spring system. k Predicted and l actual
positions (that is,.x1 and.x2 positions), predicted with respect to actual m kinetic energy, n pair-wise
potential energy, .Vij for the (0–0), (0–1), and (1–1) interactions, and o forces in 1 (blue square),
2 (red triangle), and 3 (green circle) directions of the .75-particle LJ system. p Predicted and q
actual positions (that is, .x1 − and .x2 −positions), predicted with respect to actual r kinetic energy, s
potential energy, and t forces in 1 (blue square), and 2 (red triangle) directions of the gravitational
system
13.4 Graph Neural Networks 241
Figure 13.12 shows the results of HGNN on four systems, namely, .n−pendulum,
n−spring, gravitational systems and binary Kob-Anderson Lennard Jones systems.
.
HGNN learns the dynamics directly from the trajectory in excellent agreement with
the ground truth trajectory. Further, the forces and energies predicted by the HGNN
are also in excellent agreement with the ground truth. This suggests that the HGNN
trained purely on the trajectory can indeed learn the exact forces and energies of each
of the particles in the system wihout explicit training on them. Thus, HGNN can be
used to learn the interactions between the systems directly from their trajectory.
This can be evaluated further by analyzing the functions learned by the MLPs corre-
sponding to the nodes and edges. Figure 13.13 shows the learned potential and kinetic
energy functions by the HGNN. It is demonstrated that the functions learned by the
HGNN exhibits an exact match with the actual functions. Further, symbolic regres-
sion can be used to discover the functional forms based on the interpreted data points
of the functions. Table 13.1 shows the functions obtained from SR based on the data
points shown in Fig. 13.13. For most of the functions, the learned functions exhibit
a close match with the original function. The best equation is determined based on
the score function, which is a balance of the complexity and loss. The equation with
the maximum score represents the one with optimal complexity and correspond-
ingly low loss values. Note that the equation obtained from SR also depends on the
hyperparameters and also for the number of epochs for which the SR is run. By
increasing the number of epochs better equations could be discovered. Altogether,
the PEGNN-based framework can be used to learn the dynamics directly from the
Fig. 13.13 Interpreting the learned functions in HGNN. a Potential energy of pendulum system
with the .2-position of the bobs. b Kinetic energy of the particles with respect to the velocity for
the pendulum bobs. c Potential energy with respect to the pair-wise particle distance for the spring
system. d The pair-wise potential energy of the binary LJ system for 0–0, 0–1, and 1–1 type of
particles. The results from HGNN are shown with the markers, while the original function is shown
as dotted lines
242 13 Machine Learned Material Simulation
Table 13.1 Original equation and the best equation discovered by symbolic regression based on
the score for different functions. The loss represents the mean squared error between the data points
from HGNN and the predicted equations
Functions Original equation Discovered equation Loss Score
Kinetic . Ti = 0.5m|ẋi |2 . Ti = 0.500m|ẋi |2 .7.96 × 10−10 .22.7
energy
( )2
Harmonic . Vi j = 0.5(ri j − 1)2 . Vi j = 0.499 ri j − 1.00 .1.13 × 10−9 .3.15
spring
Binary LJ . Vi j = 2.0
ri12
− 2.0
ri6j
. Vi j = 1.90
ri12
− 1.95
ri6j
.0.00159 .2.62
j j
(0–0)
Binary LJ . Vi j = . Vi j = 2.33
ri9j
− 2.91
ri8j
.3.47 × 10−5 .5.98
(0–1) 0.275
ri12
− 0.786
ri6j
j
Binary LJ . Vi j = . Vi j = 0.215
ri12
− 0.464
ri6j
.1.16 × 10−5 .5.41
j
(1–1) 0.216
ri12
− 0.464
ri6j
j
trajectory and further to interpret the learned dynamics and the abstract quantities
governing it. Finally, it can also be used to scale to large system sizes, thus, making
the framework an ideal candidate to learn the dynamics from ab-initio data which
can then be used in classical molecular dynamics simulations.
13.5 Summary
approach bridges the gap between data-driven machine learning and physics-based
modeling, enabling the simulation of large-scale systems with improved accuracy
and reduced computational cost. Finally, physics-enforced graph neural networks
have shown great potential in capturing the underlying dynamics of materials sys-
tems represented as graphs. By incorporating physical principles and constraints
into graph neural networks, these models can effectively learn the interactions, while
ensuring that the governing laws are strictly. Thanks to their interpretability and gen-
eralizability, they hold promise in learning quantum-accuracy simulations at much
higher length scales.
Collectively, these machine learning approaches for materials simulations offer
novel and powerful tools for advancing our understanding of materials behavior,
accelerating materials discovery, and enabling the design of materials with tailored
properties. By combining data-driven machine learning techniques with the underly-
ing physics and chemistry of materials, materials scientists can unlock new insights,
overcome computational challenges, and drive innovation in materials science and
engineering. Looking ahead, future research should focus on further refining and
expanding these machine learning techniques, exploring their applicability to new
materials systems, and addressing challenges such as interpretability, robustness,
and scalability. Additionally, interdisciplinary collaborations between materials sci-
entists, data scientists, and computational physicists will be crucial in pushing the
boundaries of machine learning for materials simulations and realizing its full poten-
tial in revolutionizing materials research and development.
References
1. G.N. Simm, M. Reiher, Error-controlled exploration of chemical reaction networks with gaus-
sian processes. J. Chem. Theory Comput. 14(10), 5238–5248 (2018)
2. S.J. An, J. Li, C. Daniel, D. Mohanty, S. Nagpure, D.L. Wood III., The state of understanding
of the lithium-ion-battery graphite solid electrolyte interphase (SEI) and its relationship to
formation cycling. Carbon 105, 52–76 (2016)
3. G. Reddy, Z. Liu, D. Thirumalai, Denaturant-dependent folding of GFP. Proc. Natl. Acad. Sci.
109(44), 17832–17838 (2012)
4. Y. Shu, B.G. Levine, Communication: non-radiative recombination via conical intersection at
a semiconductor defect. J. Chem. Phys. 139(8), 081102 (2013)
5. P. Friederich, F. Häse, J. Proppe, A. Aspuru-Guzik, Machine-learned potentials for next-
generation matter simulations. Nat. Mater. 20(6), 750–761 (2021)
6. T.B. Blank, S.D. Brown, A.W. Calhoun, D.J. Doren, Neural network models of potential energy
surfaces. J. Chem. Phys. 103(10), 4129–4137 (1995)
7. J. Behler, M. Parrinello, Generalized neural-network representation of high-dimensional
potential-energy surfaces. Phys. Rev. Lett. 98, 146401 (2007). https://doi.org/10.1103/
PhysRevLett.98.146401
8. B. Onat, C. Ortner, J.R. Kermode, Sensitivity and dimensionality of atomic environment rep-
resentations used for machine learning interatomic potentials. J. Chem. Phys. 153(14), 144106
(2020). https://doi.org/10.1063/5.0016005
9. L. Himanen, M.O. Jäger, E.V. Morooka, F. Federici Canova, Y.S. Ranawat, D.Z. Gao, P. Rinke,
A.S. Foster, Dscribe: library of descriptors for machine learning in materials science. Comput.
244 13 Machine Learned Material Simulation
Abstract This chapter explores the application of machine learning (ML) algo-
rithms to image-based data for a comprehensive understanding of materials. The
focus is on various aspects, including the investigation of structure-property rela-
tionships, prediction of ionic conductivity, accelerated property prediction through
the combination of finite element analysis and image-based modeling, and the use
of molecular dynamics and image-based modeling to predict crack propagation in
atomic systems. Additionally, the chapter discusses the use of neural operators to effi-
ciently learn stress and strain fields from limited ground truth data. The integration
of ML algorithms with image-based data has shown promising results in advancing
materials science, enabling deeper insights into material behavior and accelerating
property prediction. Future directions involve the development of more advanced
neural operator frameworks, integration with quantum mechanics, exploration of
complex material systems, and incorporation of experimental techniques. Overall,
the application of ML algorithms to image-based data offers exciting opportuni-
ties for materials design and optimization, paving the way for the discovery of novel
materials with tailored properties and improved performance in various applications.
14.1 Introduction
Here, we will discuss about several ML techniques used to analyse images for
understanding and predicting material structure, properties, and responses. Although
various steps involved in this process are similar to those in case of traditional ML-
based property prediction as outlined in Chap. 10, the algorithms used in this case vary
significantly. Specifically, images represent unstructured data with a large number
of dimensions to process. For instance, a .64 × 64 pixel image has 4096 pixels each
representing a unique input feature. This is extremely large for a traditional MLP
to handle. Further, the order in which the pixels are organized do convey some
meaningful information structurally. To address this challenge, convolutional neural
networks (CNNs) are widely used, which significantly reduce the input dimension
through operations such as convolution and pooling (Fig. 14.2).
The broad approach used in materials for image-based property prediction are
outlined in Fig. 14.1. This involves the preprocessing of the images to make the
Fig. 14.1 Computer vision based approach for predicting material properties. Reprinted with
permission from [1]
Fig. 14.2 Applications of computer vision toward various downstream tasks in materials domain
14.2 Structure–Property Prediction Using CNN 247
dataset consistent, for instance, making them greyscale, same pixel dimensions, etc.
Following this, the architecture of the CNN is finalized and the model is trained
using the training set. The hyperparametric optimization may then be carried out
using the validation set or by k-fold cross validation. Finally, the performance of
the model is evaluated using the test set. In this chapter, we will discuss several ML
algorithms that exploits image-based data for improved understanding of materi-
als. These include the understanding of structure-property linkages, predicting the
ionic conductivity, combining finite element analysis with image-based modeling
for accelerated property prediction, and combining molecular dynamics with image-
based modeling for predicting crack propagation in atomic systems. Finally, we also
discuss neural operators that learn the stress and strain fields efficiently from sparse
ground truth data.
where .n and . p denotes the two states between which the two-point correlation is
computed, .J[·] denotes the Fourier transformation with respect to position .r and
the complex conjugate .[·]∗ . Once the two-point correlation functions are obtained,
dimensionality reduction techniques such as PCA can be employed to reduce the
size of the input dimension. Specifically, the top .n principal components can be
selected and used as input features based on their variance. These features can then
be used in OLS or other regression techniques to predict the target property y. These
techniques suffer with the curse of dimensionality, reducing which using PCA or other
approaches results in the loss of information. Moreover, the feature engineering in
this approach is still manual and depends on the skill of the domain expert.
14.2 Structure–Property Prediction Using CNN 249
These issues can be addressed using the CNN approach as illustrated in Fig. 14.3c.
In this case, no a priori feature engineering is required—representative features are
extracted from the raw images by the CNN while passing through the convolutional
layers. These features are then used to predict the properties through a fully connected
layer. In other words, CNN predicts the property y directly from the raw images
in an end-to-end fashion without any intermediate intervention in terms of feature
engineering. However, it should be noted that the features learnt by the CNN may
not be easily interpretable by humans. To this extent, post hoc approaches such as
integrated gradient, gradient SHAP, or other interpretable algorithms may be used
[2, 6–8].
Exploiting CNNs for property predicting directly from the images can be an
extremely useful for materials researchers. A CNN properly trained for a prop-
erty based on microstructure can then predict the properties for new microstructures
directly from the image. Kondo et al. [2] employed this approach to predict the ionic
conductivity of yttria-stabilized zirconia (YSZ) based on their microstructures. First
column of the Fig. 14.4 shows the microstructures of different YSZ obtained by vary-
ing the sintering temperature (1400 .◦ C, 1440 .◦ C, and 1480 .◦ C) and sintering time
(1, 5, 10, and 30 hours). y value represents the ionic conductivity of these samples
in mS/cm. By training with these images (after applying some image augmentation
techniques such as rotation, cropping, flipping) as inputs, CNN was able to predict
the ionic conductivity on unseen dataset reasonably. It is worth noting that only seven
original images were used to train the CNN, which consists of 70,000 parameters
that are learned during the training.
In order to further understand the features learnt by the CNN, a feature visualiza-
tion method similar to CAM [9] and Grad-CAM [10] was employed. Specifically,
this approach identifies the feature maps, which is then used to define masking maps.
The masking maps hide irrelevant features that have little or no role in governing
the ionic conductivity, which is the output property. Figure 14.4 shows the regions
governing low and high ionic conductivity as learnt by the CNN. Specifically, the
blue and red masks represent the regions ignored by the CNN when predicting low
and high ionic conductivity, respectively. Thus, low ionic conductivity YSZ are char-
acterized by increased voids while high ionic YSZ contain fewer crystal defects. This
observation is consistent with the experimental results wherein the ionic conductivity
decreases with decreasing sintered density [11]. Altogether, CNNs can be used to
capture the structure–property relationships in a reasonable fashion from very few
data points.
250 14 Image-Based Predictions
Fig. 14.4 The first columns are the input images, and the second and third columns are the masked
maps for low and high ionic conductivity, respectively. The blue and red regions are locations that
are ignored by the CNN for low and high ionic conductivity, respectively. Reprinted with permission
from [2]
14.2 Structure–Property Prediction Using CNN 251
Fig. 14.5 Visualizations of three microstructures that show clearly contrasting architectural features
(top row), and their spatial statistics (bottom row). The overall trends in the structures are reflected
in the isosurface contours of 2-point statistics. Reprinted with permission from [1]
252 14 Image-Based Predictions
polynomial fit using a concatenated feature set including the CNN features and the 2-
point correlation features were used (a total of 256.+ 132651.= 132907 features). This
unified model exhibited significantly improved performance with an error reduction
by almost 50% in comparison to the 3-D CNN approach. It is interesting to note that
both the 2-point statistic and CNN had unique information which complemented
each other in the combined model to yield an improved performance in comparison
to each of the individual models. Overall, the work demonstrated that 3-D CNN can
be effectively used to predict the elastic properties of composite materials based on
the 3-D microstructure, which can then be used to extrapolate for new microstructures
[1].
Several other works have also followed similar approaches either with minor
variations or for different kinds of materials such as steel, ballistic composites, and
porous materials and have demonstrated that image-based approaches can be used
effectively to predict composite properties [12–17]. It has been further extended
to predict plastic properties [18] as well as the entire stress versus strain curve of
the composite materials [19, 20]. Note that the CNN model can also be potentially
extended to develop tailored microstructures with targeted elastic properties through
topology optimization [21] which can be potentially realized through a 3D printing
or additive manufacturing approach [14, 22].
Fig. 14.6 Materials discovery flowchart. Reprinted with permission from [23]
Fig. 14.7 Convolutional neural network for predicting the modulus of shale. Reprinted with
permission from [23]
The understanding of fracture, a critical process in assessing the integrity and sustain-
ability of engineering materials, can be greatly enhanced through advanced machine
learning techniques. Molecular simulations is a powerful tool that allows atomic level
information to be captured in crack propagation. Traditional approaches to studying
254 14 Image-Based Predictions
brittle fracture in solids have relied on continuum mechanics modeling methods, such
as the extended finite element method (XFEM), phase field modeling, and cohesive
zone modeling (CZM), among others. These methods aim to estimate fragmentation
patterns and fracture dynamics. However, the dynamic propagation of cracks in brit-
tle materials involves atomistic bond breaking, which necessitates in-depth analysis
using atomistic-level modeling. Unfortunately, incorporating atomistic details into
continuum mechanics models is challenging due to the assumption of a continuum
and the lack of explicit information about chemical bond behavior. While atomistic
models offer sophistication and predictability, they are computationally expensive
and not conducive to rapid material performance predictions. This limitation ham-
pers their effective use in material optimization, particularly when the atomic scale
serves as the fundamental design parameter (Figs. 14.8 and 14.9).
Combining molecular simulation with a physics-based data-driven multiscale
model can be a powerful tool to predict fracture processes [24] in a computation-
ally efficient fashion. By employing atomistic modeling and an innovative image-
processing technique, the researchers compiled a comprehensive training dataset
comprising fracture patterns and toughness values for various crystal orientations.
The predictive power of the machine-learning model was extensively evaluated,
demonstrating excellent agreement not only in computed fracture patterns but also
in fracture toughness values, even under both mode I and mode II loading condi-
tions. Additionally, the model’s capability was examined to predict fracture patterns
in bicrystalline materials and materials with gradients of microstructural crystal ori-
entation, further confirming its outstanding predictive performance. These results
Fig. 14.8 Combining molecular dynamics with LSTM-CNN for predicting crack propagation in
materials. Reprinted with permission from [24]
14.4 Combining Molecular Dynamics and CNN for Crack Prediction 255
Fig. 14.9 Several unseen test cases to evaluate the ML model. Here the crack images of overall
fracture were substituted for the dataset for machine learning. a Prediction of overall fracture
of small-difference bicrystal material. b Prediction of overall fracture of big-difference bicrystal
material. c Prediction of overall fracture of gradient crystal material (such a system could be subject
to optimization, e.g., to maximize fracture toughness or crack path tortuosity). Reprinted with
permission from [24]
highlight the significant potential of the developed model, offering promising appli-
cations in material design and development.
It is worth noting that the data generated from MD simulations are in the form
of atomic positions and hence converting these continuous images require addi-
tional data processing. The work presented several approaches for data representa-
tion through a processing method for analyzing MD simulation results. Here, the
discrete atomic information was embedded into image-based data structures. Some
256 14 Image-Based Predictions
of the advantages of the approach are outlined here. Firstly, a dataset with labels
in matrix form could be automatically constructed from MD simulation results,
reducing the manual efforts required in common supervised learning approaches.
Secondly, the dataset could intuitively incorporate information on the temporal and
spatial behavior of cracking for ML models. Moreover, the approach had the poten-
tial to be well integrated with other simulation methods, such as particle methods,
phase field modeling, CZM, or XFEM, given sufficient geometric information. This
opens up several avenues for future research, integrating multiparadigm modeling
into the neural network framework proposed in this study. Moreover, the training set
could even incorporate both experimental and simulation-based data, enabling the
development of predictive models from a rich and diverse set of raw data.
The primary objective of this study was to not only adopt a scalable machine-
learning model to bypass complex, computationally intensive simulations but also
to predict dynamic fracture paths for different crystalline structures and boundary
conditions. The model’s capability to predict crack patterns under different loading
conditions was demonstrated, offering a general framework for diverse fracture sce-
narios. The results exhibited a good agreement with the trend of crack length in the
distribution of crystalline orientations, indicating the potential for exploring more
complex systems. Future studies could improve the model’s performance by incor-
porating additional data from MD simulation results with complicated geometric
conditions. This approach not only aligns with general machine learning methods
but also confirms the creation of a generalizable and feasible way to represent data
from MD modeling for AI applications.
It is worth noting that the predictive method solely relied on the geometry and
position of the initial crack to make predictions, providing a highly efficient pro-
cess for modeling this complex physical phenomenon and introducing a new mate-
rial design approach at the nanoscale. The machine-learning algorithm for fracture
mechanics may offer new opportunities for designing engineering materials, such
as high-performance composites, and understanding how these materials respond
to various crack propagation scenarios. Some of the results in the bicrystal cases
exhibited deviations between the MD results and the machine-learning model. This
discrepancy can likely be attributed to the model not learning the behavior from
MD simulations of bicrystals. Nonetheless, these cases were included to explore the
model’s ability to make adequate predictions for variations in angles or bicrystal
interfaces, and the results were promising, demonstrating the predictive power of
the method to extrapolate beyond the cases included in the training set. Future work
could build upon this by implementing an autonomous retraining (or transfer learn-
ing) approach to expand the training set, if necessary, within a multiscale modeling
setup.
Based on the findings of this study, it may be concluded that the AI-based approach
for predicting fracture patterns and possible toughness opens the door to generative
methods, enabling the reverse engineering of mechanical properties. In future work,
the reported method could be further extended by incorporating adversarial training
14.5 Fourier Neural Operator for Stress-Strain Prediction 257
Several studies have focussed on predicting the stresses and strains of materials
directly from their microstructure. In such cases, CNNs become the natural archi-
tectural choice due to its ability to capture local and global patterns while preserving
translation invariance. Indeed, other approaches based on RNNs, and generative
models such as cGANs have also been applied to predict the stress-strain evolution
in materials. However, these approaches have several limitations when applied to the
problem of learning stress and strain fields. The major limitation of these models is
their inability to generalize, thereby, failing to make predictions for the input settings
unseen to the model. This would include different loading conditions, boundary con-
ditions, or different microstructure to name a few. Additionally, such pixel-to-pixel
learning-based methods are incapable of resolving higher resolution inputs unseen
during model training. These challenges can be addressed by employing an operator
based learning approach.
Fourier Neural Operators (FNO) are a class of neural network architectures that
combine the mathematical framework of Fourier analysis with neural networks to
address operator learning tasks. In FNO, the goal is to learn operators that map
inputs to outputs in a data-driven manner, leveraging the expressive power of neural
networks and the efficiency of Fourier analysis. The architecture of a FNO is shown
in Fig. 14.10. Let us consider an operator .L defined on a domain . that maps an input
Fig. 14.10 Architecture of the Fourier neural operator. Reprinted with permission from [25]
258 14 Image-Based Predictions
û(ξ ) = F[u](ξ )
. (14.2)
where .û(ξ ) represents the Fourier coefficients of .u at frequency .ξ and .F[·] denotes
the Fourier transform. The decoder, on the other hand, takes the Fourier coefficients
.û(ξ ) and maps them to the output function .v. This mapping is performed using a
neural network that learns the underlying operator .L. Mathematically, the decoder
can be represented as:
.v = G[û](ξ ) (14.3)
Fig. 14.11 Employing FNO to predict the stress and strain fields. Reprinted with permission
from [25]
14.6 Summary
Fig. 14.12 The model trained on chessboard geometry of soft and stiff units is tested against
arbitrary non-chequered geometries with varying fractions of soft/stiff units. Direct comparison
of strain in y direction predicted by ML model versus FEM shown for three typical examples.
Reprinted with permission from [25]
References 261
References
1. A. Cecen, H. Dai, Y.C. Yabansu, S.R. Kalidindi, L. Song, Material structure-property linkages
using three-dimensional convolutional neural networks. Acta Mater. 146, 76–84 (2018)
2. R. Kondo, S. Yamakawa, Y. Masuoka, S. Tajima, R. Asahi, Microstructure recognition using
convolutional neural networks for prediction of ionic conductivity in ceramics. Acta Mater.
141, 29–38 (2017)
3. Y. Jiao, F. Stillinger, S. Torquato, Modeling heterogeneous materials via two-point correlation
functions. ii. algorithmic details and applications. Phys. Rev. E 77(3), 031135 (2008)
4. Y. Jiao, F. Stillinger, S. Torquato, Modeling heterogeneous materials via two-point correlation
functions: basic principles. Phys. Rev. E 76(3), 031110 (2007)
5. Y. Jiao, F. Stillinger, S. Torquato, A superior descriptor of random textures and its predictive
capacity. Proc. Natl. Acad. Sci. 106(42), 17634–17639 (2009)
6. M. Sundararajan, A. Taly, Q. Yan, Axiomatic attribution for deep networks, in International
Conference on Machine Learning (PMLR, 2017), pp. 3319–3328
7. G. Erion, J.D. Janizek, P. Sturmfels, S.M. Lundberg, S.-I. Lee, Improving performance of deep
learning models with axiomatic attribution priors and expected gradients. Nat. Mach. Intell.
1–12 (2021)
8. S.M. Lundberg, S.-I. Lee, A unified approach to interpreting model predictions, in Proceedings
of the 31st International Conference on Neural Information Processing Systems (2017), pp.
4768–4777
9. B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, A. Torralba, Learning deep features for discrim-
inative localization, in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (2016), pp. 2921–2929
10. R.R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-cam: visual
explanations from deep networks via gradient-based localization, in Proceedings of the IEEE
International Conference on Computer Vision (2017), pp. 618–626
11. X. Chen, K. Khor, S. Chan, L. Yu, Influence of microstructure on the ionic conductivity of
yttria-stabilized zirconia electrolyte. Mater. Sci. Eng. A 335(1–2), 246–252 (2002)
262 14 Image-Based Predictions
12. X. Lei, X. Wu, Z. Zhang, K. Xiao, Y. Wang, C. Huang, A machine learning model for predicting
the ballistic impact resistance of unidirectional fiber-reinforced composite plate. Sci. Rep.
11(1), (2021). Cited By 0. https://doi.org/10.1038/s41598-021-85963-3
13. G.X. Gu, C.-T. Chen, M.J. Buehler, De novo composite design based on machine learning
algorithm. Extrem. Mech. Lett. 18, 19–28 (2018)
14. G.X. Gu, C.-T. Chen, D.J. Richmond, M.J. Buehler, Bioinspired hierarchical composite design
using machine learning: simulation, additive manufacturing, and experiment. Mater. Horiz.
5(5), 939–945 (2018)
15. C.-T. Chen, G.X. Gu, Machine learning for composite materials. MRS Commun. 9(2), 556–566
(2019)
16. J. Zhang, Y. Li, T. Zhao, Q. Zhang, L. Zuo, K. Zhang, Machine-learning based design of digital
materials for elastic wave control. Extrem. Mech. Lett. 48, 101372 (2021). ISSN: 2352-4316.
https://doi.org/10.1016/j.eml.2021.101372
17. O. Keles, Y. He, B. Sirkeci-Mergen, Prediction of elastic stresses in porous materials using
fully convolutional networks. Scr. Mater. 197, (2021). https://doi.org/10.1016/j.scriptamat.
2021.113805
18. D. Abueidda, S. Koric, N. Sobh, H. Sehitoglu, Deep learning for plasticity and thermo-
viscoplasticity. Int. J. Plast. 136, (2021). Cited By 12. https://doi.org/10.1016/j.ijplas.2020.
102852
19. C. Yang, Y. Kim, S. Ryu, G. X. Gu, Prediction of composite microstructure stress-strain curves
using convolutional neural networks. Mater. & Des. 189, 108509 (2020)
20. A. Yamanaka, R. Kamijyo, K. Koenuma, I. Watanabe, T. Kuwabara, Deep neural network
approach to estimate biaxial stress-strain curves of sheet metals. Mater. Des. 195, 108970
(2020). ISSN: 0264-1275. https://doi.org/10.1016/j.matdes.2020.108970
21. H.T. Kollmann, D.W. Abueidda, S. Koric, E. Guleryuz, N.A. Sobh, Deep learning for topology
optimization of 2d metamaterials. Mater. & Des. 196, 109098 (2020)
22. Z. Jin, Z. Zhang, K. Demir, G.X. Gu, Machine learning for advanced additive manufacturing.
Matter 3(5), 1541–1556 (2020)
23. X. Li, Z. Liu, S. Cui, C. Luo, C. Li, Z. Zhuang, Predicting the effective mechanical property of
heterogeneous materials by image based modeling and deep learning. Comput. Methods Appl.
Mech. Eng. 347, 735–753 (2019)
24. Y.-C. Hsu, C.-H. Yu, M.J. Buehler, Using deep learning to predict fracture patterns in crystalline
solids. Matter 3(1), 197–211 (2020)
25. M.M. Rashid, T. Pittie, S. Chakraborty, N.A. Krishnan, Learning the stress-strain fields in
digital composites using Fourier neural operator. Iscience 25(11), (2022)
Chapter 15
Natural Language Processing
15.1 Introduction
is spread over multiple entities such as tables, text, and images. Thus, information
extraction from research literature requires a multi-pronged approach that extracts
information from all these different entities, combines them in a meaningful manner
(for instance, a knowledge graph), and presents it to the user in an easily accessible
form. To this extent, NLP presents a strong and powerful tool.
Although, the field of NLP has been around for more than 60 years, applications
of NLP to materials are not more than a decade old. One of the early approaches in
NLP for materials science involves using rule-based methods. These methods rely on
predefined linguistic rules and patterns to extract relevant information from text. For
instance, in materials science, specific rules can be defined to identify and extract
mentions of materials, properties, synthesis methods, or experimental techniques
from scientific articles. While rule-based approaches can be effective for specific
tasks, they often require manual effort in crafting and maintaining the rules, which
can limit their scalability and adaptability to different contexts. In this context, three
major approaches that have resulted several seminal studies are briefly discussed
below, namely, Word2Vec, BERT and ChatGPT.
Word2Vec is a popular algorithm in the field of NLP that learns distributed rep-
resentations (word embeddings) of words based on their contextual usage. These
word embeddings capture semantic and syntactic relationships between words,
allowing algorithms to understand and infer meaning from the text. In materials sci-
ence, Word2Vec models have been used to explore relationships between materials,
properties, and synthesis methods. By leveraging the learned word embeddings,
researchers can perform tasks such as similarity analysis, document classification,
and information retrieval in a more meaningful way.
15.2 Materials-Domain Language Model 265
Most of the earlier works on NLP in materials were relying on rule-based approaches
such as ChemDataExtractor, which has also been used in predicting phase diagrams,
and generation battery and superconducting materials databases. A seminal work
following this was the use of word vectors in converting semantic queries to vector
algebra which was further extended to the prediction of thermoelectrics. Following
this, there several works focused on extraction of synthesis recipes, testing protocols
and processing conditions using rule based approaches. This was also followed by the
development of Matscholar, a comprehensive material science search and discovery
engine that is able to automatically identify materials, properties, characterization
methods, phase descriptors, synthesis methods, and applications from a given text
through a custom built named entity recognition (NER) system. A major departure
from these approaches was the development of a materials science language models
such as MatBERT and MatSciBERT.
Figure 15.2 shows the overall workflow employed in the training of MatSciBERT.
While existing LMs like BERT and SciBERT have been trained on large datasets,
266 15 Natural Language Processing
Fig. 15.2 MatSciBERT training and evaluation workflow. Reprinted with permission from [13]
they do not include materials-related text. To address this gap, the authors collected
research papers from the materials science domain, specifically in the categories of
inorganic glasses and ceramics, metallic glasses, cement and concrete, and alloys.
They queried the Crossref metadata database and obtained a list of articles, then
downloaded papers from the Elsevier Science Direct database. A custom XML parser
was used to extract text from the downloaded articles, including full sections when
available and abstracts otherwise. For specific material categories like concrete and
alloys, relevant papers were identified through manual annotation and the use of SciB-
ERT classifiers. The resulting dataset, called the Material Science Corpus (MSC),
was divided into training and validation sets, with 85% used for LM training and
15% for validation. Using this dataset, MatSciBERT was pretrained employing the
RoBERTa approach for 10 days in 2 NVIDIA V100 32GB GPUs. Once the LM is
pretrained, they can be finetuned for several downstream tasks such as named-entity
recognition, relation classification, question answering, to name a few. One of the
major challenges when testing the LM for downstream tasks is the availability of high
quality datasets that are manually labeled. Developing such datasets can be extremely
useful for evaluating the performance of LMs on materials-domain specific tasks.
Results suggest that the LM model trained on domain-specific text outperform the
existing LMs such as BERT and SciBERT. This evaluation was carried out on three
tasks, namely, named entity recognition, abstract classification, and relation classifi-
cation. In all three tasks, it was observed that the domain-specific LM outperformed
the generic LMs. Further, some of the applications of the materials domain specific
LMs are discussed below.
1. Document classification: The increasing number of published manuscripts on
materials-related topics presents a challenge in effectively identifying relevant
papers. Traditional approaches like TF-IDF and Word2Vec, coupled with classifi-
15.2 Materials-Domain Language Model 267
Fig. 15.3 Named entities recognised by MatSciBERT. The manual labels identified against each
caption is also included. Reprinted with permission from [13]
testing and environmental conditions on material properties, along with the rel-
evant parameters. Properties such as hardness or fracture toughness, which are
highly sensitive to sample preparation protocols, testing conditions, and equip-
ment used, can benefit from this analysis. MatSciBERT enables the extraction
of information regarding synthesis and testing conditions that may be otherwise
difficult to uncover within the text.
cessing (NLP) task: the extraction of materials, their constituents, and their relative
percentages from tables. Developing a model for this task requires addressing several
challenges. Some of these key challenges are described below [14].
Thus, the task of automated extraction of compositions from tables can be formu-
lated as follows. The task involves extracting compositions expressed in a given
table .T , along with its caption and the complete text of the publication in which
the table appears. The desired output of the extraction process is a set of tuples
270 15 Natural Language Processing
Fig. 15.5 Different types of tables in the DisCoMaT framework. Reprinted with permission
from [14]
K id
.(id, ckid , pkid , u id
k )k=1 , where .id represents the material ID as defined by researchers
in the field of materials science. The material ID is used to reference the composition
in text and other tables. Each tuple consists of a constituent element or compound
id id
.ck present in the material, the total number of constituents . K in the material, the
percentage contribution . pk > 0 of .ck in the composition, and the unit .u id
id id id
k of . pk
(either mole% or weight%). For example, the desired output tuples corresponding to
ID A1 from Fig. 15.4a are (A1, MoO.3 , 5, mol%), (A1, Fe.2 O.3 , 38, mol%), and (A1,
P.2 O.5 , 57, mol%).
To address this task, the materials tables are divided into non-composition (NC),
single cell composition (SCC), multi-cell composition with complete informa-
tion (MCC-CI), and mult-cell composition with partial information (MCC-PI) (see
Fig. 15.5). Further, each row or column is also annotated into four labels in the
dataset: ID, composition, constituent, and other. While training data is created using
distant supervision, dev and test sets are hand annotated. To extract information from
the tables, authors proposed a framework, named, distantly supervised compositions
extraction from materials tables (DisCoMaT). The DiSCoMaT architecture is shown
in Fig. 15.6. The first task is to determine whether the table .T is a single-cell com-
pound (SCC) table, which can be identified based on the presence of multiple numbers
and compounds in single cells. DiSCoMaT utilizes a graph neural network (GNN)
based SCC predictor to classify the table .T as an SCC table or not. If it is classified
as an SCC table, the system employs a rule-based composition parser to extract the
compositions. For tables that are not classified as SCC tables, the system employs a
second GNN (referred to as GNN.2 ) to label the rows and columns of the table .T as
compositions, material IDs, constituents, or others. If no constituents or composition
15.4 Future Directions 271
With the advent of transformer models such as ChatGPT and GPT4, the NLP field has
been revolutionized and many tasks which were thought to be extremely challenging
has become possible with a prompt. However, these language models are still black-
box in nature and knowing their limitations are crucial for the reliable and secure
development of tools that can be used for practical applications. Accordingly, there
has been a lot of emphasis on data-centric AI, a paradigm shift from model-centric
AI. The development of datasets that probe and aid in understanding the limitations
of such language models are imperative.
272 15 Natural Language Processing
and perform joint analysis to uncover hidden patterns, correlations, and knowledge.
Moreover, multimodal NLP can be used to bridge the gap between different types
of data sources in materials science. For instance, by combining text-based litera-
ture with experimental data and images, researchers can enhance the interpretation
and validation of experimental results, enabling more reliable and efficient material
analysis.
Another open area in NLP would be the development of materials domain knowl-
edge graph combining information from tables, text, and images. The development of
a materials domain knowledge graph that combines information from tables, text, and
images holds great potential for advancing materials science research. A knowledge
graph is a structured representation of knowledge that captures entities, their proper-
ties, and the relationships between them. By integrating data from diverse modalities,
such as tables, text, and images, we can construct a comprehensive knowledge graph
that encompasses a wide range of materials-related information.
Tables often contain valuable data on material compositions, properties, and exper-
imental results. By extracting information from tables, we can populate the knowl-
edge graph with structured data, enabling efficient querying and analysis. This data
can include material compositions, synthesis methods, characterization techniques,
and various material properties.
Textual information, such as research articles, provides rich descriptions, expla-
nations, and context surrounding materials science. Natural language processing
techniques can be employed to extract relevant information from the text, such as
material synthesis procedures, properties, and relationships. This extracted informa-
tion can then be linked to entities in the knowledge graph, enriching its content and
facilitating semantic connections between different data points.
Images play a crucial role in materials science, providing visual representations
of material structures, morphologies, and properties. Advanced image processing
and computer vision techniques can be applied to extract features and metadata from
images, which can be integrated into the knowledge graph. This allows researchers to
explore the relationships between material structures, properties, and performance,
leveraging both visual and textual information.
By combining information from tables, text, and images, the materials domain
knowledge graph becomes a powerful tool for researchers. It enables holistic explo-
ration and analysis of materials data, facilitating discovery of new materials, identifi-
cation of structure-property relationships, prediction of material behavior, and opti-
mization of synthesis and processing methods. The knowledge graph also supports
data-driven approaches, enabling researchers to navigate and leverage vast amounts
of interconnected materials information.
274 15 Natural Language Processing
15.5 Summary
This chapter explores the application of natural language processing (NLP) tech-
niques in the field of materials science. It discusses the challenges associated with
information extraction from both text and tables in materials science literature. The
chapter highlights the importance of extracting material compositions, synthesis
methods, and characterization techniques from text, as well as the significance of
extracting material compositions from tables.
The chapter introduces the concept of materials-domain language models, with a
focus on MatSciBERT. These language models are specifically trained on materials
science literature and enable improved topic classification, relation classification, and
question answering. They capture the contextual meaning of materials-related terms
and enhance the understanding of materials science texts. Additionally, the chapter
presents a pipeline approach for information extraction from tables using graph
neural networks (GNNs), namely, DiSCoMaT. The pipeline includes steps such as
identifying different types of tables, labeling rows and columns, and distinguishing
between complete and partial information tables. It discusses the challenges involved
in table analysis and highlights the potential of NLP techniques in addressing these
challenges.
Altogether, the chapter emphasizes the value of NLP techniques in materials
science research. The use of materials-domain language models, such as MatSciB-
ERT, improves the extraction of relevant information from materials science texts
and enhances various downstream tasks. The pipeline approach for table analy-
sis demonstrates the potential of GNNs in handling table structures and extracting
material compositions effectively.
Looking ahead, further advancements in NLP techniques hold great promise for
materials science research. The development of more refined materials-domain lan-
guage models can enhance the understanding of complex materials-related concepts
and facilitate knowledge discovery. Additionally, the application of NLP techniques
to other materials science tasks, such as materials property prediction or materials
design, opens up new avenues for research and innovation. The integration of NLP
with other computational methods, such as computer vision, leading to multimodal
NLP can result in more comprehensive and efficient analysis of materials science
literature.
References
Correction to:
N. M. A. Krishnan et al., Machine Learning for Materials
Discovery, Machine Intelligence for Materials
Science, https://doi.org/10.1007/978-3-031-44622-1
The original version of the book was inadvertently published without the ESM in
chapters 1, 2, 4, 5, 6. Which has now been updated. The correction chapters and the
book have been updated with these changes.
© The Editor(s) (if applicable) and The Author(s), under exclusive license C1
to Springer Nature Switzerland AG 2024
N. M. A. Krishnan et al., Machine Learning for Materials Discovery, Machine
Intelligence for Materials Science, https://doi.org/10.1007/978-3-031-44622-1_16
Index
Imbalanced data, 44, 45 131–134, 137, 139, 140, 142, 143, 145–
In-silico, 5, 7 147, 150–152, 154, 156–160, 162–165,
Interatomic, 16, 153, 181, 221–225, 228, 168–171, 175–189, 191–199, 202, 205,
229, 231, 232, 235, 242 206, 209–211, 213, 217, 218, 221–224,
Interquartile range, 43 226, 231–236, 242, 243, 247, 251, 252,
254–261, 263–267, 269, 271, 272, 274
Range, 10, 26, 28, 31, 36–38, 43, 44, 62, 121, Steepest descent, 65
132, 145, 180, 189, 192, 193, 230, 232, Stochastic LMS, 68, 70, 81, 99
242, 261, 273 Sturges’ rule, 31
Recurrent, 50, 148, 150, 197, 199, 234 Summarizing statistics, 34
Recurrent Neural Network (RNN), 50, 148, Supervised learning, 9, 48, 50, 59, 113, 151,
197, 199, 234 256
Regression, backward stepwise, 76 Support vector, 9, 54, 59, 85, 86, 101, 103,
Regression, batch LMS, 68, 70, 81 105, 107, 108, 110, 139, 159, 183, 184,
Regression, forward stepwise, 76, 77 209, 213–215, 226
Regression, gradient descent, 66 Symbols, list of, xix
Regression, LAR, 77
Regression, linear, 63
Regression, linearisation, 73 T
Regression, LWR, 73 Tree maps, 25, 26, 28, 45
Regression, random forest, 90 Trial-and-error, 4, 191, 201
Regression, stepwise, 62, 75–77, 81
Regression, stochastic LMS, 70
Regression, subset selection, 74 U
Regression tree, 85–87, 90, 93, 183 Undersampling, 44
Reinforcement learning, 9, 10, 47–49, 51, Unsupervised learning, 9, 47, 48, 55, 59, 113,
57–59, 145, 146, 156, 191, 192, 197, 114
199, 201–203, 205
V
S Variance, 26, 36, 38, 56, 74, 90, 109, 114,
Sample, 31, 34–38, 44, 63, 70, 72, 90, 114, 115, 126, 133, 154, 214, 247, 248
151, 154, 155, 163, 165, 195, 197, 268 Variational Autoencoder (VAE), 41, 154–
156, 197–200
Scatter plot, 25, 26, 29, 45
SHapley Additive exPlanations (SHAP),
159–165, 168, 176, 209–213, 217, 249, W
What-if scenarios, 4, 5
Skewness, 26, 39, 40 Widrow-Hoff learning rule, 66
SMOTE, 25, 44, 45
Softness, 209, 213–218
Standard deviation, 38, 42, 72, 109, 110, 184, X
228 XGboost, 9, 86, 93–95, 138, 140–142, 183