0% found this document useful (0 votes)
231 views288 pages

Machine Learning For Materials Science

The document outlines a book series focused on the integration of machine learning and artificial intelligence in materials science, emphasizing their applications in materials discovery, design, and optimization. It aims to provide accessible resources for researchers, engineers, and students, bridging the gap between theoretical knowledge and practical implementation. The book is structured into three parts, covering foundational concepts, practical applications, and hands-on examples, making it suitable for a diverse audience.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
231 views288 pages

Machine Learning For Materials Science

The document outlines a book series focused on the integration of machine learning and artificial intelligence in materials science, emphasizing their applications in materials discovery, design, and optimization. It aims to provide accessible resources for researchers, engineers, and students, bridging the gap between theoretical knowledge and practical implementation. The book is structured into three parts, covering foundational concepts, practical applications, and hands-on examples, making it suitable for a diverse audience.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 288

Machine Intelligence for Materials Science

N. M. Anoop Krishnan
Hariprasad Kodamana
Ravinder Bhattoo

Machine
Learning
for Materials
Discovery
Numerical Recipes
and Practical Applications
Machine Intelligence for Materials Science

Series Editor
N. M. Anoop Krishnan, Department of Civil Engineering, Yardi School of Artificial
Intelligence (Joint Appt.), Indian Institute of Technology Delhi, New Delhi, India
This book series is dedicated to showcasing the latest research and developments at
the intersection of materials science and engineering, computational intelligence, and
data sciences. The series covers a wide range of topics that explore the application
of artificial intelligence (AI), machine learning (ML), deep learning (DL), reinforce-
ment learning (RL), and data science approaches to solve complex problems across
the materials research domain.
Topical areas covered in the series include but are not limited to:
• AI and ML for accelerated materials discovery, design, and optimization
• Materials informatics
• Materials genomics
• Data-driven multi-scale materials modeling and simulation
• Physics-informed machine learning for materials
• High-throughput materials synthesis and characterization
• Cognitive computing for materials research
The series also welcomes manuscript submissions exploring the application of AI,
ML, and data science techniques to following areas:
• Materials processing optimization
• Materials degradation and failure
• Additive manufacturing and 3D printing
• Image analysis and signal processing
Each book in the series is written by experts in the field and provides a valuable
resource for understanding the current state of the field and the direction in which
it is headed. Books in this series are aimed at researchers, engineers, and academics
in the field of materials science and engineering, as well as anyone interested in the
impact of AI on the field.
N. M. Anoop Krishnan · Hariprasad Kodamana ·
Ravinder Bhattoo

Machine Learning
for Materials Discovery
Numerical Recipes and Practical Applications
N. M. Anoop Krishnan Hariprasad Kodamana
Department of Civil Engineering, Yardi Department of Chemical Engineering,
School of Artificial Intelligence (Joint Yardi School of Artificial Intelligence (Joint
Appt.) Appt.)
Indian Institute of Technology Delhi Indian Institute of Technology Delhi
New Delhi, India New Delhi, India

Ravinder Bhattoo
Indian Institute of Technology Delhi
New Delhi, India

ISSN 2948-1813 ISSN 2948-1821 (electronic)


Machine Intelligence for Materials Science
ISBN 978-3-031-44621-4 ISBN 978-3-031-44622-1 (eBook)
https://doi.org/10.1007/978-3-031-44622-1

© Springer Nature Switzerland AG 2024, corrected publication 2024

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Paper in this product is recyclable.


To My Parents, Parents-in-law, Arathy, and
Anant—NMAK
To My Parents, Susha, Avyay, and
Adwait—HK

To My Parents, Pushpa, and Priyanka—RB


Foreword

Historically, science and engineering domains periodically experience revolutionary


ideas that completely change the way we think about the domain. A few decades ago,
it was the introduction of nanoengineering materials and the emergence of predictive
multiscale modeling. Today we are in the cusp of a similar revolution. Advances in
machine learning and artificial intelligence are now enabling tasks that would have
seemed impossible just a short while ago, or would have been taking decades to
achieve. Impacts of these new tools include the discovery of novel drugs, ultrahigh
strength alloys, automated design of composites, efficient quantum accuracy simu-
lations, bioinspired design, and knowledge transfer across domains, just to name a
few. The new book by Prof. Krishnan, Prof. Kodamana, and Dr. Bhattoo provides
an excellent introduction into the emerging field of machine learning for materials
discovery. This book bridges a gap and acts as an enabler for the adoption of machine
learning by material scientists, engineers, and students.
The book offers an excellent pedagogical approach towards the use of machine
learning for materials discovery. The book is written in a lucid fashion, and accessible
to audience ranging from undergraduate students to scientists. The book does not
assume any prior knowledge in the domain of machine learning, and is self-sufficient.
The second part of the book covers the basics of machine learning theory including
supervised and unsupervised strategies with examples from the materials domain.
An excellent feature of the book is that theory on machine learning is followed by
codes that allows instructors, students, and practitioners to try the approaches in a
hands-on fashion. The third section discusses a wide range of applications giving
an overview of different avenues where machine learning can be used for materials
discovery.
While research aspect of the topic is interesting, it is equally, if not more, important
to train the next-generation materials scientists to be skilled in machine learning and
artificial intelligence, especially to being able to critically discern the best modeling

vii
viii Foreword

strategies among a broad set of tools. I believe this book can give an impetus for the
adoption of machine learning in materials science curricula across many universities.
I hope you will enjoy reading this excellent book as much as I did.

Markus J. Buehler
Massachusetts Institute of Technology (MIT)
Cambridge, USA

The original version of the book has been revised. The ESM information for chapters 1, 2, 4, 5,
6 has been updated. A correction to this book can be found at https://doi.org/10.1007/978-3-031-
44622-1_16
Preface

The last decade in materials science has seen a major wave of change due to the advent
of machine learning and artificial intelligence. While there have been significant
advances in machine learning for the materials domain, a vast majority of students,
researchers, and professionals working on materials still do not have access to the
theoretical backgrounds of machine learning. This can be attributed partially to the
intricate mathematical treatments commonly followed in many machine learning
textbooks and the use of general examples that lack relevance to materials science-
related applications.
This textbook aims to bridge this gap by providing an overview of machine
learning in materials modeling and discovery. The textbook is well-suited for a diverse
audience, including undergraduates, graduates, and industry professionals. The book
is also structured as foundational and can be used as a textbook covering the basics
and advanced techniques while giving hands-on examples using Python codes.
The book is structured into three parts. Part I gives an introduction to the evolu-
tion of machine learning in the materials domain. Part II focuses on building the
foundations of machine learning, with various tailor-made examples accompanied
by corresponding code implementations. In the part III, emphasis is given to several
practical applications related to machine learning in the materials domain.
Although several use cases from the literature are covered, the book also integrates
examples from the authors’ research whenever possible. This deliberate choice is
motivated by accessible data and first-hand details of available codes that might not
readily exist in the literature. We believe such a treatment facilitates comprehen-
sive information about practical implementation while striking a balance with the
theoretical exposition.
The field of machine learning is growing at an exponential pace, and it is impos-
sible to cover all the state-of-the-art methods. This book by no means is exhaustive.
Rather, this book is an attempt to capture the essence of the basics of machine learning
and make the readers aware of the foundations so that they can either delve into the
deeper aspects of machine learning or focus on the applications to the materials
domain using existing approaches to solve an impactful problem in the domain.

ix
x Preface

We hope you enjoy the book and find it useful for your journey in materials
discovery.

New Delhi, India N. M. Anoop Krishnan


August 2023 Hariprasad Kodamana
Ravinder Bhattoo
Acknowledgements

There are a lot of people who have contributed both actively and passively to the
development of this book. First, we would like to thank our editor, Dr. Zachary
Evenson, for initiating the idea of the book and encouraging us to complete it. It is
indeed his motivation and support that resulted in the book. Thanks to Mohd Zaki
for helping with the images and suggestions on graphics. Thanks are also due to
Indrajeet Mandal, who painstakingly collected the copyrights for all the images used
in the work. Special thanks to the research scholars of the M3RG at IIT Delhi for
their comments, feedback, and proofreading that helped significantly improve the
book. The authors also thank the support from IIT Delhi and specifically the Yardi
School of Artificial Intelligence, Department of Civil Engineering, and Department
of Chemical Engineering. The role played by the authors’ family in the form of
continuous support to complete the book cannot be emphasized enough. Boundless
thanks to them for supporting us through this endeavor through thick and thin, COVID
and many other uncertainties and challenges, and making this happen.

xi
Contents

Part I Introduction
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1 Materials Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Physics- and Data-Driven Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Introduction to Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Machine Learning for Materials Discovery . . . . . . . . . . . . . . . . . . . 10
1.4.1 Property Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.2 Materials Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.3 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.4 Understanding the Physics . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.5 Automated Knowledge Extraction . . . . . . . . . . . . . . . . . . . 14
1.4.6 Accelerating Materials Modeling . . . . . . . . . . . . . . . . . . . . 15
1.5 Outline of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Part II Basics of Machine Learning


2 Data Visualization and Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.1 Bar Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.2 Heat Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.3 Tree Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.4 Scatter Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2.5 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.6 Density Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3 Extracting Statistics from Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.1 Central Measures of Data . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.2 Measures of Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.3 Higher Order Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

xiii
xiv Contents

2.4 Outlier Detection and Data Imputing . . . . . . . . . . . . . . . . . . . . . . . . 40


2.4.1 Outlier Detection Based on Standard Deviation . . . . . . . 42
2.4.2 Outlier Detection Based on Using Median
Absolute Deviation (MAD) Approach . . . . . . . . . . . . . . . 43
2.4.3 Outlier Detection Using Interquartile Approach . . . . . . . 43
2.5 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3 Introduction to Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.1 Machine Learning Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.1.1 Unsupervised Learning Algorithms . . . . . . . . . . . . . . . . . . 48
3.1.2 Supervised Learning Algorithms . . . . . . . . . . . . . . . . . . . . 50
3.1.3 Reinforcement Learning Algorithms . . . . . . . . . . . . . . . . . 51
3.2 Parametric and Non-parametric Models . . . . . . . . . . . . . . . . . . . . . . 51
3.2.1 Parametric Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.2 Non-parametric Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2.3 Choosing Between Parametric and Non-parametric
Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3 Classification and Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3.1 Classification Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.5 Reinforcement Learning: Model-Free and Policy Grad . . . . . . . . . 57
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4 Parametric Methods for Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 Closed Form Solution of Regression . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3 Iterative Approaches for Regression . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.1 Gradient Descent Optimizer . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.2 Gradient Descent Approach for Linear Regression . . . . . 66
4.3.3 Least Squares: A Probabilistic Interpretation . . . . . . . . . . 72
4.4 Locally Weighted Linear Regression (LWR) . . . . . . . . . . . . . . . . . 73
4.5 Best Subset Selection for Regression . . . . . . . . . . . . . . . . . . . . . . . . 74
4.5.1 Stepwise Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.5.2 Stagewise Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.5.3 Least Angle Regression (LAR) . . . . . . . . . . . . . . . . . . . . . 77
4.6 Logistic Regression for Classification . . . . . . . . . . . . . . . . . . . . . . . 78
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Contents xv

5 Non-parametric Methods for Regression . . . . . . . . . . . . . . . . . . . . . . . . 85


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2 Tree-Based Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.2.1 Regression Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.2.2 Random Forest Regression . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.2.3 Gradient Boosted Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.3 Multi-layer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.4 Support Vector Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.4.1 Linear Separable Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.4.2 Linear Non-separable Case . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.4.3 Kernel SVR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.5 Gaussian Process Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6 Dimensionality Reduction and Clustering . . . . . . . . . . . . . . . . . . . . . . . 113
6.1 An Introduction Unsupervised ML . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.3 k Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.4 Gaussian Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.5 t-Distributed Stochastic Neighbor Embedding . . . . . . . . . . . . . . . . 126
6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7 Model Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.2 Regularization for Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.2.1 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.2.2 Least Absolute Shrinkage and Selection Operator
(LASSO) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7.2.3 Elastic-Net Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.3 Cross-Validation for Model Generalizability . . . . . . . . . . . . . . . . . . 137
7.4 Hyperparametric Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.4.1 Grid Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.4.2 Random Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.4.3 Bayesian Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8.2 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
8.3 Long-short Term Memory Networks . . . . . . . . . . . . . . . . . . . . . . . . 148
8.4 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
8.5 Graph Neural Networks (GNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
8.6 Variational Auto Encoders (VAE) . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
xvi Contents

8.7 Reinforcement Learning (RL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156


8.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
9 Interpretable Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
9.2 Shapley Additive Explanations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
9.3 Integrated Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
9.4 Symbolic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
9.5 Other Interpretability Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
9.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

Part III Machine Learning for Materials Modeling and Discovery


10 Property Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
10.2 Dataset Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
10.3 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
10.4 Model Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
10.5 Hyperparametric Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
10.6 Physics-Informed ML for Property Prediction . . . . . . . . . . . . . . . . 185
10.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
11 Material Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
11.2 ML Surrogate Model Based Optimization . . . . . . . . . . . . . . . . . . . . 192
11.3 Material Selection Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
11.4 Generative Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
11.5 Reinforcement Learning for Optimizing Atomic Structures . . . . . 201
11.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
12 Interpretable ML for Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
12.2 Composition–Property Relationships . . . . . . . . . . . . . . . . . . . . . . . . 210
12.3 Interaction of Input Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
12.4 Decoding the Physics of Atomic Motion . . . . . . . . . . . . . . . . . . . . . 213
12.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
13 Machine Learned Material Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 221
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
13.2 Machine Learning Interatomic Potentials for Atomistic
Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
Contents xvii

13.3 Physics-Informed Neural Networks for Continuum


Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
13.4 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
13.4.1 Physics-Enforced GNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
13.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
14 Image-Based Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
14.2 Structure–Property Prediction Using CNN . . . . . . . . . . . . . . . . . . . 247
14.2.1 Predicting the Ionic Conductivity . . . . . . . . . . . . . . . . . . . 249
14.2.2 Predicting the Effective Elastic Properties
of Composites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
14.3 Combining CNN with Finite Element Modeling . . . . . . . . . . . . . . 252
14.4 Combining Molecular Dynamics and CNN for Crack
Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
14.5 Fourier Neural Operator for Stress-Strain Prediction . . . . . . . . . . . 257
14.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
15 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
15.2 Materials-Domain Language Model . . . . . . . . . . . . . . . . . . . . . . . . . 265
15.3 Extracting Material Composition from Tables . . . . . . . . . . . . . . . . 268
15.4 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
15.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
Correction to: Machine Learning for Materials Discovery . . . . . . . . . . . . . C1
N. M. Anoop Krishnan, Hariprasad Kodamana, and Ravinder Bhattoo

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
Acronyms

ADASYN Adaptive synthetic sampling technique


AFM Atomic force microscope
AI Artificial intelligence
CNN Convolutional neural network
CALPHAD Calculation of phase diagrams
CSD Cambridge structural database
DBSCAN Density-based spatial clustering of applications with noise
EBSD Electron backscatter diffraction
FEM Finite element methods
GAN Generative adversarial networks
GCN Graph convolutional network
GNN Graph neural network
GPR Gaussian process regression
GRU Gated recurrent units
IID Independent and identically distributed
LAR Least angle regression
LARS Least angle regression and shrinkage
LMS Least mean square
LSTM Long short-term memory
LWR Locally weighted linear regression
MD Molecular dynamics
MC Monte-Carlo
ML Machine learning
MLP Multi-layer perceptron
NLP Natural language processing
NN Neural network
OLS Ordinary least square
OPTICS Ordering points to identify cluster structure
PCA Principal component analysis
QSPR Quantitative structure–property relationships
RF Random forest

xix
xx Acronyms

RL Reinforcement learning
RNN Recurrent neural network
SARSA State-action-reward-state-action
SEM Scanning electron microscope
SMOTE Synthetic minority oversampling technique
SVM Support vector machine
SVR Support vector regression
t-SNE t-distributed stochastic neighbor embedding
XGBoost Extreme gradient boosting
Part I
Introduction
Chapter 1
Introduction

Abstract Materials form the basis of human civilization. With the advance of com-
putational algorithms, computational power, and cloud-based services, materials
innovation is accelerating at a pace never witnessed by humankind. In this chapter,
we briefly introduce the materials discovery approaches using AI and ML that has
enabled some breakthrough in our understanding of materials. We list some publicly
available databases on materials and some of the applications where AI and ML has
been used to design and discover novel materials. The chapter concludes with a brief
outline of the book.

1.1 Materials Discovery

The progress of human civilization has been closely related to the discovery and usage
of new materials. Materials have shaped how we interact with the world, from the
stone to the silicon age. This is exemplified by the fact that the different ages of human
history have been named after the prominent materials used in those eras–the stone
age, the ber, the importance of materials discovery was formally accepted only in the
1950s with the proposition of materials as a separate engineering domain. During
world war II and the ensuing cold war, countries realized that materials were the
bottleneck in advancing military, space, and medical technologies. Thus, materials
science emerged as the first discipline formed out of the fusion and collaborations of
multiple disciplines from basic sciences and engineering, focusing on understanding
material response leading to materials discovery. While the early focus of materials
science remained in metallurgy, it was soon expanded to other domains such as
ceramics, polymers, and later to composites, nano-materials, and bio-materials.
However, the importance of materials discovery was formally accepted only in the
1950s with the proposition of materials as a separate engineering domain. During
world war II and the ensuing cold war, countries realized that materials were the
bottleneck in advancing military, space, and medical technologies. Thus, materials

Supplementary Information The online version contains supplementary material available at


https://doi.org/10.1007/978-3-031-44622-1_1.

© Springer Nature Switzerland AG 2024 3


N. M. A. Krishnan et al., Machine Learning for Materials Discovery,
Machine Intelligence for Materials Science,
https://doi.org/10.1007/978-3-031-44622-1_1
4 1 Introduction

Fig. 1.1 Flow chart of traditional materials discovery based on what-if scenarios. Intuition and
expert knowledge is used to cleverly pose the what-if questions that can potentially lead to the
discovery of novel materials

science emerged as the first discipline formed out of the fusion and collaborations of
multiple disciplines from basic sciences and engineering, focusing on understanding
material response leading to materials discovery. While the early focus of materials
science remained in metallurgy, it was soon expanded to other domains such as
ceramics, polymers, and later to composites, nano-materials, and bio-materials.
The earlier approaches for materials discovery relied on trial-and-error approaches
driven either by physics or strong intuition developed through years of experience.
In such cases, the idea of what-if scenarios was used for discovering materials with
tailored properties, as shown in Fig. 1.1. This approach would start from a “what-if”
question on one or more aspects of the tetrahedron materials: processing, struc-
ture, property, and performance. A set of candidate solutions would be proposed
based on the available knowledge and intuitions. These solutions would be tested
using experimental synthesis and characterizations. If a candidate solution meets the
expected performance, the new material is manufactured, verified, validated, cer-
1.1 Materials Discovery 5

tified, and deployed in the industry. If any of the candidate solutions do not meet
the expected performance, the iteration is continued until a desired candidate is
discovered. As a trivial example, consider the following. Carbon can improve the
hardness and strength of steel–what if we increase the carbon content of steel? Experi-
mental studies reveal that the increase in carbon content improves steel’s hardness and
strength. However, higher carbon content makes steel brittle and less weldable! Thus,
although the new candidate meets the expected performance in terms of strength, it
induces some undesirable side-effects on other properties. Hence, the candidate may
not be accepted. Thus, the what-if scenarios required detailed and time-consuming
experimental characterization and analysis of new materials, significantly increas-
ing the cost and time required for materials discovery. In these cases, the typical
timescale associated with the discovery of new material was 20–30 years from the
initial research to its first use.
The invention of computers and in-silico approaches came as a breakthrough in
materials discovery in the second half of the twentieth century. Monte Carlo (MC)
algorithms and molecular dynamics (MD) simulations, both proposed in the 1950s,
became valuable tools for understanding materials response under different scenar-
ios. These approaches reduced the number of actual experiments to be carried out,
accelerating materials discovery. At the same time, slowly but steadily, researchers
also started realizing the importance of compiling and documenting the materials
data generated by the experiments and simulations. The first attempts to this extent
were the Cambridge Structural Database (CSD) and Calculation of Phase Diagrams
(CALPHAD) around the 1970s. These databases enabled the development of a quan-
titative structure-property relationships (QSPR) approach in materials. The QSPR
approaches primarily relied on correlations and simple linear or polynomial regres-
sions that allowed the discovery of patterns from the available data, which ultimately
provided insights into materials response.
List of Publically Available Materials Databases
1. CSD: Cambridge Structural Database
2. CALPHAD
3. Granta Design
4. Pauling File
5. ICSD: Inorganic Crystal Structure Database
6. ESP: Electronic Structure Project
7. AFLOW: Automatic-Flow for Materials Discovery
8. MatNavi
9. AIST: National Institute of Advanced Industrial Science and Technology
Databases
10. COD: Crystallography Open Database
11. MatDL: Materials Digital Library
12. The Materials Project
13. CMR: Computational Materials Repository
14. Springer Material
15. OpenKIM
6 1 Introduction

16. NREL CID: NREL Center for Inverse Design


17. MGI: Materials Genome Initiative
18. MatWeb
19. MATDAT
20. CEPDB: The Clean Energy Project Database
21. CMD: Computational Materials Network
22. Catalysis Hub
23. OQMD: Open Quantum Materials Database
24. Open Material Databases
25. NREL MatDB
26. Citrine Informatics
27. Exabyte.io
28. NOMAD: Novel Materials Discovery Laboratory
29. Marvel
30. Thermoelectrics Design Lab
31. MaX: Materials Design at the Exascale
32. CritCat
33. Khazana
34. Material Data Facility
35. MICCOM: Midwest Integrated Center for Computational Materials
36. MPDS: Materials Platform for Data Science
37. CMI2: Center for Materials Research by Information Integration
38. HTEM: High Throughput Experimental Materials Database
39. JARVIS: Joint Automated Repository for Various Integrated Simulations
40. OMDB: Organic Materials Database
41. aNANt
42. Atom Work Adv
43. FAIR Data Infrastructure
44. Materiae
45. Materials Zone
46. MolDis
47. QCArchive: The Quantum Chemistry Archive
48. PyGGi: Python for Glass Genomics
The next breakthrough in materials discovery could be attributed to the inter-
net revolution, which democratized the access to data for everyone. This period
also saw a surge in the development of experimental and computational databases
which started serving as information repository and cook-book for material synthesis.
Figure 1.2 shows the databases that are available and their geographical distribution.
The availability of these databases also inspired the automated search for correlations
in composition–structure–processing–property relationships of the materials. Thus,
the stage was set for the use of machine learning for materials discovery with all the
relevant ingredients in place, namely,
1. availability of large amounts data,
2. computational power to process and “learn” the data, and
1.2 Physics- and Data-Driven Modeling 7

Fig. 1.2 Material database timeline and geographical region of origin. Reprinted with permission
from [1]

3. extremely non-linear composition–structure–property relationships along with


the poor understanding of physics governing these relationships in materials.

1.2 Physics- and Data-Driven Modeling

Models are simplified replicas of real-world scenarios with attention to the features
or phenomena of interest. For example, a ball-and-stick model of atoms aims to
show the relative atomic positions for a given lattice, while completely ignoring
the dynamics, electronic structure, and other details of an atomic system. Figure 1.3
shows the ball and stick model for benzene with the chemical formula C.6 H.6 . Note that
the black balls represent the carbon atom while the white ones represent hydrogen
atoms. Further, the alternating single and double bonds are represented beautifully
by single and double sticks connecting the carbon atoms. Such models can be very
useful for giving a quick understanding of complex molecular structures and are
hence used commonly for teaching purposes.
While a ball-and-stick model is a physical model, phenomenons are typically
expressed through mathematical models. Traditional models in materials and engi-
neering disciplines have relied on mathematical equations derived based on physical
theories or laws. This approach has been widely accepted for centuries and has stood
the test of time. Some of the widely used mathematical models in materials science
include laws of thermodynamics, Fick’s laws, Avrami equation, Arrhenius equa-
tion, Gibbs–Thomson equation, Bragg’s law, and Hooke’s law. Thus, the physical
8 1 Introduction

Fig. 1.3 Ball and stick model of benzene (C.6 H.6 )

models are derived based on existing theories and can be explained using reason-
ing to understand the phenomenon. However, the physical models have traditionally
been limited to simple systems. The extremely complex and non-linear nature of
advanced materials have remained elusive to physical models as well as in-silico
models. Understanding the response of these materials require high-fidelity high-
throughput experiments and simulations, which are highly prohibitive in terms of
cost and manpower.
An alternate approach that has emerged recently is the data-driven approach. Here,
the data is used to first identify the model and then fit the parameters of the model.
Data-driven models are not based on physical theories and hence are occasionally
termed as “black-box” models. It is interesting to note that although, data-driven
models such as machine learning was first proposed at the same time as MC and
MD simulations in 1950s, it has started finding wide-spread applications in materi-
als engineering only for the past two decades. The inertia to not accept data-driven
models, despite their fast, accurate, and efficient ability to learn patterns from data,
could be attributed to their black-box nature. In other words, the data-driven mod-
els cannot be explained using known physics, they can only be tested for unknown
scenarios. However, the advances in machine learning coupled with the availability
of large-scale data on materials have shown the potential of data-driven approaches
for materials discovery. In addition, the development of explainable machine learn-
1.3 Introduction to Machine Learning 9

ing algorithms, which allows the interpretation of black-box models, has allowed
domain experts to interpret the black-box models. This allows the interpretation of
the features “learned” by the model, thereby, giving insights into the inner workings
of the models. Overall, data-driven approaches have shown significant potential to
accelerate materials discovery and reduce the discovery-to-deployment period from
20 years to 10 years or even lesser.

1.3 Introduction to Machine Learning

Machine learning (ML) refers to the branch of study which focuses on developing
algorithms that “learns” the hidden patterns in the data. In contrast to physics-based
models, ML uses the data for both model development and model training. Further,
it improves the model in a recursive fashion using a predictor-corrector approach
without being explicitly programmed to do the specific task. As such, large amounts
of data is required for the ML models to learn the patterns reasonably–the more the
data, the better the ML model is. ML has already been widely used in our day-to-day
life for several applications such as face recognition, email spam detection, personal
assistants, automated chat-bots, and fraud detection. To achieve these tasks, ML uses
different classes of algorithms as detailed below.
Algorithms in ML can be broadly classified into supervised, unsupervised, and
reinforcement learning. Supervised learning refers to those which learns the function
that maps a set of input-output data. The examples of this approach include predicting
the Young’s modulus or density of an alloy based on the composition and processing
or classifying a set of materials into conductor or insulator. It may be noticed in the
first task the output Young’s modulus can take continuous values as a function of the
composition and processing and hence, is known as regression. Whereas in the second
task, the output can either be conductor or insulator, and hence is a classification task.
Note that the classification problems can be multi-class as well having more than two
classes, for example, conductor, insulator, superconductor, and semi-conductor. The
crucial aspect in supervised learning is the availability of a labeled dataset on which
the model can be trained. The accuracy of the model depends highly on the accuracy
of the dataset among other factors. Some commonly used supervised models are
linear and polynomial regressions, logistic regression, decision trees, random forest
(RF), XGBoost, support vector (SVR), neural network (NN), and Gaussian process
regression (GPR).
In unsupervised learning, the algorithm tries to find out patterns from the features
of the data. In this case, there is no labeled training set that is used. Some of the
main approaches in unsupervised learning include clustering and anomaly detection.
Clustering refers the automated grouping of materials based on their similarity to
each other based on the features provided. Clustering may be used to remove an
outlier in the data, or to identify subgroups in the data. Some of the unsupervised
models include k-means, DBSCAN, OPTICS (inspired from DBSCAN), t-SNE, and
principal component analysis (PCA).
10 1 Introduction

Reinforcement learning, although holds a great potential, is relatively less explored


in materials discovery. Reinforcement learning relies on a carrot-and-stick policy
where an agent is trained to take actions that maximizes the cumulative reward. Thus,
reinforcement learning tries to combine the existing knowledge and exploration in
a judicious fashion to maximize the reward. Reinforcement learning can be used
to identify optimal process parameters for material synthesis and characterization,
and also to explore novel materials with superior properties such as room temper-
ature superconductors or ultra-stable glasses. Some of the reinforcement learning
algorithms include Q-learning, and State-action-reward-state-action (SARSA). In
this book, we will be primarily focusing only on supervised and unsupervised algo-
rithms. These algorithms are discussed in detail in Part II. Reinforcement learning
is briefly outlined in Sect. 8.7.

1.4 Machine Learning for Materials Discovery

Machine learning has found applications in accelerating the discovery of a variety


of materials as well as to gain deep insights into the material response [2–6]. Here,
we briefly review some of the applications where ML has successfully solved some
of the open problems or has outperformed classical approaches. These applications
are discussed in detail in Part III.

1.4.1 Property Prediction

One of the most commonly used application of ML is property prediction. This is


a major problem for almost all materials such as alloys, ceramics, glasses, poly-
mers, and nanomaterials as the property of a material can be a complex, non-convex
function of composition, structure, and processing [3, 7–14]. For some properties
such as hardness, it can also be a function of the testing method and testing param-
eters [15]. To predict material properties, first a clean dataset of input features and
output property of interest need to be prepared. Note that the input features can be
simple chemical composition, or more complex features such as the periodic table
based descriptors. The input features can also be a combination of multiple features
engineered using additional unsupervised ML techniques. Once a clean dataset is
prepared, supervised ML is used to train models that can predict the property of
interest.
Figure 1.4 shows the predicted values of density, Young’s modulus, Vicker’s hard-
ness, and shear modulus of oxide glasses with respect to the experimental values [16].
The dataset consists of 50,000 oxide glasses with multiple components. We observe
that the the predicted values for this large dataset exhibit a good agreement with
1.4 Machine Learning for Materials Discovery 11

Fig. 1.4 Predicted values of a density, b Young’s modulus, c Vicker’s hardness, and d shear modulus
of oxide glasses with respect to the experimental values. The R.2 values of training, validation,
and test are shown. The inset shows the histogram of error in the prediction along with the 95%
confidence interval

respect to the experimental values for all the properties. In addition, the 95% confi-
dence interval of the error histogram shown in the inset confirms that the predictions
indeed exhibit a very low error in comparison to the range of values considered.
Similar approaches have been widely used for the prediction of properties of several
materials including ceramics, metal alloys, metallic glasses, 2D materials, polymers,
and even proteins.

1.4.2 Materials Discovery

While property prediction allows one to explore the properties of hitherto unknown
composition, it necessarily does not directly provide a recipe of new materials. Mate-
rials discovery a more challenging problem having constraints on multiple properties
and components. For instance, a desired alloy for automotive applications should be
12 1 Introduction

light-weight, hard, strong, tough, ductile, and easily weldable. Many of these prop-
erties are conflicting. Effectively, this problem translates to solving the inverse of
property prediction. Here, we need to predict the candidate composition and process-
ing parameters corresponding to a target property. To this extent, surrogate model
based optimization approaches can be used, wherein the surrogate model is devel-
oped using supervised ML. Once the surrogate model is developed for composition–
property relationships, metaheuristic algorithms such as ant colony optimization,
particle swarm optimization, or genetic algorithm or Bayesian optimization can be
used to identify the family of compositions that satisfies the compositional and prop-
erty constraints. The list of predicted compositions can be tested experimentally for
validation. This approach significantly reduces the total number of experiments to
be carried out, thereby accelerating materials discovery significantly.

Fig. 1.5 Optimization flow chart for the discovery glass compositions with Young’s modulus
greater than 30 GPa with liquidus less 1500 K and having a mol% SiO.2 between 70 and 90%. This
approach has been implemented for glass discovery in PyGGi Zen
1.4 Machine Learning for Materials Discovery 13

Figure 1.5 shows the flow chart for the discovery glass compositions with Young’s
modulus greater than 30 GPa while having a liquidus less 1500 K and comprising
of SiO.2 with a mol% between 70 and 90%. Here the objective is to discover glasses
with Young’s modulus greater than 30 GPa. The constraints are applied on both
property (liquidus .< 1500 K) and composition (70% .< SiO.2 < 90%). Additional
constraints on other properties or components can also be applied. Now, the model
for composition–property relationships is obtained using ML. The optimization algo-
rithms such as gradient descent, particle swarm, ant colony and genetic algorithm
are applied on this model to discover new glass compositions satisfying both the
objectives and the constraints. Finally, the product is experimentally validated. Sim-
ilar approaches have been employed in several materials discovery packages such as
The Materials Project or PyGGi Zen.

1.4.3 Image Processing

Images hold key information about materials form a crucial part of the materials
literature [13, 17–19]. For instance, the scanning electron microscope (SEM) image
of a microstructure of a material provides detailed information about the grain struc-
ture, orientation, phase, and texture of the material, which in turn contributes to
its mechanical properties [20–22]. Quantifying this information requires a domain
expert and is an extremely time intensive task. ML has been successfully used to
address this challenge. ML has proved to be able to automatically capture grain-level
information form the SEM images. In addition, these SEM images can be used to
predict the properties of materials. These models can then be used to obtain materials
with tailored microstructures having desired properties as well.
Figure 1.6 shows the prediction of crystal structures, that is, the Bravais lattice
and space group, based on the electron backscatter diffraction (EBSD) patterns [23].
EBSD patterns are directly given as an input to a CNN, which consists of alternating
convolution and pooling layers. The goal of the convolution layer is to extract the
features from the images to form feature maps, which are then downsampled by the
pooling layers. Finally, a feedforward NN is placed at the last layer to perform a
classification task, which takes the learned and downsampled features as inputs and
predicts the crystal structure. Thus, the ML model allows the prediction of crystal
structure directly from the EBSD images.

1.4.4 Understanding the Physics

Understanding the physics governing the response of materials to different types


of stimuli such as crystallization [24, 25], phase separation [25–27], fracture [26–
29], still remain open problems. This is further exemplified by the fact that glass
transition has been hailed as one of the greatest unsolved challenges of the twenty first
14 1 Introduction

Fig. 1.6 Applying CNN to predict the Bravais lattice or space group of a crystal structure from the
electron backscatter diffraction patterns. Reprinted from [23] with permission

century [30]. Recently, ML has shown potential provide insights into the dynamics
governing glass transition. Specifically, ML methods have been able to predict the
structural control of glass dynamics, which allowed extrapolation of glass relaxation
behavior to large timescales. Similarly, ML has also been used to understand the
preferred direction of crack propagation (see Fig. 1.7). These studies show that ML
has the potential to provide deep insights into the physics of material behavior, which
may hold the key solving some the open materials problems.

1.4.5 Automated Knowledge Extraction

While there have been several databases on material properties, most of the informa-
tion on materials lie buried as unstructured data in the form of text in literature. ML
allows automated extraction of the knowledge from literature which can then be used
to predict new materials. This approach has been demonstrated to discover novel ther-
moelectric materials. This approach also allows to gain correlations between mate-
rials that was otherwise not obvious even to domain experts. Altogether, ML can
aid the extraction of knowledge from text, which can further be used for knowledge
dissemination in an accelerated fashion.
1.4 Machine Learning for Materials Discovery 15

Fig. 1.7 Prediction of fracture path based on the atomistic simulation trajectory trained using a
LSTM and CNN. Reprinted from [31] with permission

Recently, several natural language processing (NLP) based techniques have


been widely used to extract information from the scientific literature and images
[32–37]. Some of these include the chemdataextractor [33], imagedataextractor [35],
and other automated databases developed using NLP [38]. Figure 1.8 shows the key
steps in the extraction of named entities from the scientific literature which can then
be used for mining large scale information. This information can in turn be used to
discover novel applications for existing materials based on property similarity and
novel materials for targeted application based on materials similarity.
16 1 Introduction

Fig. 1.8 Workflow for named entity recognition. The key steps are as follows: 1 documents are
collected and added to our corpus, 2 the text is preprocessed (tokenized and cleaned), 3 for training
data, a small subset of documents are labeled (SPL .= symmetry/phase label, MAT .= material,
APL .= application), 4 the labeled documents are combined with word embeddings (Word2vec)
generated from unlabeled text to train a neural network for named entity recognition, and finally 5
entities are extracted from our text corpus. Reprinted with permission from [32]

1.4.6 Accelerating Materials Modeling

Materials modeling is another area where ML has been disruptive. Specifically, mate-
rials modeling can be at multiple scales ranging from electronic and atomic length-
scales to continuum scales. Atomistic simulations model the interactions between
atoms using empirical forcefields, the accuracy which governs the accuracy of simula-
tions. While first principle simulations can provide accurate results, these simulations
are constrained in terms of the number of atoms due to the prohibitive computational
cost. To address this challenge, ML-based interatomic forcefields have been devel-
oped which provide the accuracy of first principle simulations at a significantly lower
computational cost. These generic potentials may thus enable the simulation almost
any element in the periodic table with reasonable computational cost. Figure 1.9
shows a GNN framework which enables accurate predictions of interatomic force
from the particle positions and velocities [39]. In this approach, each atomic configu-
ration is represented as a directed graph wherein the influnce of neighboring atoms . j
on an atom .i, or node .vi , is represented by a directed edge .ei j along the direction .ui j .
Nodes and edges are embedded with latent vectors in the embedding stage. Initially,
1.5 Outline of the Book 17

Fig. 1.9 A graph neural network framework that directly predicts the force on each atom directly
from the atomic structure. The trained model can be used as a surrogate for direct-to-force archi-
tecture to be used in molecular simulation. Reprinted with permission from [39]

the node and edge embeddings respectively contain the atom type and interatomic
distance information. The embeddings are then iteratively updated during the mes-
sage passing stage. The final updated edge embeddings are used for predicting the
interatomic force magnitudes. The force on the center atom . j is calculated by sum-
ming the force contributions of neighboring atoms that are calculated by multiplying
the force magnitude and the respective unit vector. The predicted forces are finally
used for updating the atomic positions in MD.
Another approach used for accelerating simulations at higher length scales is the
physics-based ML approaches for materials simulation. Here, the ML is allowed to
learn the equation governing material response while being constrained by physical
laws such as energy, mass, and momentum conservation. Thus, these physics-based
ML models allow generic material simulation while also obeying the laws of physics.
Some examples of these simulations include the Hamiltonian neural network [40],
physics-informed neural network [41], and Lagrangian neural network [42].

1.5 Outline of the Book

This book is organized into three major parts.


1. Part I: The first part gives a broad introduction to the role of ML for materials
discovery. The aim is of this section is to give a bird’s eye view on the role of AI
and ML in materials modeling, discovery, and simulation.
2. Part II: The second part focuses on the basics of ML algorithms. This section
provides the mathematical details of different ML algorithms. Further, the code
snippets that allow the readers to implement and reproduce each of these algo-
18 1 Introduction

rithms are also given. This section aims to allow the readers to understand the
inner workings of ML algorithms and empower them to use these algorithms to
solve their own problems of interest.
3. Part III: The third part focuses on the applications of ML to solve several chal-
lenges in the materials domain. This section aims to give insights into the problems
that has been tackled by AI and ML. Further, through these examples, we also aim
to inspire the readers to identify problems in their domain, which can be solved
using ML.
The area of ML for materials, being a very active and dynamic one, is evolving at
a very high pace. The ideas, methods, and problems discussed in this book are by no
means exhaustive. Further, these discussions should not be considered as a detailed
review on the applications of ML in materials domain. Rather, the discussions in this
book are simply illustrative in nature to exemplify the applications of several ML
methods in the context of materials domain. Moreover, we hope that these discussions
inspire the readers to improve upon the state-of-the-art of ML for materials.

References

1. L. Himanen, A. Geurts, A.S. Foster, P. Rinke, Data-driven materials science: status, challenges,
and perspectives. Adv. Sci. 6(21), 1 900 808 (2019). https://doi.org/10.1002/advs.201900808.
eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1002/advs.201900808. https://onlinelibrary.
wiley.com/doi/abs/10.1002/advs.201900808
2. J. Li, K. Lim, H. Yang, Z. Ren, S. Raghavan, P.-Y. Chen, T. Buonassisi, X. Wang,
Ai applications through the whole life cycle of material discovery. Matter 3(2), 393–
432 (2020). ISSN: 2590-2385. https://doi.org/10.1016/j.matt.2020.06.011. https://www.
sciencedirect.com/science/article/pii/S2590238520303015
3. J.E. Gubernatis, T. Lookman, Machine learning in materials design and discovery: exam-
ples from the present and suggestions for the future. Phys. Rev. Mater. 2(12), 120 301
(2018). https://doi.org/10.1103/PhysRevMaterials.2.120301. https://link.aps.org/doi/10.1103/
PhysRevMaterials.2.120301. Accessed 19 Feb 2019
4. Y. Liu, T. Zhao, W. Ju, S. Shi, Materials discovery and design using machine learning. J. Mate-
riomics 3(3), 159–177 (2017). High-throughput Experimental and Modeling Research toward
Advanced Batteries, ISSN: 2352-8478. https://doi.org/10.1016/j.jmat.2017.08.002. https://
www.sciencedirect.com/science/article/pii/S2352847817300515
5. P. Raccuglia, K.C. Elbert, P.D. Adler, C. Falk, M.B. Wenny, A. Mollo, M. Zeller, S.A. Friedler, J.
Schrier, A.J. Norquist, Machine-learningassisted materials discovery using failed experiments.
Nature 533(7601), 73–76 (2016)
6. Q. Zhou, P. Tang, S. Liu, J. Pan, Q. Yan, S.-C. Zhang, Learning atoms for materials discovery.
Proc. Natl. Acad. Sci. 115(28), E6411–E6417 (2018)
7. A. Fluegel, Statistical regression modelling of glass properties -a tutorial. Glass Technol. - Eur.
J. Glass Sci. Technol. Part A 50(1), 25–46 (2009)
8. Q. Ling, H. Zijun, L. Dan, Multifunctional cellular materials based on 2D nanoma-
terials: prospects and challenges. Adv. Mater. 30(4), 1 704 850 (2018). ISSN: 1521-
4095. https://doi.org/10.1002/adma.201704850. https://onlinelibrary.wiley.com/doi/abs/10.
1002/adma.201704850
9. D.R. Cassar, A.C.P.L.F. de Carvalho, E.D. Zanotto, Predicting glass transition temper-
atures using neural networks. Acta Materialia 159, 249–256 (2018). ISSN: 1359-6454.
References 19

https://doi.org/10.1016/j.actamat.2018.08.022. http://www.sciencedirect.com/science/article/
pii/S1359645418306542. Accessed 02 Oct 2019
10. T. Oey, S. Jones, J.W. Bullard, G. Sant, Machine learning can predict setting behavior and
strength evolution of hydrating cement systems. J. Amer. Ceramic Soc. 103(1), 480–490
(2020). eprint: https://ceramics.onlinelibrary.wiley.com/doi/pdf/10.1111/jace.16706. ISSN:
1551- 2916. https://doi.org/10.1111/jace.16706. https://ceramics.onlinelibrary.wiley.com/doi/
abs/10.1111/jace.16706. Accessed 27 Feb 2021
11. A. Yamanaka, R. Kamijyo, K. Koenuma, I. Watanabe, T. Kuwabara, Deep neural network
approach to estimate biaxial stress-strain curves of sheet metals. Mater. Design 195, 108 970
(2020). ISSN: 0264-1275. https://doi.org/10.1016/j.matdes.2020.108970
12. R. Kondo, S. Yamakawa, Y. Masuoka, S. Tajima, R. Asahi, Microstructure recognition using
convolutional neural networks for prediction of ionic conductivity in ceramics. Acta Materialia
141, 29–38 (2017)
13. J. Ling, M. Hutchinson, E. Antono, B. DeCost, E.A. Holm, B. Meredig, Building data-
driven models with microstructural images: generalization and interpretability. Mater. Discov.
10, 19–28 (2017). ISSN: 2352-9245. https://doi.org/10.1016/j.md.2018.03.002. https://www.
sciencedirect.com/science/article/pii/S235292451730042X. Accessed 27 Feb 2021
14. F. Ren, L. Ward, T. Williams, K.J. Laws, C. Wolverton, J. Hattrick-Simpers, A. Mehta,
Accelerated discovery of metallic glasses through iteration of machine learning and high-
throughput experiments. Sci. Adv. 4(4), eaaq1566 (2018). ISSN: 2375-2548. https://doi.org/10.
1126/sciadv.aaq1566. https://advances.sciencemag.org/content/4/4/eaaq1566. Accessed 30
July 2019
15. M. Zaki, Jayadeva, and N.A. Krishnan, Extracting processing and testing parameters from mate-
rials science literature for improved property prediction of glasses. Chem. Eng. Proc. - Process
Intensif. 108 607 (2021). ISSN: 0255-2701. https://doi.org/10.1016/j.cep.2021.108607. https://
www.sciencedirect.com/science/article/pii/S0255270121003020
16. R. Ravinder, K.H. Sridhara, S. Bishnoi, H. Singh Grover, M. Bauchy, Jayadeva, H. Kodamana,
N.M.A. Krishnan, Deep learning aided rational design of oxide glasses. Mater. Horizons (2020).
Publisher: Royal Society of Chemistry. https://doi.org/10.1039/D0MH00162G. https://pubs.
rsc.org/en/content/articlelanding/2020/mh/d0mh00162g. Accessed 10 May 2020
17. V. Venugopal, S.R. Broderick, K. Rajan, A picture is worth a thousand words: apply-
ing natural language processing tools for creating a quantum materials database map.
MRS Commun. 9(4), 1134–1141 (2019). Publisher: Cambridge University Press. ISSN:
2159-6859, 2159-6867. https://doi.org/10.1557/mrc.2019.136. https://www.cambridge.org/
core/journals/mrs-communications/article/picture-is-worth-a-thousand-words-applying-
natural-language-processing-tools-for-creating-a-quantum-materials-database-map/
8956AFA3C1D282BAF0A85DA36AB0F6B2. Accessed 19 Oct 2020
18. X. Li, Z. Liu, S. Cui, C. Luo, C. Li, Z. Zhuang, Predicting the effective mechanical property of
heterogeneous materials by image based modeling and deep learning. Comput. Methods Appl.
Mech. Eng. 347, 735–753 (2019)
19. J. Bernal, K. Kushibar, D.S. Asfaw, S. Valverde, A. Oliver, R. Marti, X. Llado, Deep convo-
lutional neural networks for brain image analysis on magnetic resonance imaging: a review.
Artif. Intell. Med. 95, 64–81 (2019)
20. K. Kim, Z. Lee, W. Regan, C. Kisielowski, M.F. Crommie, A. Zettl, Grain boundary mapping
in polycrystalline graphene. ACS Nano 5(3), 2142–2146 (2011). ISSN: 1936-0851. https://doi.
org/10.1021/nn1033423. https://doi.org/10.1021/nn1033423. Accessed 07 April 2019
21. A. Shekhawat, R.O. Ritchie, Toughness and strength of nanocrystalline graphene. Nat. Com-
mun. 7, 10 546 (2016). ISSN: 2041-1723. https://doi.org/10.1038/ncomms10546. https://www.
nature.com/articles/ncomms10546. Accessed 07 April 2019
22. H.I. Rasool, C. Ophus, W.S. Klug, A. Zettl, J.K. Gimzewski, Measurement of the intrinsic
strength of crystalline and polycrystalline graphene. Nat. Commun. 4, 2811 (2013). ISSN: 2041-
1723. https://doi.org/10.1038/ncomms3811. https://www.nature.com/articles/ncomms3811.
Accessed 07 April 2019
20 1 Introduction

23. K. Kaufmann, C. Zhu, A.S. Rosengarten, D. Maryanovsky, T.J. Harrington, E. Marin,


K.S. Vecchio, Crystal symmetry determination in electron diffraction using machine
learning. Science 367(6477), 564–568 (2020). ISSN: 0036-8075. https://doi.org/10.
1126/science.aay3062. eprint: https://science.sciencemag.org/content/367/6477/564.full.pdf.
https://science.sciencemag.org/content/367/6477/564
24. V.M. Fokin, E.D. Zanotto, N.S. Yuritsyn, J.W.P. Schmelzer, Homogeneous crystal nucle-
ation in silicate glasses: a 40 years perspective. J. Non-Crystall. Solids 352(26–27), 2681–
2714 (2006). ISSN: 0022-3093. https://doi.org/10.1016/j.jnoncrysol.2006.02.074. https://
www.sciencedirect.com/science/article/pii/S0022309306005205. Accessed 29 Jan 2017
25. E.D. Zanotto, Glass crystallization research –A 36-year retrospective. Part I, fundamental stud-
ies. Int. J. Appl. Glass Sci. 4(2), 105–116 (2013). ISSN: 2041-1294. https://doi.org/10.1111/
ijag.12022. http://onlinelibrary.wiley.com/doi/10.1111/ijag.12022/abstract. Accessed 30 Jan
2017
26. C.J. Simmons, S.W. Freiman, Effects of phase separation on crack growth in borosili-
cate glass. J. Non-Cryst. Solids, XIIth Int. Congress Glass 38, 503–508 (1980). ISSN:
0022-3093. https://doi.org/10.1016/0022-3093(80)90469-X. http://www.sciencedirect.com/
science/article/pii/002230938090469X. Accessed 15 Aug 2017
27. L. Tang, N.M.A. Krishnan, J. Berjikian, J. Rivera, M.M. Smedskjaer, J.C. Mauro, W.
Zhou, M. Bauchy, Effect of nanoscale phase separation on the fracture behavior of
glasses: toward tough, yet transparent glasses. Phys. Rev. Mat. 2(11) (2018). ISSN: 2475-
9953. https://doi.org/10.1103/PhysRevMaterials.2.113602. https://link.aps.org/doi/10.1103/
PhysRevMaterials.2.113602. Accessed 15 Nov 2019
28. M.J. Buehler, F.F. Abraham, H. Gao, Hyperelasticity governs dynamic fracture
at a critical length scale. Nature 426(6963), 141–146 (2003). ISSN: 0028-0836.
https://doi.org/10.1038/nature02096. http://www.nature.com/nature/journal/v426/n6963/abs/
nature02096.html. Accessed 30 Aug 2016
29. E. Sharon, S.P. Gross, J. Fineberg, Energy dissipation in dynamic fracture. Phys. Rev. Lett.
76(12), 2117–2120 (1996)
30. D.L. Anderson, Through the glass lightly. Science 267(5204), 1618–1618 (1995)
31. Y.-C. Hsu, C.-H. Yu, M.J. Buehler, Using deep learning to predict fracture patterns in crystalline
solids. Matter 3(1), 197–211 (2020)
32. L. Weston, V. Tshitoyan, J. Dagdelen, O. Kononova, A. Trewartha, K.A. Persson, G. Ceder, A.
Jain, Named entity recognition and normalization applied to large-scale information extraction
from the materials science literature. J. Chem. Inf. Model. 59(9), 3692–3702 (2019)
33. M.C. Swain, J.M. Cole, ChemDataExtractor: a toolkit for automated extraction of chemical
information from the scientific literature. J. Chem. Inf. Model. 56(10), 1894–1904 (2016).
Publisher: American Chemical Society. ISSN: 1549-9596. https://doi.org/10.1021/acs.jcim.
6b00207. https://doi.org/10.1021/acs.jcim.6b00207. Accessed 19 Oct 2020
34. A.C. Vaucher, F. Zipoli, J. Geluykens, V.H. Nair, P. Schwaller, T. Laino, Automated extrac-
tion of chemical synthesis actions from experimental procedures. Nat. Commun. 11(1), 3601
(2020). ISSN: 2041-1723. https://doi.org/10.1038/s41467-020-17266-6. https://doi.org/10.
1038/s41467-020-17266-6
35. K.T. Mukaddem, E.J. Beard, B. Yildirim, J.M. Cole, Imagedataextractor: a tool to extract and
quantify data from microscopy images. J. Chem. Inf. Model. 60(5), 2492–2509 (2019)
36. V. Tshitoyan, J. Dagdelen, L. Weston, A. Dunn, Z. Rong, O. Kononova, K.A. Persson, G. Ceder,
A. Jain, Unsupervised word embeddings capture latent knowledge from materials science
literature. Nature 571(7763), 95–98 (2019)
37. V. Venugopal, S. Sahoo, M. Zaki, M. Agarwal, N.N. Gosvami, N.M.A. Krishnan, Looking
through glass: knowledge discovery from materials science literature using natural language
processing. Patterns 2(7), 100 290 (2021). ISSN: 2666-3899. https://doi.org/10.1016/j.patter.
2021.100290. https://www.sciencedirect.com/science/article/pii/S2666389921001239
38. S. Huang, J.M. Cole, A database of battery materials auto-generated using chemdataextractor.
Sci. Data 7(1), 1–13 (2020)
References 21

39. C.W. Park, M. Kornbluth, J. Vandermause, C. Wolverton, B. Kozinsky, J.P. Mailoa, Accu-
rate and scalable graph neural network force field and molecular dynamics with direct force
architecture. npj Comput. Mater. 7(1), 1–9 (2021)
40. S. Greydanus, M. Dzamba, J. Yosinski, Hamiltonian neural networks. Adv. Neural Inf. Proc.
Syst. 32, 15 379–15 389 (2019)
41. G.E. Karniadakis, I.G. Kevrekidis, L. Lu, P. Perdikaris, S. Wang, L. Yang, Physics-informed
machine learning. Nat. Rev. Phys. 3(6), 422–440 (2021)
42. M. Cranmer, S. Greydanus, S. Hoyer, P. Battaglia, D. Spergel, S. Ho, Lagrangian neural net-
works (2020). arXiv:2003.04630
Part II
Basics of Machine Learning

This part covers the basics of machine learning methods used for materials discovery.
In Chap. 2, we focus on the dataset visualization and preprocessing. Followed by
this, a brief introduction on various ML approaches such as supervised, unsuper-
vised, and reinforcement learning is provided in Chap. 3. Chapters 4 and 5 discuss
in detail about the supervised algorithms for regressions with the former focusing
on parametric methods and the latter on non-parametric methods. Chapter 6 deals
with classification and clustering algorithms. Chapter 7 focuses on model refinement
using hyperparametric optimization. Chapter 8 provides an overview to advanced ML
algorithms and deep learning such as variational auto-encoders, generative adver-
sarial networks, graph neural networks, and reinforcement learning. Finally, Chap. 9
focuses on the interpretability of the black-box ML algorithms.
Chapter 2
Data Visualization and Preprocessing

Abstract ML methods, being purely data driven, relies on the availability of high
quality dataset. However, in reality, the datasets may have inconsistencies, errors,
and may even be incomplete. Further, the choice of an appropriate ML algorithm for
a given dataset will depend highly on the nature, size, distribution of the dataset. In
this chapter, we discuss the different approaches to visualize data such histograms,
scatter plots, heat maps, and tree maps. Further, several measures that quantify the
data including central and higher-order measures are discussed. Next, we discuss
several commonly used outlier detection algorithms that enable “data cleaning”.
Finally, we discuss data-imputation algorithms such as SMOTE and ADASYN for
imputing data in imbalanced datasets.

2.1 Introduction

Being a data-driven approach, the reliability of ML models is highly dependent on


the quality of the data used to train the model. An ideal (or preferred) dataset used
should for training ML models exhibits certain properties as follows.
• Consistent: Data should be generated using the same experimental/simulation
protocol with all the conditions being exactly the same.
• Representative: The dataset should be able to represent the entire domain of interest
reasonably. For example, to classify elements as metals and non-metals, both
metals and non-metals should be present in the dataset.
• Balanced: Data from all the classes should be equally represented in the dataset.
For example, to classify elements as metals and non-metals, both metals and non-
metals should be present in approximately equal numbers in the training data.
• Accurate: The dataset should have accurate values to the appropriate significant
digits to avoid “garbage in, garbage out (GIGO)”.

Supplementary Information The online version contains supplementary material available at


https://doi.org/10.1007/978-3-031-44622-1_2.

© Springer Nature Switzerland AG 2024 25


N. M. A. Krishnan et al., Machine Learning for Materials Discovery,
Machine Intelligence for Materials Science,
https://doi.org/10.1007/978-3-031-44622-1_2
26 2 Data Visualization and Preprocessing

Unfortunately, many datasets do not fulfill one or more of the above-mentioned


criteria. Thus, it is extremely important to visualize and preprocess the data before
using it to develop ML models. Such visualizations provide deep insights into the
nature of the data and help one plan forward with additional data collection or clean-
ing. Data visualization also enables one to choose the appropriate ML algorithm for
the study. It should be noted that the choice of ML algorithm, although completely up
to the user, is highly dictated by the nature of the data. A mantra to be followed while
choosing the ML models is that the “simplest model that can provide predictions
(with reasonable accuracy) is always the best model”. This aspect of choosing the
ML algorithm will be discussed in detail later.
The rest of the chapter is arranged as follows. First, we will focus on different plots
used to visualize data: bar graphs, heat maps, tree maps, scatter plots, histograms, and
density plots. Then, we will focus on extracting the statistics from the data, including
the central measures (mean, median, and mode) and measures of variability (range,
variance, skewness, and kurtosis). Following this, we will briefly discuss some outlier
detection algorithms and other data augmentation strategies. The present chapter
focuses only on structured data that follows a particular format and is stored in a
machine-readable format, such as comma-separated values. The dataset visualization
of unstructured and semi-structured data is typically performed after data processing
and converting them to a structured data format. Hence, the techniques discussed
in this chapter are applicable for all data formats provided relevant preprocessing is
performed, if required.

2.2 Data Visualization

2.2.1 Bar Graph

A bar graph is a means to visualize categorical data with rectangular bars with heights
of each bar proportional to the values they represent. The bars can be plotted vertically
or horizontally. Figure 2.1 shows the bar graph of a dataset of glasses containing
sodium silicate, that is, (Na.2 O).x .· (SiO.2 ).(1−x) , and calcium aluminosilicate glasses,
that is, (CaO). y ·(Al.2 O.3 ).z .· (SiO.2 ).(1−y−z) , where .x, y, and .z represent the mole % of
the respective oxides in the glasses and can take any value in the range of [0, 1]. The
barplot shows the number of glasses with each oxide having non-zero values. We
observe that the number of glasses having Na.2 O is the least, while maximum glasses
have SiO.2 present in them. Code snippet 2.1 shows the Python code to reproduce the
results.

2.2.2 Heat Map

A heat map is another useful data visualization tool. In this, the input features are
represented in two dimensions (x and y), and the variation of the output property
2.2 Data Visualization 27

Fig. 2.1 Number of glasses


with component .Na2 O, .CaO,
.Al2 O3 and .SiO2

"""
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ' )
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# i m p o r t n u m p y and p a n d a s
i m p o r t numpy as np
i m p o r t p a n d a s as pd
# load s a m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / e l a s t i c _ m o d u l u s . csv " )
# print column names
print ( data . c o l u m n s )
n u m _ N a 2 O = sum ( data [ ' Na2O ' ] > 0 )
n u m _ C a O = sum ( data [ ' CaO ' ] > 0 )
n u m _ A l 2 O 3 = sum ( data [ ' Al2O3 ' ] > 0 )
n u m _ S i O 2 = sum ( data [ ' SiO2 ' ] > 0 )
y = [ num_Na2O , num_CaO , num_Al2O3 , n u m _ S i O 2 ]
x = [ " $ N a _ 2 O $ " , " $CaO$ " , " $ A l _ 2 O _ 3 $ " , " $ S i O _ 2 $ " ]
# bar plot using m a t p l o t l i b
plt . bar (x , y , fc = " none " , ec = " k " , hatch = " // " )
plt . y l a b e l ( " N u m b e r " )
plt . l e g e n d ()
plt . gca () . s e t _ a s p e c t ( ' auto ' )
s a v e f i g ( " s a m p l e b a r p l o t . png " )
print ( " End of o u t p u t " )

Output:
Index(['Al2O3', 'CaO', 'SiO2', 'Na2O', 'Young's modulus (GPa)'], dtype='object')
End of output

See Fig. 2.1

Code snippet 2.1: Bar plot


28 2 Data Visualization and Preprocessing

Fig. 2.2 Ternary diagram


for Young’s modulus of
calcium aluminosilicate
glasses. The squares
represent experimental
values. The underlying heat
map represents the
predictions based on ML
models. Details of the work
can be found in [1]

is represented using a coloring scheme. The intensity of the color in the x-y space
represents the variation in the property values. Thus, a heat map facilitates the visu-
alization of three-dimensional data in two dimensions. An advantage of a heat map
is that it can provide a quick visual summary of the data sets. Heat maps are also used
to visualize the correlations between two variables (namely, correlation heat maps).
Such heat maps provide a quick way to understand correlations among variables in
a visual manner.
Figure 2.2 shows the ternary diagram for Young’s modulus of calcium aluminosili-
cate glasses [1]. The squares represent the experimental values, while the background
represents the predictions based on an ML model. The coloring scheme on the right
shows the range of values for Young’s modulus. The heat map thus clearly shows the
trends in Young’s modulus values with respect to composition. In addition, it allows
a direct comparison of the model predictions represented by the heat map with the
experimental values represented by the squares.

2.2.3 Tree Map

Treemaps are generally used to visualize hierarchical data by a series of nested


rectangles whose area is proportional to the corresponding data value. For example,
elements can be divided into alkali, alkaline earth, pnictogens, chalcogens, halogens,
and noble gases (Fig. 2.3). A tree map can be used to reflect two main properties of
the data set: (i) relative or qualitative values of the desired properties by means of the
chart area, for example, what is the percentage of alkaline earth metals in the total
periodic table, (ii) quantitative values of the desired property which can be expressed
as the color of the individual rectangles with a color legend, for example, exactly
how many alkaline earth metals are present in the periodic table.
2.2 Data Visualization 29

Fig. 2.3 Treemap of elements in the periodic table. Note that the elements are grouped into canonical
categories such as alkali, alkaline earth, noble gases, non-metals, lanthanides, metalloids, actinides,
and transition metals

2.2.4 Scatter Plots

A scatter plot uses simple markers (and not continuous lines) to represent values for
two or three different variables. Accordingly, the resulting plot will be two- or three-
dimensional, respectively. Scatter plots are particularly useful and are one of the
most widely used plots in analyzing relationships between variables. A scatter plot
can also be used to unearth hidden patterns in data when the plot closely resembles
data points clustered together. Scatter plots can enable one to identify if there are any
unexpected gaps or outliers present in the data.
Figure 2.4 shows the scatter plot of Young’s modulus of sodium silicate glasses,
(Na.2 O.x ).·(SiO.2 ).(1−x) , with respect to the silica percentage in the glass. See Code
Snippet 2.2 to reproduce the results and the plot. From the scatter plot, we observe
that Young’s modulus values of the glass compositions lie scattered. Further, for
similar compositions having SiO.2 percent of 65%, 70%, or 75%, the values of
Young’s modulus exhibit huge variations. This suggests the presence of outliers in
the data. For instance, if the compositions with Young’s modulus values less than
55 GPa are discarded (five data points), Young’s modulus exhibits an increasing trend
in an average sense with respect to the silica content. Thus, scatter plots provide a
30 2 Data Visualization and Preprocessing

Fig. 2.4 Young’s modulus


(GPa) with respect to .SiO2
(mol %)

clear visualization of the trend in the data while also providing insights into the
outliers.

"""
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ' )
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# i m p o r t n u m p y and p a n d a s
i m p o r t numpy as np
i m p o r t p a n d a s as pd
# load s a m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / NS_ym . csv " )
# print column names
print ( data . c o l u m n s )
x = data [ ' SiO2 ' ]
y = data [ " Y o u n g ' s m o d u l u s ( GPa ) " ]
# s c a t t e r plot using m a t p l o t l i b
plt . s c a t t e r ( x , y , m a r k e r = " o " , fc = " none " , ec = " k " )
plt . y l a b e l ( " Y o u n g ' s m o d u l u s ( GPa ) " )
plt . x l a b e l ( r " $ S i O _ 2 $ ( mol % ) " )
plt . l e g e n d ()
s a v e f i g ( " s a m p l e s c a t t e r p l o t . png " )
print ( " End of o u t p u t " )

Output:
Index(['Na2O', 'SiO2', 'Young's modulus (GPa)'], dtype='object')
End of output

See Fig. 2.4

Code snippet 2.2: Scatter plot


2.2 Data Visualization 31

2.2.5 Histogram

A histogram is an efficient way of representing the distribution of a property. For


example, the distribution of Young’s modulus of steel or the strength of concrete
blocks can be visualized using histograms. The first step of making a histogram
is to divide the entire range of values into a series of intervals (or bins) and then
calculate the frequency in each interval. Although the intervals need not be of equal
size, they typically are. Then a rectangle is erected over an interval height or area
proportional to the frequency. A histogram may also be normalized to display the
relative frequencies of the properties of the data sets. It should be noted that the
distribution represented by the histogram is highly dependent on the bin size–too
small an interval may lead to noisy representation, and too big an interval might lead
to the loss of information. While there are many approaches to choosing the bin size
of a histogram, two commonly used approaches are √ square root and Sturges’ rule.
The square root rule suggests the number of bins as . n, whereas the Sturges’ rule
suggests the number of bins as .1 + log2 (n) or .1 + 3.322 log(n), where .n is the size
of the sample.
Figure 2.5 shows the histogram of Young’s modulus sodium silicate and calcium
aluminosilicate glasses. Code Snippet 2.3 can be used to reproduce the results and
the plot. In the histogram, we clearly observe two distinct distributions, one between
40 and 70 GPa and the other between 80 and 120 GPa. A closer observation of the
original data reveals that the first peak corresponds to Young’s modulus of sodium
silicate glasses and the second one to that of the calcium aluminosilicate glasses.
Note that the same data has been used in the earlier figures as well. Thus, a bi-modal
distribution typically suggests the existence of two different sources from which the
data has been obtained, each of which follows a different distribution. Histogram can
also be used to identify the distribution of the data, outliers present, and any sparse
regions within the data distribution.

Fig. 2.5 Distribution of


Young’s modulus (GPa)
32 2 Data Visualization and Preprocessing

"""
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ' )
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# i m p o r t n u m p y and p a n d a s
i m p o r t numpy as np
i m p o r t p a n d a s as pd
# load s a m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / e l a s t i c _ m o d u l u s . csv " )
# print column names
print ( data . c o l u m n s )
# hist plot using m a t p l o t l i b
plt . hist ( data [ " Y o u n g ' s m o d u l u s ( GPa ) " ] , bins = 16 , range = ( 42 . 5 , 122 . 5 ) ,
fc = " none " , ec = " k " , hatch = " // " )
plt . y l a b e l ( " F r e q u e n c y " )
plt . x l a b e l ( " Y o u n g ' s m o d u l u s ( GPa ) " )
plt . l e g e n d ()
plt . gca () . s e t _ a s p e c t ( ' auto ' )
s a v e f i g ( " s a m p l e h i s t p l o t . png " )
print ( " End of o u t p u t " )

Output:
Index(['Al2O3', 'CaO', 'SiO2', 'Na2O', 'Young's modulus (GPa)'], dtype='object')
End of output

See Fig. 2.5

Code snippet 2.3: Histogram plot

2.2.6 Density Plots

A density plot enables visualization of the distribution of data over a continuous


interval. In other words, a kernel smoothing function applied to histograms provides
a smooth distribution of the variable, namely, density plots. The peaks of the density
plots are helpful in identifying where values are concentrated over the interval. An
advantage of density plots over histograms is their unique ability to determine the
approximate distribution shape. Note that similar to the bin size in the histogram,
bandwidth is a free parameter in the kernel smoothing, which can have a strong
influence on the resulting estimate.
Figure 2.6 shows the density plots of Young’s modulus with different bandwidths.
The raw dataset is represented by the underlying histogram. Code Snippet 2.4 can be
used to reproduce the results and the plot. We observe that the trend in the density
plot can be “noisy” or “smooth” depending on the bandwidth chosen for the kernel
smoothing. To obtain a realistic trend, one should do a parametric study on the
bandwidth and choose an appropriate one to visualize the data using the density
plots.
2.2 Data Visualization 33

"""
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ' )
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# i m p o r t n u m p y and p a n d a s
i m p o r t numpy as np
i m p o r t p a n d a s as pd
# load s a m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / e l a s t i c _ m o d u l u s . csv " )
# print column names
print ( data . c o l u m n s )
# create sample dataset
y = [ np . r a n d o m . n o r m a l ( loc = 2 . 0 , scale = 1 . 0 ) for i in range ( 1000 ) ]
y + = [ np . r a n d o m . n o r m a l ( loc = 5 .0 , scale = 1 . 0 ) for i in range ( 1000 ) ]
y = np . array ( y )
# d e n s i t y plot using m a t p l o t l i b
r a n g e _ = ( y . min () , y . max () )
binsize = 0.4
bins = int (( r a n g e _ [ 1 ] - r a n g e _ [ 0 ] ) / b i n s i z e )
plt . hist (y , bins = bins , range = range_ , fc = " none " , ec = " k " , hatch = " // " ,
d e n s i t y = True , alpha = 0 . 1 )
binsize = 0.4
bins = int (( r a n g e _ [ 1 ] - r a n g e _ [ 0 ] ) / b i n s i z e )
values , b i n _ e d g e s = np . h i s t o g r a m (y , bins = bins , range = range_ , d e n s i t y =
True )
x = ( bin_edges [:-1]+ bin_edges [1:])/2
plt . plot ( x , v a l u e s )
binsize = 0.2
bins = int (( r a n g e _ [ 1 ] - r a n g e _ [ 0 ] ) / b i n s i z e )
values , b i n _ e d g e s = np . h i s t o g r a m (y , bins = bins , range = range_ , d e n s i t y =
True )
x = ( bin_edges [:-1]+ bin_edges [1:])/2
plt . plot ( x , v a l u e s )
binsize = 1.2
bins = int (( r a n g e _ [ 1 ] - r a n g e _ [ 0 ] ) / b i n s i z e )
values , b i n _ e d g e s = np . h i s t o g r a m (y , bins = bins , range = range_ , d e n s i t y =
True )
x = ( bin_edges [:-1]+ bin_edges [1:])/2
plt . plot ( x , v a l u e s )
plt . y l a b e l ( " P r o b a b i l i t y d e n s i t y f u n c t i o n " )
plt . x l a b e l ( " Y o u n g ' s m o d u l u s ( GPa ) " )
plt . l e g e n d ()
s a v e f i g ( " s a m p l e d e n s i t y p l o t . png " )
print ( " End of o u t p u t " )

Output:
Index(['Al2O3', 'CaO', 'SiO2', 'Na2O', 'Young's modulus (GPa)'], dtype='object')
End of output

See Fig. 2.6

Code snippet 2.4: Density plot


34 2 Data Visualization and Preprocessing

Fig. 2.6 Distribution of


Young’s modulus (GPa)

2.3 Extracting Statistics from Data

The nature of data in a qualitative manner can be obtained from data visualization.
However, a quantitative representation requires the extraction of statistics from the
data. These methods of extracting key information from the data are outlined next.
A population is the set of all possible data of the characteristics under investigation.
This population may be finite or infinite in size. However, due to various limitations,
a population may not be fully accessible to anyone. For example, it is impossible
to directly measure Young’s modulus of each grain and phase of a steel sample
with a highly heterogeneous microstructure. To address this issue, all the statistical
measures are defined on a sample, that is, the part or the subset of the population
which is fully accessible. Thus, to quantify Young’s modulus of steel, 100 or 200
measurements may be made at randomly selected or uniformly distributed grid points.
These points are considered to represent the entire microstructure of steel, and the
statistical properties are then extracted from this dataset. For further discussion on
statistical methods, readers are directed References [2–6].

2.3.1 Central Measures of Data

A measure of the central tendency of a dataset is a single characteristic (number)


that describes the data most appropriately. They are also called the summarizing
statistics of the dataset. The mean (often called average) is the most widely used
central measure of data, along with the median and the mode. They are defined
generally in connection with a population or a sample. For example, the mean is
considered the representative value of Young’s modulus from the distribution of
Young’s modulus of steel.
2.3 Extracting Statistics from Data 35

2.3.1.1 Mean

Mean is the most common central measure used to represent a dataset that is dis-
tributed in a continuous fashion. The sample mean is defined as:

1∑
. x= xi (2.1)
n n

where .n is the size of the sample and .xi is the .ith sample point. Note that .x may
not represent the region where data is most densely distributed, especially if the
distribution is asymmetric. In fact, the mean can occur in regions where there is no
sparse data. Further, as the mean is the weighted sum of all the data points, outliers
present in the data can significantly affect the location of the mean.

2.3.1.2 Median

For .n observations, the sample median, . M, is (i) . n+1 2


th largest observation for odd
values of .n, or (ii) mean of . n2 and . n2 + 1th largest observations for even values of .n.
Hence, the median is the value that splits the dataset in half. Due to this property,
outliers present in the data may not affect the location of the median significantly. In
other words, the median is robust against outliers, whereas the mean is sensitive to
outliers.

2.3.1.3 Mode

Mode is the value that occurs with the most significant frequency. If each.xi is unique,
it is impossible to represent the Mode. Typically, Mode is the best central measure
while dealing with categorical or discrete data.
Consider the dataset on Young’s modulus of the sodium silicate and calcium
aluminosilicate glasses presented in the earlier section. Figure 2.7 shows the mean,
median, and mode for the dataset along with the underlying histogram of the data.
Code Snippet 2.5 can be used to reproduce the results and the plot. Here, we observe
that the mean corresponds to 80 GPa, a value around which there are very few data
points in the raw dataset. This could be attributed to the bimodal nature of the data,
wherein the mean is not a very meaningful central measure. Mode corresponds to
60 GPa, which is the mean Young’s modulus of the dataset consisting of sodium sili-
cate glasses only. Thus, no information about the calcium aluminosilicate glasses in
the dataset is included in the mode. The median, having a value of 85 GPa, represents
the histogram bin having the maximum number of calcium aluminosilicate glasses.
Thus, the median represents a reasonable central measure of the dataset in this case.
It is interesting to note that, for the present dataset, each of the three central measures
36 2 Data Visualization and Preprocessing

Fig. 2.7 Distribution of


Young’s modulus (GPa)

corresponds to different regions in the dataset—the mean in the sparse region, the
mode in the sodium silicate region, and the median in the calcium aluminosilicate
region.

2.3.2 Measures of Variability

Measures of spread can summarise how scattered the data is and how are each of
the points in the dataset distributed with respect to the central measure considered.
Some of the commonly used measures of spread are range, percentile, and variance.
It is also useful to estimate the uncertainty associated with some experimental mea-
surements. For example, the spread in the values of Young’s modulus from multiple
measurements in a homogeneous material provides insights into the accuracy of the
measurements. This is typically represented using error bars which incorporate the
measures of variability along with the central measure of the data.

2.3.2.1 Range

The range is the simplest measure of the spread in the dataset. The range is defined as
the difference between the largest and smallest observations in a sample set. Although
it is easy to compute, the range is highly sensitive to outliers. For example, for a dataset
given by (70, 71, 73, 69, 91) GPa—representing five measurements of Young’s
modulus of silica glass in GPa—the range is given by .91 − 69 = 22 GPa. From
the dataset, it can be observed that all the values, other than 91 GPa, are distributed
closely, and hence 91 GPa is clearly an outlier. Excluding the outlier 91 GPa, the range
2.3 Extracting Statistics from Data 37

"""
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ' )
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# i m p o r t n u m p y and p a n d a s
i m p o r t numpy as np
i m p o r t p a n d a s as pd
# load s a m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / e l a s t i c _ m o d u l u s . csv " )
# print column names
print ( data . c o l u m n s )
y = data [ " Y o u n g ' s m o d u l u s ( GPa ) " ]
mean_ = y . mean ()
m e d i a n _ = y . m e d i a n ()
mode_ = y . mode () [ 0 ]
# hist plot using m a t p l o t l i b
plt . hist (y , bins = 16 , range = ( 42 .5 , 122 . 5 ) , fc = " none " , ec = " k " , hatch = " //
" , alpha = 0 . 5 )
plt . v l i n e s ( mean_ , 0 . 0 , 25 . 0 , lw =3 , color = " k " , label = " Mean " )
plt . v l i n e s ( median_ , 0 .0 , 25 . 0 , lw =3 , ls = " : " , color = " k " , label = " M e d i a n "
)
plt . v l i n e s ( mode_ , 0 . 0 , 25 . 0 , lw =3 , ls = " -- " , color = " k " , label = " Mode " )
plt . y l a b e l ( " F r e q u e n c y " )
plt . x l a b e l ( " Y o u n g ' s m o d u l u s ( GPa ) " )
plt . l e g e n d ()
s a v e f i g ( " s a m p l e h i s t m m m p l o t . png " )
print ( " End of o u t p u t " )

Output:
Index(['Al2O3', 'CaO', 'SiO2', 'Na2O', 'Young's modulus (GPa)'], dtype='object')
End of output

See Fig. 2.7

Code snippet 2.5: Density plot

is .73 − 69 = 4 GPa. As such, the range is not a good measure for the variability of
data where there may be outliers with large variations in their values.

2.3.2.2 Percentile and Quartiles

The . pth percentile is that threshold such that . p% of observations are at or below this
value. It is .(k + 1)th largest sample point if .np/100 /= integer, where .k is largest
integer less than .np/100. The first quartile .(Q 1 ), the second quartile .(Q 2 ), and
the third quartile .(Q 3 ) are defined as .25 p = n+1 4
, .50 p = n+1
2
, and .75 p = 3(n+1)
4
,
respectively. The second quartile is also known as the median (. M). A box plot is a
convenient graphic representing range, median, and quartiles. In box plot, a box is
drawn from . Q 1 to . Q 3 , . Q 2 (median) is drawn as a vertical line in the box, and outer
lines are drawn either up to the outermost points, or at .1.5 × (Q 3 − Q 1 ), and the
length of the line represents the range.
38 2 Data Visualization and Preprocessing

2.3.2.3 Variance

The variance (. S 2 ) and its square root value, the standard deviation (. S), are measures of
the variability of the data sets around the mean. They give a picture of the distribution
of the data around their mean value. If a dataset is highly dispersed, they tend to
spread farther away from the mean, leading to a high value of variance and standard
deviation and vice versa. The standard deviation of a normal distribution enables us
to calculate confidence intervals. In a normal distribution, about .68% of the values
lie within one standard deviation on either side of the mean and about .95% of the
scores are within two standard deviations of the mean, and about .99.5% values are
within three standard deviations from the mean. The sample variance is computed
by
∑n
∑n
(xi − x)2 ∑n
(xi − i=1
xi 2
)
.S = =
2 n
(2.2)
i=1
n − 1 i=1
n − 1

It is to be noted for the population data, the notation used for mean, variance, and
standard deviation are .μ, .σ 2 , and .σ , respectively.
Figure 2.8 shows the range and percentiles of a normalized dataset of Young’s
modulus values. The normalization is performed by subtracting the mean of the
distribution from each data point and dividing the value by the standard deviation
of the distribution. As such, the values of Young’s modulus are distributed between
.−2.5 and 2.5. Code Snippet 2.6 can be used to reproduce the results and the plot.

Fig. 2.8 Distribution of


Young’s modulus (GPa)
2.3 Extracting Statistics from Data 39

"""
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ' )
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# i m p o r t n u m p y and p a n d a s
i m p o r t numpy as np
i m p o r t p a n d a s as pd
# load s a m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / e l a s t i c _ m o d u l u s . csv " )
# print column names
print ( data . c o l u m n s )
y = np . random . randn ( 1000 ) # data [" Young 's m o d u l u s ( GPa ) "]
q25 , q50 , q75 = np . q u a n t i l e ( y , [ 0 . 25 , 0 .5 , 0 . 75 ] )
# hist plot using m a t p l o t l i b
fig , ax1 = plt . s u b p l o t s ( 1 , 1 , s h a r e x = True )
r a n g e _ = ( -3 , 3 )
bins = 100
ax1 . hist (y , bins = bins , range = range_ , ec = " k " , fc = " none " , hatch = " // " ,
alpha = 0 .5 , d e n s i t y = True )
ax1 . b o x p l o t (y , p o s i t i o n s = [ 1 . 1 ] , vert = False )
ax1 . v l i n e s ( q25 , 0 . 25 , 1 . 02 , ls = " : " )
ax1 . h l i n e s ( 0 . 25 , - 3 , q25 , ls = " : " )
ax1 . v l i n e s ( q50 , 0 . 5 , 1 . 02 , ls = " : " )
ax1 . h l i n e s ( 0 . 50 , - 3 , q50 , ls = " : " )
ax1 . v l i n e s ( q75 , 0 . 75 , 1 . 02 , ls = " : " )
ax1 . h l i n e s ( 0 . 75 , - 3 , q75 , ls = " : " )
ax1 . hist (y , bins = 10 * bins , range = range_ , fc = " none " , hatch = " " , alpha = 0 .
5 , d e n s i t y = True , h i s t t y p e = " step " ,
c u m u l a t i v e = True , lw = 1 . 5 )
plt . sca ( ax1 )
plt . y t i c k s ( [ 0 . 25 , 0 . 50 , 0 . 75 , 1 . 00 ] , [ 0 . 25 , " 0 . 50 " , 0 . 75 , " 1 . 00 " ] )
plt . ylim (0 , 1 . 3 )
s a v e f i g ( " s a m p l e q u a r t i l e s p l o t . png " )
print ( " End of o u t p u t " )

Output:
Index(['Al2O3', 'CaO', 'SiO2', 'Na2O', 'Young's modulus (GPa)'], dtype='object')

See Fig. 2.8

Code snippet 2.6: Density plot

2.3.3 Higher Order Measures

Two higher-order measures of data representation are skewness and kurtosis. While
skewness is a measure of the distortion, kurtosis is a measure heavy-tailed nature of
the data relative to a Normal distribution.
40 2 Data Visualization and Preprocessing

2.3.3.1 Skewness

Skewness, .G 1 measure the degree of distortion of the data from the normal distribu-
tion. A symmetrical distribution will have a skewness of 0. Skewness is calculated
by

n (n − 1) ∑ (xi − x)3
n
. G1 = (2.3)
n−2 i=1
nS 3

If the .−0.5 ≤ G 1 ≤ 0.5 then the data are fairly symmetrical. If the .G 1 < −0.5 it is
called negatively skewed, while if the .G 1 > 0.5, it is called positively skewed.

2.3.3.2 Kurtosis

Kurtosis, .κ is used to describe the extreme values in one versus the other tail, and
therefore, it is a measure of outliers present in the distribution. Gaussian distribution
has a kurtosis value of three. Hence, typically excess kurtosis is defined to compare
the kurtosis of the data set under consideration with respect to Gaussian distribution
as follows:
∑ n
(xi − x)3
.κ = −3 (2.4)
i=1
nS 4

A high kurtosis value in a data set indicates that data has heavy tails or outliers and
vice versa.

2.4 Outlier Detection and Data Imputing

The development of reliable ML models is highly dependent on the data quality.


Outliers may arise in the data due to various factors such as instrument errors, human
error, measurement conditions, system behavior, or even natural variation in the
data. Detection of outliers is important not only to develop high-fidelity models
but also to identify several anomalous behaviors—for example, some materials or
compositions exhibit a sudden increase/decrease in the property. Outlier detection
presents a major challenge to obtaining clean data in materials science. Several outlier
detection methodologies are available in the literature depending on the nature of data
and training algorithms used, including open-source packages such as PyOD. PyOD
is a Python-based package that includes more than 30 outlier detection algorithms
and ensemble-based methods which use a selected combination of the individual
algorithms following a designed acceptance/rejection criteria.
The outlier detection algorithms can be broadly divided into linear models,
proximity-based models, probabilistic models, ensemble models, and NN-based
2.4 Outlier Detection and Data Imputing 41

models. Most of these algorithms are unsupervised in nature, although requiring


a threshold for limiting the number of outliers. One exemption to this is the extreme
boosting-based outlier detection (XGBOD), which is a supervised algorithm for out-
lier detection. The list of some of the commonly used outlier detection algorithms
(many of which are available in the PyOD package) is listed below.
1. Linear models
• PCA
• Minimum covariance determinant (MCD)
• One-class SVM (OCSVM)
• Deviation-based outlier detection (LMDD)
2. Proximity-based models
• Local outlier factor (LOF)
• Connectivity-based outlier factor (COF)
• Clustering-based local outlier factor (CBLOF)
• Local correlation integral (LOCI)
• Histogram-based outlier (HBOS)
• k nearest neighbor (kNN)
• Average kNN (AvgkNN)
• Median kNN (MedkNN)
• Subspace outlier detection (SOD)
• Rotation-based outlier detection (ROD)
3. Probabilistic models
• Angle-based outlier detection (ABOD)
• Fast angle-based outlier detection (FastABOD)
• Copula-based outlier detection (COPOD)
• Median absolute deviation (MAD)
• Stochastic outlier selection (SOS)
4. Neural networks
• Autoencoder
• Variational autoencoder (VAE)
• Beta-variational autoencoder (Beta-VAE)
• Single objective generative adversarial active learning (SO_GAAL)
• Multi-objective generative adversarial active learning (MO_GAAL)
• Deep one-class classification (DeepSVDD).

Figure 2.9 shows the comparison of the performance of twelve different outlier detec-
tion algorithms on a benchmark data available in PyOD. The errors associated with
each method are given in the figure within parentheses. The results clearly show that
42 2 Data Visualization and Preprocessing

Fig. 2.9 The comparison of the performance of twelve different outlier detection algorithms based
on benchmark dataset available in PyOD

a trial and error-based approach using different methods is required for the robust
identification of outliers. The code to reproduce the results can be obtained from
https://github.com/yzhao062/pyod/blob/master/notebooks/benchmark.py. Here, we
will discuss some of the commonly used outlier detection methodologies that are
model agnostic, that is, which don’t depend on the ML model.

2.4.1 Outlier Detection Based on Standard Deviation

In this method, we first calculate the mean and standard deviation of the data. A data
point is identified as an outlier if it is away from the mean by a pre-specified threshold
in terms of the standard deviations. That is, if a data point .xi satisfies calculated the
. Z score,
2.4 Outlier Detection and Data Imputing 43

|xi − x|
. > k, where, k = 1, 2, or 3 (2.5)
S
then it is detected as an outlier. In other words, a datapoint is an outlier if it is beyond
x + k S. For a normally distributed dataset, .1S, .2S, and .3S represents, 68.27%,
.

95.45%, and 99.73%, respectively, of the dataset. However, this method can fail
to detect outliers if the . S is large.

2.4.2 Outlier Detection Based on Using Median Absolute


Deviation (MAD) Approach

The median is a central measure of data that is less susceptible to outliers. Median
Absolute Deviation (MAD) is calculated as the median absolute difference between
each point and the median as:

. M AD = median(|xi − M|), 1, . . . , n (2.6)

Then the modified . Z M -score is calculated using MAD values as:

0.6745(xi − M)
. ZM = (2.7)
M AD
As a rule of thumb, if . Z M is greater than 3, an outlier is detected.

2.4.3 Outlier Detection Using Interquartile Approach

The interquartile range (IQR) is calculated the same way as the range. IQR is com-
puted by subtracting the first quartile from the third quartile:

. I Q R = Q3 − Q1 (2.8)

IQR can be used to detect outliers as follows: if any data point .xi

. Q 3 + 1.5I Q R < xi < Q 1 − I Q R (2.9)

can be a potential outlier.


Note that these approaches are helpful in identifying an outlier at the extremes of
the overall data or selected bins. However, many of the outliers may be present within
the range of the data as well. This is especially common in materials databases where
a particular family of compositions may have outliers that will be very well within
the overall range of the data. Identifying these outliers which are within the data
44 2 Data Visualization and Preprocessing

range requires the use of more sophisticated outlier detection algorithms mentioned
earlier or an outlier ensemble combining multiple of these algorithms.

2.5 Data Augmentation

If the data set size is small or imbalanced, there would be difficulty in training a
desired model. This is because very little information can be extracted from small
data sets, and data-driven modeling generally depends on sufficiently large data
sets for information extraction. To address this issue, data augmentation may be
performed, which generates and includes artificial data points to the dataset. The
main idea of data augmentation techniques is to learn the statistical features and
underlying distribution of the data. Some typical approaches for imputing missing
data are based on the mean and median of the data. For instance, the missing value
for each feature is replaced with the mean or the median of non-missing values of the
respective features. Note that this method does not consider the correlation between
features and does not account for the uncertainty in the imputation. Another approach
is to use kNN for data imputation. Here, first, kNN is used to identify the clusters
based on the original data points. Then a new point is assigned based on how closely
it resembles the points in the selected cluster. For instance, the mean of the input
features of a given cluster can be used to generate a new data point. In addition, the
mean of selected points in the cluster can also be used to generate new points. The
advantage of this approach is that the new point generated will also lie within the
cluster and hence will not disturb the distribution of the data. This idea has been
improved to develop more sophisticated algorithms allowing the data imputation in
targeted regions such as SMOTE, as outlined below.
There are several ML algorithms used for data imputation which directly take into
account the distribution of the data. These approaches may especially be necessary if
there is a significant data imbalance. For instance, from a dataset of images on con-
crete, we aim to identify the images with fracture and without fracture. The number
of images with fracture may be significantly smaller than those without fracture, say
1:99, respectively. In such cases, one easy approach is to use undersampling, wherein
we remove the images in the larger class to make the data balance. However, this
approach leads to a loss of information and suboptimal use of the dataset. An alter-
nate approach is to use oversampling algorithms. A popular approach used to address
oversampling is the synthetic minority oversampling technique (SMOTE). SMOTE
oversample or artificially synthesize data from smaller sample data sets until the data
is sufficiently balanced. Figure 2.10 shows the SMOTE oversampling approach used
for imbalanced data. In SMOTE, the n-nearest neighbors in the minority class for
each sample in the class are identified. Then, a line is drawn connecting the two
points in the minority classes. New data is generated by randomly identifying points
on this line. Note that the new point can be the midpoint or any other point along the
line connecting two points. Due to the possibility of identifying an infinite number
of points along a line, this approach allows the generation of new points until the
2.6 Summary 45

Fig. 2.10 Oversampling for minority data using SMOTE. Reproduced from [7]

data is balanced. An improved version of SMOTE is the adaptive synthetic sampling


method (ADASYN). ADASYN essentially uses the points generated by SMOTE and
adds random noise to it. Thus, the new points generated now may not be along the
line connecting the two minority data points, making it more realistic.

2.6 Summary

In conclusion, this chapter has explored various data visualization techniques,


including bar graphs, heat maps, ternaries, tree maps, scatter plots, and histograms.
Additionally, an examination of data statistics was conducted, encompassing cen-
tral measures such as mean, median, and mode, as well as higher-order measures.
The chapter further delved into outlier detection methods for data cleansing and
data imputation techniques for handling imbalanced data. However, it is essential to
acknowledge that the effectiveness of these approaches is contingent upon the spe-
cific dataset under consideration. Data cleaning and outlier detection lack a universal
46 2 Data Visualization and Preprocessing

solution, demanding careful evaluation and thorough analysis tailored to each dataset.
Notably, comprehensive outlier detection involving multiple methods should be con-
ducted subsequent to meticulous data visualization. Extensive evidence supports the
notion that enhanced outlier detection significantly enhances the performance of
machine learning algorithms. Consequently, the focus is shifting towards attaining
high-quality data rather than solely developing advanced machine learning algo-
rithms for improved outcomes.

References

1. R. Ravinder, K.H. Sridhara, S. Bishnoi, H. Singh Grover, M. Bauchy, Jayadeva, H. Kodamana,


N.M.A. Krishnan, Deep learning aided rational design of oxide glasses,. Mater. Horizons (2020).
Publisher: Royal Society of Chemistry. https://doi.org/10.1039/D0MH00162G. https://pubs.rsc.
org/en/content/articlelanding/2020/mh/d0mh00162g. Accessed 10 Aug 2020
2. F. Hu, H. Li, A novel boundary oversampling algorithm based on neighborhood rough set model:
Nrsboundary-smote. Math. Problems Eng. 2013 (2013)
3. R.I. Levin, Statistics for Management (Pearson Education India, 2011)
4. R. Peck, C. Olsen, J.L. Devore, Introduction to Statistics and Data Analysis (Cengage Learning,
2015)
5. J. L. Devore, Probability and Statistics for Engineering and the Sciences (Cengage Learning,
2011)
6. D.C. Montgomery, G.C. Runger, Applied Statistics and Probability for Engineers (Wiley, 2010)
7. S.M. Ross, Introductory Statistics (Academic, 2017)
Chapter 3
Introduction to Machine Learning

Abstract Machine learning algorithms can be broadly divided into three categories
depending on the nature of the “learning” process, namely, supervised, unsupervised,
and reinforcement learning. In this chapter, we introduce these different categories
with the focus on the nature of the tasks for which these algorithms are useful. Specif-
ically, we focus on supervised and unsupervised learning algorithms. The supervised
algorithms may further be classified as parametric and non-parametric algorithms
depending on the mathematical model used to fit the data. These algorithms may
be used for several downstream tasks such as classification, regression, or cluster-
ing. Finally, we discuss the idea of overfitting and underfitting in machine learning
algorithms.

There is really nothing you must be. And there is nothing you must do. There is really
nothing you must have. And there is nothing you must know. There is really nothing you
must become. However, it helps to understand that fire burns, and when it rains, the earth
gets wet.

—Japanese Zen scroll.

The concept of learning from data is deeply rooted in human history, predating
the term “machine learning,” coined in the mid-twentieth century. In fact, learning
from data is a fundamental process deeply ingrained in human cognition. It involves
extracting knowledge and understanding from information gathered through obser-
vation and experience. This ability allows us to recognize patterns, make predic-
tions, and adapt our behavior accordingly. The concept of learning from data has
been leveraged in various domains to develop computational methods that emulate
this human learning process. Consequently, machine learning refers to a broad class
of algorithms that focus on extracting patterns from data, enabling the inference of
meaningful insights. These algorithms continuously self-correct and improve through
experience, leveraging the provided data. Consequently, they are commonly referred
to as data-driven methods. By adopting an approach akin to human learning, machine
learning algorithms emulate the ability to acquire knowledge from data. For example,
a domain expert can effortlessly determine the number of grains in the microstructure

© Springer Nature Switzerland AG 2024 47


N. M. A. Krishnan et al., Machine Learning for Materials Discovery,
Machine Intelligence for Materials Science,
https://doi.org/10.1007/978-3-031-44622-1_3
48 3 Introduction to Machine Learning

of a polycrystalline material or identify various phases in a composite material based


solely on an optical image. Similarly, machine learning algorithms utilize available
data to discern patterns and enhance their performance with increased experience,
often in the form of additional data. Thus, machine learning can be considered a
subset of artificial intelligence (AI).
Artificial intelligence focuses on enabling machines or systems to perform actions
comparable to human capabilities based on stimuli or situations encountered. This
process involves teaching machines to learn and implement functions that map
sequences to actions using various approaches. This represents a paradigm shift from
traditional physics-driven modeling, where computers were explicitly instructed on
what to do. Instead, the emphasis now lies on allowing systems to “learn” how to
perform actions based on available data. Early AI algorithms primarily relied on
rule-based systems and did not incorporate machine learning techniques. However,
the advent of deep learning has significantly accelerated the utilization of machine
learning algorithms within the realm of artificial intelligence. Consequently, machine
learning serves as a crucial subset of AI, specifically dedicated to developing algo-
rithms and systems capable of detecting and comprehending patterns in data, thereby
extending their applicability to previously unexplored domains and scenarios.
In this chapter, we will discuss briefly the machine learning approaches for mate-
rials modeling. Further, we will discuss the major algorithms and their application
to different areas of materials modeling, discovery, and synthesis.

3.1 Machine Learning Paradigm

Machine learning encompasses a diverse set of algorithms and approaches that enable
computers to learn from data, recognize patterns, and make predictions or decisions.
Figure 3.1 shows the machine learning framework and some of the popular algorithms
in each of the categories. By categorizing machine learning algorithms into unsuper-
vised learning, supervised learning, and reinforcement learning, we can effectively
leverage these approaches for materials discovery and research. Note that the large
number of available methods in the ML can make it overwhelming for a researcher to
choose an appropriate method. While there are no strong mathematical guidelines,
there are several experience-based thumbrules one can follow while selecting the
model. To this extent, the guideline provided by sci-kit learn is shown in Fig. 3.2.
Note that this is simply a guideline and should not be taken too seriously or strictly.

3.1.1 Unsupervised Learning Algorithms

Unsupervised learning algorithms are particularly valuable when dealing with unla-
beled data, where the objective is to uncover hidden patterns or structures within
the dataset. In the context of materials science, unsupervised learning algorithms
3.1 Machine Learning Paradigm 49

Fig. 3.1 The machine learning paradigm includes supervised, unsupervised, and reinforcement
learning algorithms. Some of the algorithms belonging to each class are included

Fig. 3.2 Workflow for the choice of ML algorithm based on the dataset size and model complexity
based on sci-kit learn tutorial

can assist in data exploration, clustering, and anomaly detection. Some notable algo-
rithms include:
• Clustering: Clustering algorithms, such as k-means clustering, hierarchical cluster-
ing, and density-based clustering, group similar data points together based on their
feature similarities. This allows for the identification of distinct clusters within the
data, enabling insights into materials classifications or compositions.
50 3 Introduction to Machine Learning

• Dimensionality Reduction: Dimensionality reduction techniques, such as Princi-


pal Component Analysis (PCA) and t-SNE, reduce the dimensionality of high-
dimensional datasets while preserving the essential information. By visualizing
the reduced data, researchers can gain insights into the relationships between dif-
ferent variables or identify important features for further analysis.
• Anomaly Detection: Anomaly detection algorithms, including statistical methods
like the Mahalanobis distance and machine learning, approaches like Isolation
Forest and Autoencoders, help identify rare or abnormal data points. These algo-
rithms are valuable for detecting outliers or identifying unusual material properties
or behavior within datasets.

3.1.2 Supervised Learning Algorithms

Supervised learning algorithms play a crucial role in materials modeling, as they


leverage labeled data to build predictive models. By training these models on known
inputs and corresponding outputs, they can make accurate predictions or classifica-
tions on new, unseen data. Some commonly used supervised learning algorithms in
materials science are given below.

• Linear Regression: Linear regression models establish linear relationships between


input features and continuous target variables. These models can be employed to
predict various material properties, such as the relationship between temperature
and thermal conductivity.
• Support Vector Machines (SVM): SVM algorithms find an optimal hyperplane that
separates different classes within the data. They have been successfully applied in
materials science for tasks such as classifying materials into different phases or
predicting the outcome of chemical reactions.
• Decision Trees: Decision tree algorithms construct tree-like models based on
sequential decision rules derived from the input features. These models are inter-
pretable and can be used for tasks such as predicting the mechanical properties of
materials based on their composition and processing parameters.
• Neural Networks: Neural networks, including deep learning architectures like Con-
volutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs),
excel in learning complex representations from data. In materials science, neu-
ral networks have been employed for tasks such as materials classification based
on microstructural images or predicting material performance from experimental
data.
3.2 Parametric and Non-parametric Models 51

3.1.3 Reinforcement Learning Algorithms

Reinforcement learning algorithms are well-suited for sequential decision-making


tasks in materials science. These algorithms learn through interaction with an envi-
ronment, receiving rewards or penalties based on their actions. Reinforcement learn-
ing has the potential to optimize materials design processes and discover new mate-
rials with desired properties. Some of the noteworthy reinforcement learning algo-
rithms are given below.

• Q-Learning: Q-learning is a model-free reinforcement learning algorithm that


learns an optimal policy by iteratively updating the Q-values, which represent
the expected cumulative rewards for taking specific actions in given states. Q-
learning has been applied to optimize materials processing parameters or discover
novel material compositions.
• Deep Q-Networks (DQN): DQN combines Q-learning with deep neural networks,
enabling the handling of high-dimensional state and action spaces. DQN has been
successful in optimizing complex materials systems, such as developing new mate-
rials with improved performance characteristics.
• Policy Gradient Methods: Policy gradient methods directly learn the optimal policy
by iteratively adjusting the parameters of a policy function. These methods have
been employed in materials science for tasks such as optimizing experimental
conditions or designing new materials with desired properties.
By comprehending the different machine learning paradigms and algorithms avail-
able, materials scientists can effectively leverage the power of data-driven approaches
to accelerate materials discovery, optimize materials’ properties, and unlock new
avenues for innovation in materials science and engineering.

3.2 Parametric and Non-parametric Models

Machine learning models can be broadly categorized into two types: parametric mod-
els and non-parametric models. These models are designed to learn from data and
make predictions or classifications based on the patterns and information present in
the dataset. Understanding the differences between parametric and non-parametric
models is essential in grasping the fundamentals of machine learning and their respec-
tive applications, particularly in the field of materials science.

3.2.1 Parametric Models

Parametric models make strong assumptions about the underlying distribution of the
data. These models have a fixed number of parameters that determine their behavior
52 3 Introduction to Machine Learning

and are independent of the size of the dataset. In materials science, parametric models
can be applied to understand the relationship between material properties and specific
features.
For example, consider the case of predicting the mechanical strength of a material
based on its composition. A parametric model like linear regression can assume a lin-
ear relationship between the elemental composition and the mechanical strength. The
model’s parameters, such as the coefficients associated with each element, represent
the influence of the composition on the mechanical strength. Once the parameters are
estimated from a training dataset containing material compositions and correspond-
ing mechanical strengths, the model can make predictions on new compositions by
calculating the weighted sum of the elemental contributions.
Parametric models offer computational efficiency and interpretability, which can
be valuable in materials science. However, they may struggle to capture complex,
nonlinear relationships between the features and the target variable if the underlying
assumptions do not hold.

3.2.2 Non-parametric Models

Non-parametric models, on the other hand, make fewer assumptions about the under-
lying distribution of the data. These models have a flexible structure that can adapt
to the complexity of the dataset. In materials science, non-parametric models can
be used to capture intricate relationships between material properties and multiple
features.
For instance, let’s consider the task of predicting the bandgap energy of a material
based on its crystal structure and elemental composition. A non-parametric model
like the k-nearest neighbors (KNN) algorithm does not impose any specific form of
the relationship. Instead, it identifies the k closest neighbors in the training dataset
with similar crystal structures and compositions and predicts the bandgap energy
based on their average values.
Non-parametric models excel in capturing complex patterns and relationships in
materials data, even when the underlying distribution is unknown. However, they can
be computationally more intensive and require larger training datasets to generalize
well.

3.2.3 Choosing Between Parametric and Non-parametric


Models

Choosing between parametric and non-parametric models in materials science


depends on various factors. If there is prior knowledge or evidence suggesting a
simple relationship between the features and the target property, a parametric model
3.3 Classification and Regression 53

like linear regression can provide efficient and interpretable results. For instance, if
previous research indicates a linear correlation between the concentration of a dopant
element and the electrical conductivity of a material, a parametric model can capture
this relationship effectively.
On the other hand, when dealing with complex material systems or when the
underlying relationship is not well understood, non-parametric models like KNN or
decision trees may be more suitable. These models can adapt to the intricacies of the
dataset and handle nonlinear relationships between material properties and features.
It is important to note that the choice of model in materials science is a data-
driven and iterative process. Researchers often experiment with different models
and evaluate their performance using metrics such as mean squared error or accu-
racy. Techniques like cross-validation can also be employed to assess how well a
model generalizes to unseen data. Further details of model training, validation, and
hyperparametric optimizations are discussed in detail later.
In summary, parametric models make strong assumptions about the data distribu-
tion and have a fixed number of parameters, while non-parametric models are more
flexible and can adapt to complex relationships. The selection of a model in mate-
rials science depends on the characteristics of the dataset and the specific problem
at hand. Understanding the differences between these two types of models enables
researchers to make informed decisions and develop effective machine-learning solu-
tions for materials discovery and property prediction.

3.3 Classification and Regression

Classification and regression models are two major classes of supervised ML algo-
rithms. These models are designed to analyze data and make predictions or classifi-
cations based on the patterns and information present in the dataset.

3.3.1 Classification Models

Classification models are used when the target variable or outcome is categorical
or discrete. These models aim to assign input data points into predefined classes
or categories based on the patterns and characteristics present in the dataset. For
example, classification models can be employed to classify materials based on certain
properties or behaviors. Some major algorithms used for classification are as follows.
• Logistic regression takes a set of independent variables or features related to the
material, such as composition, structural characteristics, or spectroscopic data. It
produces a probability value between 0 and 1, representing the likelihood of the
input belonging to a particular class. For example, logistic regression can be used to
predict whether a material is a conductor or an insulator based on its composition
54 3 Introduction to Machine Learning

and electronic structure. The model uses a loss function called logistic loss or
cross-entropy loss to measure the discrepancy between the predicted probabilities
and the true class labels.
• Decision trees take a set of input features related to the material, such as compo-
sition, crystal structure, or elemental properties. The output of a decision tree is a
predicted class label or category for a given set of input features. Decision trees
employ various algorithms to optimize the structure of the tree, such as the Gini
impurity or information gain, which determines the splits in the tree that maximize
the separation between different classes. Decision trees can be utilized to classify
materials into different crystal structures based on their elemental composition
and lattice parameters.
• Random forest is an ensemble learning algorithm that combines multiple decision
trees. It takes the same input features as decision trees, consisting of material-
related properties. The output of random forest is the majority vote or average pre-
diction of a set of decision trees within the ensemble, resulting in a final predicted
class label. Random forest combines the individual decision trees by minimizing
the overall classification error or using the entropy-based criterion.
• Support Vector Machines are powerful classification algorithm that takes a set of
input features related to the material, such as structural parameters, composition,
or material descriptors. SVM aims to find the optimal hyperplane that maximally
separates the classes by minimizing the hinge loss or maximizing the margin
between the classes. SVM can be applied to classify materials based on their
mechanical properties, such as distinguishing between ductile and brittle materials.

3.3.2 Regression

Regression models, in contrast to classification models, are used when the target
variable is continuous or numerical. These models aim to predict a value or estimate
a relationship between the input features and the output variable. In materials sci-
ence, regression models can be utilized to predict material properties or performance
metrics. On the other hand, regression models are used when the target variable is
continuous or numerical. These models aim to predict a value or estimate a relation-
ship between the input features and the output variable. Regression models can be
utilized to predict material properties or performance metrics.
• Linear regression is a fundamental regression algorithm that takes a set of input
features, such as material composition, processing conditions, or structural proper-
ties. The output of linear regression is a continuous numerical value representing
the predicted property or performance metric. Linear regression minimizes the
sum of squared errors or mean squared errors between the predicted values and
the actual target values.
• Support Vector Regression (SVR) is an extension of SVM for regression tasks. It
takes a set of input features related to the material, such as composition, crystal
3.4 Clustering 55

structure, or material descriptors. The output of SVR is a continuous numerical


value representing the predicted property or performance metric. SVR aims to find
an optimal hyperplane that minimizes the deviation or error between the predicted
and actual values. It uses loss functions such as epsilon-insensitive loss or squared
epsilon-insensitive loss.
• Random forest regression applies the ensemble learning technique of random forest
to regression tasks. It takes the same input features as random forest classification,
such as material-related properties. The output of random forest regression is the
average prediction of a set of decision trees within the ensemble, resulting in a
continuous numerical value. Random forest regression minimizes the overall error
or discrepancy between the predicted and actual values using metrics like mean
squared error or mean absolute error.
• Neural networks are a versatile class of algorithms that can be used for classifica-
tion and regression tasks. They take a set of input features, such as composition,
structure, or spectroscopic data, related to the material. The output of a neural
network can be a single continuous numerical value representing the predicted
property or a vector of probabilities for multiple categories. Neural networks use
various loss functions depending on the task, such as mean squared error for regres-
sion or cross-entropy loss for classification. Neural networks have been applied to
predict material properties like band gaps or predict the synthesis conditions for
targeted materials.
In materials science, classification models can assist in identifying material classes
or phases, while regression models can help predict material properties or perfor-
mance metrics. These models provide valuable tools for materials discovery, design,
and optimization by leveraging the power of machine learning algorithms.

3.4 Clustering

Clustering is a fundamental technique in machine learning that aims to group simi-


lar data points together based on their intrinsic characteristics. It is an unsupervised
learning method, meaning that it does not require labeled data or predefined classes.
Clustering algorithms enable us to discover patterns, structures, and relationships
within data, making it a valuable tool in materials science for exploring material
compositions, properties, and behaviors. Now, we briefly discuss some of the clus-
tering algorithms.
• K-Means Clustering: K-Means is a popular and widely used clustering algorithm.
It aims to partition the data into K clusters, where K is a user-defined parameter.
Each cluster is represented by its centroid, which is the mean of all the data points
assigned to that cluster. K-means clustering can be employed to group materials
based on their composition, structural properties, or other descriptors. For example,
K-means can be used to cluster materials with similar crystal structures or cluster
alloys based on their elemental compositions.
56 3 Introduction to Machine Learning

• Hierarchical Clustering: Hierarchical clustering builds a hierarchy of clusters, also


known as a dendrogram, by recursively merging or splitting clusters based on
a distance metric. It can be agglomerative (bottom-up) or divisive (top-down).
Hierarchical clustering can be used to explore relationships between materials at
different levels of granularity. For instance, it can identify groups of materials with
similar properties and reveal hierarchical relationships between different material
classes.
• Density-Based Spatial Clustering of Applications with Noise (DBSCAN):
DBSCAN groups data points based on their density. It defines clusters as areas of
high density separated by regions of low density. Points that fall in low-density
regions are considered noise or outliers. DBSCAN can be used to identify regions
of high-density materials in a multidimensional space. For example, it can be
applied to cluster materials based on their mechanical properties to identify regions
of similar strength or ductility.
• Gaussian Mixture Models (GMM): GMM assumes that the data points are gener-
ated from a mixture of Gaussian distributions. It models the data as a weighted sum
of Gaussian distributions, where each distribution represents a cluster. GMM can
be utilized to model and cluster materials with complex distributions. For exam-
ple, it can be used to cluster materials based on their thermal conductivity, where
different clusters may represent materials with distinct heat transport mechanisms.
• Self-Organizing Maps (SOM): SOM, also known as Kohonen maps, is an arti-
ficial neural network-based clustering algorithm. It organizes the data in a low-
dimensional grid or map, preserving the topological properties of the input space.
SOM can be employed to visualize and cluster high-dimensional data such as
spectroscopic or imaging data. For instance, it can be used to identify groups
of materials with similar spectral fingerprints or map materials based on their
microstructural features.
• t-Distributed Stochastic Neighbor Embedding (t-SNE): t-SNE is a dimensionality
reduction technique commonly used for visualization of high-dimensional data in
a lower-dimensional space, typically 2D or 3D. It preserves the local structure of
the data while aiming to separate dissimilar data points. t-SNE can be applied to
visualize complex datasets such as spectroscopic or imaging data. For example,
it can help identify clusters or patterns within a dataset based on the similarities
between materials, revealing relationships between material properties or struc-
tures.
• Principal Component Analysis (PCA): PCA is a dimensionality reduction tech-
nique that transforms a high-dimensional dataset into a lower-dimensional space
by finding the principal components, which are orthogonal directions capturing
the maximum variance in the data. PCA can be used to analyze and reduce the
dimensionality of datasets with many variables, such as compositional or spectral
data. It helps identify the most significant variables or features contributing to the
variance in the dataset, facilitating data exploration and pattern recognition.

In the materials domain, clustering algorithms enable the discovery of material


groups, similarities, and underlying structures within complex datasets. They facil-
3.5 Reinforcement Learning: Model-Free and Policy Grad 57

itate the identification of materials with similar properties, compositions, or behav-


iors, allowing researchers to gain insights into material relationships, discover novel
materials, and guide the design and optimization of materials for specific applica-
tions. They also allow dimensionality reduction, data distillation, and fingerprinting
of features based on large materials data.

3.5 Reinforcement Learning: Model-Free and Policy Grad

Reinforcement learning is a branch of machine learning that focuses on enabling an


agent to learn and make decisions through interactions with an environment. It is
a dynamic and iterative process where the agent learns to maximize a cumulative
reward signal by taking appropriate actions in different states. Reinforcement learn-
ing is inspired by how humans and animals learn through trial and error and feedback
from the environment. Below, we discuss some of the RL algorithms for materials
discovery.
• Q-Learning: Q-Learning is a model-free reinforcement learning algorithm that
learns an optimal policy by estimating the values of state-action pairs. It uses a Q-
value function that represents the expected cumulative reward for taking a specific
action in a given state. Q-Learning iteratively updates the Q-values based on the
observed rewards and transitions in the environment. Q-Learning can be applied to
optimize materials synthesis processes. For example, it can be used to determine
the optimal conditions and parameters for a chemical reaction to maximize the
desired material properties or minimize production costs.
• Deep Q-Networks (DQN): DQN is an extension of Q-Learning that utilizes deep
neural networks to approximate the Q-value function. It combines the power of
deep learning with reinforcement learning, allowing for more complex and high-
dimensional state representations. DQN can be employed to optimize materials
characterization or experimental design. It can determine the optimal sequence of
experiments or measurements to gather the most informative data, reducing the
time and cost required to obtain accurate material properties.
• Policy Gradient Methods: Policy gradient methods directly learn a policy function
that maps states to actions without explicitly estimating the value function. These
methods optimize the policy by gradient ascent, iteratively improving the policy
based on the observed rewards. Policy gradient methods can be used to design
materials with specific properties or functionalities. For example, they can optimize
the choice of material compositions or processing parameters to maximize energy
storage in batteries or enhance catalytic activity.
• Proximal Policy Optimization (PPO): PPO is a policy optimization algorithm that
aims to strike a balance between exploration and exploitation. It updates the policy
in small steps, ensuring that the policy changes are within a certain threshold
to maintain stability during learning. PPO can be applied to optimize materials
design in multi-objective scenarios. It can find a trade-off between conflicting
58 3 Introduction to Machine Learning

material properties, such as strength and ductility, by exploring the design space
and discovering Pareto-optimal solutions.
Reinforcement learning in materials science offers opportunities to optimize var-
ious aspects, such as material synthesis, characterization, design, and control. By
using reinforcement learning algorithms, researchers can discover optimal strate-
gies, policies, or conditions for material development, leading to improved materials
with desired properties, enhanced performance, and reduced costs. Some of the appli-
cations of RL for materials discovery are outlined below.
• Materials Discovery: RL can aid in the discovery of new materials with desired
properties by optimizing the selection of chemical compositions, crystal structures,
or material configurations. For example, RL algorithms can be used to guide the
exploration of the vast compositional and structural space to discover new high-
performance materials. By learning from the feedback on material properties,
RL agents can intelligently navigate through the search space to find promising
candidates. For example, RL can optimize the discovery of novel catalysts by
suggesting compositions and configurations that enhance catalytic activity and
selectivity. By interacting with the environment (chemical reactions) and receiving
rewards based on reaction efficiency or product quality, RL agents can learn to
propose optimal catalyst designs.
• Atomic Structure Optimization: RL can optimize the arrangement of atoms within
a material to minimize energy or maximize desired properties. It can explore the
vast configuration space efficiently and converge to stable or optimized atomic
structures. RL agents can learn from the feedback on energy calculations or prop-
erty evaluations to guide the search toward more favorable atomic arrangements.
For example, RL can optimize the structure of a molecule or a material to achieve
specific properties, such as band gaps or binding energies. By iteratively adjusting
the atomic positions and receiving rewards or penalties based on the calculated
properties, RL agents can learn to converge towards optimized structures.
• Process Optimization: RL can optimize various processes involved in materials
synthesis, such as controlling reaction conditions, optimizing deposition parame-
ters, or fine-tuning manufacturing processes. By learning from the rewards asso-
ciated with process efficiency, product quality, or cost reduction, RL agents can
identify optimal process settings and parameter combinations. For example, RL
can optimize the growth of thin films by controlling deposition parameters such
as temperature, pressure, and precursor flow rates. By exploring the parameter
space and receiving feedback on film quality or desired properties, RL agents can
discover optimal deposition conditions.
• Planning for Automated Materials Synthesis: RL can facilitate the planning and
decision-making in automated materials synthesis systems. RL agents can learn to
optimize the selection of precursor materials, reaction pathways, synthesis condi-
tions, or fabrication steps to achieve desired material properties, performance, or
functionality. For example, RL can optimize the synthesis of complex materials,
such as perovskite compounds, by guiding the selection of precursor materials and
reaction conditions. By receiving rewards based on desired properties or crystal
3.6 Summary 59

quality, RL agents can learn to plan the synthesis process and make informed
decisions for efficient and effective material production.
In summary, RL offers promising avenues for materials science applications, includ-
ing materials discovery, atomic structure optimization, process optimization, and
planning for automated materials synthesis. However, RL remains one of the under-
explored areas in machine learning for materials. The adoption of RL algorithms
in materials research necessitates addressing domain-specific considerations, such
as the design of appropriate reward functions tailored to materials properties and
performance. In addition, further research is required to develop specialized RL
algorithms that can handle the intricacies of atomic and molecular systems, account
for the long-range interactions and quantum effects that dictate material behavior,
and effectively optimize the vast combinatorial space of material configurations and
processing parameters.
Similarly, integrating RL with experimental workflows is crucial to bridge the gap
between simulation and real-world materials synthesis and characterization. Devel-
oping RL-driven autonomous systems for intelligent decision-making in materials
synthesis, process optimization, and adaptive experimentation holds great promise
for accelerating materials discovery and optimization. By actively exploring and
advancing RL methodologies specifically tailored for materials science, we can
unlock new opportunities for rational materials design, enable the discovery of novel
materials with tailored properties, optimize complex manufacturing processes, and
ultimately revolutionize the field of materials science and engineering.

3.6 Summary

This chapter introduced the machine learning paradigm and its applications in materi-
als science. We explored unsupervised learning algorithms for clustering and dimen-
sionality reduction, such as k-means, hierarchical clustering, DBSCAN, PCA, and
t-SNE. Supervised learning algorithms, including decision trees, random forests,
support vector machines, and neural networks, were discussed for classification and
regression tasks in materials science. Further, we discussed supervised algorithms
including classification and regression models such as linear and logistic regressions,
support vectors, random forests, and neural networks. Additionally, we highlighted
the potential of reinforcement learning (RL) in materials research, though it remains
relatively under-explored. RL algorithms like Q-Learning, DQN, and PPO offer
opportunities for materials discovery, atomic structure optimization, process opti-
mization, and planning in automated materials synthesis. We also discussed para-
metric and non-parametric models and the primary differences among them. The
remaining chapters of this part of the book will discuss these algorithms in detail.
60 3 Introduction to Machine Learning

References

1. M. Mohri, A. Rostamizadeh, A. Talwalkar, Foundations of Machine Learning (MIT Press, 2018)


2. C.M. Bishop, Neural Networks for Pattern Recognition (Clarendon Press, 1995). ISBN: 978-0-
19-853864-6
3. E. Alpaydin, Introduction to Machine Learning (MIT Press, 2020)
4. Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature 521(7553), 436–444 (2015). ISSN: 1476-
4687. https://doi.org/10.1038/nature14539. https://www.nature.com/articles/ Nature 14539
(Visited on 23 July 2019)
5. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P.
Prettenhofer, R. Weiss, V. Dubourg, et al., Scikit-learn: machine learning in python. J Mach
Learn Res 12, 2825–2830 (2011)
6. A. Géron, Hands-on Machine Learning with Scikit-Learn, Keras, and Tensor-Flow: Concepts,
tools, and Techniques toBuild Intelligent Systems (O’Reilly Media, 2019)
7. R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction (MIT Press, 2018)
Chapter 4
Parametric Methods for Regression

Abstract Regression analysis is a statistical approach which is used to determine


the relationship of a dependent (or output) variable with respect to a few independent
variables. In this chapter, we focus on the parametric approaches of regression. We
discuss about the mathematical model for regression and how the parameters asso-
ciated with each variables are identified through the minimization of a cost function.
Classical parametric approaches include simple linear regression, also known as the
ordinary least square (OLS) regression. Following this, other approaches obtained by
modifying OLS such as weighted linear regression, stagewise regression, and least
angle regression are discussed. Finally, logistic regression, a parametric classification
algorithm, is also discussed.

4.1 Introduction

Regression analysis is one of the most widely used approaches to predict or fore-
cast the values of a dependent variable as a function of selected independent
variables. Regression analysis finds several applications, such as to predict the
composition–property relationships in materials, or to predict the composition–
processing–structure relationships. For instance, regression analysis can be used to
compute the slope of the strain-strain curve of a material, which represents Young’s
modulus of the material. As discussed in Chap. 3, in ML, regression analysis can
be performed using parametric and non-parametric methods. Parametric methods
include traditional approaches such as OLS and its regularized versions, weighted
regression, and LAR, to name a few.
The parametric approaches a priori assumes the functional form for the data to
be fitted. For instance, in the framework of linearized elasticity, we assume that the
stress-strain curve of a material follows a straight line passing through the origin, the
slope of which represents Young’s modulus. Thus, prior to regression, we assume

Supplementary Information The online version contains supplementary material available at


https://doi.org/10.1007/978-3-031-44622-1_4.

© Springer Nature Switzerland AG 2024 61


N. M. A. Krishnan et al., Machine Learning for Materials Discovery,
Machine Intelligence for Materials Science,
https://doi.org/10.1007/978-3-031-44622-1_4
62 4 Parametric Methods for Regression

that the mathematical form of the curve to be . y = mx, or .σ = Eε, where .σ and .ε
represent the stress and strain for a uniaxial loading condition. Young’s modulus . E
of the material is the free parameter to be fitted. Thus, . E can be obtained based on
regression analysis. Once . E is known for a material, the stress associated with any
strain can be computed from the equation .σ = Eε, provided the strain is within the
elastic limit.
Thus, the basic steps involved in the parametric methods for regression can be
outlined as follows.
1. Identify the dependent and independent variables from the available data. All
possible relevant independent variables should be considered while developing the
model. The selection of these independent parameters can be based on intuition,
the physics of the problem, or expert knowledge.
2. Identify the functional form that fits the data best. These forms can be simple lin-
ear or polynomial forms such as . y = a0 + a1 x or . y = a0 + a1 x + a2 x 2 + · · · +
an x n . In some cases, it may take more complex forms, such as a power law or
exponential, based on the physics of the governing problem. In case the func-
tional form is not known a priori as in the case of many composition-property
relationships, multiple functional forms can be evaluated on the data, and the best
performing functional form can be selected later on.
3. Identify the free parameters that are to be obtained by regression analysis. In the
earlier example of linear regression, the free parameters are .a0 and .a1 , while in
the case of polynomial, the free parameters range from .a0 to .an .
4. Define a cost function such as the root mean squared errors between the predicted
value and the actual (or measured value), where the predicted value is obtained
from the model considered.
5. Apply analytical or numerical methods to identify the values of free parameters
that minimize the cost function. These values can then be used in the functional
form considered to predict values for unknown cases.
The main advantage of the parametric methods is their interpretability. Owing to
their straightforward functional form, the free parameters provide direct insight into
the weight of each of the independent parameters in governing the output value.
Also, since the functional form is chosen a priori, this approach allows a rational
choice for the mathematical model based on the known physics of the problem, that
is, linear, non-linear, static, or dynamic. Due to these reasons, parametric approaches
have been widely used for more than a century in aiding materials discovery.
In this chapter, we will first focus on the closed-form solution for the gener-
alized form of linear regression. Then we will discuss some iterative approaches,
such as the gradient descent optimizer for solving the regression. Following this,
other approaches inspired by linear regression, such as locally weighted regression,
stepwise regression, and LAR, are discussed. Finally, the chapter concludes with
a discussion on the application of logistic regression to classify data into multiple
labels.
4.2 Closed Form Solution of Regression 63

4.2 Closed Form Solution of Regression

First, we focus on the closed-form solution of the generalized form of a linear regres-
sion problem . y = a0 + a1 x. In the generalized form, the output . y can be a function
j
of multiple input variables .xi or their powers .xi or any combination thereof. We aim
to derive the analytical solution to this generalized problem, which can then be used
to obtain any particular solutions, for example, simple linear regression.
Consider .m samples of training points that containing .n input variables .x1 ,
x2 , . . . , xn where each .xi ∈ R and the corresponding labelled output variable . y ∈ R.
We are interested in developing a linear regression model of the following form

. ŷ := h θ (x) = θ1 h(x1 ) + θ2 h(x2 ) + · · · + θn h(xn ) (4.1)

The model presented in Eq. (4.1) encompasses the classes of both linear regression
models as well as regression models that are linear in parameters. ..̂ notation is added
to. y to identify it as the predicted output. ŷ, which may be different from the true output
value . y obtained from the experimental measurement or physics-based simulations.
For instance, the function .h(x) can be evaluated as .x itself for linear regression
problems or the corresponding nonlinear polynomial functions (such as .x 2 , x 3 , or
n
. x ), resulting in polynomial regression. Our aim is to develop a model that maximally
represents the available data. With this in view, we consider all the training examples,
and the resulting data set is represented in tabular form as shown in Table 4.1.
In the Table 4.1, .x (i, j) represents .ith training example for . jth variable and
(i)
. y represents the corresponding output. For calculating .θ := [θ1 , . . . , θn ] , a
T

straightforward approach is to minimize error (traditionally represented by the least


squared error) between the model output values . ŷ (i) = h θ (x)(i) and corresponding
true outputs values . y (i) for all .i data points available in the training sample .m. This
will result in the following least squares objective function as

Table 4.1 Data matrix for regression modeling


Data .y .h(x 1 ) .h(x 2 ) … .h(x j ) … .h(x n−1 ) .h(x n )
index
1 . y (1) .h(x (1,1) ) .h(x (1,2) ) … .h(x (1, j) ) … .h(x (1,n−1) ) .h(x (1,n) )
2 . y (2) .h(x (2,1) ) .h(x (2,2) ) … .h(x (2, j) ) … .h(x (2,n−1) ) .h(x (2,n) )
. . . . . .
. . . . . .
. . . . . .
.i . y (i) .h(x (i,1) ) .h(x (i,2) ) … .h(x (i, j) ) … .h(x (i,n−1) ) .h(x (i,n) )
. . . . . .
. . . . . .
. . . . . .
.m − 1 . y (m−1) .h(x (m−1,1) ) .h(x (m−1,2) ) … .h(x (m−1, j) ) … .h(x (m,n−1) ) .h(x (m,n) )
.m . y (m) .h(x (m,1) ) .h(x (m,2) ) … .h(x (m, j) ) … .h(x (m,n−1) ) .h(x (m,n) )
64 4 Parametric Methods for Regression

1 ∑ (i)
m
θ = min J (θ ) :=
. ( ŷ − y (i) )2 (4.2)
θ 2 i=1

To solve Eq. (4.2), first, we transform the dataset in Table 4.1 to maintain a compact
notation scheme. Let the .ith row of inputs in Eq. (4.2) be represented as
⎛ ⎞T
h(x (i,1) )
⎜ h(x (i,2) ) ⎟
⎜ ⎟
⎜ ... ⎟
⎜ ⎟
.h[x(i)] := ⎜ h(x
(i, j) ⎟
T
⎜ ) ⎟ (4.3)
⎜ ... ⎟
⎜ (i,n−1) ⎟
⎝h(x )⎠
h(x (i,n) )

Then, the input datasets from all the.m training samples can be compactly represented
in the following fashion as
⎛ ⎞
h[x(1)]T
⎜ h[x(2)]T ⎟
⎜ ⎟
⎜ ... ⎟
⎜ ⎟
.h[X ] := ⎜ T ⎟
⎜ h[x(i)] ⎟ (4.4)
⎜ ... ⎟
⎜ ⎟
⎝h[x(m − 1)] ⎠
T

h[x(m)]

Similarly, the corresponding output samples for training can be represented as


⎛ ⎞
y (1)
⎜ ⎟
y (2)
⎜ ⎟
⎜ ...⎟
⎜ ⎟
.Y := ⎜ ⎟
y (i) (4.5)
⎜ ⎟
⎜ ...⎟
⎜ (m−1) ⎟
⎝y ⎠
(m)
y

Combining these compact notations in Eqs. (4.3), (4.4), and (4.5), the OLS objective
function in Eq. (4.2) can be rewritten as

1
.θ = min J (θ ) := (h[X ] − Y )T (h[X ] − Y ) (4.6)
θ 2

Here,.θ is computed by applying first-order optimal conditions, that is, by equating


the gradient of Eq. (4.6) to zero and performing the necessary steps as shown below.
4.3 Iterative Approaches for Regression 65

1
∇θ J (θ ) =
. ∇θ (θ T h[X ]T h[X ]θ − θ T h[X ]T Y − Y T h[X ]θ + Y T Y ) = 0 (4.7)
2
1
. = (h[X ]T h[X ]θ + h[X ]T h[X ]θ − 2h[X ]T Y ) = 0 (4.8)
2
. =⇒ h[X ] h[X ]θ = h[X ] Y
T T
(4.9)
−1
. =⇒ θ = (h[X ] h[X ]) h[X ] Y
T T
(4.10)

The final value of .θ obtained in Eq. (4.10) represents the optimal values of the free
parameters obtained by minimizing the error between the predicted values using the
mathematical model with respect to the observed (or experimental) values for the
output variable.

4.3 Iterative Approaches for Regression

Even though the closed-form expression to calculate .θ provided in Eq. (4.10) is quite
convenient, there are several issues associated with it as outlined below.
1. Existence of an inverse: If matrix .(h[X ]T h[X ]) is rank deficient, the inverse
−1
.(h[X ] h[X ])
T
will not exist. This may occur if any two rows or columns of
.h[X ] are linearly dependent on each other, which is quite likely in a large dataset.
2. Numerical instabilities: If there large number of entries in .h[X ], this may result
in numerical instabilities while calculating the inverse.(h[X ]T h[X ])−1 . Numerical
instabilities while computing the inverse can also occur if there are huge (order
of magnitude) variations in the dataset, which is quite likely in a realistic dataset.
In short, while dealing with a large amount of data where one expects variables
to have a correlation among each other, the closed-form expression given by Eq.
(4.10), though seemingly simpler for calculations, may result in undesirable results.
To address these issues, iterative approaches are commonly used to obtain the solution
closest to the minimum as discussed in the next section.

4.3.1 Gradient Descent Optimizer

While the closed-form solutions are simpler and desirable for computing the optimal
solution, in practical cases, it often fails as mentioned earlier. In such cases, one
needs to opt for iterative approaches for solving the least squares objective function
given by Eq. (4.2). The canonical algorithm used for iteratively solving the Eq. (4.2)
is the gradient descent algorithm also known as the steepest descent algorithm.
Figure 4.1 shows the gradient descent algorithm applied in the case of a simple
parabolic function to achieve the minimum. In a gradient descent algorithm, one
starts with an initial guess for the free parameters .θ and repeatedly updates the guess
66 4 Parametric Methods for Regression

Fig. 4.1 Gradient descent

value of .θ such that the . J (θ ) in Eq. (4.2) is minimized. The direction for the iterative
∂ J (θ )
search will be along the rate of change of . J (θ ) represented by . ∂θ j j to reach the
minimum in the shortest possible steps. Parameter update stalls when the gradient
of . J (θ ) is zero or close to zero, indicating that a minimum is achieved. Gradient
descent update for the . jth parameter, .θ j , is given by


θ := θ j − α
. j J (θ j ), j = 1, . . . , n (4.11)
∂θ j

The negative update rule in Eq. (4.11) indicates that the search is towards minimiz-
ing the gradient. Parameter .α ∈ (0, 1) in Eq. (4.11) is called the learning rate or the
forgetting factor. A higher value of .α may result in long search steps, and a lower
value of .α short search steps. The alpha value has to be tuned properly for achieving
an optimal trade-off between achieving minima and faster convergence of the algo-
rithm. Code Snippet 4.1 can be used to reproduce the results and the plots in Fig. 4.1.

Note: Authors recommend the readers to run the code with different values of
. α to understand the role of learning rate in converging towards the minimum.

4.3.2 Gradient Descent Approach for Linear Regression

In this subsection, we derive the gradient descent approach for computing the param-
eter for the regression model given by Eq. (4.1). This method is also known as the
least mean square (LMS) update or Widrow-Hoff learning rule. Assume we have
4.3 Iterative Approaches for Regression 67

""" G r a d i e n t D e s c e n t
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# import numpy
i m p o r t n u m p y as np
# P a r a b o l a with its m i n i m u m at x = 2 . 0
def f ( x ) :
r e t u r n 0 . 5 * ( x - 2 ) ** 2
# D e f i n e f u n c t i o n to c a l c u l a t e g r a d i e n t
def g r a d i e n t _ f ( x ) :
return x-2
# I n i t i a l p o s i t i o n at x = 6 . 0
x0 = 6 . 0
# define step size
stepsize = 0.1
# update position
def x_new ( x0 , stepsize , g r a d i e n t ) :
r e t u r n x0 - s t e p s i z e * g r a d i e n t
# I t e r a t e this for 100 steps
x _ v a l u e s = [ x0 ]
f _ v a l u e s = [ f ( x0 ) ]
x_ = x0
for i in r a n g e ( 100 ) :
g r a d i e n t = g r a d i e n t _ f ( x_ )
x_ = x_new ( x_ , stepsize , g r a d i e n t )
x _ v a l u e s . a p p e n d ( x_ )
f _ v a l u e s . a p p e n d ( f ( x_ ) )
# Print values
p r i n t ( ' I n i t i a l x : { :. 3f } , f ( x ) = { :. 3f } '. f o r m a t ( x0 , f ( x0 ) ) )
p r i n t ( ' F i n a l x : { :. 3f } , f ( x ) = { :. 3f } '. f o r m a t ( x_ , f ( x_ ) ) )
# S c a t t e r plot using m a t p l o t l i b
xs = np . a r a n g e ( - 3 , 7 , 0 . 01 )
plt . plot ( xs , [ f ( i ) for i in xs ] , c = 'k ' , l a b e l = 'f ( x ) ')
plt . s c a t t e r ( x_values , f_values , c = 'r ' , l a b e l = ' ' , s = 80 , lw = 0 .5 , ec = 'k ')
plt . xlim ( [ -3 , 7 ] )
plt . x l a b e l ( " x " )
plt . y l a b e l ( " f ( x ) " )
plt . text ( x0 , f ( x0 ) , " x0 " , ha = " r i g h t " )
plt . l e g e n d ()
s a v e f i g ( " g r a d i e n t _ d e s c e n t . png " )
p r i n t ( " End of o u t p u t " )

Output:

Initial x: 6.000, f(x) = 8.000


Final x: 2.000, f(x) = 0.000
End of output

See Fig. 4.1

Code snippet 4.1: Gradient descent plot


68 4 Parametric Methods for Regression

only one training example .(x, y). In this case, the gradient term . ∂θ∂ j J (θ j ) for the
regression model Eq. (4.1) is computed as

∂ ∂ 1
. J (θ ) = (h θ (x) − y)2 , j = 1, . . . , n (4.12)
∂θ j ∂θ j 2
1 ∂
. = 2× (h θ (x) − y) (h θ (x) − y) (4.13)
2 ∂θ j
∂ ∑
n
. = (h θ (x) − y) ( θi xi − y) (4.14)
∂θ j i=1
. = (h θ (x) − y)x j (4.15)

Using Eq. (4.15), the gradient descent update for the .ith training example is given as

θ := θ j + α(y (i) − h θ (x (i) ))x (i)


. j j , j = 1, . . . , n (4.16)

However, the problem with Eq. (4.16) is that the parameter update is performed
considering only one training example. If there is more than one training example,
which is usually the case, we need to generalize this method. To this extent, one
approach is to update the parameter .θ considering the contributions from all the
training examples. This approach is called the batch LMS update as indicated in the
Algorithm 1.

Algorithm 1: Batch LMS


Data: m training examples
Result: θ j , i = 1, . . . , n
initialization θ j ;
while until convergence do
∑m
θ j := θ j + α i=1 (y (i ) − h θ (x (i) ))x (i)
j , j = 1, . . . , n

Algorithm 2: Stochastic LMS


Data: m training examples
Result: θ j , i = 1, . . . , n
initialization θ j ;
while (optional) until desired minimum is reached do
for i = 1, . . . , m do
θ j := θ j + α(y (i) − h θ (x (i) ))x (i)
j , j = 1, . . . , n

In batch LMS, the magnitude of the update is proportional to the error given by
(y (i) − h θ (x (i) )). This means that a more significant change to the parameters will be
.
made when .h θ (x (i) ) deviates more from . y (i) . Similarly, when a training example on
which the prediction nearly matches the actual value of . y (i) , the parameter change
is minimal. Note that, LMS can be susceptible to local minima and may converge
4.3 Iterative Approaches for Regression 69

Fig. 4.2 Linear regression

Fig. 4.3 Locally weighted


linear regression

to the nearest local minima instead of finding the global optimal solution. However,
this is not a concern in the case of linear regression as it has only one global solution.
As such, the gradient descent approach always converges to the global solution in
linear regression.
Figure 4.3 shows the example of a linear regression on a set of data points rep-
resented by open circles. The equation of a straight line passing through the origin
. ŷ = h θ (x) = mx, where . x, and . y correspond to the independent and dependent vari-
ables, respectively, and .m is the slope, which is the free parameter to be fitted. First,
we start with an initial value of .m, m 0 = 0. This corresponds to a line parallel to the
x-axis. The error or loss is then calculated as
70 4 Parametric Methods for Regression


n
loss =
. (y (i) − h θ (x (i) )
i=1

n
= (y (i) − 0 × x (i) ) (4.17)
i=1
∑n
= (y (i) )
i=1

In this case, the loss comes to be a large number of 350 k (see Fig. 4.3). The value
of the slope .m is then updated to a positive value .m 1 according to the direction of
decreasing loss as provided by the gradient. Consequently, we observe that the error
is decreased. The procedure is continued until the loss converges to the minimum.
Note that the stepsize, as marked by the change in the value of .m, also decreases
with decreasing loss. The final fitted line represents the optimal model with mini-
mum error regressed through the points considered. Code Snippet 4.2 can be used to
reproduce the results and plots.

Note: The authors recommend the readers run the code for different initial values
of .m and learning rate .lr to understand the effects of initial values and learning
rate in converging towards the global minimum.

In batch LMS , the algorithm scans through every datapoint present in the training
set on every single step. Thus, the entire training set is considered before an update
is performed on the parameters, leading to a computationally expensive operation if
the training dataset is large. To address this issue, a stochastic version of the LMS
is used. In stochastic LMS , each time when a training datapoint is encountered, the
parameters are updated according to that single training datapoint only. Therefore,
stochastic LMS can start the update right away from the first training data point itself.
Thus, stochastic LMS enables faster convergence to optimal .θ values in comparison
to the batch LMS. However, it should be noted that the stochastic LMS suffers from
the disadvantage that it may never converge to the actual minimum. This is due
to the fact that as the update in each step in stochastic LMS is based on a single
training datapoint instead of considering the entire training sample, the values of
the parameters will be such that the . J (θ ) will oscillate around the minimum without
necessarily converging to the exact value. Nevertheless, when the training set is large,
stochastic LMS is often preferred over batch LMS considering the trade-off between
the accuracy of the solution and the computational cost.
4.3 Iterative Approaches for Regression 71

""" L i n e a r R e g r e s s i o n
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
i m p o r t m a t p l o t l i b . p a t c h e s as p a t c h e s
# i m p o r t n u m p y and s k l e a r n
i m p o r t n u m p y as np
i m p o r t p a n d a s as pd
# Create sample dataset
X = np . a r a n g e ( - 10 , 11 , 1 )
X_ = np . a r a n g e ( - 20 , 21 , 1 )
y = 30 * X + 10 * np . r a n d o m . r a n d n ( len ( X ) )
def l i n e a r _ m o d e l ( x , m ) :
return m*x
def loss (x , y , m ) :
r e t u r n np . sum ( 0 . 5 * ( l i n e a r _ m o d e l ( x , m ) - y ) ** 2 )
def l o s s _ g r a d i e n t ( x , y , m ) :
temp = l i n e a r _ m o d e l ( x , m ) - y
g_m = np . sum ( x * temp )
r e t u r n g_m
m = 0
lr = 0 . 0002
l o s s _ = [ loss (X , y , m ) ]
m_ = [ m ]
fig , axs = plt . s u b p l o t s ( 1 , 2 )
plt . sca ( axs [ 0 ] )
plt . plot ( X_ , l i n e a r _ m o d e l ( X_ , m ) , c = " k " )
for i in r a n g e ( 50 ) :
g_m = l o s s _ g r a d i e n t ( X , y , m )
m = m - lr * g_m
plt . plot ( X_ , l i n e a r _ m o d e l ( X_ , m ) , c = " k " , a l p h a = 0 . 2 * ( 1 - ( i + 1 ) / 50 ) )
l o s s _ + = [ loss (X , y , m ) ]
m_ + = [ m ]
plt . s c a t t e r ( X , y , s = 60 , ec = " k " , fc = " none " )
plt . x l a b e l ( " $x$ " )
plt . y l a b e l ( " $y = mx$ " )
plt . grid ( ' on ' )
plt . xlim ( - 20 , 20 )
plt . ylim ( - 300 , 300 )
plt . sca ( axs [ 1 ] )
plt . s c a t t e r ( m_ , loss_ , s = 60 , ec = " k " , fc = " none " )
plt . y t i c k s ( [ 100000 , 200000 , 3 0 0 0 0 0 ] , [ " 100k " ," 200k " ," 300k " ] )
plt . x l a b e l ( " $m$ " )
plt . y l a b e l ( " $ l o s s ( m ) $ " )
s a v e f i g ( " l i n e a r _ r e g r e s s i o n . png " )
p r i n t ( " End of o u t p u t " )

Output:

End of output

See Fig. 4.2

Code snippet 4.2: Linear regression using sklearn module. (a) A linear equation
.y = mx is fitted through the training data points represented by open circles. The
light lines represent the models corresponding to different values of .m, namely,
.m 0 , m 1 , m 2 , . . . , m n obtained during each update step
72 4 Parametric Methods for Regression

4.3.3 Least Squares: A Probabilistic Interpretation

In practical conditions, the experimental measurements can be affected by several


factors including measurement accuracy, environmental conditions, human errors,
sample impurity, or imperfections, to name a few. These factors are extremely dif-
ficult to model and can affect the measured values of the output. To account for
such variations, a probabilistic approach can be incorporated in the OLS regression,
wherein the effects of all the auxiliary variables which are not considered in the
equation are modeled using a random noise.
Assume that the outputs and the inputs are related via

. y (i) = θ T x (i) + ∈ (i) (4.18)

∈ (i) an error term that captures either unmodeled effects, or random noise and are
.
independently and identically distributed (IID) according to a Gaussian distribution
with mean 0 and standard deviation.σ , that is,.∈ (i) ∼ N (0, σ 2 ). Hence, the probability
distribution of .∈ can be given as

1 (∈ (i) )2
. p(∈ (i) ) = √ exp(− ) (4.19)
2πσ σ2

From Eq. 4.18, .∈ (i) can be written as . y (i) − θ T x (i) . Incorporating this in Eq. 4.19, the
probability of a . y (i) given .x (i) can be written in terms of the parameter .θ as

1 (y (i) − θ T x (i) )2
. p(y (i) |x (i) ; θ ) = √ exp(− ) (4.20)
2π σ σ2

Extending this expression to . X , which contains all the .x (i) ’s) and .θ , the distribution
of the .Y comprising all the . y (i) is given by . p(Y |X ; θ ). Explicit representation of
these distributions as a function of .θ is expressed in terms of likelihood function,
. L(θ ) given by


m
. L(θ ) = L(θ ; X, Y ) = p(y (i) |x (i) ; θ ) (4.21)
i=1

Considering the IID assumption on the .∈ (i) , the . L(θ ) can be written as


m
1 (y (i) − θ T x (i) )2
. = √ exp(− ) (4.22)
i=1
2π σ σ2

Similar to minimizing the loss function in OLS regression, to obtain the optimal
values of .θ , here we apply the maximum likelihood estimation. In the maximum
likelihood estimation of .θ , we obtain the values of .θ that maximize the . L(θ ). In
4.4 Locally Weighted Linear Regression (LWR) 73

practice, this approach is challenging due to the presence of the exponential in the
L(θ ) term. To address this issue, the logarithm of . L(θ ) is considered which reduces
.
the expression to a more straightforward form. In addition, the logarithm being
a monotonically increasing function, will not alter the maximal points. Thus, we
calculate the loglikelihood .l(θ ) = ln (L(θ )) for simplicity and tractability as below.

(∏
m
1 (y (i) − θ T x (i) )2 )
l(θ ) =ln
. √ exp(− ) (4.23)
i=1
2πσ σ2
1 1 ∑ (i)
m
1
. = m ln √ − 2 (y − θ T x (i) )2 (4.24)
2π σ σ 2
~ ~~ ~ ~~~~ i=1
constant scaling

∑m
Hence, maximizing.l(θ ) is equivalent to minimizing. 21 i=1 (y (i) − θ T x (i) )2 , the least
squares objective. In other words, under the probabilistic assumptions on the data,
OLS regression corresponds to finding the maximum likelihood estimate of .θ.

4.4 Locally Weighted Linear Regression (LWR)

Although linear regression fits a straight line, with minor modifications, OLS can
be used for nonlinear data as well. This approach is based on the assumption that
a nonlinear function can be approximated as a combination of several piece-wise
linear functions. This approach is known as the linearisation of nonlinear functions.
LWR is an approach inspired by linearisation of nonlinear functions and having
multiple instance-based models ∑for a particular dataset. In linear regression, we fit .θ
to minimize the loss function . i (y (i) − θ T x (i) )2 in training and calculate the output
as .θ T x. In contrast, LWR ∑ is an online algorithm. Here, during the training process,
we fit .θ to minimize . i w (i) (y (i) − θ T x (i) )2 and prediction the output is calculated
as .θ T x, where .w (i) is a set of non-negative valued (.≥ 0) weights. Thus, when .w (i)
takes a large value, penalization of the loss.(y (i) − θ T x (i) )2 for that particular training
datapoint is high, and when .w (i) is small, penalization of the loss .(y (i) − θ T x (i) )2 for
(i)
that particular datapoint is small. A standard choice of .w (i) is .w (i) = exp(− (x 2τ−x)
2
2 ),
at a particular query point .x. This results in .|(x ( j) − x (i) )| is small .w (i) ~ 1 and
( j)
.|(x − x (i) )| is large .w (i) ~ 0. Here, .τ is the bandwidth parameter that decides how
the weight of a training datapoint .x ( j) reduces with its distance from the query point
(i)
.x .
Figure 4.4 shows the LWR approach on a nonlinearly distributed dataset .(x, y).
In this particular example, the datasets are generated as a combination of a linear
function and a sinusoidal function (.x + sin(x/5)) with some noise added to it. We
observe that the LWR is able to fit the nonlinear function very well albeit considering
a linearised form. Code Snippet 4.3 can be used to reproduce the results and the plots.
74 4 Parametric Methods for Regression

Fig. 4.4 Least angle regression

The default values of slope .m, learning rate .lr , and the bandwidth parameter .tau (.t
in the code) are 1, 0.001, and 1.2, respectively.

Note: Authors recommend the readers to run the code with different values of
. m, .lr , and .t to understand the effects of initial values, learning rate, and varying
bandwidth on the convergence, and the final fit obtained.

4.5 Best Subset Selection for Regression

Although OLS estimates provide reasonable predictions in many cases, there are two
reasons why we are often not satisfied with the OLS estimates.
1. Prediction accuracy: The OLS estimates often have low bias but large variance.
Prediction accuracy can sometimes be improved by shrinking or setting some
coefficients to zero. By doing so we sacrifice the bias to reduce the variance of
the predicted values. This approach may improve the overall prediction accuracy.
2. Interpretability: With a large number of predictors, we often would like to
determine a smaller subset of independent variables that exhibit the strongest
effects. Reducing the model to a lower number of input features may thus provide
improved interpretability. In other words, to get the “big picture”, we are often
willing to sacrifice some of the small details.
To achieve this, we need to judiciously select a subset of features or independent
variables from all the available ones without compromising significantly on the
4.5 Best Subset Selection for Regression 75

""" L o c a l l y W e i g h t e d L i n e a r R e g r e s s i o n ( LWR )
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
i m p o r t m a t p l o t l i b . p a t c h e s as p a t c h e s
# i m p o r t n u m p y and s k l e a r n
i m p o r t n u m p y as np
i m p o r t p a n d a s as pd
np . r a n d o m . seed ( 2020 )
# Create sample dataset
X = np . a r a n g e ( - 20 , 21 , 1 )
y_ = X + X * np . sin ( X / 5 )
y = y_ + 2 * np . r a n d o m . r a n d n ( len ( X ) )
def w e i g h t ( t ) :
r e t u r n l a m b d a x , x_i : np . exp ( - ( x_i - x ) ** 2 / 2 . 0 / t ** 2 )
def w e i g h t e d _ l i n e a r _ m o d e l _ (x , X , y , w_func , m = 1 . 0 ) :
w = w_func (x , X)
l o s s _ g = l a m b d a m : np . sum ( - 2 . 0 * w * ( y - m * X ) * X )
m_ = 1 . 0 * m
lr = 0 . 001
for i in r a n g e ( 1000 ) :
m_ = m_ - lr * l o s s _ g ( m_ )
r e t u r n m_ * x
def w e i g h t e d _ l i n e a r _ m o d e l ( x , * args ) :
r e t u r n [ w e i g h t e d _ l i n e a r _ m o d e l _ ( i , * args ) for i in x ]
y _ p r e d = w e i g h t e d _ l i n e a r _ m o d e l (X , X , y , w e i g h t ( 1 . 2 ) )
fig , axs = plt . s u b p l o t s ( 1 , 1 )
plt . s c a t t e r ( X , y , s = 60 , ec = " k " , fc = " none " , l a b e l = " Data p o i n t s " )
plt . s c a t t e r ( X , y_pred , s = 60 , ec = " r " , l a b e l = " LWR " )
plt . x l a b e l ( " $x$ " )
plt . y l a b e l ( " $y$ " )
plt . l e g e n d ()
s a v e f i g ( " l w _ r e g r e s s i o n . png " )
p r i n t ( " End of o u t p u t " )

Output:
End of output

See Fig. 4.3

Code snippet 4.3: Locally weighted regression

model performance. This approach is known as the best subset selectionindexRe-


gression, subset selection for regression. Here, we enlist various approaches for best
subset selection for regression, namely, (i) step-wise, (ii) stage-wise, and (iii) least
angle regression (LAR) methods.

4.5.1 Stepwise Regression

Stepwise regression is the simplest and most intuitive approach for best subset
selection. In this approach, we first consider the linear regression model given by
76 4 Parametric Methods for Regression

∑n
. ŷ = θ0 + i=1 (θi xi + ∈). Then, the best subset regression based on the stepwise
∑k
approach takes the form . ŷ = β0 + i=1 (βi xi + ∈) where .k < n. The best subset of
features .k are selected such that for each .i ∈ {1, 2, . . . , n}, the selected subset of .xi s
with size .k gives the smallest residual sum of squares in comparison to any other
subset of .xi s having size .k. To achieve this, two approaches, namely, forward and
backward stepwise regressions, are employed.
In forward stepwise regression approach, the algorithm starts with fitting the
intercept .β0 first. Then, it evaluates the model performance with a single predictor
and the intercept by adding only one among all the available features one-by-one.
Then it identifies the descriptor that provides the maximum training score and adds it
to the model. This approach is continued until the model becomes one with.k features.
Thus, forward stepwise regression sequentially adds into the model the predictors
. x i s (independent features) one by one that most improve the fit. Thus, the equation
∑k
for . ŷ evolves as .β0 , β0 + β1 x1 , β0 + β1 x1 + β2 x2 , . . . , β0 + i=1 (βi xi ). Note that
at each step, the feature that provides the best training score among all the remaining
features is added in this process. In contrast to forward approach, ∑the backward
n
stepwise regression approach starts with the full model . ŷ = θ0 + i=1 (θi xi + ∈).
Then, it identifies the feature that has the least impact on the model by removing
each of the input features one by one and evaluating the performance of the updated
model. Among all the features, the one having the least impact is deleted, and the
procedure is continued. Thus, backward stepwise regression sequentially deletes the
predictors that have the least impact on the model until it reduces to a model with .k
number of features.

4.5.2 Stagewise Regression

Stagewise regression is an approach inspired by stepwise regression with minor


modifications. It starts similar to forward stepwise regression by setting the intercept
of the model to be equal to the mean value of the output . y, and the coefficients of all
the predictors to be set to zero. Following this, the algorithm computes the residual as
. R = y − ŷ, where . y corresponds to the actual observation and . ŷ corresponds to the

model predictions. Then, the predictor or feature most correlated with the residual is
computed by taking the inner product of the input features with the residual. Once
the feature is identified, it performs linear regression considering only the selected
feature to identify the coefficient corresponding to this feature. The updated model
is then used to compute the residual. This process is continued until none of the
variables have a correlation with the residuals. Note that unlike forward stepwise
regression, only the coefficient associated with a single feature most correlated with
the residual is adjusted in one update step. Coefficients corresponding to the other
variables are not adjusted when this term is added to the model. As a result, forward
stagewise regression can take significantly more steps in comparison to stepwise
4.5 Best Subset Selection for Regression 77

regression and hence is deemed to be slow in fitting. However, it has been found that
stagewise regression can be very effective in solving high-dimensional problems.

4.5.3 Least Angle Regression (LAR)

LAR, inspired from stagewise regression, is a “democratic” version of forward step-


wise or stagewise regression. LAR is similar to the forward stagewise regression in
that it identifies the feature most correlated with the residual in the first step. Fol-
lowing this, instead of performing OLS regression with this variable thoroughly, the
coefficient is moved slightly in the direction of reducing error/residual. This step
is continued until one more variable exhibits correlation with the residual. Once a
second or more variables show a correlation with the residual, the coefficients of
both these variables are adjusted together while performing the OLS regression. The
process is continued until all the variables are included in the model. It should be
noted that the correlation, as represented by the inner product between the variables
and the residual, will be maximum for those variables having the least angle with
the residual. Hence, the algorithm is named the “least angle” regression. The basic
algorithm of LAR is given in Algorithm 3.

Algorithm 3: LAR algorithm


Data: m training datapoints
Result: θ j , i = 1, . . . , n
initialization θ0 = y and θ j = 0∀ j ∈ (1, n) ;
while until θ j /= 0 ∀ j ∈ (1, n) do
Compute the residual, r = y − y
Find the predictors xi most correlated with r
if Only one predictor xi is correlated with r then
Move the coefficient of xi , (θi ) toward the optimal value in a stepwise
manner;
else
Move all the coefficients of xi s (θi s) toward the optimal value together
stepwise manner;

Figure 4.2 shows the LAR performed on a dataset with 11 features. Specifically,
Fig. 4.2a and b show the values taken by the weights corresponding to each of the
input variables in each iteration and the corresponding training R.2 . We observe that
at the initial point (0.th iteration), all the weights are having a value of 0. With every
iteration, the weight associated with a different variable gains a non-zero value. Cor-
respondingly, we also observe an increase in the training score. In the 11.th iteration,
we observe that all the features have non-zero weights and the best training score.
However, it is interesting to note that the values of the weights associated with the
last three features are significantly low. Further, the increase in the training score
with the inclusion of these features is also marginal. These results suggest that the
78 4 Parametric Methods for Regression

last three features do not contribute significantly to the model predictions. On the
other hand, the first five features contribute significantly to the output of the model.
Thus, LAR is a highly interpretable approach to developing a regression model. Code
Snippet 4.4 can be used to reproduce the results and the plot. Note that the data used
is the density of multicomponent oxide glasses with the glass composition as the
input and density as the output.

Note: Authors recommend the readers to run the code with different values of
iterations, number of features, and different random states to understand the
effects on the final fit obtained.

4.6 Logistic Regression for Classification

All the regression models discussed thus far considers continuous variables for both
the input and output of the model. In many cases, while the input variables may
be continuous, the output variable may be discrete or even binary. This is typically
referred to as a classification problem. In this final section, we discuss the logistic
regression, which is a parametric algorithm for classification.
Logistic regression is a probabilistic model that uses a logistic function to model a
binary dependent variable. Mathematically, a binary logistic model has a dependent
variable with two possible values, which is represented by an indicator variable,
where the two values are labeled “0” and “1”. Outputs with more than two values
are modeled by multinomial logistic regression.
Consider a binary classification problem in which . y can take only two values,
0 and 1. Since . y is discrete-valued, if linear regression is used to predict . y as a
function of .x, it will give poor predictions. To address this challenge, we invoked the
logistic function .g(z) = 1+e1 −z , also known as a sigmoid function. Using the sigmoid
function, the logistic regression model is given as

1
h (x) = g(θ T x) =
. θ (4.25)
1 + e−θ T x

Note that .g(z) → 1 when .z → ∞, and .g(z) → 0 when .z → −∞. Thus the value
of .g(z) is bounded between 0 and 1. Further, .g(z) is a smooth, continuous and
differentiable function. With respect to the differentiability, .g(z) exhibits an interest-
ing property where .g ' (z) = g(z)(1 − g(z)). This property can be verified as shown
below.
4.6 Logistic Regression for Classification 79

""" L e a s t A n g l e R e g r e s s i o n
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
i m p o r t m a t p l o t l i b . p a t c h e s as p a t c h e s
# i m p o r t n u m p y and s k l e a r n
i m p o r t n u m p y as np
i m p o r t p a n d a s as pd
from s k l e a r n . m e t r i c s i m p o r t r 2 _ s c o r e
from s k l e a r n i m p o r t l i n e a r _ m o d e l
# Load sa m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / f u l l _ d e n . csv " ) . s o r t _ v a l u e s ( by = " Na2O " )
p r i n t ( data . c o l u m n s )
X = data [ data . c o l u m n s [ : - 1 ] ] . v a l u e s
y = data . v a l u e s [ : ,- 1 : ]
p r i n t ( " N u m b e r of f e a t u r e s : " , X . s h a p e [ 1 ] )
# Regression model
regr = l i n e a r _ m o d e l . Lars ( n _ n o n z e r o _ c o e f s = X . s h a p e [ 1 ] , r a n d o m _ s t a t e = 20 )
regr . fit (X , y )
scores = []
for w in regr . c o e f _ p a t h _ [ : , 1 : ] . T :
mask = w ! = 0 . 0
l i n _ r e g r = l i n e a r _ m o d e l . L i n e a r R e g r e s s i o n ()
l i n _ r e g r . fit ( X [ : , mask ] , y )
s c o r e s + = [ l i n _ r e g r . s c o r e ( X [ : , mask ] , y ) ]
# plor W e i g h t s vs I t e r a t i o n
fig , axs = plt . s u b p l o t s ( 1 , 2 )
plt . sca ( axs [ 0 ] )
m a r k e r s = [ " o " , " <" , " s " , " ^ " , " D " , " + " , " >" , " * " , " x " , " P " , " X " ]
for i , w in e n u m e r a t e ( regr . c o e f _ p a t h _ ) :
w_ = w
w_ = [ 0 . 0 ] + list ( w [ w ! = 0 . 0 ] )
x_ = r a n g e ( 12 - len ( w_ ) , 12 )
plt . plot ( x_ , w_ , " -- { } k " . f o r m a t ( m a r k e r s [ i ] ) , ms = 8 , mfc = " none " ,
l a b e l = " W$_ { { { } } } $ " . f o r m a t ( i + 1 ) )
plt . x t i c k s ( r a n g e ( 0 , 12 ) )
plt . xlim ( [ -1 , 16 ] )
plt . plot ( r a n g e (0 , 12 ) , [ 0 ] * 12 , " k " )
plt . x l a b e l ( " I t e r a t i o n " )
plt . y l a b e l ( " C o e f f i c i e n t v a l u e " )
plt . l e g e n d ()
plt . sca ( axs [ 1 ] )
plt . plot ( r a n g e (1 , len ( s c o r e s ) + 1 ) , scores , " -- ok " , ms = 8 , mfc = " none " )
plt . x t i c k s ( r a n g e ( 1 , 12 ) )
plt . y l a b e l ( " R$ ^ 2$ " )
plt . x l a b e l ( " N u m b e r of f e a t u r e " )
s a v e f i g ( " lars . png " )
p r i n t ( " End of o u t p u t " )

Output:
Index([’Al2O3’, ’B2O3’, ’CaO’, ’Fe2O3’, ’FeO’, ’MgO’, ’Na2O’, ’P2O5’, ’TeO2’,
’TiO2’, ’ZrO2’, ’Density’],
dtype=’object’)
Number of features: 11
End of output

See Fig. 4.4

Code snippet 4.4: Least angle regression


80 4 Parametric Methods for Regression

d 1 1
. g ' (z) = −z
=− (−e−z )
dz 1 + e (1 + e−z )2
( )
1 e−z 1 1
= = 1− (4.26)
(1 + e−z ) (1 + e−z ) (1 + e−z ) (1 + e−z )
= g(z)(1 − g(z))

For a given dataset, the model parameters of logistic regression, namely .θ , are
computed as follows. Assume that

. p(y = 1|x; θ ) = h θ (x)


p(y = 0|x; θ ) = 1 − h θ (x) (4.27)

Combining these, . p(y|x; θ ) can be written as

. p(y|x; θ ) = (h θ (x)) y (1 − h θ (x))(1−y) (4.28)

The parameters are estimated in the maximum likelihood estimation framework, as


in the case of the OLS in the probabilistic framework. Invoking Eq. 4.22, if there
are .m independent training examples, the likelihood function . L(θ ) is computed as
follows.


m
. L(θ ) = P(Y |X ; θ ) = p(y (i) |x (i) ; θ )
i=1

m
(i) (i)
= (h θ (x))(y ) (1 − h θ (x))(1−y )
(4.29)
i=1

Consequently, the loglikelihood is computed as

l(θ ) = log(L(θ ))
.


m
= y (i)ln h(x (i) ) + (1 − y (i) )ln (1 − ln h(x (i) )) (4.30)
i=1

Finally, by employing gradient ascent algorithm (as we are maximizing the like-
lihood)
( )
∂ 1 1 ∂
. l(θ ) = y − (1 − y) g(θ T x)
∂θ j g(θ T x) (1 − g(θ T x)) ∂θ j
( )
1 1 ∂
= y − (1 − y) g(θ T x)(1 − g(θ T x)) (θ T x)
T
g(θ x) (1 − g(θ x))
T ∂θ j
= (y(1 − g(θ T x)) − ((1 − y)g(θ T x))
= (y − h θ (x))x j (4.31)
4.7 Summary 81

Fig. 4.5 Logistic


classification

This would result in the following parameter update expression, which can be solved
in the batch or stochastic LMS framework.

θ := θ + α(y (i) − h θ (x (i) ))x (i)


. j (4.32)

Figure 4.5 shows an example problem of a two-class classification problem solved


using the logistic regression approach. Code Snippet 4.5 can be used to reproduce
the results and the plot.

Note: Authors recommend the readers to run the code with different values of
hyperparameters such as .C to analyze the effect on the final fit obtained.

4.7 Summary

In this chapter, we focused on the parametric approaches toward regression. We


discussed the analytical closed-form solution for the linear regression problem. The
iterative solutions for OLS regression employing gradient descent algorithm, both the
batch LMS and stochastic LMS versions, were also discussed. Further, we reframed
the least square regression in a probabilistic framework employing the notion of
maximizing the loglikelihood, which was shown to be equivalent to minimizing
the loss function in OLS. We also discussed linearization approaches for non-linear
functions such as LWR. Following this, we discussed the selection of the best subset
for regression using stepwise regression, stagewise regression, and LAR. Finally, the
applicability of logistic regression for solving a binary classification problem was
also discussed. In summary, this chapter focused on the classical parametric methods
used to solve ML regression and classification problems.
82 4 Parametric Methods for Regression

""" L o g i s t i c c l a s s i f i c a t i o n
Code s o u r c e : Gael V a r o q u a u x
M o d i f i e d for d o c u m e n t a t i o n by J a q u e s G r o b l e r
L i c e n s e : BSD 3 c l a u s e
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
i m p o r t m a t p l o t l i b . p a t c h e s as p a t c h e s
i m p o r t n u m p y as np
i m p o r t m a t p l o t l i b . p y p l o t as plt
from s k l e a r n . l i n e a r _ m o d e l i m p o r t L o g i s t i c R e g r e s s i o n
from s k l e a r n i m p o r t d a t a s e t s
# i m p o r t s o m e data to play with
iris = d a t a s e t s . l o a d _ i r i s ()
X = iris . data [ : , : 2 ] # we only take the first two f e a t u r e s .
Y = iris . t a r g e t
# C r e a t e an i n s t a n c e of L o g i s t i c R e g r e s s i o n C l a s s i f i e r and fit the data
.
l o g r e g = L o g i s t i c R e g r e s s i o n ( C = 1e5 )
l o g r e g . fit ( X , Y )
# Plot the d e c i s i o n b o u n d a r y . For that , we will assign a color to each
# point in the mesh [ x_min , x _ m a x ] x [ y_min , y _ m a x ].
x_min , x _ m a x = X [ : , 0 ] . min () - . 5 , X [ : , 0 ] . max () + . 5
y_min , y _ m a x = X [ : , 1 ] . min () - . 5 , X [ : , 1 ] . max () + . 5
h = . 02 # step size in the mesh
xx , yy = np . m e s h g r i d ( np . a r a n g e ( x_min , x_max , h ) , np . a r a n g e ( y_min , y_max
, h))
Z = l o g r e g . p r e d i c t ( np . c_ [ xx . r a v e l () , yy . r a v e l () ] )
# Put the result into a color plot
Z = Z . r e s h a p e ( xx . s h a p e )
plt . f i g u r e (1 , f i g s i z e = ( 4 , 3 ) )
plt . p c o l o r m e s h ( xx , yy , Z , cmap = plt . cm . Paired , s h a d i n g = ' auto ')
# Plot also the t r a i n i n g p o i n t s
plt . s c a t t e r ( X [ : , 0 ] , X [ : , 1 ] , c = Y , e d g e c o l o r s = 'k ' , cmap = plt . cm . P a i r e d )
plt . x l a b e l ( ' S e p a l l e n g t h ')
plt . y l a b e l ( ' S e p a l w i d t h ' )
plt . xlim ( xx . min () , xx . max () )
plt . ylim ( yy . min () , yy . max () )
plt . x t i c k s (() )
plt . y t i c k s (() )
s a v e f i g ( " l o g i s t i c _ c l a s s i f i c a t i o n . png " )
p r i n t ( " End of o u t p u t " )

Output:
End of output

See Fig. 4.5

Code snippet 4.5: Logistic classification


References 83

References

1. M. Kuhn, K. Johnson, et al., Applied Predictive Modeling, vol. 26 (Springer, Berlin, 2013)
2. G. James, D. Witten, T. Hastie, R. Tibshirani, An Introduction to Statistical Learning: With
Applications in R (Springer Publishing Company, Incorporated, 2014). ISBN: 1461471370
3. C.M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics)
(Springer, Berlin, Heidelberg, 2006). ISBN: 0387310738
4. Y.S. Abu-Mostafa, M. Magdon-Ismail, H.-T. Lin, Learning From Data (AMLBook, 2012).
ISBN: 1600490069
5. A. Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and Tensor-Flow: Concepts,
Tools, and Techniques to Build Intelligent Systems (O’Reilly Media, 2019)
6. J. Friedman, T. Hastie, R. Tibshirani, et al., The Elements of Statistical Learning. Springer Series
in Statistics New York, vol. 1 (2001)
7. C.M. Judd, G.H. McClelland, C.S. Ryan, Data Analysis: A Model Comparison Approach to
Regression, ANOVA, and Beyond (Routledge, 2017)
Chapter 5
Non-parametric Methods for Regression

Abstract Parametric approaches for regression, discussed in the previous chapter,


requires a priori knowledge of the functional form to be fitted. This inherently
assumes that the nature of the dataset represented by the distribution and the respec-
tive central and higher order measures is known. In cases where the nature of the
data is not known, non-parametric methods are highly useful. Non-parametric meth-
ods are not limited by the distribution of the data and hence can be used for any
dataset. In this chapter, we focus on non-parametric methods for regression. First,
we discuss about tree-based approaches, such as regression tree, random forest, and
gradient boosted trees. Then we discuss about multi-layer perceptron, popularly
known as neural network and support vector regression. Finally, we discuss about
a non-parametric probabilistic approach for regression, namely, Gaussian process
regression.

5.1 Introduction

Predicting the value of an output variable requires accurate knowledge of the func-
tion relating the independent variables with that of the output variable. Parametric
approaches such as linear, polynomial, or logistic regression makes a priori assump-
tions about the nature of the data before the process of learning the function. For
instance, for a linear elastic material, it is assumed that the stress is proportional to
strain, which leaves only one free parameter to be fitted. The knowledge of this func-
tional form may be based on the underlying physics, expert knowledge, or intuition.
Thus, the number of parameters in a parametric model is fixed. During the learning
process, the values of these parameters are learned.
On the other hand, non-parametric algorithms does not make any assumptions
about the underlying nature of the data. Further, they do not have any restrictions on
the number of parameters to be used for the fitting. Rather, the functional form is

Supplementary Information The online version contains supplementary material available at


https://doi.org/10.1007/978-3-031-44622-1_5.

© Springer Nature Switzerland AG 2024 85


N. M. A. Krishnan et al., Machine Learning for Materials Discovery,
Machine Intelligence for Materials Science,
https://doi.org/10.1007/978-3-031-44622-1_5
86 5 Non-parametric Methods for Regression

learned based on the data supplied to the model. As a consequence, non-parametric


approaches offer the following advantages over parametric approaches.
• Flexibility: Since the number of parameters in a non-parametric approach is not
fixed, the algorithm can learn the patterns from any dataset irrespective of its
complexity. As such, the non-parametric approaches offer high flexibility to fit
any functional form based on the data provided.
• Performance: Thanks to the flexibility of non-parametric model, it provides
improved performance in learning the dataset, especially in the case of more com-
plex ones.
• No expert knowledge required: No prior knowledge regarding the distribution
of data or the process is required while choosing a non-parametric model. Due
to the flexibility of the model to fit any functional form, expert knowledge or
intuition-based guesses can be avoided.
In this chapter, we discuss some of the commonly used non-parametric approaches
for regression problems. It should be noted that most of the non-parametric mod-
els used for regression can also be used for solving classification problems with
minor modifications. First, we discuss the tree-based approaches, following which
we discuss multilayer perceptron and support vector regression. Finally, we discuss
a non-parametric probabilistic method namely, Gaussian process regression before
concluding the chapter.

5.2 Tree-Based Approaches

Tree-based methods partition the feature space into a set of symmetric regions and
then fit a simple model, mostly a constant, in each of these regimes. For instance,
consider a regression problem with continuous output variable . y and inputs features
. x 1 , . . . , x n . Here, we split the feature space generated by the input features into
regions . R1 , . . . , Rm . These regions are split in such a fashion that the response taken
by the variable is a constant .ci in each region. The accuracy of the model depends on
how these regions are identified and how the final prediction is made. Accordingly,
there are several tree-based approaches with minor modifications, namely, regression
trees, random forest, gradient boosted trees, XGBoost, and AdaBoost. We discuss
some of these algorithms in detail below.

5.2.1 Regression Tree

A standard regression tree model can be written as


M
. f (x ( j) ) = ci I (x ( j) ∈ Ri ) (5.1)
i=1
5.2 Tree-Based Approaches 87

where in. I stands for an indicator function, which evaluates to identity if the condition
.(x ( j) ∈ Ri ) is met (that is, . I (x ( j) ) = 1 if .x ( j) ∈ Ri , otherwise . I (x ( j) ) = 0), .x ( j)
stands for a given value of predictors for which the predictions are made, . f (x ( j) )
stands for the prediction, and .i corresponds to the number of regions split in the
regression tree. Thus, in a regression tree model, we have to evaluate two parameters,
namely, .ci and . Ri . For evaluating .ci , one can use least squares principles such that
∑ ( j)
.
N
j=1 (y − f (x ( j) ))2 is minimized, over. N samples. It can be shown that this would
result in

ĉ = average(y ( j) |x ( j) ∈ Ri )
. i (5.2)

Now, the following approach can be used to generate the nodes based on a greedy
approach. Consider a case where a node is created for splitting a variable .xi at a
splitting point . j, such that:

. R1 (i, j) = {x|xi < j} and R2 (i, j) = {x|xi ≥ j} (5.3)

so that the best binary partition is achieved. Thus, any value of .xi < j would belong
to the region . R1 and any value of .xi ≥ j would belong to region . R2 . Hence, we
would minimize the following objective function given by
⎡ ⎤
∑ ∑
. min ⎣min (yi − c1 )2 + min (yi − c2 )2 ⎦ (5.4)
i, j c1 c2
x∈R1 (i, j) x∈R2 (i, j)

It is clear that for any suitable choice .i and . j, the inner minimization problem yields
the following optimal solutions

ĉ = average(y|x ∈ R1 (i, j)), and ĉ2 = average(y|x ∈ R2 (i, j))


. 1 (5.5)

In other words, .ĉ1 and .ĉ2 correspond to the average values of the output variable . y
values obtained by considering all the input features .x belonging to the region . R1
and . R2 , respectively. Note that for each splitting variable .xi , the determination of
the split point . j is determined by scanning through all of the input and finding the
optimal pair .(i, j). Having found the best split, we partition the data into the two
resulting regions and repeat the splitting process on each of the two regions. Then
this process is repeated until the full regime is explored. It should be noted that the
tree size is a tuning parameter governing the model’s complexity, and the optimal
tree size should be suitably chosen based on the data. To achieve that, one can choose
to employ a condition to split tree nodes only if the decrease in sum-of-squares due
to the split exceeds a manually set threshold.
Figure 5.1 shows the example of regression tree to predict the density of binary
sodium silicate glasses with the chemical composition (Na.2 O).x ·(SiO.2 ).(1−x) . In
Fig. 5.1a, shows the results for tree regression with a tree depth of two, which results
in four regions (two per node). The node splitting was performed based on the feature
88 5 Non-parametric Methods for Regression

Fig. 5.1 Tree regression

(Na.2 O) for regions with mol % of (Na.2 O) .< 15% (. R1 ), 15–25% (. R2 ), 25–35% (. R3 ),
and > 35% (. R4 ). We observe that in these four regions . R1 , R2 , R3 , R4 , the density
values given by the values of .ĉ1 , ĉ2 , ĉ3 , ĉ4 are 2.31, 2.38, 2.46, 2.53 g/cm.3 , respec-
tively. It may be noted that a tree depth of two underfits the data. Figure 5.1b, c, and
d shows the tree regression with tree depths three (eight regions), four (16 regions),
and five (32 regions). A tree depth of four seems to provide an optimal fit, while tree
depth seems to exhibit trends of overfitting represented sudden spikes with notable
difference in magnitude. Note that the training:test ration for the data set is 70:30.
That is, only 70% of the data is provided to the algorithm to develop the model. The
remaining 30% test data can be used to evaluate the performance of the model. Code
Snippet 5.1 can be used to reproduce the results and the plot.
5.2 Tree-Based Approaches 89

Note: Authors recommend the readers to run the code with different values
of of the random state for test train split (.random_state) and test set size
(.test_size to understand the effect on the final fit obtained.

"""
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# i m p o r t n u m p y and s k l e a r n
i m p o r t n u m p y as np
i m p o r t p a n d a s as pd
from s k l e a r n . m e t r i c s i m p o r t r 2 _ s c o r e
from s k l e a r n . m o d e l _ s e l e c t i o n i m p o r t t r a i n _ t e s t _ s p l i t
from s k l e a r n . tree i m p o r t D e c i s i o n T r e e R e g r e s s o r
# Load sa m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / N S _ d e n . csv " ) . s o r t _ v a l u e s ( by = " Na2O " )
p r i n t ( data . c o l u m n s )
X = data [ " Na2O " ] . v a l u e s . r e s h a p e ( - 1 , 1 )
y = data [ " D e n s i t y ( g / cm3 ) " ] . v a l u e s . r e s h a p e ( -1 , 1 )
# Split dataset
X_train , X_test , y_train , y _ t e s t = t r a i n _ t e s t _ s p l i t ( X , y , t e s t _ s i z e = 0 . 3
, r a n d o m _ s t a t e = 2020 )
o r d e r = np . a r g s o r t ( X _ t e s t [ : , 0 ] )
fig , axs_ = plt . s u b p l o t s ( 2 , 2 )
axs = axs_
for ind , td in e n u m e r a t e ( [ 2 ,3 ,4 , 5 ] ) :
# Fit r e g r e s s i o n m o d e l
regr = D e c i s i o n T r e e R e g r e s s o r ( m a x _ d e p t h = td )
regr . fit ( X_train , y _ t r a i n )
# Predict
y _ p r e d _ t e s t = regr . p r e d i c t ( X _ t e s t )
# Score
p r i n t ( " Test R2 ( Tree depth = { } ) : " . f o r m a t ( td ) , r 2 _ s c o r e ( y_test ,
y_pred_test ))
plt . sca ( axs [ ind ] )
plt . plot ( X _ t e s t [ order , 0 ] , y _ p r e d _ t e s t [ o r d e r ] , c = 'r ' , l a b e l = ' Tree
d e p t h { } '. f o r m a t ( td ) )
plt . s c a t t e r ( X_test , y_test , l a b e l = ' Test data p o i n t s ' , ec = " k " , fc = "
none " )
plt . l e g e n d ()
plt . x l a b e l ( " $ N a _ 2 O $ ( mol % ) " )
plt . y l a b e l ( " D e n s i t y ( g cm$ ^ { -3 } $ ) " )
s a v e f i g ( " t r e e _ r e g r e s s i o n . png " )
p r i n t ( " End of o u t p u t " )

Output:

Index([’Na2O’, ’SiO2’, ’Density (g/cm3)’], dtype=’object’)

See Fig. 5.1

Code snippet 5.1: Tree-based regression. Model trained for varying tree-depths of
2, 3, 4, and 5 are shown
90 5 Non-parametric Methods for Regression

5.2.2 Random Forest Regression

One major problem of regression tree is that they tend to overfit their training sets,
if the trees are grown very deep. To address this issue, one of the approaches is
develop multiple regression trees over which an average can be computed to obtain
the function value for a given predictor. This can be achieved using the random
forest algorithm. Random forests are a way of averaging multiple deep decision
trees, trained on different parts of the same training set, with the goal of reducing
the variance. Suppose, we fit a tree to the training data . Z = {(x1 , y1 ), . . . , (x N , y N )},
obtaining the prediction. f (x) at input.x. Similarly, consider several subsets of. Z given
by. Z b , b = 1, 2, . . . , B. Now, fit a tree for each these subsets separately obtaining the
predictions. f b (x) at input.x. Then, the mean value of these predictions. f b (x) is given
as the estimate of the output value in the bagging approach. In other words, bagging
estimate of the prediction over a collection of .b bootstrap samples gives improved
prediction over a single model, thereby reducing its variance. In this case, for each
bootstrap sample . Z b , b = 1, 2, . . . , B, subset of . Z , a regression tree is developed,
giving prediction . f b (x). Thus, the bagging estimate . fˆb is defined by

1 ∑ b
B
ˆ
. fb = f (x) (5.6)
B b=1

Random forests is a modification of the bagging approach that builds a large


collection of de-correlated trees, and then averages them. The procedure for building
a RF regression model is given below.
1. For all randomly bootstrapped samples .b = 1, . . . , B
a. Draw a bootstrap sample . Z b of size . Nb from the training data.
b. Grow a random-forest tree .Tb to the bootstrapped data by select .m variables
at random from the . p variables
2. Compute output the ensemble of trees .T1 , . . . , TB
3. To make a prediction at a data point .x, use the following expression:

1 ∑
B
ˆ
. f r f (x) =
B
Tb (x) (5.7)
B i=1

It should be noted that RF with only one tree is equivalent to regression tree. In other
words, regression tree is a special case of RF with only one tree.
Figure 5.2 shows the example of a RF regression with varying number of esti-
mators for predicting the density of calcium aluminosilicate glasses. The number of
estimators refer to the number of trees considered in the RF model. For each tree,
the tree depth is another variable that can be optimized. Here, the tree depth is kept
constant as two for all the trees considered. Thus, for each tree, there are four regions
5.2 Tree-Based Approaches 91

Fig. 5.2 Random forest regression

(. R1 , R2 , R3 , and . R4 ). Figure 5.2a shows the prediction for the test data using one tree
having a tree depth of two. In this case, the four regions can be very clearly observed
as the four step values in the figure. Figure 5.2b, c, and d shows RF predictions with
the number of trees 2, 5, and 10, with each tree having a tree depth of 2. Despite
having only four regions, we observe that the prediction improves with increasing
number of estimators. This could be attributed to the bootstrapping and bagging over
a increasing number of trees which gives a better estimate of the output variable.
Code Snippet 5.2 can be used to reproduce the results and plots.

Note: Authors recommend the readers to run the code with different values of test
set size (.test_size), random state for train test partition (.random_state),
number of trees in the RF (.n_estimator), and tree depth (.max_depth).
92 5 Non-parametric Methods for Regression

"""
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# i m p o r t n u m p y and s k l e a r n
i m p o r t n u m p y as np
i m p o r t p a n d a s as pd
from s k l e a r n . m e t r i c s i m p o r t r 2 _ s c o r e
from s k l e a r n . m o d e l _ s e l e c t i o n i m p o r t t r a i n _ t e s t _ s p l i t
from s k l e a r n . e n s e m b l e i m p o r t R a n d o m F o r e s t R e g r e s s o r
# Load sa m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / C A S _ d e n . csv " ) . s o r t _ v a l u e s ( by = " CaO " )
p r i n t ( data . c o l u m n s )
X = data [ [ " CaO " , " A l 2 O 3 " , " SiO2 " ] ] . v a l u e s
y = data [ " D e n s i t y ( g / cm3 ) " ] . v a l u e s . r e s h a p e ( -1 , 1 )
# Split dataset
X_train , X_test , y_train , y _ t e s t = t r a i n _ t e s t _ s p l i t ( X , y , t e s t _ s i z e = 0 . 3
, r a n d o m _ s t a t e = 2020 )
fig , axs_ = plt . s u b p l o t s ( 2 , 2 )
axs = axs_ . f l a t t e n ()
for ind , n in e n u m e r a t e ( [ 1 , 2 ,5 , 10 ] ) :
# Fit r e g r e s s i o n m o d e l
regr = R a n d o m F o r e s t R e g r e s s o r ( n _ e s t i m a t o r s =n , m a x _ d e p t h =2 ,
r a n d o m _ s t a t e = 20 )
regr . fit ( X_train , y _ t r a i n )
# Predict
y _ p r e d _ t e s t = regr . p r e d i c t ( X _ t e s t )
# Score
p r i n t ( " Test R2 ( n u m b e r of e s t i m a t o r s = { } ) : " . f o r m a t ( n ) , r 2 _ s c o r e (
y_test , y _ p r e d _ t e s t ) )
plt . sca ( axs [ ind ] )
i = 0
o r d e r = np . a r g s o r t ( X _ t e s t [ : , i ] )
plt . plot ( X _ t e s t [ order , i ] , y _ p r e d _ t e s t [ o r d e r ] , c = 'r ' , l a b e l = ' N u m b e r
of e s t i m a t o r s { } '. f o r m a t ( n ) )
plt . s c a t t e r ( X _ t e s t [ : ,i ] , y_test , l a b e l = ' Test data p o i n t s ' , ec = " k " ,
fc = " none " )
plt . l e g e n d ()
plt . x l a b e l ( " $ C a O $ ( mol % ) " )
plt . y l a b e l ( " D e n s i t y ( g cm$ ^ { -3 } $ ) " )
s a v e f i g ( " r a n d o m _ f o r e s t _ r e g r e s s i o n . png " )
p r i n t ( " End of o u t p u t " )

Output:

Index([’Al2O3’, ’CaO’, ’SiO2’, ’Density (g/cm3)’], dtype=’object’)

See Fig. 5.2

Code snippet 5.2: Random forest regression with the number of estimators varying
from 1, 2, 5, 10
5.3 Multi-layer Perceptron 93

5.2.3 Gradient Boosted Trees

The fundamental assumption of a regression model is that prediction residuals are


white noise which are distributed normally around zero. The idea behind gradient
boosting algorithm is to repetitively leverage the pattern of the residuals until there
is no significant improvement. To achieve this, gradient boosting algorithms tries to
iteratively fit a model on the residuals of the data. The basic steps involved in the
building of a gradient boosted tree model is given below.
1. Fit a regression tree model . fˆ(x) on the training dataset (.x i , y i ).
2. Calculate the error residuals as the difference between actual target value (. y i ) and
predicted target value (. ŷ i ) as (.r i = ŷ i − y i ) for each training data point .i.
3. Fit a new model with the error residuals . fr (x) as target variable with the same
input variables .x i as used in the regression tree fitted in Step 1.
4. Add the predicted residuals to the previous predictions.
5. Fit another model on residuals that is still left and repeat steps 2 to 5 until i the
sum of residuals become constant.
Two major class of algorithms that does gradient boosting are AdaBoost and
XGBoost for developing gradient boosted decision trees. The AdaBoost algorithm
treats poorly performing decision trees as weak learners and penalizes incorrectly pre-
dicted samples by assigning a larger weight to them after each prediction round. On
the other hand, XGBoost algorithm uses several regularization parameters addition-
ally to reduce overfitting. Code Snippet 5.3 shows the code for performing XGBoost
regression on the dataset with varying number of tree depth and estimators. Figure 5.3
shows the performance of XGBoost models for predicting the density of sodium sil-
icate glasses as a function of Na.2 O composition.
See Code Snippet 5.3

Note: Authors recommend the readers to run the code with different val-
ues of test set size (.test_size), random state for train test partition
(.random_state), number of trees in the XGBoost (.n_estimator), and
tree depth (.max_depth) .

5.3 Multi-layer Perceptron

Perceptron is two class classifier using generalized linear model structure and Multi-
layer perceptron (MLP) is network of perceptrons. In percetrpon, the input vector .x
is first transformed using a fixed nonlinear transformation to give a feature vector
.φ(x), and this is then used to construct a generalized linear model of the form

. y = f (w T φ(x)) (5.8)
94 5 Non-parametric Methods for Regression

""" X G B o o s t r e g r e s s i o n
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# import
i m p o r t n u m p y as np
i m p o r t p a n d a s as pd
from x g b o o s t i m p o r t X G B R e g r e s s o r
from s k l e a r n . m o d e l _ s e l e c t i o n i m p o r t t r a i n _ t e s t _ s p l i t
from s k l e a r n . m e t r i c s i m p o r t m e a n _ s q u a r e d _ e r r o r , r 2 _ s c o r e
# Load sa m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / N S _ d e n . csv " ) . s o r t _ v a l u e s ( by = " Na2O " )
p r i n t ( data . c o l u m n s )
X = data [ data . c o l u m n s [ : - 1 ] ] . v a l u e s
y = data [ " D e n s i t y ( g / cm3 ) " ] . v a l u e s . r e s h a p e ( -1 , 1 )
# Split dataset
X_train , X_test , y_train , y _ t e s t = t r a i n _ t e s t _ s p l i t ( X , y , t e s t _ s i z e = 0 . 3
, r a n d o m _ s t a t e = 2020 )
o r d e r = np . a r g s o r t ( X _ t e s t [ : , 0 ] )
fig , axs_ = plt . s u b p l o t s ( 2 , 2 )
axs = axs_
for ind , td in e n u m e r a t e ( [ ( 5 , 1 ) ,( 5 , 5 ) ,( 5 , 10 ) ,(5 , 50 ) ] ) :
# Fit r e g r e s s i o n m o d e l
regr = X G B R e g r e s s o r ( m a x _ d e p t h = td [ 0 ] , n _ e s t i m a t o r s = td [ 1 ] )
regr . fit ( X_train , y _ t r a i n )
# Predict
y _ p r e d _ t e s t = regr . p r e d i c t ( X _ t e s t )
# Score
p r i n t ( " Test R2 ( Tree depth = { } ) : " . f o r m a t ( td ) , r 2 _ s c o r e ( y_test ,
y_pred_test ))
plt . sca ( axs [ ind ] )
plt . plot ( X _ t e s t [ order , 0 ] , y _ p r e d _ t e s t [ o r d e r ] , c = 'r ' , l a b e l = ' Tree
d e p t h { } , E s t i m a t o r s { } '. f o r m a t
( * td ) )
plt . s c a t t e r ( X _ t e s t [ : , 0 ] , y_test , l a b e l = ' Test data p o i n t s ' , ec = " k " ,
fc = " none " )
plt . l e g e n d ()
plt . x l a b e l ( " $ N a _ 2 O $ ( mol % ) " )
plt . y l a b e l ( " D e n s i t y ( g cm$ ^ { -3 } $ ) " )
s a v e f i g ( " x g b _ r e g r e s s i o n . png " )
p r i n t ( " End of o u t p u t " )

Output:

Index([’Na2O’, ’SiO2’, ’Density (g/cm3)’], dtype=’object’)


Test R2 (Tree depth = (5, 1)): -200.079819523798
Test R2 (Tree depth = (5, 5)): -11.340051395396541
Test R2 (Tree depth = (5, 10)): 0.21244833047268075
Test R2 (Tree depth = (5, 50)): 0.5583701728131119
End of output

See Fig. 5.3

Code snippet 5.3: XGBoost regression


5.3 Multi-layer Perceptron 95

Fig. 5.3 XGBoost regression

If no features are used, then .φ(x) become .φ(x) = x. The nonlinear activation func-
tion . f (.) is given by a step function of the form:
P
+1, i f a ≥ 0
. f (a) = (5.9)
−1, i f a < 0

In earlier two-class classification problems, we have focused on a target coding


scheme in which . y = {0, 1}, however, it is more convenient to use target values
.tn = +1 for class .C 1 and .tn = −1 for class .C 2 for perceptron (5.9). We need to

compute .w by minimizing “perceptron criterion”. For hard classification, for class


.C 1 :.w φ(x n ) > 0 with.tn = +1, for class.C 2 :.w φ(x n ) < 0 with.tn = −1. For correct
T T

classification, .w φ(xn )tn > 0, for incorrect classification .w T φ(xn )tn < 0, always.
T

So, in perceptron training, we minimize,


96 5 Non-parametric Methods for Regression

. E p (w) = −w T φ(xn )tn (5.10)
n∈M

where .M, is the misclassified points. The weight are updated by stochastic gradient
descent as

.w := w − α∇w E p (w) = w + αφ(xn )tn (5.11)

accuracy is reached. The perceptron can only be expected to handle problems that are
linearly separable. To tackle more complicated (nonlinear) situations, we can increase
the set of perceptrons in multiple layers. The additional layers are called‘hidden’
layers between the output and input layers. Also popularly called feed-forward neu-
ral nets or feed-forward networks. The model comprises multiple layers of logistic
regression models (with continuous nonlinearities) rather than multiple perceptrons
(with discontinuous nonlinearities). Structure of basis functions are
( M )

. y(x, w) = f w j φ j (x) (5.12)
i=1

where . f (.) is a nonlinear activation function in the case of classification and is the
identity in the case of regression, .φ(x) = x if variables are considered directly.
In an MLP framework output of first hidden layer is calculated as:
( D )

z = h(a j ) = h
. j w (1)
ji x i (5.13)
i=1

where input variables-.x1 , . . . , x D , . j = 1, . . . , M -hidden layer units, .w ji -weights,


a j —hidden layer activations, .h -hidden layer activation function. Outputs unit acti-
.
vation .ak os computed as


M
a =
. k wk(2)
j zj (5.14)
j=1

where .k = 1, . . . , K , . K —total number of outputs, .wk(2)


j -output weights. The outputs
are computed as
P
ak i f regression
. yk = (5.15)
σ (ak ) i f classification

where .σ (.) is the output function. Final network output . yk is evaluated to be:
5.3 Multi-layer Perceptron 97
⎛ ⎞
∑M ∑
D
. yk (x, w) = σ ⎝ wk(2)
j h( w (1)
ji x i )
⎠ (5.16)
j=1 i=1

Deep networks are generally termed if there are multiple hidden layers in an MLP.
Credit Assignment Path (CAP) is the chain of transformations from input to output
in a network. E.g. For a feedforward neural network, the depth of the CAP is the
number of hidden layers plus one For a deep network CAP depth.> 2 and CAP of
depth 2 has been shown to be a universal approximators. Output activation function
in an MLP is mostly logistic function

1
σ (a) =
. (5.17)
1 + ex p(−a)

For hidden layer activation function:



⎪ Logistic([0, 1]) − 1+ex1p(−a)


⎨tanh − ([−1, 1]) − ea −e−a
.h(a) = P ea +e−a
(5.18)

⎪ 0 i f a ≤ 0

⎩ ReLU − a i f a > 0

Training of MLP is done by backpropagation approach which is a shorthand for


“the backward propagation of errors”, since an error is computed at the output and
distributed backwards throughout the network’s layers. Hence, the multilayer percep-
tron architecture is sometimes called a backpropagation network. Backpropagation
Involves two simple steps:
1. In the first stage, the derivatives of the error function with respect to the weights
must be evaluated
2. In the second stage, the derivatives are then used to compute the adjustments to
be made to the weights.
Here, notation .k indicates output layer, notation . j indicates hidden layers. The loss
or error function is computed as:


N
. E(w) = E n (w) (5.19)
n=1

Consider .nth data vector, then, for all outputs . y1 , . . . , yk :

1∑
. E n (w) = (ynk − tnk )2 (5.20)
2 k
98 5 Non-parametric Methods for Regression

. nky —.kth output of the .nth data vector, .tnk -corresponding target of . ynk (Omitting
subscript .n for convenience for parameters) n a feed-forward network (forward prop-
agation), Output of a general hidden layer: (.z i = xi , if there is only one hidden layer)


M
z = h(a j ) = h(
. j w ji z i ) (5.21)
i=1

Gradient calculation for hidden layer is given as:

∂ En ∂ E n ∂a j
. = = δ j zi (5.22)
∂w ji ∂a j ∂w ji
~~~~ ~ ~~ ~
δj zi

Gradient calculation for output layer is given as:

∂ En ∂ E n ∂ak
. = = δk z j (5.23)
∂wk j ∂ak ∂wk j
~~~~ ~ ~~ ~
δk zj

Thus, we need only to calculate the value of .δ j and .δk . For weights in the output
layer, we compute

∂ En
δ =
. k (5.24)
∂ak
∑K
Asyk = ak , E n (w) =
.
1
2 k=1 (yk − t k )2

. = (ak − tk ) = (yk − tk ) (5.25)

For the hidden layer we need to sum over all output nodes, as:

∂ En ∑ ∂ E n ∂ak
K
. j δ = = (5.26)
∂a j k=1
∂ak ∂a j

As .ak = k wk j h(a j )


K
. = h , (a j ) δk w k j (5.27)
k=1

Until desired training criterion is met, repeat the following steps. Apply an input
vector .xn to the network and forward propagate through the network using to find
5.3 Multi-layer Perceptron 99

Fig. 5.4 Neural network regression

the activations of all the hidden and output units. Final weight updates are performed
as:

• Output layer weight update using stochastic LMS (Use Eqs. (12), (14)):

.wk j := wk j − α(yk − tk )z j , k = 1, . . . , K , & j = 1, . . . , M (5.28)

• Hidden layer weight update using stochastic LMS (Use Eqs. (11), (16)):


K
,
.wi j := wi j − αh (a j )( (yk − tk )wk j )z i , i = 1, . . . , D (5.29)
k=1

where .z i = xi if there is only one hidden layer

Code Snippet 5.4 shows the code for predicting the density of calcium silicate
glasses using MLPs. Figure 5.4 shows the performance of MLP with varying number
of neurons. With increasing number of neurons the model gradually starts to fit the
data in a better fashion.
100 5 Non-parametric Methods for Regression

"""
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# i m p o r t n u m p y and s k l e a r n
i m p o r t n u m p y as np
i m p o r t p a n d a s as pd
from s k l e a r n . m e t r i c s i m p o r t r 2 _ s c o r e
from s k l e a r n . m o d e l _ s e l e c t i o n i m p o r t t r a i n _ t e s t _ s p l i t
from s k l e a r n . n e u r a l _ n e t w o r k i m p o r t M L P R e g r e s s o r
# Load sa m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / C A S _ d e n . csv " ) . s o r t _ v a l u e s ( by = " CaO " )
p r i n t ( data . c o l u m n s )
X = data [ [ " CaO " , " A l 2 O 3 " , " SiO2 " ] ] . v a l u e s
y = data [ " D e n s i t y ( g / cm3 ) " ] . v a l u e s . r e s h a p e ( -1 , 1 )
# Split dataset
X_train , X_test , y_train , y _ t e s t = t r a i n _ t e s t _ s p l i t ( X , y , t e s t _ s i z e = 0 . 3
, r a n d o m _ s t a t e = 2020 )
fig , axs_ = plt . s u b p l o t s ( 2 , 2 )
axs = axs_ . f l a t t e n ()
for ind , n in e n u m e r a t e ( [ 1 , 2 ,5 , 10 ] ) :
# Fit r e g r e s s i o n m o d e l
regr = M L P R e g r e s s o r ( h i d d e n _ l a y e r _ s i z e s = [n ,n , n ] , r a n d o m _ s t a t e = 20 )
regr . fit ( X_train , y _ t r a i n )
# Predict
y _ p r e d _ t e s t = regr . p r e d i c t ( X _ t e s t )
# Score
p r i n t ( " Test R2 ( Nu m b e r of n e u r o n s = { } ) : " . f o r m a t ( n ) , r 2 _ s c o r e (
y_test , y _ p r e d _ t e s t ) )
plt . sca ( axs [ ind ] )
i = 0
o r d e r = np . a r g s o r t ( X _ t e s t [ : , i ] )
plt . plot ( X _ t e s t [ order , i ] , y _ p r e d _ t e s t [ o r d e r ] , c = 'r ' , l a b e l = ' N u m b e r
of n e u r o n s = { } '. f o r m a t ( n ) )
plt . s c a t t e r ( X _ t e s t [ : ,i ] , y_test , l a b e l = ' Test data p o i n t s ' , ec = " k " ,
fc = " none " )
plt . l e g e n d ()
plt . x l a b e l ( " $ C a O $ ( mol % ) " )
plt . y l a b e l ( " D e n s i t y ( g cm$ ^ { -3 } $ ) " )
s a v e f i g ( " n n _ r e g r e s s i o n . png " )
p r i n t ( " End of o u t p u t " )

Output:

Index([’Al2O3’, ’CaO’, ’SiO2’, ’Density (g/cm3)’], dtype=’object’)

See Fig. 5.4

Code snippet 5.4: MLP-based regression to predict the density with varying number
of hidden layer units, namely, 1, 2, 5, and 10
5.4 Support Vector Regression 101

Note: Authors recommend the readers to run the code with different values of test
set size (.test_size), random state for train test partition (.random_state),
number of neurons or hidden layer units, and learning rates.)

5.4 Support Vector Regression

5.4.1 Linear Separable Case

Support Vector Regression (SVR) is an optimization-based non-parametric regres-


sion approach. Consider a linear classification boundary:

. y(x) = w T φ(x) + b (5.30)

.φ(x) denotes a fixed feature-space transformation (used only if needed), and we have
made the bias parameter .b explicit. The training data set comprises . N input vectors
.{x 1 , . . . , x N }, with corresponding target values .{t1 , . . . , t N } where .tn ∈ {−1, 1}, and

new data points .x are classified according to the sign of . y(x).


We shall assume for the moment that the training dataset is linearly separable
in feature space. That is, there exists at least one choice of .w and .b that satisfies
Eq. 5.30 such that . y(xn ) > 0 for points having .tn = +1 and . y(xn ) < 0 for points
having .tn = −1. Thus, .tn y(xn ) > 0 for all training data points.
Let .r be the perpendicular distance .x from the decision surface:
w w
. x = x⊥ + r =⇒ w T x = w T x⊥ + r w T (5.31)
||w||2 ||w||2
. As, y(x) = w x + b, ∴ y(x ⊥ ) = w x ⊥ + b = 0 =⇒ w x ⊥ = −b
T T T
(5.32)
w w w T
y(x)
. y(x) = w T x⊥ + r w T +b =r =⇒ r = (5.33)
||w||2 ||w||2 ||w||2

For all data points .n = 1, . . . , N , we need to maximize the separation of only the
correctly classified classes so that .tn y(xn ) > 0:

tn y(xn ) tn (w T xn + b)
. = (5.34)
||w||2 ||w||

We also need to minimize the number of points within the margin. In short, we need
to find the parameters .w, .b, by
{ }
1
. arg max min tn (w T xn + b) (5.35)
w,b ||w|| n
102 5 Non-parametric Methods for Regression

m Eq. (8) is a complex optimization problem to solve due to simultaneous maxi-


mization and minimization, so weed to take an alternative route. By seeing Eq. (7),
we can see that rescaling of RHS will not affect the distance; that is, id .w = κw
and .b = κb, RHS of Eq. (7) will remain the same. Then we can also choose a .κ
such that .tn (w T xn + b) = 1 for points close to the hyperplane. For all classified
points, .tn (w T xn + b) ≥ 1, n = 1, . . . , N , also .max ||w||
1
=⇒ min ||w||.Hence, the
reformulated problem is:

1
min ||w||2
. (5.36)
w,b 2

.subject to tn (w x n + b) ≥ 1, n = 1, . . . , N
T
(5.37)

Duality theory shows how we can construct an alternative problem from the func-
tions and data that define the original optimization problem. This alternative “dual”
problem is related to the original problem (which is sometimes referred to in this
context as the “primal” for purposes of contrast). In most cases, the dual problem is
easier to solve computationally than the original problem or can be used to obtain
easily a lower bound on the optimal value of the objective for the primal problem.
If we obtain a lower bound, we can maximize the lower bound to obtain the optimal
value. The dual optimization problem is:

min f (x) sub. to ci (x) ≥ 0, i = 1 . . . , n


. (5.38)
x

Rewriting (11):

. min f (x) sub. to c(x) ≥ 0, i = 1 . . . , n (5.39)


x

where .c(x) := [c1 (x), . . . , cn (x)]T Lagrangian function is written as:

. L(x, λ) = f (x) − λT c(x) (5.40)

where .λ ≥ 0 is Lagrange multiplier vector The dual objective function is computed


as (Lower bound of (11))

q(λ) := min L(x, λ)


. (5.41)
x

The dual problem of (11) (Maximizing lower bound) is given as

max q(λ) sub. to λ ≥ 0


. (5.42)
λ

In light of that, the SVM optimization problem is given as


First-order necessary conditions (Karush-Kuhn-Tucker conditions) are used for
optimality for nonlinearly constrained optimization problems used for solving the
5.4 Support Vector Regression 103

optimization problem. KKT conditions suggest that if .x ∗ is a (local) optimal solution


of (11), then there is a Lagrange multiplier vector .λ∗ such that the following are
satisfied:
1. .∇x L(x ∗ , λ∗ ) = 0, where . L(x, λ) = f (x) − λT c(x)

2. .ci (x ) = 0 for equality constraints

3. .ci (x ) ≥ 0 for inequality constraints

4. .λi ≥ 0 for inequality constraints
∗ ∗
5. .λi ci (x ) = 0 for any constraint

Condition 5 implies that (complementarity conditions) either constraint .i is active


(equality constraints) or .λ = 0 (inactive), or possibly both. In order to classify
new data points, we evaluate the sign of . y(x) = w T x + b. For ∑ Ncalculating .w
we have to solve dual problem (16) and get .λ, where .w = n=1 λn tn xn and
∑N
.y = λ + λ n ≥ 0; (ii)
T
t
n=1 n n n x x b For calculating .b need KKT conditions: (i) .

.tn (w x n + b) ≥ 1 = tn y(x n ) − 1 ≥ 0; (iii) .λn {tn y(x n ) − 1} = 0. If .λn = 0, no con-


T

tribution to . y, the rest of the data points are called support vectors as .tn y(xn ) = 1
(and also they contribute in . y. Incidentally, .tn y(xn ) = 1 implies they are active con-
straints and lie on the maximum margin hyperplane. For calculating .b, we use the
following relation. Any support vector .xn satisfies:

. nt y(xn ) = 1
tn {w T xn + b} = 1
P N }

tn λm tm xm xn + b = 1, m ∈ S
T

m=1
P }

N
multiplying both sides bytn and usingtn2 =1 λm tm xmT xn + b = tn
m=1
P }

=⇒ b = tn − λm tm xmT xn
m∈S

where . S is the index set support vectors. Averaging these equations over all support
vectors gives
P P }}
1 ∑ ∑
.b = tn − λm tm xmT xn (5.43)
NS N m∈S
S

where . N S is the total number of support vectors.


104 5 Non-parametric Methods for Regression

5.4.2 Linear Non-separable Case

In the case of linearly non-separable classes, some points fall inside the class sepa-
ration band. For situations where the samples are linearly non-separable, the above
optimization procedure is not valid. It is not possible to draw a hyperplane that will
have a class separation band around it with no data points inside this band. We now
modify this approach so that data points are allowed to be on the “wrong side” of
the margin boundary but with a penalty that increases with the distance from that
boundary. As a result, we need to introduce slack variables .ξn = 0 for data points that
are on or inside the correct margin boundary and .ξn = |tn − y(xn )| for other points.
That would result in three potential possibilities:
1. Vectors that fall outside the margin and are correctly classified: .tn (w T xn + b) ≥
1 =⇒ ξn = 0
2. Vectors that fall inside the margin but are correctly classified:.0 ≤ tn (w T xn + b) ≤
1 =⇒ 0 ≤ ξn ≤ 1
3. Vectors that are misclassified:.tn (w T xn + b) ≤ 0 =⇒ ξn > 1
This would result in the following modified optimization problem:

1 ∑N
. min ||w||2 + C ξn (5.44)
w,b,ξ 2 n=1

.sub. to tn (w T xn + b) ≥ 1 − ξn , ξn ≥ 0, n = 1, . . . , N (5.45)

C is a positive constant that controls the trade-off, and it relaxes the hard constraint
.

for classification to some extent. If data points have .ξn = 0 =⇒ correctly classified
within margin, .0 ≤ ξn ≤ 1 =⇒ lie inside the margin, but on the correct side of the
decision boundary, and if .ξn > 0 =⇒ misclassified (undesirable). It will also try to
minimize .ξn , so it will only take the smallest value, so by default, it will only take
the suitable value. Here, the Lagrangian is computed as:

. L(w, b, λ, μ)
1 ∑N ∑N ∑N
= ||w||2 + C ξn − λn {tn (w T xn + b) − 1 + ξn } − μn ξn (5.46)
2 n=1 n=1 n=1

∂L ∑N
.μn , λn , n = 1, . . . , N are Lagrange multipliers, also. ∂w =0 =⇒ w = n=1 λ n tn x n ,
∂L ∑N ∂L
.
∂b
= 0 = n=1 λn tn , . ∂ξn = 0 = λn = C − μn =⇒ C = λn + μn Dual objective
.q(λ) is evaluated as:
5.4 Support Vector Regression 105
P N N }
1 ∑∑ ∑ ∑
N N
. = λ n λ m tn tm x n x m + (λn + μn )ξn − λ n tn b
T
2 n=1 m=1 n=1 n=1
~ ~~ ~
=0
P }

N ∑
N ∑
N ∑
N ∑
N
− λn λm tn tm xnT xm + λn − λn ξn − μn ξn
n=1 m=1 n=1 n=1 n=1
P N N }

N
1 ∑∑
= λn − λn λm tn tm xnT xm (5.47)
n=1
2 n=1 m=1

Consider .λn ≥ 0 and .λn + μn = C, if .μn ≥ 0, the constraints would be


N
.0 ≤ λn ≤ C, λ n tn = 0 (5.48)
n=1

In order to classify new data points, we evaluate the sign of . y(x) = w T x + b. For
calculating .w, we solve dual problem (20, 21) and get .λ, use


N
w=
. λ n tn x n (5.49)
n=1

and . y calculated is


N
. y= λn tn xnT x + b. (5.50)
n=1

By complimentary KKT conditions, .ξn μn = 0 =⇒ ξn = 0. Hence, from (18), for


support vectors,
P } P P }}

N
1 ∑ ∑
t
. n λm tm xmT xn +b =1 ∴ b= tn − λ m tm x m x n
T
(5.51)
m=1
NS N m∈S S

where . N S is the total number of support vectors.

5.4.3 Kernel SVR

Sometimes it is very difficult or rather impossible to find a decision boundary when


we are using only the true attributes, i.e.,.x, due to the non-linearity or non-separability
of these features. In such cases, other possible options use .φ(x) as extracted features
directly from the attributes or inflate the dimension of the data by adding additional
106 5 Non-parametric Methods for Regression

features. By defining a kernel function, using an inner product between the data
facilitates a nonlinear transformation from the input space to a higher dimension
space, where the problem may be linearly √ separable.
.φ : (x 1 , x 2 ) | → (z 1 , z 2 , z 3 ) := (x 1 , 2x1 x2 , x22 ) For second-order polynomials,
2

√ , √ ,
.φ(x1 , x2 )T φ(x1 , x2 ) = (x12 , 2x1 x2 , x22 )T (x12 , 2x1, x2, , x22 ) (5.52)
, ,
. = x12 x12+ 2x1 x1, x2 x2, + x22 x22 (5.53)
. = (x1 x1,+ x2 x2, )2 (5.54)
. = {(x1 , x2 )T (x1, , x2, )}2 = k(x, x , ) (5.55)

The term .k(x, x , ) is called the inner product kernel. Similarly for a .dth order poly-
nomial:

k(x, x , ) = φ(x, x , )T φ(x, x , ) = (x T x , )d


. (5.56)

. The net result is that we can compute the value of the inner product without explicitly
carrying out a mapping involving monomials. Let .φ : X |→ H indicate a nonlinear
transformation from the input space . X to the feature space . H , them the separating
hyperplane is

.wφT φ(x) + b = 0 (5.57)

The corresponding output is


N
. y= λn tn k(xm , xn ) + b (5.58)
n=1

with weights


N
.wφ = λn tn φ(xn ) (5.59)
n=1
P P }}
1 ∑ ∑
.b = tn − λm tm k(xm , xn ) (5.60)
NS N m∈S
S

Here, the term .k(xm , xn ) = φ(xm )T φ(xn ) is called inner product kernel. . K (xi , x j )
may be very inexpensive to calculate due to inner product evaluation, even though
.φ(x) itself may be expensive to calculate. The inner product of orthogonal functions
is zero, and the collinear function is one. If .φ(x) and .φ(x , ) are far apart-say nearly
orthogonal to each other-then . K (x, x , ) = φ(x)φ(x , ) will be small otherwise it will
be large. We can think of . K (x, x , ) as some measurement of how similar are .φ(x)
5.4 Support Vector Regression 107

and .φ(x , ) , or of how similar are .x and .x , Kernels are also known as covariance
functions, and some popular kernels are:
1. Polynomial kernel: .k(x, x , ) = (x T x , + c)d , of .d degree
, 2
2. Gaussian kernel: .k(x, x , ) = exp(− ||x−xσ2
||
)

"""
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# i m p o r t n u m p y and s k l e a r n
i m p o r t n u m p y as np
i m p o r t p a n d a s as pd
from s k l e a r n . m e t r i c s i m p o r t r 2 _ s c o r e
from s k l e a r n . m o d e l _ s e l e c t i o n i m p o r t t r a i n _ t e s t _ s p l i t
from s k l e a r n . svm i m p o r t SVR
# Load sa m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / N S _ d e n . csv " ) . s o r t _ v a l u e s ( by = " Na2O " )
p r i n t ( data . c o l u m n s )
X = data [ " Na2O " ] . v a l u e s . r e s h a p e ( - 1 , 1 )
y = data [ " D e n s i t y ( g / cm3 ) " ] . v a l u e s . r e s h a p e ( -1 , 1 )
# Split dataset
X_train , X_test , y_train , y _ t e s t = t r a i n _ t e s t _ s p l i t ( X , y , t e s t _ s i z e = 0 . 3
, r a n d o m _ s t a t e = 2020 )
o r d e r = np . a r g s o r t ( X _ t e s t [ : , 0 ] )
fig , axs_ = plt . s u b p l o t s ( 1 , 2 )
axs = axs_ . f l a t t e n ()
for ind , k in e n u m e r a t e ( [ " l i n e a r " , " rbf " ] ) :
# Fit r e g r e s s i o n m o d e l
regr = SVR ( C = 0 . 1 , e p s i l o n = 0 . 001 , k e r n e l =k , d e g r e e = 4 )
regr . fit ( X_train , y _ t r a i n )
# Predict
y _ p r e d _ t e s t = regr . p r e d i c t ( X _ t e s t )
# Score
p r i n t ( " Test R2 : " , r 2 _ s c o r e ( y_test , y _ p r e d _ t e s t ) )
plt . sca ( axs [ ind ] )
plt . plot ( X _ t e s t [ order , 0 ] , y _ p r e d _ t e s t [ o r d e r ] , c = 'r ' , l a b e l = '{ }
k e r n e l '. f o r m a t ( k ) )
plt . s c a t t e r ( X_test , y_test , l a b e l = ' Test data p o i n t s ' , ec = " k " , fc = "
none " )
plt . l e g e n d ()
plt . x l a b e l ( " $ N a _ 2 O $ ( mol % ) " )
plt . y l a b e l ( " D e n s i t y ( g cm$ ^ { -3 } $ ) " )
s a v e f i g ( " s v _ r e g r e s s i o n . png " )
p r i n t ( " End of o u t p u t " )

Output:

Index([’Na2O’, ’SiO2’, ’Density (g/cm3)’], dtype=’object’)

See Fig. 5.5

Code snippet 5.5: Support Vector regression for predicting density using linear
kernel and RBF kernel
108 5 Non-parametric Methods for Regression

Fig. 5.5 Support vector regression

Code Snippet 4.5 can be used to perform SVR for predicting the density of sodium
silicate glasses. Figure 5.5 shows the performance of the SVR models with linear and
RBF kernels for predicting the density.

Note: Authors recommend the readers to run the code with different values of test
set size (.test_size), random state for train test partition (.random_state),
linear and non-linear kernels, and also with varying hyperparameters according
to the non-linear kernels.

5.5 Gaussian Process Regression

Gaussian processes (GPs) model nonparametric models that are capable of modeling
datasets in a fully probabilistic fashion. The main advantages of GP models are: (i) its
unique ability to model any complex data sets; (ii) estimate the uncertainty associated
predictions through posterior variances computations. A GP is a joint distribution
of any finite set of random variables that follow Gaussian distributions. As a result,
GPR modeling framework tries to ascribe a distribution over a given set of input (.x)
and output datasets (. y). A mean function .m(x) and a covariance function .k(x, x , ),
the two degrees of freedoms that are needed to characterize a GPR fully, are as shown
below.

. y = f (x) + ∈; wher e∈ ∼ N (0, σ∈2 ), f ∼ G P(m(x), k(x, x , )) (5.61)

while the mean function .m(x) computes the expected values of output for a given
input, the covariance function captures the extent of correlation between function
outputs for the given set of inputs as
5.5 Gaussian Process Regression 109

.k(x, x , ) = E[ f (x) − m(x)), f (x , ) − m(x , )] (5.62)

In the GP literature .k(x, x , ) is also termed as the kernel function of the GP. A widely
used rationale for the selection of kernel function is that the correlation between any
points decreases with an increase in the distance between them. Some popular kernels
in the GP literature are 1. Exponential kernel: .k(x, x , ) = ex p(|x − x , |)/l 2. Squared
exponential kernel:.k(x, x , ) = σ 2f ex p[−1/2((x − x , )/l)2 ] where .l is termed as the
length-scale parameter and .σ 2f is termed as the signal variance parameter. In a GPR
model, these hyper-parameters can be tuned to model datasets that have an arbitrary
correlation. Also, the function . f ∼ G P(m(x), k(x, x , )) is often mean-centered for
relaxing the computational complexity. Suppose, we have a set of test inputs . X ∗ for
which we are interested in computing the output predictions. This would warrant sam-
pling as a set of . f ∗ := [ f (x( 1∗)), . . . , f (x( n∗))], such that . f ∗ = N (0, K (X ∗ , X ∗ ))
with the mean and covariance as
.m(x) = 0

⎛ ⎞
(k(x1∗ , x1∗ ) . . . k(x1∗ , xn∗ )
⎜ .. .. .. ⎟
. K (X ∗ , X ∗ ) = ⎝ . ⎠
. .
k(xn∗ , x1∗ ) . . . k(xn∗ , xn∗ )

By the definition( of)GP, the


( new
( and the previous outputs follow
)) a joint Gaussian
y K (X, X ) + σ∈2 I K (X, X ∗ )
distribution as . ∼ N 0, where, . K (X, X ) is
f∗ K (X, X ∗ ) K (X, X ∗ )
the covariance matrix between all observed inputs, . K (X ∗ , X ∗ ) is the covariance
matrix between the newly introduced inputs, . K (X ∗ , X ) is the covariance matrix
between the new inputs and the observed inputs and . K (X, X ∗ ) is the covariance
matrix between the observed points and the new inputs, and I is the identity matrix.
Now, applying the principles of conditionals, . p( f ∗ |y) can be shown to follow a
Normal distribution with:

.mean( f ∗ ) = K (X ∗ , X )(K (X, X ) + σ∈2 I )( − 1)y (5.63)


.covariance( f ∗ ) = K (X ∗ , X ∗ ) − K (X ∗ , X )(K (X, X ) + σ∈2 I )( − 1)K (X, X ∗ )
(5.64)

The above equations are employed to make new predictions using the GPR.
Figure 5.6 shows the performance of GPR for predicting the density of sodium
silicate glasses with RBF kernel and Matern kernel. Note that GPR can also be
used to obtain the standard deviation which is not demonstrated in the figure. Code
Snippet 5.6 can be used to reproduce the results.
110 5 Non-parametric Methods for Regression

Fig. 5.6 Gaussian process regression

Note: Authors recommend the readers to run the code with different values of test
set size (.test_size), random state for train test partition (.random_state),
different kernels, and also with varying hyperparameters according to the non-
linear kernels.

5.6 Summary

In this chapter, we discussed several non-parametric approaches to regression. We


started with simple decision trees, followed by random forest algorithm. We then
discussed multilayer perceptron, also known as the feedforward NNs. Following
this, we discussed the idea of support vector regression, both in the linear case and
the kernel SVR. It should be noted that the linear case of an SVR can be considered
as a parametric algorithm as the functional form and the number of parameters to be
fitted are fixed. However, in the case of kernel SVR neither the functional form nor the
parameters are fixed and is dependent on the chosen kernel function. Hence, kernel
SVR can be considered as a non-parametric approach. Finally, we discussed a non-
parametric approach namely, Gaussian process regression which performs regression
in a probabilistic framework. Advantage of GPR over other non-parametric method is
that the estimate corresponding to a predictor is associated with a standard deviation
as well. Thus, the reliability of a prediction can be estimated inherently using GPR.
5.6 Summary 111

""" GP r e g r e s s i o n
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# i m p o r t n u m p y and s k l e a r n
i m p o r t n u m p y as np
i m p o r t p a n d a s as pd
from s k l e a r n . m e t r i c s i m p o r t r 2 _ s c o r e
from s k l e a r n . m o d e l _ s e l e c t i o n i m p o r t t r a i n _ t e s t _ s p l i t
from s k l e a r n . g a u s s i a n _ p r o c e s s i m p o r t G a u s s i a n P r o c e s s R e g r e s s o r
from s k l e a r n . g a u s s i a n _ p r o c e s s . k e r n e l s i m p o r t RBF , Matern ,
RationalQuadratic , ExpSineSquared ,
DotProduct , ConstantKernel
# Load sa m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / N S _ d e n . csv " ) . s o r t _ v a l u e s ( by = " Na2O " )
p r i n t ( data . c o l u m n s )
X = data [ " Na2O " ] . v a l u e s . r e s h a p e ( - 1 , 1 )
y = data [ " D e n s i t y ( g / cm3 ) " ] . v a l u e s . r e s h a p e ( -1 , 1 )
# Split dataset
X_train , X_test , y_train , y _ t e s t = t r a i n _ t e s t _ s p l i t ( X , y , t e s t _ s i z e = 0 . 3
, r a n d o m _ s t a t e = 2020 )
o r d e r = np . a r g s o r t ( X _ t e s t [ : , 0 ] )
fig , axs_ = plt . s u b p l o t s ( 1 , 2 )
axs = axs_ . f l a t t e n ()
n a m e s = [ " RBF " , " M a t e r n " ]
for ind , k in e n u m e r a t e ( [ RBF () , M a t e r n () ] ) :
regr = G a u s s i a n P r o c e s s R e g r e s s o r ( k e r n e l =k , r a n d o m _ s t a t e = 0 )
regr . fit ( X_train , y _ t r a i n )
# Predict
y _ p r e d _ t e s t = regr . p r e d i c t ( X _ t e s t )
# Score
p r i n t ( " Test R2 ( k e r n e l = { } ) : " . f o r m a t ( n a m e s [ ind ] ) , r 2 _ s c o r e ( y_test
, y_pred_test ))
plt . sca ( axs [ ind ] )
plt . plot ( X _ t e s t [ order , 0 ] , y _ p r e d _ t e s t [ o r d e r ] , c = 'r ' , l a b e l = '{ }
k e r n e l '. f o r m a t ( n a m e s [ ind ] ) )
plt . s c a t t e r ( X_test , y_test , l a b e l = ' Test data p o i n t s ' , ec = " k " , fc = "
none " )
plt . l e g e n d ()
plt . x l a b e l ( " $ N a _ 2 O $ ( mol % ) " )
plt . y l a b e l ( " D e n s i t y ( g cm$ ^ { -3 } $ ) " )
s a v e f i g ( " g p _ r e g r e s s i o n . png " )
p r i n t ( " End of o u t p u t " )

Output:

Index([’Na2O’, ’SiO2’, ’Density (g/cm3)’], dtype=’object’)

See Fig. 5.6

Code snippet 5.6: Gaussian process regression for predicting the density of sodium
silicate glasses with RBF kernel and Matern kernel
112 5 Non-parametric Methods for Regression

References

1. C.M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics)
(Springer, Berlin, Heidelberg, 2006). ISBN: 0387310738
2. K.P. Murphy, Machine Learning: A Probabilistic Perspective (The MIT Press, 2012). ISBN:
0262018020
3. T.M. Mitchell, Machine Learning, 1st edn. (McGraw-Hill, Inc., USA, 1997). ISBN:
0070428077
4. R.O. Duda, P.E. Hart, et al., Pattern Classification (Wiley, 2006)
5. J. Friedman, T. Hastie, R. Tibshirani, et al., The Elements of Statistical learning. Springer Series
in Statistics New York, vol. 1 (2001)
6. C.K. Williams, C.E. Rasmussen, Gaussian Processes for Machine Learning, vol. 2 (MIT Press
Cambridge, MA, 2006)
7. M. Bramer, Principles of Data Mining, 2nd edn. (Springer Publishing Company, Incorporated,
2013). ISBN: 1447148835
8. A. Blum, J. Hopcroft, R. Kannan, Foundations of Data Science (Cambridge University Press,
2020)
9. C.C. Aggarwal, Neural Networks and Deep Learning: A Textbook, 1st edn. (Springer Publishing
Company, Incorporated, 2018). ISBN: 3319944622
10. S. Russell, P. Norvig, Artificial Intelligence: A Modern Approach, 3rd edn. (Prentice Hall Press,
USA, 2009). ISBN: 0136042597
11. B. Schölkopf, A.J. Smola, F. Bach, et al., Learning with Kernels: Support Vector Machines,
Regularization, Optimization, and Beyond (MIT Press, 2002)
12. L. Rokach, O.Z. Maimon, Data Mining with Decision Trees: Theory and Applications, vol. 69
(World Scientific, 2007)
Chapter 6
Dimensionality Reduction and Clustering

Abstract Supervised learning approaches discussed thus far, classification and


regression, rely on learning a mapping between the input features and the output
labels based on a ground truth data. This approach inherently assumes a label associ-
ated with each datapoint, which needs to be learnt. However, in many situations there
might not exist a label associated with the data, while there might be several features.
The goal in this case would be to group “similar” datapoints together or to identify
those minimal features that represent the data in a meaningful fashion. Such problems
can be handled by unsupervised ML algorithms such as clustering. In this chapter,
we will discuss various clustering algorithms. We will discuss how dimensionality
reduction can be achieved by unsupervised approaches such as principal component
analysis. We will also discuss algorithms such as k-means, Gaussian mixture model,
and t-SNE.

6.1 An Introduction Unsupervised ML

Unsupervised machine learning is a branch of artificial intelligence that aims to


discover patterns, relationships, and structures in data without the use of predefined
labels or target variables. Unlike supervised learning, where models are trained on
labeled data to make predictions or classifications, unsupervised learning methods
explore the inherent structure within the data itself. These methods play crucial roles
in exploratory data analysis, dimensionality reduction, and identifying underlying
patterns and clusters within datasets.
This chapter focuses on two fundamental techniques in unsupervised machine
learning: dimensionality reduction and clustering. Dimensionality reduction is use-
ful to select the features that maximally represent the data, which can then be used for
other supervised ML tasks. Since the features are selected in a data-driven manner,
the feature selection remains unbiased. Principal Component Analysis (PCA) is a

Supplementary Information The online version contains supplementary material available at


https://doi.org/10.1007/978-3-031-44622-1_6.

© Springer Nature Switzerland AG 2024 113


N. M. A. Krishnan et al., Machine Learning for Materials Discovery,
Machine Intelligence for Materials Science,
https://doi.org/10.1007/978-3-031-44622-1_6
114 6 Dimensionality Reduction and Clustering

popular technique for reducing the dimensionality of high-dimensional datasets while


preserving the most important information. By transforming the original features into
a new set of uncorrelated variables called principal components, PCA helps to capture
the most significant variations in the data. It provides a valuable tool for visualiz-
ing and understanding complex datasets by projecting them onto lower-dimensional
spaces. Clustering algorithms are another essential aspect of unsupervised learning.
They aim to group similar data points together based on their inherent similarities
or distances in the feature space. These approaches analyze the similarity between
the features to group similar datapoints together in a cluster. This allows one to
extrapolate properties of materials, for instance, materials in the same cluster should
behave similarly.
Among various clustering algorithms, the chapter covers the widely used.k-means
algorithm, which partitions the data into a pre-defined number of clusters. It also
delves into Gaussian mixture models (GMMs), which model data as a combina-
tion of Gaussian distributions, allowing for more flexible cluster shapes. Lastly, the
chapter introduces t-SNE (t-Distributed Stochastic Neighbor Embedding), a nonlin-
ear dimensionality reduction technique primarily used for visualization. Unlike PCA,
which focuses on preserving the global structure, t-SNE emphasizes the local struc-
ture and aims to project high-dimensional data into a lower-dimensional space while
preserving local similarities between data points. It is especially effective at revealing
clusters and patterns that may be challenging to identify using other methods.
Throughout this chapter, we will explore these unsupervised learning techniques’
principles, applications, strengths, and limitations. We will discuss their implemen-
tation, provide examples, and highlight considerations for selecting the most appro-
priate approach based on the specific characteristics and goals of the data analysis
task at hand. By understanding and applying these techniques, material scientists can
gain valuable insights into their datasets’ underlying structure and patterns, facili-
tating further analysis, decision-making, and knowledge discovery. For an in-depth
discussion on these methodologies, readers are directed to References [1–8].

6.2 Principal Component Analysis

Principal Component Analysis (PCA) is a widely used method for dimensionality


reduction, lossy data compression, and feature extraction. PCA can be defined as
the orthogonal projection of the data onto a lower dimensional subspace (known as
principal subspace) where the variance is maximized (Hotelling, 1933). Alternatively,
it can be defined as the linear projection that minimizes the average projection cost
defined as the mean squared distance between the data and their projections (Pearson,
1901).
Consider a dataset with .x1 , . . . , xn , . . . , X N where .xn is a Euclidean variable with
dimension . D. Our goal is to project the data onto a space having dimensionality.
For the moment, let us assume that . M is given and to begin with, consider the
projection onto a one-dimensional space (. M = 1). We can define the direction of
6.2 Principal Component Analysis 115

this space using a . D-dimensional unit vector .u 1 so that .u 1T u 1 = 1. Note that we are
only interested in the direction defined by .u 1 , not in the magnitude of .u 1 itself. The
mean of the projected data is .u 1T x where .x is the sample set mean given by

1 ∑
N
. x= xn (6.1)
N i=1

The variance of the projected data is

1 ∑ T
N
. {u xn − u 1T x} = u 1T Su 1 (6.2)
N i=1 1

The data covariance matrix is computed as

1 ∑
N
. S= (xn − x)(xn − x)T (6.3)
N i=1

We now maximize the projected variance .u 1T Su 1 with respect to .u 1 such that


.u 1 u 1 = 1
T

. L = u 1T Su 1 + λ1 (1 − u 1T u 1 ) (6.4)
∂L
. = 0 =⇒ Su 1 = λ1 u 1 (6.5)
∂u 1

That is, .u 1 must be an eigenvector of . S. If we left-multiply (6.2) by .u 1T and use the


relation use of .u 1T u 1 = 1, the following can be obtained:

u T Su 1 = λ1
. 1 (6.6)

That is the variance will be a maximum when we set.u 1 equal to the eigenvector having
the largest eigenvalue .λ1 . This eigenvector is known as the first principal component.
We can define additional principal components in an incremental fashion by choosing
each new direction to be that which maximizes the projected variance until desired
variance is achieved. The optimal linear projection for which the variance of the
projected data is maximized is now defined by the . M eigenvectors .u 1 , . . . , u M of the
data covariance matrix . S corresponding to the . M largest eigenvalues .λ1 , . . . , λ M .
Figure 6.1 shows the variance with increasing number of features. It may be noted
with 10 features a variance of 100% is achieved. Corresponding R.2 values which
uses the principal component as input features is shown. Code Snippet 6.1 shows the
code to reproduce the results of the principal component analysis.
116 6 Dimensionality Reduction and Clustering

""" PCA
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# i m p o r t n u m p y and s k l e a r n
i m p o r t n u m p y as np
i m p o r t p a n d a s as pd
from s k l e a r n . d e c o m p o s i t i o n i m p o r t PCA
from s k l e a r n . m e t r i c s i m p o r t r 2 _ s c o r e
from s k l e a r n . l i n e a r _ m o d e l i m p o r t L i n e a r R e g r e s s i o n
from s k l e a r n . p r e p r o c e s s i n g i m p o r t S t a n d a r d S c a l e r
s c a l e r = S t a n d a r d S c a l e r ()
# Load sa m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / f u l l _ d e n . csv " ) . s o r t _ v a l u e s ( by = " Na2O " )
p r i n t ( data . c o l u m n s )
X = data [ data . c o l u m n s [ : - 1 ] ] . v a l u e s
y = data . v a l u e s [ : ,- 1 : ]
p r i n t ( " N u m b e r of f e a t u r e s : " , X . s h a p e [ 1 ] )
# Scale data using s t a n d a r d s c a l e r
s c a l e r . fit ( X )
X_ = s c a l e r . t r a n s f o r m ( X )
ns = [ 1 , 2 , 3 ,4 ,5 , 6 , 7 , 8 , 9 , 10 , 11 ]
var = [ ]
Xs_pca = []
R2 = [ ]
for n _ c o m p o n e n t s in ns :
pca = PCA ( n _ c o m p o n e n t s = n _ c o m p o n e n t s )
X s _ p c a + = [ pca . f i t _ t r a n s f o r m ( X_ ) ]
var + = [ sum ( pca . e x p l a i n e d _ v a r i a n c e _ r a t i o _ ) * 100 ]
regr = L i n e a r R e g r e s s i o n () . fit ( X s _ p c a [ - 1 ] , y )
R2 + = [ r 2 _ s c o r e ( y , regr . p r e d i c t ( X s _ p c a [ - 1 ] ) ) ]
fig , axs = plt . s u b p l o t s ( 1 , 2 )
plt . sca ( axs [ 0 ] )
plt . plot ( ns , var , " -- k " )
plt . s c a t t e r ( ns , var , s = 60 , c = " k " , fc = " none " , ec = " k " )
plt . x t i c k s ( [2 ,4 , 6 , 8 , 10 ] )
plt . x l a b e l ( " N u m b e r of f e a t u r e s " )
plt . y l a b e l ( " V a r i a n c e ( % ) " )
plt . sca ( axs [ 1 ] )
plt . plot ( ns , R2 , " -- k " )
plt . s c a t t e r ( ns , R2 , s = 60 , c = " k " , fc = " none " , ec = " k " )
plt . x t i c k s ( [2 ,4 , 6 , 8 , 10 ] )
plt . x l a b e l ( " N u m b e r of f e a t u r e s " )
plt . y l a b e l ( " R$ ^ 2$ " )
s a v e f i g ( " p c a e x a . png " )
p r i n t ( " End of o u t p u t " )

Output:
Index(['Al2O3', 'B2O3', 'CaO', 'Fe2O3', 'FeO', 'MgO', 'Na2O', 'P2O5', 'TeO2',
'TiO2', 'ZrO2', 'Density'],
dtype='object')
Number of features: 11
End of output

See Fig. 6.1

Code snippet 6.1: Principal component analysis


6.3 k Means Clustering 117

Fig. 6.1 Principal component analysis

Note: Authors recommend the readers to run the code with different values of test
set size (.test_size), random state for train test partition (.random_state)
and also with varying hyperparameters associated with PCA.

6.3 . k Means Clustering

In .k means clustering, we begin by considering the problem of identifying groups,


or clusters, of data points in a multidimensional space. Suppose we have a data set
.{x 1 , . . . , x N } consisting of . N observations of a random . D-dimensional Euclidean

variable .x. Our goal is to partition the data set into some number . K of clusters,
where we shall suppose for the moment that the value of . K is given. Intuitively,
we might think of a cluster as comprising a group of data points whose inter-point
distances are small compared with the distances to points outside of the cluster. Let
.μk , where .k = 1, . . . , K , be a prototype (representing the centres of the clusters)

associated with the .kth cluster. For each data point .xn , we introduce a corresponding
set of binary indicator variables .rnk ∈ {0, 1}, where .k = 1, . . . , K describing which
of the . K clusters the data point .xn is assigned. If data point .xn is assigned to cluster
.k then .r nk = 1, and .r n j = 0 for . j / = k. This is known as the 1-of-K coding scheme.
Minimize distortion measure, given by:


N ∑
K
. J= rnk ||xn − μ j ||2 (6.7)
n=1 k=1
118 6 Dimensionality Reduction and Clustering

which represents the sum of the squares of the distances of each data point to its
assigned vector .μk . Our goal is to find values for the .rnk and the .μk so as to minimize
. J . We can perform the optimization through an iterative procedure in which each
iteration involves two successive steps corresponding to successive optimizations
with respect to the .rnk and the .μk . First we choose some initial values for the .μk .
Then we minimize. J with respect to the.rnk , keeping the.μk fixed. In the second phase
we minimize . J with respect to the .μk keeping .rnk fixed. This two-stage optimization
is then repeated until convergence. Consider first the determination of the .rnk . As
. J in (6.1) s a linear function of Because . J in (9.1) is a linear function of .r nk , this
optimization can be performed easily to give a closed form solution. The terms
involving different .n are independent and so we can optimize for each .n separately
by choosing.rnk to be 1 for whichever value of.k gives the minimum value.||xn − μk ||2 .
That is,

⎨1 i f k = arg min ||xn − μk ||2
.r nk = j (6.8)
⎩0 other wise

Now consider the optimization of the .μk with the .rnk held fixed. The objective
function . J is a quadratic function of .μk , and it can be minimized by setting its
derivative with respect to .μk to zero giving


N
2
. rnk (xn − μk ) = 0 (6.9)
n=1
∑N
rnk xn
. =⇒ μk = ∑n=1
N
(6.10)
n=1 r nk

The denominator in this expression is equal to the number of points assigned to


cluster .k, and so this result has a simple interpretation, namely set .μk equal to the
mean of all of the data points .xn assigned to cluster .k an for this reason, the procedure
is known as the .k-means algorithm.
Figure 6.2 shows the results of SSE as a function of the number of clusters to
be used in the elbow method for selecting the number of clusters. Further, t-SNE
embedding is used to visualize the clusters. Code Snippet 6.2 can be used to per-
form k-means clustering for varying number of clusters and following which t-SNE
embedding performed to visualize the clusters by embedding them to two dimen-
sions.

Note: Authors recommend the readers to run the code with different values of
varying hyperparameters associated with k-means and t-SNE.
6.3 k Means Clustering 119

""" K - means and TSNE


"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# i m p o r t n u m p y and s k l e a r n
i m p o r t n u m p y as np
i m p o r t p a n d a s as pd
from s k l e a r n . c l u s t e r i m p o r t K M e a n s
from s k l e a r n . m a n i f o l d i m p o r t TSNE
# Load sa m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / f u l l _ d e n . csv " ) . s o r t _ v a l u e s ( by = " Na2O " )
p r i n t ( data . c o l u m n s )
X = data [ data . c o l u m n s [ : - 1 ] ] . v a l u e s
y = data . v a l u e s [ : , - 1 : ]
p r i n t ( " N u m b e r of f e a t u r e s : " , X . s h a p e [ 1 ] )
sse = [ ]
ks = list ( r a n g e ( 1 , 11 ) )
for k in ks :
k m e a n s = K M e a n s ( n _ c l u s t e r s =k , m a x _ i t e r = 1000 ) . fit ( X )
sse + = [ k m e a n s . i n e r t i a _ ]
l a b e l s = K M e a n s ( n _ c l u s t e r s =3 , m a x _ i t e r = 1000 ) . fit ( X ) . l a b e l s _
X _ e m b e d d e d = TSNE ( n _ c o m p o n e n t s =2 , e a r l y _ e x a g g e r a t i o n = 12 , p e r p l e x i t y = 100
). fit_transform (X)
fig , axs = plt . s u b p l o t s ( 1 , 2 )
plt . sca ( axs [ 0 ] )
plt . plot ( ks , sse , " -- k " )
plt . s c a t t e r ( ks , sse , s = 60 , c = " k " , fc = " none " , ec = " k " )
plt . x t i c k s ( [2 ,4 , 6 , 8 , 10 ] )
plt . x l a b e l ( " N u m b e r of c l u s t e r s " )
plt . y l a b e l ( " SSE " )
i m p o r t m a t p l o t l i b . t i c k e r as m t i c k
plt . gca () . y a x i s . s e t _ m a j o r _ f o r m a t t e r ( m t i c k . F o r m a t S t r F o r m a t t e r ( '% . 0e ') )
plt . sca ( axs [ 1 ] )
colors = [" brown " , " darkblue " , " orange "]
for i in np . u n i q u e ( l a b e l s ) :
mask = l a b e l s = = i
plt . s c a t t e r ( X _ e m b e d d e d [ mask , 0 ] , X _ e m b e d d e d [ mask , 1 ] , s = 60 , c = c o l o r s [
i] , label = " Cluster {}". format
(i+1))
plt . x l a b e l ( " X1 " )
plt . y l a b e l ( " X2 " )
plt . l e g e n d ()
s a v e f i g ( " k m e a n s _ t s n e . png " )
p r i n t ( " End of o u t p u t " )

Output:
Index(['Al2O3', 'B2O3', 'CaO', 'Fe2O3', 'FeO', 'MgO', 'Na2O', 'P2O5', 'TeO2',
'TiO2', 'ZrO2', 'Density'],
dtype='object')
Number of features: 11
End of output

See Fig. 6.2

Code snippet 6.2: K-means clustering. Number of clusters and visualisation using
t-SNE
120 6 Dimensionality Reduction and Clustering

Fig. 6.2 K-means clustering

6.4 Gaussian Mixture Model

The general structure of a mixture model with . K base distributions is given as


K
. p(xi |θ ) = πk pk (xi |θ ) (6.11)
k=1

Equation (6.1) is a convex combination of the . pk (xi |θ )’s, since we are ∑ Ktaking a
weighted sum, where the mixing weights .πk satisfy .0 ≤ πk ≤ 1 and . k=1 πk =
1. But the structure of the model (6.1) does not give an idea about the belonging
of each data point to a particular cluster. To achieve that, we need to additionally
introduce latent or hidden variables to capture the cluster attribution. Let .xi be visible
or observable variables and .z i be a latent variable to capture its cluster attribution.
GMM model has the following form:


K
. p(xi |θ ) = πk N (xi |μk , ∑k ) (6.12)
k=1

Let us introduce a . K -dimensional binary random variable .z having a .1-of-. K repre-


sentation in which a particular element .z k is equal to 1, and ∑ all other elements are
equal to 0. The values of .z k therefore satisfy .z k ∈ {0, 1} and . k z k = 1 and we see
that there are. K possible states for the vector.z according to which element is nonzero.
We shall define the joint distribution . p(x) in terms of a marginal distribution . p(z)
and a conditional distribution . p(x|z), corresponding to the graphical model above
(. p(x) = p(x|z) p(z)). The marginal distribution over .z is specified in terms of the
mixing coefficients .πk (because it is mixed), such that
6.4 Gaussian Mixture Model 121

. p(z k = 1) = πk (6.13)

The parameters .πk satisfy:


K
.(i) 0 ≤ πk ≤ 1, (ii) πk = 1 (6.14)
k=1

Equation (6.5) is multinoulli (categorical, generalized Bernoulli) distribution. The


parameters specifying the probabilities of each possible outcome are constrained
only by the fact that each must be in the range 0–1, and all must sum to 1. Equation
(6.5) is rewritten considering independence of .z 1 , . . . , z k :


K
. p(z 1 , . . . , z k ) = p(z) = π1 π2 . . . π K = π1z1 π2z2 . . . π Kz K = πkzk (6.15)
k=1

Similarly, the conditional distribution of .x given a particular value for .z = 1 is a


Gaussian:

. p(x|z k = 1) = N (x|μk , ∑k ) (6.16)

Hence, . p(x|1 , . . . , z k ) = p(x|z):


K
. p(x|z) = N (x|μk , ∑k )zk (6.17)
k=1


where . p(x) is computed by marginalization (. P(A) = n p(A|Bn )Bn )

∑ ∑ ∏
K ∏
K
. p(x) = p(z) p(x|z) = πkzk N (x|μk , ∑k )zk (6.18)
z 1 ,...,z k z 1 ,...,z k k=1 k=1

As only one .k out of . K is active and if it is activated .z k = 1,


K
. p(x) = πk N (x|μk , ∑k ) (6.19)
k=1

Responsibility of a component .k, .γ (z k ) in GMM:

p(x|z k = 1) p(z k = 1)
.γ (z k ) = p(z k = 1|x) = (6.20)
p(x)
p(x|z k = 1) p(z k = 1)
. = ∑K (6.21)
j=1 p(z j = 1) p(x|z j = 1)
122 6 Dimensionality Reduction and Clustering


(By Bayes theorem: . p(A|B) = p(B|A) p(A)
p(B)
, P(B) = n p(B|An )An ))

πk N (x|μk , ∑k )
γ (z k ) = ∑ K
. (6.22)
k=1 pi k N (x|μk , ∑k )

The quantity .γ (z k ) is the corresponding posterior probability once we have observed


x. .γ (z k ) can also be viewed as the responsibility that component .k takes for ‘explain-
.
ing’ the observation .x. Suppose we have a data set of observations .{x1 , . . . , x N }, and
we wish to model this data using a mixture of Gaussians. Let data set be . N × D
matrix . X in which the .nth row is given by .xnT and the corresponding latent variables
will be denoted by an . N × K matrix . Z with rows .z nT . The log likelihood function is
represented as:
{ K }

N ∑
.ln p(X |π, μ, ∑) = ln πk N (x|μk , ∑k ) (6.23)
i=1 k=1
{ }

N ∑
K
. = ln πk N (x|μk , ∑k ) (6.24)
i=1 k=1

Maximizing the log likelihood function (6.15) for a GMM turns out to be a more
complex problem than for the case of a single Gaussian. The difficulty arises from the
presence of the summation over .k that appears inside the logarithm in (6.15), so that
the logarithm function no longer acts directly on the Gaussian. If we set the derivatives
of the log likelihood to zero, we will no longer obtain a closed form solution. A
solution to this the iterative MLE approach such as Expectation Maximization (EM).
EM is an elegant and powerful method for finding maximum likelihood solutions
for models with latent variables is called the expectation-maximization algorithm,
or EM algorithm (Dempster et al., 1977; McLachlan and Krishnan, 1997). The EM
algorithm is used to find (local) maximum likelihood parameters of a statistical
model in cases where the equations cannot be solved directly. The EM algorithm
proceeds from the observation that there is a way to solve these two sets of equations
numerically:
1. one can simply pick arbitrary values for one of the two sets of unknowns, use
them to estimate the second set
2. then use these new values to find a better estimate of the first set
3. then keep alternating between the two until the resulting values both converge to
fixed points.
We denote the set of all observed data by . X , in which the .nth row represents .xnT , and
similarly we denote the set of all latent variables by . Z , with a corresponding row .z nT .
The structure of loglikelihood becomes:
6.4 Gaussian Mixture Model 123
{ }

ln p(X |θ ) = ln
. p(X, Z |θ ) (6.25)
Z

Equation (6.16) can be modified by replacing the sum over . Z with an integral for
continuous latent variables. A key observation is that {the summation over the}latent
∑N ∑K
variables appears inside the logarithm as . i=1 ln k=1 πk N (x|μk , ∑k ) . The
presence of the sum prevents the logarithm makes complicated expressions for the
maximum likelihood solution (Eg..log(x1 + · · · + xn ) /= log(x1 )) + · · · + log(xn )).
Now suppose that, for each observation in . X , we were told the corresponding
value of the latent variable . Z . Then we can call .{X, Z } the complete data set, and
we refer to the actual observed data . X as incomplete. The likelihood function for the
complete data set simply takes the form .ln p(X, Z |θ ). In practice, we are not given
the complete data set .{X, Z }, but only the incomplete data . X , State of knowledge
of the values of the latent variables in . Z is given only by the posterior distribution
. p(Z |X, θ ). As we cannot use the complete-data log likelihood, we consider instead its
expected value under the posterior distribution of the latent variable, using.θ old which
corresponds E step of the EM algorithm. In the subsequent M step, we maximize
this expectation to obtain .θ new . This is iteratively performed until convergence Given
. p(X, Z |θ ) over observed variables . X and latent variables . Z , governed by parameters

.θ , the goal is: .arg max p(X |θ ). The EM algorithm steps are illustrated below:
θ

1. Choose an initial setting for the parameters .θ old∑


. p(Z |X, θ ), . Q(θ, θ old ) := Z p(Z |X, θ old ) ln p(X, Z |θ )
old
2. E step: evaluate∑
(Recall . E(x) = x x p(x))
3. M step: evaluate .θ new as:

.θ new = arg max Q(θ, θ old ) (6.26)


θ

4. Check for convergence of the log likelihood/the parameter values. If the conver-
gence criterion is not satisfied, then:

θ old ← θ new
. (6.27)

and return to step 2.


The EM algorithm can also be applied when the unobserved variables correspond
to missing values in the data set. The distribution of the observed values is obtained
by taking the joint distribution of all the variables and then marginalizing the missing
ones. We now consider the application of this latent variable view of EM to the specific
case of a Gaussian mixture model.
{∑ Our goal is to maximize
} the log likelihood function
∑N K
.ln p(X |π, μ, ∑) = i=1 ln k=1 πk N (x|μk , ∑k ) , which is computed using the
observed data set. X . Then maximizing the likelihood for the complete data set.{X, Z }
becomes considering all the data points:
124 6 Dimensionality Reduction and Clustering


N ∏
K
. p(X, Z |π, μ, ∑) = πkznk N (xn |μk , ∑k )znk (6.28)
n=1 k=1

Then the log likelihood for the complete data set .{X, Z } (one of the terms of the . Q
function) becomes considering all the data points:


N ∑
K
.ln p(X, Z |π, μ, ∑) = z nk {ln πk + ln N (xn |μk , ∑k )} (6.29)
n=1 k=1

∑N {∑ }
K
Now let us compare (6.21) with (6.15) (. i=1 ln π
k=1 k N (x|μ k , ∑ )
k ), no log-
arithm inside summation. So explicit introduction of the hidden variables (.z nk )
removed the logarithm inside summation, but the difficulty is that we need to estimate
then for all data points.

. Q(θ, θ old ) = p(Z |X, θ old )ln p(X, Z |π, μ, ∑) (6.30)
Z


N ∑
K
. = p(z nk |xn , θ old )z nk {ln πk + ln N (xn |μk , ∑k )} (6.31)
n=1 k=1

we are interested the case only when .z nk = 1, as other is trivial case


N ∑
K
. = γ (z nk )z nk {ln πk + ln N (xn |μk , ∑k )} (6.32)
n=1 k=1

Recall: .γ (z k ) = p(z k = 1|x)


N ∑
K
. = γ (z nk ) {ln πk + ln N (xn |μk , ∑k )} (6.33)
n=1 k=1

We need to maximize (6.24) for obtaining the parameters (M-step).


In the E-step we evaluate the responsibilities using the current parameter values

πk N (xn |μk , ∑k )
γ (z nk ) = ∑ K
. (6.34)
k=1 πk N (x n |μk , ∑k )

∑N ∑K
Having got a fixed .γ (z nk ) maximize . n=1 k=1 γ (z nk ) {ln πk + ln N (x n |μk , ∑k )}
(6.24). .πk do not depend on .μk and .∑k , we can eliminate that term for calculating
∑N ∑K
.μk and .∑k , i.e, maximize . n=1 k=1 γ (z nk )ln N (x n |μk , ∑k )
6.4 Gaussian Mixture Model 125


N ∑
K
. γ (z nk )ln N (xn |μk , ∑k ) (6.35)
n=1 k=1


N { ( )}
1 (xn − μk )2
. γ (z nk )ln √ ex p − (6.36)
n=1
2π ∑k 2∑k
∑ ∑K { ∑ }
N
√ N
(xn − μ)2
. γ (z nk ) − ln( 2π ∑k ) − (6.37)
i=1 k=1 i=1
2∑k

Taking derivative w.r.t .μk and equating to zero


{ }
∂ ∑
N
(xn2 − 2xn μk + μ2k )2
. γ (z nk ) − =0 (6.38)
∂μk i=1 2∑k

N ∑
N
. = γ (z nk )xn = μk γ (z nk ) (6.39)
i=1 i=1
∑N
γ (z nk )xn
. =⇒ μk = ∑i=1
N
(6.40)
i=1 γ (z nk )

Taking derivative w.r.t .∑k and equating to zero


{ √ ∑
N ∑
N }
∂ (xn − μk )2
. − ln( 2π ∑k ) γ (z nk ) − γ (z nk ) =0 (6.41)
∂∑k i=1 i=1
2∑k
{ ∑N ∑N }
∂ 1 (xn − μk )2
. − ln(∑k ) γ (z nk ) − γ (z nk ) =0 (6.42)
∂∑k 2 i=1 i=1
2∑k

1 1 ∑ ∑
N N
(xn − μk )2
. − γ (z nk ) − γ (z nk ) (−1) = 0 (6.43)
2 ∑k i=1 i=1
2∑k2
∑N
γ (z nk )(xn − μk )2
. =⇒ ∑k = i=1∑ N (6.44)
i=1 γ (z nk )

∑K
π calculation should include the constraint . i=1
. k πk = 1. Also, we can eliminate the
terms corresponding to .μk and .∑k , as they are independent of .πk . The Lagrangian is
computed as:
126 6 Dimensionality Reduction and Clustering
( )

N ∑
K ∑
K
. L= γ (z nk ) ln πk + λ 1 − πk (6.45)
n=1 k=1 k=1
∑N
∂L n=1 γ (z nk )
. = −λ=0 (6.46)
∂πk πk
∑N ∑N
γ (z nk ) γ (z nk )
. =⇒ λ = n=1
=⇒ πk = n=1
(6.47)
πk λ
∑K
As . k=1 πk = 1, the following will hold
∑N

K
1 ∑∑
K N
1 γ (z nk )
. πk = γ (z nk ) = N = 1 =⇒ πk = n=1
. (6.48)
k=1
λ k=1 n=1 λ N

6.5 t-Distributed Stochastic Neighbor Embedding

Higher-dimensional data are present in various real-world problems in different


domains, and getting a sense of that data is a complex problem. t-distributed Stochas-
tic Neighbor Embedding (t-SNE) is a machine learning algorithm that visualizes
higher-dimensional data in a two-dimensional or three-dimensional space. This tech-
nique models higher dimensional data such that Higher-dimensional objects are mod-
eled such that objects with similar features have a higher probability of outputting
as nearby points, whereas objects with different features have a higher probability
of being outputted as far-away points. Unlike Principal Component Analysis which
is concerned with preserving large pairwise distances to maximize variance, t-SNE
preserves small pairwise distances and local similarities. This algorithm consists of
two steps:
• Step 1: Pairs of similar objects are assigned higher probabilities, and dissimilar
objects are assigned lower probabilities with a probability function in the higher
dimensional space by building a probability distribution function.
• Step 2: A similar probability distribution function is defined in the lower dimen-
sional space. The cost function is defined as Kullback–Leibler divergence (KL
divergence) between the two distributions, which is then minimized.
The t-distributed Stochastic Neighbour Embedding algorithm is an improvement
on the original Stochastic Neighbour Embedding algorithm by using a student t-
distributed, which is also known as Cauchy Distribution to perform a similarity
measurement in lower dimensional data space, which enables it to preserve local
distances and improve on optimization.
To represent similarities in the higher dimensional space, this algorithm converts
the higher-dimensional Euclidean distances into conditional probabilities. The simi-
larity of datapoint xi to datapoint xj is the conditional probability . p j|i , .xi would pick
6.5 t-Distributed Stochastic Neighbor Embedding 127

x j as its neighbor if neighbors were picked in proportion to their probability density


.

under a Gaussian centered at .xi . To calculate this conditional probability, a Gaussian


distribution is centered over each data point .xi . Using this Gaussian distribution, we
will find the density of all points . j , and then this distribution is normalized again on
all these points. Mathematically, it can be calculated with the following equation

exp(||xi − x j ||2 /2σi 2)


. p j|i = (6.49)
∑k/=i (||xi − xk ||2 /2σi 2))

where .σi is the variance of the Gaussian distribution that is centered on the datapoint
x . The resulting distributions give a representation of local similarities in these
. i
data points. These distributions are influenced by the perplexity parameter, which
manipulates the distribution variance.
In the above step, we have modeled a Gaussian Distribution on each of the data
points; we can do this for the lower dimensional space as well, which will give
the normal Stochastic Neighbouring Embedding technique, but in this algorithm, a
student t-distribution with one degree of freedom which is also known as Cauchy
distribution. This distribution distinguished this method from other local techniques
for multi-dimensional mapping wherein the area covered by nearby data points does
not leave space in the two-dimensional map space to accommodate distant data points,
thus preserving local distances. The area of the two-dimensional map that is available
to accommodate moderately distant data points will not be nearly large enough
compared with the area available to accommodate nearby data points. Like in the
first step, this distribution is again centered on data point.xi , and the density of all other
data points .x j are calculated, and the distribution is normalized. Mathematically, this
distribution .q j|i can be calculated with the help of following equation

(1 + ||yi − y j ||2 )−1


. q j|i = (6.50)
∑k/=i (1 + ||yi − yk ||2 )−1

This student t-distribution with a single degree of freedom is helpful as the function
(1 + ||yi − y j ||2 )−1 closely resembles inverse square law for large pairwise distances
.

in the lower-dimensional space. Hence, for large distances, the mapping distances do
not change mainly as compared to the Gaussian distribution. Thus, far-apart clusters
have an interrelation of the same order as close individual points. Computationally,
this is far less costly than the exponential function.
We have calculated the probability distributions in the higher dimensional space
using Gaussian distribution and the lower dimensional space using Student t-
distribution with one degree of freedom. The next objective is to minimize the differ-
ence in these distributions such that the data points in the two map structures are as
similar as possible. This will make the distribution in lower dimensional data space
mirror the higher-dimensional data space. Kullback–Leibler divergence measures
the degree of fairness with .q j|i models . p j|i . Thus this quantity is a good choice for
128 6 Dimensionality Reduction and Clustering

the cost function, which can be written mathematically as in the following equation
pi j
C = ∑i K L(Pi ||Q i ) = ∑i ∑ j pi j log
. (6.51)
qi j

where the values of . pi j are obtained as symmetric conditional probabilities and


calculated mathematically as

pi| j + p j|i
. pi j = (6.52)
2
This cost function is a measure of information gained by substituting probability
distribution .qi j with probability distribution . pi j or the information lost when the
distribution .qi j is used to model . pi j . This gradient of the cost function causes dis-
similar points that are represented by small pairwise distances in low-dimensional
space to move away from each other. This makes dissimilar points appear at a far
away distance in lower-dimensional space.
Figure 6.3 shows the t-SNE embedding in two dimensions for sodiumsilicate and
calcium aluminosilicate glasses. Code Snippet 6.3 can be used to perform the t-SNE
embedding.

Note: Authors recommend the readers to run the code with different values of
varying hyperparameters associated with t-SNE.

Fig. 6.3 t-SNE embedding


in two dimensions for a
dataset comprising sodium
silicate and calcium
aluminosilicate glasses
6.6 Summary 129

""" TSNE
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
from s h a d o w . font i m p o r t s e t _ f o n t _ f a m i l y , s e t _ f o n t _ s i z e
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# i m p o r t n u m p y and s k l e a r n
i m p o r t n u m p y as np
i m p o r t p a n d a s as pd
from s k l e a r n . m a n i f o l d i m p o r t TSNE
# Load sa m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / e l a s t i c _ m o d u l u s . csv " ) . s o r t _ v a l u e s ( by = " Na2O " )
p r i n t ( data . c o l u m n s )
X = 1 * data
# e n c o d e d a t a in d i f f e r e n t c l a s s e s
for ind , i in e n u m e r a t e ( [ 1 , 2 ,4 , 8 ] ) :
X [ f " c { ind } " ] = i * ( X . v a l u e s [ : , ind ] > 0 . 0 )
X [ " code " ] = X [ " c0 " ] + X [ " c1 " ] + X [ " c2 " ] + X [ " c3 " ]
# this will yield 7 ( CaO - Al2O3 - SiO2 ) and 12 ( Na2O - SiO2 )
u n i q u e s = np . sort ( np . u n i q u e ( X [ " code " ] ) )
tsne = TSNE ( n _ c o m p o n e n t s =2 , v e r b o s e = 0 , p e r p l e x i t y = 25 , n _ i t e r = 300 )
t s n e _ r e s u l t s = tsne . f i t _ t r a n s f o r m ( data . v a l u e s )
m a s k 1 = X [ " code " ] . v a l u e s = = u n i q u e s [ 0 ]
plt . s c a t t e r ( t s n e _ r e s u l t s [ mask1 , 0 ] , t s n e _ r e s u l t s [ mask1 , 1 ] , s = 60 , ec = " k " ,
c o l o r = " r " , l a b e l = " CaO - Al2O3 - SiO2 " )
m a s k 2 = X [ " code " ] . v a l u e s = = u n i q u e s [ 1 ]
plt . s c a t t e r ( t s n e _ r e s u l t s [ mask2 , 0 ] , t s n e _ r e s u l t s [ mask2 , 1 ] , s = 60 , ec = " k " ,
c o l o r = " b " , l a b e l = " Na2O - SiO2 " )
plt . l e g e n d ()
plt . x l a b e l ( " tsne - dimention - 1 " )
plt . y l a b e l ( " tsne - dimention - 2 " )
s a v e f i g ( " tsne . png " )
p r i n t ( " End of o u t p u t " )

Output: See Fig. 6.3

Code snippet 6.3: t-SNE embedding in two dimensions for a dataset comprising
sodium silicate and calcium aluminosilicate glasses

6.6 Summary

Unsupervised machine learning techniques, including PCA, k-means clustering,


Gaussian mixture models (GMMs), and t-SNE, have wide-ranging applications in
various domains, including materials discovery. Throughout this chapter, we have
explored the principles and use cases of these methods. Principal Component Anal-
ysis (PCA) allows for dimensionality reduction in complex datasets while retaining
important information. Clustering algorithms like k-means and GMMs enable the
identification of distinct patterns or groups within the data. t-SNE provides valuable
visualization tools to uncover intricate relationships and structures. By leveraging
unsupervised machine learning, researchers can gain valuable insights and make
sense of large and complex datasets in a variety of fields. These techniques aid
in exploratory data analysis, pattern recognition, and visualization, allowing for a
130 6 Dimensionality Reduction and Clustering

deeper understanding of the underlying structures and relationships within the data.
It is important to note that unsupervised machine learning methods are not stand-
alone solutions but rather tools that complement domain knowledge and human
expertise. The interpretation of results requires careful consideration and valida-
tion through experimentation. In conclusion, unsupervised machine learning tech-
niques offer a powerful framework for data analysis and exploration. They enable
researchers to uncover hidden patterns, simplify complex datasets, and gain valuable
insights. Whether in materials science or other domains, the integration of unsuper-
vised machine learning methods enhances our understanding and drives advance-
ments in various fields of research and knowledge discovery.

References

1. C.M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics)
(Springer, Berlin, Heidelberg, 2006). ISBN: 0387310738
2. K.P. Murphy, Machine Learning: A Probabilistic Perspective (The MIT Press, 2012). ISBN:
0262018020
3. T.M. Mitchell, Machine Learning, 1st edn. (McGraw-Hill, Inc., USA, 1997). ISBN: 0070428077
4. R.O. Duda, P.E. Hart, et al., Pattern Classification (Wiley, 2006)
5. J. Friedman, T. Hastie, R. Tibshirani, et al., The Elements of Statistical Learning, 10, vol. 1.
Springer Series in Statistics New York (2001)
6. A.C. Faul, A Concise Introduction to Machine Learning (CRC Press, 2019)
7. A.R. Webb, Statistical Pattern Recognition (Wiley, 2003)
8. R.A. Johnson, D.W. Wichern, et al., Applied Multivariate Statistical Analysis, vol. 6. (Pearson,
London, UK, 2014)
Chapter 7
Model Refinement

Abstract Model refinement is a critical process in machine learning that aims


to enhance the performance and generalization of predictive models. This chapter
explores two fundamental aspects of model refinement: the use of regularizers and
hyperparameter optimization techniques. Regularizers, including Lasso, Ridge, and
Elastic Net, play a vital role in mitigating overfitting and improving model gen-
eralization. Lasso introduces sparsity by applying an L.1 penalty, enabling feature
selection. Ridge regression utilizes an L.2 penalty to shrink coefficient values, reduc-
ing their impact. The Elastic Net regularizer combines L.1 and L.2 penalties, striking
a balance between feature selection and coefficient shrinkage. Hyperparameter opti-
mization techniques are essential for fine-tuning models. Hyperparameters, such as
learning rates and regularization strengths, significantly impact model performance.
Techniques like grid search, random search, Bayesian optimization, and evolution-
ary algorithms efficiently explore the hyperparameter space to identify optimal con-
figurations. By leveraging regularizers and hyperparameter optimization, machine
learning practitioners can refine models, balance complexity and generalization, and
improve predictive accuracy. This chapter delves into the principles, implementa-
tion, and practical considerations of regularizers and hyperparameter optimization,
highlighting their impact on model performance and interpretability.

7.1 Introduction

Model refinement is a critical step in machine learning that aims to improve the per-
formance and generalization ability of predictive models. While initial model training
may provide a starting point, further fine-tuning and optimization are often neces-
sary to enhance the model’s accuracy, robustness, and interpretability. This chapter
focuses on two key aspects of model refinement: the use of regularizers, including
Lasso, Ridge, and Elastic Net, and hyperparameter optimization techniques.
Regularization techniques play a vital role in mitigating overfitting and improving
the generalization of machine learning models. Overfitting occurs when a model
becomes overly complex, capturing noise or idiosyncrasies in the training data that
do not generalize well to unseen data. Regularizers address this issue by introducing
© Springer Nature Switzerland AG 2024 131
N. M. A. Krishnan et al., Machine Learning for Materials Discovery,
Machine Intelligence for Materials Science,
https://doi.org/10.1007/978-3-031-44622-1_7
132 7 Model Refinement

additional constraints or penalties to the model’s objective function during training,


encouraging simpler and more generalizable solutions. One widely used regularizer
is Lasso (Least Absolute Shrinkage and Selection Operator), which applies an L.1
penalty to the model’s coefficients. This penalty promotes sparsity, driving some
coefficients to become exactly zero and effectively performing feature selection.
Ridge regression, on the other hand, employs an L.2 penalty that shrinks the coefficient
values, effectively reducing their impact. The Elastic Net regularizer combines L.1 and
L.2 penalties, offering a trade-off between feature selection and coefficient shrinkage.
In addition to regularizers, hyperparameter optimization techniques are crucial
for fine-tuning models and finding the optimal set of hyperparameters that yield
the best performance. Hyperparameters are parameters that are not learned directly
from the data but are set prior to model training, such as learning rate, regulariza-
tion strength, or the number of hidden layers in a neural network. The selection of
appropriate hyperparameters significantly impacts the model’s performance and gen-
eralization capabilities. Hyperparameter optimization methods, such as grid search,
random search, and more advanced techniques like Bayesian optimization or evolu-
tionary algorithms, systematically explore the hyperparameter space to identify the
optimal configuration. These techniques help automate and streamline the process
of hyperparameter selection, saving valuable time and computational resources.
In this chapter, we will delve into the principles, implementation, and practical
considerations of regularizers, including Lasso, Ridge, and Elastic Net, and hyperpa-
rameteroptimization techniques. We will discuss their impact on model performance,
their ability to mitigate overfitting, and their role in achieving better generalization
and interpretability in machine learning models. For an in-depth discussion on the
theoretical background of these methods and further reading, readers are directed to
References [1–6].

7.2 Regularization for Regression

One key point that to be ensured while training a ML model is circumventing over-
fitting. Overfitting happens mainly due to fitting model parameters on the noisy data.
As a result, instead of learning the right signal, the model will be following the noise,
leading to incorrect prediction results. This will also increase the number of model
parameters in the ML model. Regularization is approach that regularizes or shrinks
the coefficient estimates towards zero by imposing constraints on the range of model
parameters. This essentially avoids learning unwanted complex models that fit the
data to prevent the risk of overfitting.
A model with more number of variables is difficult to interpret physically. For
instance, let us say we are modelling property of product (. y) as a function of compo-
sition 1 (.x1 ), composition 2 (.x2 ), composition 3 (.x3 ), composition 4(.x4 ), composition
5 (.x5 ), i.e, . y = β0 + β1 x1 + β2 x2 + β3 x3 + β4 x4 + β5 x5 . If all the coefficients are
active .β0 , . . . , β5 , we do not have any clue which is the dominant variable that con-
trols property. Let us only one or two active variables after modeling, say, .x1 and
7.2 Regularization for Regression 133

x , we conclude that composition 1 and composition 3 play a key role in control-


. 3

ling the property, . y = β1 x1 + β3 x3 . That is, lower number of variables, improves


interpretability and decision making.
One wants to choose a model that both accurately captures the regularities in its
training data, but also generalizes well to unseen data. A less complex model always
underfitting, while a more complex model more does overfit. In the first case, it
would be resulting a large bias error and in the second case, it would lead to a large
variance error. Regularization controls these two extremes and produces sparse and
interpretable models.

7.2.1 Ridge Regression

Ridge regression shrinks the regression coefficients by imposing a penalty on their


coefficient by minimizing the residual sum of squares


n ∑
p

p
. hatβ ridge = arg min {(yi − β0 − β j x j )2 + λ β 2j } (7.1)
β i=1 j=1 j=1

Here .λ ≥ 0 is a complexity parameter that controls the amount of shrinkage and the
larger the value of .λ, the greater the amount of shrinkage. The coefficients are shrunk
toward zero (and each other). When there are many correlated variables in a linear
regression model, their coefficients can become poorly determined and exhibit high
variance. A wildly large positive coefficient on one variable can be cancelled by a
similarly large negative coefficient on its correlated variable. By imposing a size
constraint on the coefficients, this problem is alleviated. An equivalent formation
which makes explicit the size constraint on the parameters:
⎧ ⎫
n ⎨
∑ ∑
p ⎬
. β̂ ridge = arg min (yi − β0 − β j x j )2 (7.2)
β ⎩ ⎭
i=1 j=1


p
. β 2j ≤ t (7.3)
j=1

. β̂ ridge = (X T X + λI )−1 X T y (7.4)

Notice that the intercept.β0 has been left out of the penalty term to avoid the procedure
depending on the origin chosen for . y. The solution adds a positive constant to the
diagonal of . X T X before inversion. This makes the problem non-singular, even if
T
. X X is not of full rank, and was the main motivation for ridge regression.
Code Snippet 7.1 can be used to perform ridge regression.
134 7 Model Refinement

Note: Authors recommend the readers to run the code with different values of
varying hyperparameters associated with ridge regression.

""" R i d g e R e g r e s s i o n
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# i m p o r t n u m p y and s k l e a r n
i m p o r t p a n d a s as pd
# Load sa m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / f u l l _ d e n . csv " ) . s o r t _ v a l u e s ( by = " Na2O " )
p r i n t ( data . c o l u m n s )
X = data [ data . c o l u m n s [ : - 1 ] ] . v a l u e s
y = data . v a l u e s [ : , - 1 : ]
from s k l e a r n i m p o r t l i n e a r _ m o d e l
clf = l i n e a r _ m o d e l . R i d g e ( a l p h a = 0 . 1 )
clf . fit ( X , y )
p r i n t ( " C o e f f i c i e n t : " , clf . c o e f _ )
p r i n t ( " I n t e r c e p t : " , clf . i n t e r c e p t _ )
p r i n t ( " End of o u t p u t " )

Output:

Index([’Al2O3’, ’B2O3’, ’CaO’, ’Fe2O3’, ’FeO’, ’MgO’, ’Na2O’, ’P2O5’, ’TeO2’,


’TiO2’, ’ZrO2’, ’Density’],
dtype=’object’)
Coefficient: [[-0.00812629 -0.01362728 -0.00246626 0.00691105 0.01848752 -0.00421779
-0.00747339 -0.01062724 0.0174123 -0.00012834 0.00385573]]
Intercept: [3.47251353]
End of output

Code snippet 7.1: Ridge regression

7.2.2 Least Absolute Shrinkage and Selection Operator


(LASSO)

The LASSO regression is a shrinkage method like the ridge, with subtle but important
differences. LASSO regression uses the following penalization for estimating the
model parameters:
7.2 Regularization for Regression 135


n ∑
p
. β̂ L ASS O = arg min {(yi − β0 − β j x j )2 } (7.5)
β i=1 j=1


p
. |β j | ≤ t (7.6)
j=1

Equivalent Lagrangian form is


n ∑
p

p
. β̂ ridge
= arg min {(yi − β0 − βjxj) + λ
2
|β j |} (7.7)
β i=1 j=1 j=1

Code Snippet 7.2 can be used to perform the LASSO regression with a regular-
ization.

Note: Authors recommend the readers to run the code with different values of
varying hyperparameters associated with LASSO regression.

""" L a s s o R e g r e s s i o n
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# i m p o r t n u m p y and s k l e a r n
i m p o r t p a n d a s as pd
# Load sa m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / f u l l _ d e n . csv " ) . s o r t _ v a l u e s ( by = " Na2O " )
p r i n t ( data . c o l u m n s )
X = data [ data . c o l u m n s [ : - 1 ] ] . v a l u e s
y = data . v a l u e s [ : , - 1 : ]
from s k l e a r n i m p o r t l i n e a r _ m o d e l
clf = l i n e a r _ m o d e l . L a s s o ( a l p h a = 0 . 1 )
clf . fit ( X , y )
p r i n t ( " C o e f f i c i e n t : " , clf . c o e f _ )
p r i n t ( " I n t e r c e p t : " , clf . i n t e r c e p t _ )
p r i n t ( " End of o u t p u t " )

Output:

Index([’Al2O3’, ’B2O3’, ’CaO’, ’Fe2O3’, ’FeO’, ’MgO’, ’Na2O’, ’P2O5’, ’TeO2’,


’TiO2’, ’ZrO2’, ’Density’],
dtype=’object’)
Coefficient: [-0.00408565 -0.01062827 0. 0.00881923 0.01883976 -0.
-0.00413582 -0.00749122 0.02032784 0.00161859 0. ]
Intercept: [3.16802183]
End of output

Code snippet 7.2: Lasso regression


136 7 Model Refinement

Here, the penalty of the form. L 1 and the shrinkage is much more compared to ridge
regression. This latter constraint makes the solutions non-linear in the . yi , and there
is no closed-form expression as in ridge regression. Computing the lasso solution is
a quadratic programming problem. The generalized form of LASSO regression is


n ∑
p

p
. β̃ = arg min {(yi − β0 − β j x j )2 + λ ||β j ||q } (7.8)
β i=1 j=1 j=1

Here, the value .q = 0 corresponds to variable subset selection, as the penalty simply
counts the number of nonzero parameters. As a result, .q = 1 corresponds to the
Lasso and .q = 2 corresponds to ridge regression. We can also optimize .q using
hyperparameter optimization for obtaining the best shrinkage.
In Ridge and LASSO regression the residual sum of squares has elliptical contours,
centered at the full least squares estimate. Constraints for ridge regression result
in disk shape contours .x12 + x22 ≤ t. However, constraints for LASSO regression
.|x 1 | + |x 2 | ≤ t results in diamond shape contours. In constraint optimization, solution
would lie at the active constraints (on the boundary of the constraint surface). Unlike
the disk, the diamond has corners; if the solution occurs at a corner, then it will make
one parameter .β j equal to zero. Thus, shrinkage in Lasso is more than in Ridge

7.2.3 Elastic-Net Regression

The elastic net is a regularized regression method that linearly combines the . L 1 and
L 2 penalties of the Lasso and Ridge methods. Elastic-net regression has the following
.

form


n ∑
p

p

p
. β̂ elastic = arg min {(yi − β0 − β j x j )2 + λ 1 |β j | + λ2 ||β j ||2 } (7.9)
β i=1 j=1 j=1 j=1

Often times .λ1 and .λ2 are related to be .λ2 = 1 − λ1 . Hence, the elastic net regres-
sion form includes bothe the LASSO and ridge regression. When .λ1 = λ and .λ2 = 0
elastic-net regression behaves as Ridge regression, while, .λ2 = λ and .λ1 = 0 elastic-
net regression become LASSO regression.
Code Snippet 7.3 can be used to perform the elastic net regression with a regu-
larization.

Note: Authors recommend the readers to run the code with different values of
varying hyperparameters associated with LASSO regression.
7.3 Cross-Validation for Model Generalizability 137

""" E l a s t i c Net R e g r e s s i o n
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# i m p o r t n u m p y and s k l e a r n
i m p o r t p a n d a s as pd
# Load sa m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / f u l l _ d e n . csv " ) . s o r t _ v a l u e s ( by = " Na2O " )
p r i n t ( data . c o l u m n s )
X = data [ data . c o l u m n s [ : - 1 ] ] . v a l u e s
y = data . v a l u e s [ : , - 1 : ]
from s k l e a r n i m p o r t l i n e a r _ m o d e l
clf = l i n e a r _ m o d e l . E l a s t i c N e t ( a l p h a = 0 . 1 , l 1 _ r a t i o = 0 . 5 )
clf . fit ( X , y )
p r i n t ( " C o e f f i c i e n t : " , clf . c o e f _ )
p r i n t ( " I n t e r c e p t : " , clf . i n t e r c e p t _ )
p r i n t ( " End of o u t p u t " )

Output:

Index([’Al2O3’, ’B2O3’, ’CaO’, ’Fe2O3’, ’FeO’, ’MgO’, ’Na2O’, ’P2O5’, ’TeO2’,


’TiO2’, ’ZrO2’, ’Density’],
dtype=’object’)
Coefficient: [-0.00488952 -0.01100025 -0. 0.00905504 0.01979262 -0.00066453
-0.00460861 -0.00793621 0.02001977 0.00191433 0.00081675]
Intercept: [3.205022]
End of output

Code snippet 7.3: Elastic net regression

7.3 Cross-Validation for Model Generalizability

ML models trained does not necessarily guarantee generalization to unseen datasets.


K-fold cross-validation is a popular technique in for estimating the performance of a
model and evaluating its generalization ability. It provides a more reliable assessment
by repeatedly splitting the dataset into multiple subsets, or “folds” to train and validate
the model. This section provides a brief explanation of k-fold cross-validation, its
benefits, and an example of its implementation.
See Code Snippet 7.4.
In k-fold cross-validation, the dataset is divided into k equal-sized folds. The
model is trained and evaluated k times, with each fold serving as the validation set
once while the remaining k-1 folds are used for training. The performance metric,
such as accuracy or mean squared error, is calculated for each iteration. Finally, the
average performance across all folds is considered as the overall performance of the
model.
The main advantages of k-fold cross-validation are as follows

• Robust Performance Estimate: By repeatedly evaluating the model on different


subsets of the data, k-fold cross-validation provides a more reliable estimate of
138 7 Model Refinement

""" X G B o o s t r e g r e s s i o n with h y p e r p a r a m e t e r g r i d s e a r c h and cross -


validation
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# import
i m p o r t n u m p y as np
i m p o r t p a n d a s as pd
from x g b o o s t i m p o r t X G B R e g r e s s o r
from s k l e a r n . m o d e l _ s e l e c t i o n i m p o r t t r a i n _ t e s t _ s p l i t
from s k l e a r n . m e t r i c s i m p o r t m e a n _ s q u a r e d _ e r r o r , r 2 _ s c o r e
import optuna
from s k l e a r n . m o d e l _ s e l e c t i o n i m p o r t c r o s s _ v a l _ s c o r e
# Load sa m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / N S _ d e n . csv " ) . s o r t _ v a l u e s ( by = " Na2O " )
p r i n t ( data . c o l u m n s )
X = data [ data . c o l u m n s [ : - 1 ] ] . v a l u e s
y = data [ " D e n s i t y ( g / cm3 ) " ] . v a l u e s . r e s h a p e ( -1 , 1 )
# Split dataset
X_train , X_test , y_train , y _ t e s t = t r a i n _ t e s t _ s p l i t ( X , y , t e s t _ s i z e = 0 . 3
, r a n d o m _ s t a t e = 2020 )
o r d e r = np . a r g s o r t ( X _ t e s t [ : , 0 ] )
def o b j e c t i v e ( t r i a l ) :
d = trial . suggest_int ( ' depth ', 2 , 8)
n = t r i a l . s u g g e s t _ i n t ( ' e s t i m a t o r s ' , 2 , 512 )
# Fit r e g r e s s i o n m o d e l
regr = X G B R e g r e s s o r ( m a x _ d e p t h = d , n _ e s t i m a t o r s = n )
# cross - v a l i d a t i o n
s c o r e s = c r o s s _ v a l _ s c o r e ( regr , X_train , y_train , cv = 5 , s c o r i n g = '
neg_mean_squared_error ')
r e t u r n - s c o r e s . mean ()
search_space = {
' d e p t h ': [2 , 4 , 6 , 8 ] ,
' e s t i m a t o r s ' : [ 2 , 16 , 128 , 512 ]
}
study = optuna . create_study ( sampler = optuna . samplers . GridSampler (
search_space ))
s t u d y . o p t i m i z e ( objective , n _ t r i a l s = 4 * 4 )
p r i n t ( " best p a r a m s : " , s t u d y . b e s t _ p a r a m s ) # Get best p a r a m e t e r s for
the o b j e c t i v e f u n c t i o n .
p r i n t ( " best value : " , s t u d y . b e s t _ v a l u e ) # Get best o b j e c t i v e v a l u e .
p r i n t ( " End of o u t p u t " )

Output:

Index([’Na2O’, ’SiO2’, ’Density (g/cm3)’], dtype=’object’)


best params: {’depth’: 2, ’estimators’: 16}
best value: 0.001421600999609488
End of output

Code snippet 7.4: XGBoost with grid search and cross validation
7.4 Hyperparametric Optimization 139

the model’s performance. It helps mitigate the impact of dataset variability and
randomness.
• Efficient Use of Data: In traditional train-test splits, a portion of the data is set aside
for testing, which reduces the amount of data available for training. K-fold cross-
validation ensures that all data points are used for both training and validation,
maximizing the utilization of the available data.
• Model Selection and Hyperparameter Tuning: K-fold cross-validation is com-
monly used for model selection and hyperparameter tuning. It allows for comparing
different models or hyperparameter settings based on their average performance
across multiple folds, enabling more informed decisions. These aspects are dis-
cussed in detail in the next section.
• Identifying Overfitting: By evaluating the model on multiple validation sets,
k-fold cross-validation helps detect overfitting. If a model performs significantly
better on the training set compared to the validation sets, it indicates that the model
may be overfitting the training data.

7.4 Hyperparametric Optimization

During model development, different models are tested, and hyperparameters are
tuned to get more reliable predictions. Hyperparameters are specified parameters
that can control a machine learning algorithm’s behavior by adjusting. They are dif-
ferent from parameters in that hyperparameters are parameters set before training
and supplied to the model, while parameters are learned during training phase. Thus,
hyperparameters are configuration settings that are not learned from the data but set
by the user before model training. Examples of hyperparameters include the learning
rate, regularization strength, number of layers in a neural network, or the choice of
a kernel in a support vector machine. Hyperparameter optimization techniques aim
to automatically search for the best combination of hyperparameters that maximizes
the model’s performance. The performance of ML models are highly dependent on
the choice of hyperparameters. For example, a typical soft-margin SVM classifier
equipped with an RBF kernel has at least two hyperparameters that need to be tuned
for good performance on unseen data: the regularization constant and a kernel hyper-
parameter. There are various approaches to achieve hyperparameter optimization,
namely, (i) random Search, (ii) grid Search, and (iii) Bayesian optimization.

7.4.1 Grid Search

The straightforward hyperparameter optimization has been grid search, an exhaustive


searching through a pre-specified subset of a learning algorithm’s hyperparameter
space. Once the hyperparametric search space is decided, a suitable performance
140 7 Model Refinement

metric is defined to perform the optimization. Afterward, this criterion can be applied
over cross-validation on the training set or validation set. A poorly chosen spaces for
the hyperparameters may result in non-convergence grid search. Grid search approach
heavily suffers from the curse of dimensionality, because the hyperparameter settings
it evaluates are typically independent of each other.
Code snippet 7.5 can be used to perform the grid search for hyperparametric
optimization of XGBoost models. Similarly, Code snippet 7.6 can be used to perform
random search for hyperparametric optimization of XGBoost models. Note that the
same code can be applied for any other models as well.

Note: Authors recommend the readers to run the code with different regression
models for hyperparametric optimization using grid search and random search.

7.4.2 Random Search

Random Search replaces the exhaustive enumeration of all combinations by selecting


them randomly. It can outperform Grid search, especially when only a small number
of hyperparameters affect the learning algorithm’s final performance. As there is no
exhaustive search employed in this case, the optimization problem is observed to
have a low intrinsic dimensionality. Here also, suitable performance metric is need
to be defined to perform the optimization.

7.4.3 Bayesian Optimization

Bayesian optimization is a surrogate model based global optimization strategy


employed for the black-box functions that do not possess any functional forms.
When engaged in the hyperparameter optimization framework, Bayesian optimiza-
tion builds a probabilistic surrogate model of the function mapping from hyper-
parameter values to the objective evaluated on a validation set by fixing a prior
and evaluating the posterior. There are several methods used to define the prior/-
posterior distribution over the objective function. The most common two methods
employed in this context are Gaussian Processes and Parzen-Tree Estimation to find
a regime that maximizes the expected improvement, iteratively. Bayesian optimiza-
tion tries to balance exploration and exploitation in the search and has been shown
to obtain better results in fewer evaluations compared to grid search and random
search. There are several packages that allow Bayesian optimization for choosing
the optimal hyperparameters including Optuna (https://github.com/optuna/optuna)
and Hyperopt (https://github.com/hyperopt/hyperopt). Authors are encouraged to
explore these packages.
7.4 Hyperparametric Optimization 141

""" X G B o o s t r e g r e s s i o n with h y p e r p a r a m e t e r g r i d s e a r c h
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# import
i m p o r t n u m p y as np
i m p o r t p a n d a s as pd
from x g b o o s t i m p o r t X G B R e g r e s s o r
from s k l e a r n . m o d e l _ s e l e c t i o n i m p o r t t r a i n _ t e s t _ s p l i t
from s k l e a r n . m e t r i c s i m p o r t m e a n _ s q u a r e d _ e r r o r , r 2 _ s c o r e
import optuna
# Load sa m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / N S _ d e n . csv " ) . s o r t _ v a l u e s ( by = " Na2O " )
p r i n t ( data . c o l u m n s )
X = data [ data . c o l u m n s [ : - 1 ] ] . v a l u e s
y = data [ " D e n s i t y ( g / cm3 ) " ] . v a l u e s . r e s h a p e ( -1 , 1 )
# Split dataset
X_train , X_test , y_train , y _ t e s t = t r a i n _ t e s t _ s p l i t ( X , y , t e s t _ s i z e = 0 . 3
, r a n d o m _ s t a t e = 2020 )
o r d e r = np . a r g s o r t ( X _ t e s t [ : , 0 ] )
def o b j e c t i v e ( t r i a l ) :
d = trial . suggest_int ( ' depth ', 2 , 8)
n = t r i a l . s u g g e s t _ i n t ( ' e s t i m a t o r s ' , 2 , 512 )
# Fit r e g r e s s i o n m o d e l
regr = X G B R e g r e s s o r ( m a x _ d e p t h = d , n _ e s t i m a t o r s = n )
regr . fit ( X_train , y _ t r a i n )
y _ p r e d _ t e s t = regr . p r e d i c t ( X _ t e s t )
r e t u r n m e a n _ s q u a r e d _ e r r o r ( y_test , y _ p r e d _ t e s t )
search_space = {
' d e p t h ': [2 , 4 , 6 , 8 ] ,
' e s t i m a t o r s ' : [ 2 , 16 , 128 , 512 ]
}
study = optuna . create_study ( sampler = optuna . samplers . GridSampler (
search_space ))
s t u d y . o p t i m i z e ( objective , n _ t r i a l s = 4 * 4 )
p r i n t ( " best p a r a m s : " , s t u d y . b e s t _ p a r a m s ) # Get best p a r a m e t e r s for
the o b j e c t i v e f u n c t i o n .
p r i n t ( " best value : " , s t u d y . b e s t _ v a l u e ) # Get best o b j e c t i v e v a l u e .
p r i n t ( " End of o u t p u t " )

Output:

Index([’Na2O’, ’SiO2’, ’Density (g/cm3)’], dtype=’object’)


best params: {’depth’: 2, ’estimators’: 16}
best value: 0.0038265304143910175
End of output

Code snippet 7.5: XGBoost with grid search


142 7 Model Refinement

""" X G B o o s t r e g r e s s i o n with h y p e r p a r a m e t e r r a n d o m s e a r c h
"""
# i m p o r t m a t p l o l i b for p l o t t i n g data
import matplotlib
m a t p l o t l i b . use ( ' agg ')
i m p o r t m a t p l o t l i b . p y p l o t as plt
from m a t p l o t l i b . p y p l o t i m p o r t s a v e f i g
# import
i m p o r t n u m p y as np
i m p o r t p a n d a s as pd
from x g b o o s t i m p o r t X G B R e g r e s s o r
from s k l e a r n . m o d e l _ s e l e c t i o n i m p o r t t r a i n _ t e s t _ s p l i t
from s k l e a r n . m e t r i c s i m p o r t m e a n _ s q u a r e d _ e r r o r , r 2 _ s c o r e
import optuna
# Load sa m p l e d a t a s e t
data = pd . r e a d _ c s v ( " data / N S _ d e n . csv " ) . s o r t _ v a l u e s ( by = " Na2O " )
p r i n t ( data . c o l u m n s )
X = data [ data . c o l u m n s [ : - 1 ] ] . v a l u e s
y = data [ " D e n s i t y ( g / cm3 ) " ] . v a l u e s . r e s h a p e ( -1 , 1 )
# Split dataset
X_train , X_test , y_train , y _ t e s t = t r a i n _ t e s t _ s p l i t ( X , y , t e s t _ s i z e = 0 . 3
, r a n d o m _ s t a t e = 2020 )
o r d e r = np . a r g s o r t ( X _ t e s t [ : , 0 ] )
def o b j e c t i v e ( t r i a l ) :
d = trial . suggest_int ( ' depth ', 2 , 8)
n = t r i a l . s u g g e s t _ i n t ( ' e s t i m a t o r s ' , 2 , 512 )
# Fit r e g r e s s i o n m o d e l
regr = X G B R e g r e s s o r ( m a x _ d e p t h = d , n _ e s t i m a t o r s = n )
regr . fit ( X_train , y _ t r a i n )
y _ p r e d _ t e s t = regr . p r e d i c t ( X _ t e s t )
r e t u r n m e a n _ s q u a r e d _ e r r o r ( y_test , y _ p r e d _ t e s t )
s t u d y = o p t u n a . c r e a t e _ s t u d y ( s a m p l e r = o p t u n a . s a m p l e r s . R a n d o m S a m p l e r ( seed =
2020 ) )
s t u d y . o p t i m i z e ( objective , n _ t r i a l s = 3 * 3 )
p r i n t ( " best p a r a m s : " , s t u d y . b e s t _ p a r a m s ) # Get best p a r a m e t e r s for
the o b j e c t i v e f u n c t i o n .
p r i n t ( " best value : " , s t u d y . b e s t _ v a l u e ) # Get best o b j e c t i v e v a l u e .
p r i n t ( " End of o u t p u t " )

Output:

Index([’Na2O’, ’SiO2’, ’Density (g/cm3)’], dtype=’object’)


best params: {’depth’: 3, ’estimators’: 177}
best value: 0.004141484735300094
End of output

Code snippet 7.6: XGBoost with random search

7.5 Summary

In conclusion, this chapter has provided a comprehensive exploration of model refine-


ment techniques in machine learning, focusing specifically on regularizers and hyper-
parameter optimization. Regularizers, such as Lasso, Ridge, and Elastic Net, offer
effective strategies for combating overfitting and improving the generalization capa-
bilities of predictive models. By introducing penalties and constraints, regularizers
promote feature selection, coefficient shrinkage, and strike a balance between com-
References 143

plexity and simplicity. On the other hand, hyperparameter optimization techniques


play a crucial role in fine-tuning models by identifying optimal configurations of
key parameters. The use of techniques like grid search, random search, Bayesian
optimization, and evolutionary algorithms enables efficient exploration of the hyper-
parameter space, leading to improved model performance. By harnessing the power
of regularizers and hyperparameter optimization, machine learning practitioners can
refine models, achieve a better trade-off between complexity and generalization,
and ultimately enhance the predictive accuracy and interpretability of their models.
The principles, implementation details, and practical considerations discussed in this
chapter offer valuable insights and guidance for researchers and practitioners in the
field of machine learning. By embracing these model refinement techniques, we can
advance the state-of-the-art in predictive modeling and unlock new possibilities in
various domains and applications.

References

1. C.M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics)
(Springer, Berlin, Heidelberg, 2006). isbn: 0387310738
2. K.P. Murphy, Machine Learning: A Probabilistic Perspective (The MIT Press, 2012). isbn:
0262018020
3. T.M. Mitchell, Machine Learning, 1st ed. (McGraw-Hill, Inc., USA, 1997). isbn: 0070428077
4. R.O. Duda, P.E. Hart et al., Pattern Classification (Wiley, 2006)
5. J. Friedman, T. Hastie, R. Tibshirani et al., The Elements of Statistical Learning, 10. Springer
Series in Statistics New York, vol. 1 (2001)
6. B. Schölkopf, A.J. Smola, F. Bach et al., Learning with Kernels: Support Vector Machines,
Regularization, Optimization, and Beyond (MIT Press, 2002)
Chapter 8
Deep Learning

Abstract In this chapter, we discuss advanced deep learning algorithms focusing


on their application and impact in the field of machine learning. We discuss the trans-
formative impact of deep learning compared to classical approaches, which heavily
rely on handcrafted features and hyperparameter tuning. To this extent, the chapter
explores a range of advanced deep learning models, including Convolutional Neural
Networks (CNNs) for materials image analysis, Long Short-Term Memory networks
(LSTMs) for sequential materials data, Generative Adversarial Networks (GANs)
for generating new material structures, Graph Neural Networks (GNNs) for analyz-
ing materials graphs, Variational Autoencoders (VAEs) for materials representation
learning, and Reinforcement Learning (RL) which has been widely used in materials
domain. Each model is presented with a detailed explanation of its underlying prin-
ciples, architectures, and training methodologies. By exploring these advanced deep
learning techniques, researchers and practitioners in the field of materials can gain
valuable insights into leveraging deep learning models to accelerate the exploration
of novel materials and optimize material properties.

8.1 Introduction

Machine learning has traditionally relied on classical approaches, which involve


manual feature engineering and careful hyperparameter tuning to develop robust and
generalizable models. However, recent advancements in deep learning have revolu-
tionized the field by enabling models to automatically learn intricate representations
directly from raw data. In this chapter, we delve into the world of deep learning and
explore its fundamental principles, applications, and how it differs from classical
machine learning approaches.
Deep learning, characterized by the use of neural networks with multiple layers,
has emerged as a paradigm shift in machine learning. Unlike classical approaches
that require human experts to handcraft features, deep learning models can automat-
ically extract meaningful representations from raw data. This capability eliminates
the need for domain-specific knowledge and allows the models to learn intricate pat-

© Springer Nature Switzerland AG 2024 145


N. M. A. Krishnan et al., Machine Learning for Materials Discovery,
Machine Intelligence for Materials Science,
https://doi.org/10.1007/978-3-031-44622-1_8
146 8 Deep Learning

terns and relationships directly from the data, leading to improved performance and
adaptability across various domains.
Compared to classical machine learning algorithms, deep learning models offer
several distinct advantages. One key advantage lies in their ability to handle large
and complex datasets. Deep learning models excel at scaling with data due to their
capacity to leverage parallel computing architectures and exploit vast amounts of
training examples. This scalability empowers deep learning models to process mas-
sive datasets efficiently, enabling them to capture intricate patterns that may be chal-
lenging for classical algorithms.
Furthermore, deep learning models can perform end-to-end learning, eliminating
the need for manual feature engineering. While classical machine learning requires
domain expertise to carefully design and select relevant features, deep learning mod-
els can learn hierarchical representations directly from raw data. This end-to-end
learning approach simplifies the development process and makes deep learning mod-
els more adaptable to diverse problem domains.
Throughout this chapter, we will delve into advanced deep learning models,
including Convolutional Neural Networks (CNNs) for computer vision, Long Short-
Term Memory networks (LSTMs) for sequential data analysis, Generative Adversar-
ial Networks (GANs) for generative modelling, Graph Neural Networks (GNNs) for
graph-structured data, Variational Autoencoders (VAEs) for generative modelling
and representation learning, and Reinforcement Learning (RL) for sequential deci-
sion making. By exploring these models in detail, we aim to provide readers with
a comprehensive understanding of the underlying principles, architectures, training
methodologies, and practical applications of deep learning, enabling them to leverage
the power of these advanced techniques in their own machine learning endeavours.
It should be noted that the chapter is not exhaustive and is only prescriptive in
nature. We only aim to provide a brief overview of different widely used architectural
frameworks and their potential applications. Readers are encouraged to read the
original research papers and textbooks on each of these subject matters for the in-
depth understanding of the mathematical operations, and theoretical guarantees, if
any, for each of the approaches discussed [1–3].

8.2 Convolutional Neural Networks

The Convolutional Neural Network (CNN) is an deep learning framework that can
process input images, highlight different aspects in the image, and distinguish one
from the other. A CNN can successfully capture an image’s spatial and temporal
dependencies by applying relevant filters or kernels by applying convolution opera-
tion (Fig. 8.1).
Like any neural network architecture, a CNN consists also of an input layer, a
hidden layer, and an output layer. In a CNN, the hidden layers include layers that
perform convolutions. Convolution operation is a component-wise inner product of
two matrices as though they are vectors. In CNN convolution filters or kernels used to
8.2 Convolutional Neural Networks 147

Fig. 8.1 CNN schematic (adapted from [4] with permission from)

perform convolution operation with the input data frame. This output is then passed
through an activation function, which is commonly used as ReLU for CNNs. A
feature map is generated as the convolution kernel convolves along with the input
page matrix for the layer. This is followed by other layers such as pooling layers,
fully connected layers, and normalization layers.
The convolution filter moves from left to right in an image with a fixed stride value
until the full width is parsed. These operation helps to extract key high level features
from the image. Traditionally, the first convolution layer is used for extracting low-
level features such as edges, color, and gradient orientation. With added layers, the
architecture can learn high-level features as well.
Like the convolution layer, the pooling layer is responsible for reducing the spatial
size of the convoluted feature. This reduces the computational power required to
process data through dimensional reduction. In addition, it helps extract powerful
features that do not change rotational and spatial, thus maintaining the effective
training of the model. There are two types of pooling: max pooling and average
pooling. Max pooling gives the maximum value from the part of the kernel-covered
image. On the other hand, the kernel provides an average pooling of all values from
the covered image portion.
This procedure is repeated in each layer till all the relevant features are extracted.
The flattened output is fed into a feed-forward neural network, and backpropagation
is applied to each training repetition to achieve the desired classification.
The convolutional layers are the core building blocks of CNNs and perform the
most important operations. The main steps of a CNN are as follows.
1. Convolution Operation: The convolutional layer applies filters or kernels to the
input image to extract local patterns and features. Each filter is a small matrix of
weights that is convolved with the input image. The convolution operation can be
defined as follows:
∑∑
.C(i, j) = I (i + m, j + n) · K (m, n) (8.1)
m n

where .C(i, j) is the output feature map at position .(i, j), . I is the input image,
and . K is the filter.
2. Non-linear Activation: After the convolution operation, a non-linear activation
function is applied element-wise to introduce non-linearity into the network. The
148 8 Deep Learning

most commonly used activation function in CNNs is the Rectified Linear Unit
(ReLU), defined as . f (x) = max(0, x), where .x is the input.
3. Pooling: The pooling layer reduces the spatial dimensions of the feature maps,
effectively downsampling the information. It helps to reduce the computational
complexity and make the network more robust to small variations in the input.
The most commonly used pooling operation is max pooling, which extracts the
maximum value from each local region.
4. Fully Connected Layers: The fully connected layers are typically placed after the
convolutional and pooling layers. These layers connect every neuron from the
previous layer to the next layer, allowing the network to learn complex patterns
and make predictions. The output of the fully connected layers is usually fed into a
softmax layer for classification tasks, which produces the probability distribution
over different classes.
The overall architecture of a CNN typically consists of multiple convolutional
layers, interleaved with pooling layers, followed by one or more fully connected
layers. The layers are trained end-to-end using backpropagation and optimization
algorithms, such as stochastic gradient descent (SGD), to update the weights and
minimize the loss function. CNNs are commonly used for image classifications,
image segmentations, and predicting the global properties such as ionic conductivity
based on the microstructure of a material. Accordingly, CNNs are extremely use-
ful in materials domain to analyze images and identify the crystal structural, grain
boundaries or global properties.

8.3 Long-short Term Memory Networks

Recurrent neural networks (RNN) are a class of neural networks that allow previous
outputs to be used as inputs while having hidden states. In RNNs the output(s) are
influenced not just by weights applied on inputs like a regular NN, but also by a
“hidden” state vector representing the context based on prior input(s)/output(s). So,
the same input could produce a different output depending on previous inputs in the
series. RNNs are recurrent due to the repeated applications of the transformations
to a series of inputs and produce a series of output vectors. In RNNs, in addition to
generating the output, the hidden state itself updated based on the input. Long short-
term memory (LSTM) is an RNN architecture that a cell, an input gate, an output
gate, and a forget gate [5]. The cell remembers values over arbitrary time intervals,
and the three gates regulate the flow of information into and out of the cell. LSTM
networks are well-suited to modeling sequential and time-series data since they can
accommodate delayed effect data points in the data-sets. Role of each unit of LSTM
is illustrated below (Fig. 8.2).The primary components of LSTM are (1) cell state
and (2) gates. The cell state a continuous information flow to all components in the
cell. The cell state information, when it run through an LSTM cell, either added or
8.3 Long-short Term Memory Networks 149

Fig. 8.2 LSTM schematic (adapted from [6])

removed via gates. The gates are different neural networks components that decide
which information is allowed on the cell state.
Input gate is used to update the status information of the cell. First, we pass the
previous hidden state and the current input to the sigmoid function. Further, the hidden
state and current input is also fed into the tanh function. Afterward, both of these
inputs are multiplied together. These operations decide the extent of information to
be stored in a cell state. Next, the updated cell state is calculated. Initially, the cell
state is multiplied by the forget vector. This results in dropping values in the cell state
if it gets multiplied by values near zero. Then the output from the input gate is added
to update the cell state, resulting in new cell state. The output gate decides what will
be the next hidden state from the current cell. First, the previous hidden state and the
current input are passed to the sigmoid function. Then the updated cell position is
fed into the tanh function. These two outputs are then multiplied together to decide
which information should be in the hidden state. The new cell position and newly
hidden are then passed to the new cell. To summarize, the forget gate decides what
is relevant to keep from previous steps, the input gate decides what information is
pertinent to add from the current step, and the output gate determines what the next
hidden state should be.
Mathematically, the operations associated with each of these cells in LSTM can
be written as follows. These equations illustrate how an LSTM cell processes input
sequences and updates its memory cell state and hidden state.
1. Input Gate: The input gate determines how much new information should be
stored in the memory cell. It takes as input the current input vector, denoted as .xt ,
and the previous hidden state, denoted as .h t−1 . The input gate activation, denoted
as .i t , is computed as follows:

i = σ (Wxi xt + Whi h t−1 + bi )


. t (8.2)

where .Wxi , .Whi , and .bi are the weight matrix and bias term associated with the
input gate.
2. Forget Gate: The forget gate determines how much information from the previous
memory cell state, denoted as .ct−1 , should be forgotten. It takes the same inputs as
the input gate and computes the forget gate activation, denoted as . f t , as follows:
150 8 Deep Learning

f = σ (Wx f xt + Wh f h t−1 + b f )
. t (8.3)

3. Update Memory Cell: The update to the memory cell state, denoted as .c̃t , is
computed by applying the hyperbolic tangent activation function to the current
input and previous hidden state:

c̃ = tanh(Wxc xt + Whc h t−1 + bc )


. t (8.4)

4. Memory Cell: The memory cell state, denoted as .ct , is updated by combining the
information from the input gate and the forget gate:

c = ft
. t ct−1 + i t c̃t (8.5)

where . denotes element-wise multiplication.


5. Output Gate: The output gate determines how much of the memory cell state
should be exposed as the current hidden state, denoted as .h t . It takes the same
inputs as the previous gates and computes the output gate activation, denoted as
.ot , as follows:

.ot = σ (W xo x t + Who h t−1 + bo ) (8.6)

6. Hidden State: The hidden state is computed by applying the hyperbolic tangent
activation function to the updated memory cell state and multiplying it by the
output gate activation:
.h t = ot tanh(ct ) (8.7)

In materials domain, LSTMs can be used to capture several time-dependent data


such as diffusion of gases, fracture propagation, or dynamics of continuum or atomic
systems. Indeed, LSTM architectures can be combined with other architectures such
as CNNs to exploit the timeseries nature of image data.
The Gated Recurrent Units (GRUs) are the newer generation of RNNs that GRU’s
got rid of the cell state and used the hidden state to transfer information. Therefore,
it has two gates, a reset gate and an update gate. The update gate acts similar to the
forget and input gate of an LSTM and decides what information to throw away and
what new information to add. In addition, the reset gate is used to determine how
much past information to forget. A fewer number of gates results in fewer tensor
operations. Therefore, they are computationally less intense than LSTMs.

8.4 Generative Adversarial Networks

Machine learning models can be broadly classified into generative models and dis-
criminative models. The fundamental difference between these models is that dis-
criminative models learn the (hard or soft) boundary between classes while generative
models learn the distribution of individual classes. A generative model is the one that
8.4 Generative Adversarial Networks 151

Fig. 8.3 GAN schematic (adapted with permission from [7])

can generate data as it models both the features and the class and have the poten-
tial to automatically learn the natural features of a data set, whether categories or
dimensions or something else entirely.
Let us consider a data set with two variables .x and . y. If we model the joint distri-
bution of both, . p(x, y), it can use this probability distribution to generate data points
and will a generative model. In contrary to that if we model . p(y|x),the conditional
probability of the observable .x, given a target . y, symbolically it will only model the
hard class where it belong to.
Generative adversarial networks, or GANs are an approach to generative model-
ing using deep learning methods that employs a discriminator to aid in the generative
modeling step. GANs consist of two main components: a generator and a discrim-
inator (Fig. 8.3). The generator learns to produce synthetic data that resembles real
data, while the discriminator learns to distinguish between real and generated data.
This adversarial setup allows the generator to improve its ability to generate realistic
data by competing against the discriminator. Thus, GANs employ a new format of
training a generative model by posing the problem as a supervised learning problem
with two sub-models that include the generator model that we train to generate new
examples and the discriminator model that tries to classify examples as either real
from the domain or fake generated. The two models are trained together in a zero-
sum game, format contesting each other, until the discriminator model is deceived
about half the time, meaning the generator model generates plausible examples.
The generator takes as input a random noise vector, denoted as .z, and maps it to a
generated sample, denoted as .G(z). The goal of the generator is to generate samples
that are indistinguishable from real samples. On the other hand, the discriminator
takes as input a sample, either real (.x) or generated (.G(z)), and outputs a probability
(. D(x) or . D(G(z))) representing its belief on whether the sample is real or generated.
The training objective of GANs can be defined using the minimax game between
the generator and the discriminator:
152 8 Deep Learning

. min max V (D, G) = Ex ∼ pdata(x)[log D(x)] + Ez∼ pz (z) [log(1 − D(G(z)))]


G D
(8.8)
Here, . pdata (x) represents the distribution of real data, and . pz (z) represents the dis-
tribution of the random noise vector .z. The objective function consists of two terms:
the first term encourages the discriminator to maximize the probability of correctly
classifying real samples as real, and the second term encourages the discriminator to
maximize the probability of correctly classifying generated samples as generated.
During the training phase, weights of both discriminator and generator are updated
only sequentially and not simultaneously. During discriminator training, the dis-
criminator classifies both real data and fake data from the generator by computing
the discriminator loss that penalizes the discriminator for misclassifying an actual
instance as counterfeit or a phony example as real. Afterward, the discriminator
updates its weights through backpropagation from the discriminator loss through the
discriminator network.
The generator is trained with the following procedure. After sampling random
noise, the generator output is computed. Then we employ a discriminator for “Real”
or “Fake” classification for generator output by calculating the loss from discrimi-
nator classification. This is then backpropagated through both the discriminator and
generator to obtain gradients, while only the generator weights will be updated.
This completes one iteration of generator training. After the training process, the
discriminator model is discarded as we are interested in the generator.

8.5 Graph Neural Networks (GNN)

Graph Neural Networks (GNNs) are a class of neural networks specifically designed
to process and analyze graph-structured data. They are particularly effective for
tasks involving relational data, such as social networks, molecular structures, citation
networks, and knowledge graphs. GNNs operate by propagating information through
the nodes and edges of a graph, allowing each node to aggregate and update its
representation based on the information from its neighbors (Fig. 8.4).
The key idea behind GNNs is to leverage the connectivity and relational structure
of graph data to enable effective information propagation and learning. By iteratively

Fig. 8.4 GNN schematic (adapted with permission from [8])


8.5 Graph Neural Networks (GNN) 153

updating node representations based on the information from their neighbors, GNNs
can capture complex dependencies and patterns in graph-structured data.
A graph is a data structure consisting of two elements: nodes (or vertices) and
edges connecting them. The nodes of a graph can be homogeneous, with all nodes
having a similar structure or heterogeneous nodes having different types of structure.
The edges of the graph can also be weighed to represent the importance of the edge.
Graph embedding is mapping a graph into a set of vectors that capture graph topology,
node-to-node relationships, and other relevant information about graphs. Each node
will have a unique set of embeddings for itself which determines its identity in the
graph.
In a typical GNN, the input consists of a graph with a set of nodes and edges.
Each node in the graph is associated with a feature vector representing its attributes,
and each edge contains information about the relationship between two connected
nodes. The goal of the GNN is to learn a function that maps these input features to
a desired output, such as node classification, node level predictions, or graph-level
prediction. Molecular structures can naturally be modeled as a graph with the atoms
representing the nodes and the bonds representing the edges. Then, GNN can be
used for several tasks such as: (i) predicting the dynamics of individual atoms by
predicting the node-level displacement, (ii) predicting the overall graph property
such as toxicity of a drug molecule, or (iii) as an interatomic potential by predicting
the potential energy of a node/edge as a function of the local structure.
The propagation step in GNNs is typically performed iteratively, allowing each
node to update its representation based on the features of its neighbors. This process
can be summarized as follows:
1. Initialization: Each node in the graph is assigned an initial feature vector, denoted
as .h (0)
v , where .v represents the node index.
2. Message Passing: During message passing, information is exchanged between
connected nodes. Each node aggregates the feature vectors of its neighbors and
combines them with its own feature vector to generate a new representation. This
aggregation step can be defined using a function, typically a neural network,
that takes the neighboring node features and produces a message for each edge.
For example, the message passed from node .u to node .v can be computed as
(t) (t)
.m uv = M(h u , h v , euv ), where . M is a function that combines the node features
(t) (t)
.h u and .h v , along with the edge feature .euv .
3. Aggregation: After computing the messages, each node aggregates the received
messages to update its own representation. This aggregation step can be defined
as .h (t+1)
v = U (h (t)
v , {m uv }), where .U is a function that combines the node features
(t)
.h v and the received messages .{m uv }.

4. Iteration: Steps 2 and 3 are repeated for a fixed number of iterations, allowing
the information to propagate through the graph. After each iteration, the repre-
sentations of the nodes are refined based on the updated information from their
neighbors.
154 8 Deep Learning

5. Readout: Once the iterations are completed, the final node representations can
be used for various downstream tasks, such as node classification, graph classifi-
cation, or link prediction.
Two prominent GNN architectures are (1) graph convolutional networks (GCN)
and graph encoder networks (GEN). In GCNs, like a CNN, a spatially moving filter
over the nodes extracts key features from the embeddings, which are further used
in a CNN framework. In GENs, an encoding layer downsamples the embedding
inputs by passing them through convolutional filters to provide the compact feature
representation, and a decoder which upsamples the representation provided by the
encoder and re-constructs the input according to the same. Both of these architectures
and their variants can be used to learn embeddings and are used to predict embeddings
for un-seen nodes.

8.6 Variational Auto Encoders (VAE)

Variational auto encoders (VAE) are a class generative of neural network architecture
belong to the class of Bayesian graphical models and variational Bayesian methods.
Unlike GANs, VAEs are explicit generative models wherein the specifics of the
probabilistic distributions are incorporated in the network architecture. This further
aids in to sample from the output distribution of the network (Fig. 8.5).
There are two integral components for the VAE: (i) an encoder network, and
(ii) a decoder network. Encode neural network is employed to convert input data .x
is to convert to a latent representation representation .z conditioned on a a suitable
attribute .a. Hence, by performing high level inference, the encoder compressors the
data learning the lower dimensional latent place distribution. While, The decoder
neural networks take the latent space input and attributes and regenerates the output
probability distribution. However, during this exercises, some amount of informa-
tion is irrecoverably lost. During the training of the VAEs, this error back propagated
through the entire network to improve the reconstruction of the original inputs. Note
that the bottle neck layer of a VAE represents the information bottle neck by reducing
the dimnesionality of the features. Thus, the bottle neck layer of the encoder rep-
resents the input data in an extremely low-dimensional space. The performance of
the decoder can be used to evaluate the variance of the information that is captured
by the bottle neck layer. In this sense, VAEs can also be used as a dimensionality
reduction technique.
The variational autoencoder can be represented as the graphical model, where
the joint probability can be expressed as . p(x, z) = p(x|z) p(z). This enables one to
sample latent variables and data points to be sampled from transformed distributions
. p(z) and . p(x|z) respectively. For the inference calculation, we need to complete

the is known as the posterior distribution . p(z|x) = p(x|z) p(z)/ p(x). To make this
computation tractable, VAEs assume that samples of .z can drawn from a simple
posterior distribution .q(z|x) which is similar to . p(Z |z). VAEs use Kullback-Leibler
8.6 Variational Auto Encoders (VAE) 155

Fig. 8.5 VAE schematic


(adapted with permission
from [9])

divergence to measure how far .q(.) is far away from . P p(z). This is achieved by a
loss function called Evidence Lower BOund function (ELBO) defined for the VAE.
The ELBO is a lower bound on the evidence and its maximization will results in
increasing the likelihood of observing the data following the assumed distribution.
Mathematically, the operations in VAEs can be defined as follows. Consider the
input data as .x and the latent variable as .z. The encoder network parameterizes the
conditional distribution .q(z|x), which approximates the true posterior distribution
. p(z|x). The encoder produces two vectors, the mean .μ and the log-variance .log(σ ),
2

that parameterize the approximate posterior distribution:

μ, log(σ 2 ) = Encoder(x)
. (8.9)

To generate a sample from the latent space, a reparameterization trick is intro-


duced. We sample an auxiliary variable .ϵ from a standard Gaussian distribution:

ϵ ∼ N(0, I )
. (8.10)

Then a latent variable .z is sampled from the approximate posterior by applying a


transformation using the mean and log-variance:
156 8 Deep Learning

. z =μ+ϵ·σ (8.11)

The decoder network parameterizes the conditional distribution . p(x|z), which


generates the reconstructed output .x̃ given a latent variable .z:

. x̃ = Decoder(z) (8.12)

The reconstruction loss, often measured as the negative log-likelihood, is used to


measure the similarity between the original input .x and the reconstructed output .x̃.
This encourages the model to learn meaningful representations in the latent space.
The reconstruction loss is typically defined as:

Lrecon = −
. x log(x̃) + (1 − x) log(1 − x̃) (8.13)

To regularize the latent space and encourage it to follow a prior distribution,


typically a standard Gaussian distribution, a regularization term called the Kullback-
Leibler (KL) divergence is used. The KL divergence measures the difference between
the approximate posterior .q(z|x) and the prior distribution . p(z). The KL divergence
is given by:

1∑
.LKL = − 1 + log(σ 2 ) − μ2 − σ 2 (8.14)
2
The overall objective function for training a VAE is the sum of the reconstruction
loss and the KL divergence:

L = Lrecon + LKL
. (8.15)

During training, VAEs optimize this objective function using stochastic gradient
descent or a similar optimization algorithm. Once trained, the decoder network can
generate new samples by sampling latent variables from the prior distribution and
decoding them into output data samples. VAEs are widely used in materials for
synthetic data generation and also dimensionality reduction. Once the VAEs are
trained, the encoder is used to reduce the dimensionality of the input features, which
can be then be used for downstream tasks such as classification or regression.

8.7 Reinforcement Learning (RL)

RL is a class of machine learning paradigm wherein feedback from the environment


is rewarded based on the desired behavior [2]. The RL framework consists of three
main components, namely the agent, the environment, and the reward. The decision
maker is called the agent which receives the state from the environment and takes the
actions. Here, the state refers to the current situation of the agent. For every action
8.7 Reinforcement Learning (RL) 157

Fig. 8.6 RL schematic


(adapted with permission
from [10])

taken, the agent receives a reward from the environment and transitions to a new
state. This interaction between the agent and the environment takes place at discrete
time steps, .t = 0, 1, 2, 3..... Specifically, at each time step t, the agent takes action .at
on receiving state .st from the environment. In the next time step, the agent receives
a scalar reward .rt (for the action .at taken from state .st ) and finds itself in a new state
.st+1 (Fig. 8.6).
The agent uses the policy.π to interact with the environment to generate a sequence
of states, actions, and rewards.

. H = (s1 , a1 , r1 , s2 , a2 , r2 , ....sT , aT , r T ) (8.16)

The goal of the agent is to find a sequence of control actions to maximize the expected
discounted return of rewards as given below:


. arg max E [Rt := γ k rt+k ] (8.17)
A
k=0

where .γ is the discount rate.


There are three approaches to solving the RL problems i.e., value-based, policy-
based, and model-based RL. Value-based RL involves learning the optimal policy
by learning the value function that maps each state/ state-action pair to a value (V(s),
Q(s, a)). Policy-based methods directly optimize the policy .π without using a value
function. This is useful when the action space is continuous or stochastic. Actor-
Critic algorithms learn approximations of both policy and value functions. The critic
measures how good the action taken is (value-based) by estimating the value function.
The “Actor” updates the policy distribution in the direction suggested by the Critic
(such as with policy gradients). Model-based RL involves creating a model of the
behavior of the environment and using that model to find the optimal policy.
158 8 Deep Learning

8.8 Summary

In this chapter, we have explored advanced deep learning techniques and their broad
applications in machine learning. Our focus was on providing a prescriptive overview
of these techniques rather than delving into detailed explanations of the algorithms.
We highlighted the transformative impact of deep learning compared to classical
approaches, showcasing the shift from manual feature engineering to end-to-end
learning. Deep learning models, such as CNNs, LSTMs, GANs, GNNs, VAEs, and
RL, have revolutionized various fields beyond materials science. While we did not
provide extensive technical explanations, we emphasized the wide-ranging applica-
tions of these advanced deep learning techniques. From computer vision and nat-
ural language processing to generative modeling and sequential decision making,
deep learning has enabled breakthroughs in areas such as image analysis, language
understanding, creative generation, network analysis, representation learning, and
autonomous agents. Altogether, the chapter aimed to provide readers with a broader
outlook on the capabilities and potential applications of advanced deep learning
techniques. By understanding the core concepts and recognizing the diverse domains
where these techniques have been successfully applied, readers can explore and adapt
these approaches to their specific problem domains.

References

1. I. Goodfellow, Y. Bengio, A. Courville, Deep Learning (The MIT Press, 2016). isbn:
0262035618
2. R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction (MIT Press, 2018)
3. K.P. Murphy, Machine Learning: A Probabilistic Perspective (The MIT Press, 2012). isbn:
0262018020
4. J. Bernal, K. Kushibar, D.S. Asfaw, S. Valverde, A. Oliver, R. Marti, X. Llado, Deep convo-
lutional neural networks for brain image analysis on magnetic resonance imaging: a review.
Artif. Intell. Med. 95, 64–81 (2019)
5. F.A. Gers, J. Schmidhuber, F. Cummins, Learning to forget: continual prediction with LSTM.
Neural Comput. 12(10), 2451–2471 (2000)
6. T. Fischer, C. Krauss, Deep learning with long short-term memory networks for financial market
predictions. Eur. J. Oper. Res. 270(2), 654–669 (2018)
7. A. Aggarwal, M. Mittal, G. Battineni, Generative adversarial network: an overview of theory
and applications. Int. J. Inf. Manag. Data Insights 100 004 (2021)
8. J.-Y. Kim, S.-B. Cho, A systematic analysis and guidelines of graph neural networks for prac-
tical applications. Expert Syst. Appl. 184, 115 466 (2021)
9. D.P. Kingma, M. Welling, An Introduction to Variational Autoencoders (2019).
arXiv:1906.02691
10. R. Nian, J. Liu, B. Huang, A review on reinforcement learning: introduction and applications
in industrial process control. Comput. Chem. Eng. 139, 106 886 (2020)
Chapter 9
Interpretable Machine Learning

Abstract ML algorithms, and deep learning models more so, are notorious for their
black-box nature providing little or no insights into the nature of the learned func-
tion. To address this challenge, increasing emphasis is being placed on developing
interpretable ML models or enabling interpretability for the existing models. Inter-
pretable machine learning addresses the challenge of understanding complex black-
box models, enabling transparency and insight into their decision-making processes.
In this chapter, we explore interpretable machine learning techniques, focusing on
two prominent methods: SHapley Additive exPlanations (SHAP) and integrated gra-
dients. The SHAP framework, based on cooperative game theory, is examined as
a method to attribute feature contributions to model outputs. SHAP values provide
a mathematically grounded approach to understanding the significance and impact
of individual features. The chapter also explores integrated gradients, which quan-
tify feature importance by integrating gradients along a reference-to-input path. This
technique offers insights into how changes in feature values affect model predictions.
We also discuss symbolic regression as a tool to abstract out symbolic laws from
the data. Finally, we discuss a few additional interpretability algorithms to unpack
the black-box ML models. Altogether, we discuss how interpretable algorithms can
provide insights into the feature-to-label map learned by the DL models.

9.1 Introduction

ML approaches are notorious for the black-box nature of the models. With increas-
ing complexity of the models, interpretability and explainability decreases. Simpler
models such as linear and polynomial regression are explainable due to the paramet-
ric nature of the equations. However, more complex models such as random forest,
and neural networks are less interpretable. Support vectors can be considered as
an intermediate model wherein the simpler linear version provides some insights
into the model where are the non-linear kernel versions are more complex with less
interpretability. Overall, the complex ML models provide little or no insights into the
nature of the problem or the input-output relationships. This questions the applicabil-
ity of these models to extrapolate beyond the training domain and apply for practical
© Springer Nature Switzerland AG 2024 159
N. M. A. Krishnan et al., Machine Learning for Materials Discovery,
Machine Intelligence for Materials Science,
https://doi.org/10.1007/978-3-031-44622-1_9
160 9 Interpretable Machine Learning

situations, particularly in domains where model decisions impact human lives and
crucial decision-making processes. Further, domain experts might find it challenging
to appreciate the model as the features learnt by the model cannot be clearly under-
stood. By making machine learning models more transparent and explainable, we
can build trust, understand biases, detect errors, and ensure ethical considerations
are met.
To address these challenges, several algorithms have been developed recently that
can be used to explain the features learnt by the ML models. While some of these
algorithms are model-specific, others are model-agnostic. For example, the feature
explainer in tree algorithms identifies the frequency of usage of the input features to
create a branch in a tree. This frequency is then used as a surrogate to explain the
feature importance learnt by the model. On the contrary, algorithms such as Shapley
additive explanations (SHAP) are model agnostic and can be applied on any ML
model.
In this chapter, we will focus on the interpretability of ML models. First, we will
discuss the SHAP framework, a versatile tool for model interpretation based on coop-
erative game theory. SHAP values provide a unified and mathematically grounded
approach to attribute the contribution of each feature to the model’s output, offer-
ing insights into the importance and impact of individual features. Next, the chapter
explores integrated gradients, a technique inspired by the field of interpretability
in deep learning. Integrated gradients provide an attribution method that quantifies
the importance of each feature by integrating gradients along a path from a refer-
ence point to the input point. This approach enables the understanding of feature
importance and how changes in feature values influence model predictions.
We will also discuss the theoretical foundations, practical implementation, and
application examples of both SHAP and integrated gradients. We highlight their
strengths, limitations, and considerations for different types of machine learning
models and datasets. By employing these interpretable machine learning techniques,
practitioners and researchers can gain a deeper understanding of how complex models
arrive at their predictions. This chapter aims to equip readers with the knowledge
and tools necessary to unravel the inner workings of black-box models, fostering
transparency, trust, and responsible deployment of machine learning in real-world
applications. For further discussion on interpretable machine learning, readers are
directed to References [1, 2].

9.2 Shapley Additive Explanations

ML models frequently ignore the physical importance of the input information on


the predictions. If the relationship between the quality of the inputs and outputs could
be established, the model would become more universally applicable and simple to
interpret. To this extent, recently, there have been various techniques proposed in
the literature, such as LIME [3], DeepLIFT [4], Layer-Wise Relevance Propagation
9.2 Shapley Additive Explanations 161

Fig. 9.1 Interaction plot for


Young’s modulus

Fig. 9.2 Mean of absolute


shap values

[5], Shapley regression values [6], Shapley sampling values and quantitative input
influence [7], among others.
The Shapley Additive exPlanation (SHAP) method uses the “Shapley value” to
understand the influence of a feature value on prediction. Shapley additive expla-
162 9 Interpretable Machine Learning

Fig. 9.3 Distribution of shap values for each component

Fig. 9.4 Force plot showing contribution of each component for a prediction

nation (SHAP) values is a unified game theoretic approach to calculate the feature
importance of an ML model. SHAP measures a feature’s importance by quantifying
the prediction error while perturbing a given feature value. If the prediction error
is large, the feature is important, otherwise the feature is less important. It is an
additive feature importance method which produces unique solution while adhering
to desirable properties namely local accuracy, missingness, and consistency. It is
defined as a feature’s average marginal contribution value across all potential feature
combinations (Figs. 9.1, 9.2, 9.3 and 9.4).
'
Suppose a regression model and their prediction function be. f (x1 , . . . , xn ) where
. x 1 , . . . , x n are the features. By means of SHAP, the contribution of . jth feature is
computed as follows:
'
.φ j ( f ) = f ' (x j ) − E( f ' (X j )) (9.1)
9.2 Shapley Additive Explanations 163

where the upper case letter stands for the feature random variable. . E( f ' (X j )) is
the mean effect estimate for . jth feature and their contribution is computed as the
difference between the feature prediction and the average estimation. If we add up
every feature contribution at once, we get the following:
'
.∑ Nj=1 φ j ( f ) = ∑ Nj=1 ( f ' (x j ) − f ' (E(X j ))) (9.2)

Let . f x (S) denote the estimate feature values in the set . S that are minimised over
features not in the set . S:

' '
f (S) =
. x f (x1 ....x N )d Px ∈S
/ − E X ( f (X ))) (9.3)

A feature value’s contribution to the prediction is weighted and added over all pos-
sible feature value combinations to determine its Shapley value. It is determined
by averaging out each feature’s marginal contributions to all viable coalitions of
features. This leads to an estimation of the Shapley value . phi j ( f x ) as follows:

∑ |S|!(N − |S| − 1)! [ ]


φ j ( fx ) =
. f x (S ∪ { j}) − f x (S)) (9.4)
S⊆N /{ j}
N!

where . S denotes the subset of features and . N denotes the total number of input
features. A negative number of Shapley indicates that the feature instance has a
negative impact on the target value, whereas a positive value indicates the opposite.
This explanation method uses additive feature contribution, which is dependent on
the linear combination of the following features.

'

N
'
. g(z ) = φo + φjz j (9.5)
j=1

'
where .z ∈ {0, 1} N is the coalition vector, . N is number of features, and .φ j is attribute
by. jth feature [8]. In this study, we have interpreted the best-performing GPR models
by utilizing the “kernelSHAP” framework. It is appropriate for nonlinear models such
as GPR and interprets the feature importance by evaluating the Shapley values.
The “KernelSHAP” is an extension of SHAP wherein the contributions of each
feature value to the estimate for a data instance are calculated for nonlinear models
such as GPR. “KernelSHAP” algorithm involves five major steps as indicated below:
'
1. Sample coalitions .z k ∈ {0, 1} M , where .k = {0, 1, ....K }, 1 means coalition has
feature and 0 means feature absent.
' '
2. Get prediction for each .z k : . f ' (h x (z k )), where .h x : {0, 1} M |→ R P .
' '
3. Estimate the weight for each .z k using the SHAP kernel .||x (z ) [8]:
164 9 Interpretable Machine Learning

' (M − 1)
.||x (z ) = ( M ) ' (9.6)
|z ' |
|z |(M − |z ' |)

'
where . M indicates the maximum size of coalition size and .|z | stands for the
number of features present for a given instance. ∑
4. Fit weighted linear model by minimizing the loss . L = z ' [ f ' (h x (z ' )) − g(z ' )]2
'
||x (z )
5. Return Shapley values for .φk the linear model’s coefficients.
After normalization, the final Shapley values are presented as the average ∑ of the
absolute Shapley values for each feature across the feature data, . Ik = .1/n in ||φki ||.
The high value of . Ik , means it is a crucial feature while predicting the target variables
and vice versa.
The SHAP values are visualized using a violin plot and a river flow plot. The violin
plot represents the contribution of a given feature towards the different output values
as a function of the feature value. Thus, the violin plot is colored according to the fea-
ture value. The river plot shows some specific paths for the final prediction, where the
intermediate points represent the contribution of different input components. These
paths are created by nudging the prediction from expected value towards a particular
direction representing that specific glass component’s particular contribution. Thus,
the river flow plot is colored according to the final output value.
Further, the correlation between several input components in a model can also be
studied using the SHAP interaction values. To this extent, we analyze the error in
the output prediction while perturbing two input components simultaneously. If the
magnitude of error in the output while perturbing a single input component is the
same for different values of the second output components, this suggests that the two
input components are not correlated; otherwise, they are correlated. The degree of
this correlation can also be quantified from the SHAP interaction values as follows.
The Shapely interaction values from classic game theory are calculated as
∑ (|S| − 1)!( p − |S| − 2)!
SHAPi j ( f ) =
. × [ f (S ∪ {i, j})
S⊆{1,2,..., p}\{i, j}
( p − 1)!
− f (S ∪ {i}) − f (S ∪ { j}) + f (S)] (9.7)

where

.SHAPi j ( f ) : SHAP interaction value between features i and j in function f


S : Subset of features
p : Total number of features
f (S) : Model prediction when using only the features in subset S
f (S ∪ {i}) : Model prediction when including feature i with features in subset S
f (S ∪ { j}) : Model prediction when including feature j with features in subset S
f (S ∪ {i, j}) : Model prediction when including both features i and j with features in subset S
9.3 Integrated Gradients 165

This equation represents the SHAP interaction values between pairs of features,
denoted by .i and . j, in a given function or model . f . The equation involves sum-
ming over all possible subsets . S of features excluding .i and . j. The terms inside
the summation compute the difference between model predictions when including
and excluding the features .i and . j in different combinations. The resulting SHAP
interaction values provide insights into the joint effect of features on the model pre-
dictions. For a model with M features, a M .× M matrix per instance is obtained. To
visualize interaction values for a specific property, a heatmap of M .× M square grid
is plotted, where color of each grid represents the normalized interaction value. Note
that interaction values are averaged (mean of absolute values) over the whole dataset
to produce a single M .× M matrix for each property. The stronger the interaction
value, the stronger is the coupling between two variables for a given property.

9.3 Integrated Gradients

Formally, integrated gradients defines the importance value for the .ith feature value
as follows:
1
δ f (x ' + α(x − x ' ))
φ I G ( f, x, x ' ) = (xi − xi' ) ×
. i dα (9.8)
α=0 δxi

where .x is the current input, . f is the model function and x’ is some baseline input
that is meant to represent the absence of feature input. The subscript .i is used to
denote the indexing into the .ith feature (Fig. 9.5).
As the formula above states, integrated gradients gets importance scores by accu-
mulating gradients on images interpolated between the baseline value and the current
input. But why would doing this make sense? Recall that the gradient of a function
represents the direction of maximum increase. The gradient is telling us which pixels
have the steepest local slope with respect to the output. For this reason, the gradient
of a network at the input was one of the earliest saliency methods.
Unfortunately, there are many problems with using gradients to interpret deep
neural networks. One specific issue is that neural networks are prone to a problem
known as saturation: the gradients of input features may have small magnitudes
around a sample even if the network depends heavily on those features. This can
happen if the network function flattens after those features reach a certain magnitude.
Intuitively, shifting the pixels in an image by a small amount typically does not
change what the network sees in the image. We can illustrate saturation by plotting
the network output at all images between the baseline .x ' and the current image. The
figure below displays that the network output for the correct class increases initially,
but then quickly flattens.
166 9 Interpretable Machine Learning

Fig. 9.5 Comparing integrated gradients with gradients at the image. Left-to-right: original input
image, label and softmax score for the highest scoring class, visualization of integrated gradients,
visualization of gradients*image. Notice that the visualizations obtained from integrated gradients
are better at reflecting distinctive features of the image

9.4 Symbolic Regression

The historical process of discovering the functional forms that relate abstract quanti-
ties, such as energy and force, to observable quantities, such as positions and veloci-
ties, was indeed a laborious and time-consuming endeavor. Scientists and researchers
dedicated significant efforts over decades and even centuries to conduct experiments
and make observations to uncover these relationships. One notable example is the
equation governing kinetic energy, which Emilie Du Chatelet identified through
meticulous study and analysis.
The discovery of the equation for kinetic energy, expressed as. 21 m ẋ 2 , was a ground-
breaking achievement. Prior to Du Chatelet’s work, there was a misconception that
kinetic energy was linearly proportional to velocity (.ẋ). Du Chatelet’s findings cor-
rected this misunderstanding and provided a more accurate understanding of the
relationship between kinetic energy, mass (m), and the square of velocity (.ẋ 2 ). Her
contribution not only corrected an existing misconception but also paved the way for
further advancements in the field of physics.
It is important to note that these discoveries were primarily driven by intuition,
empirical understanding, and occasional derivation from first principles. Formal
9.4 Symbolic Regression 167

methods, as we know them today, were not employed during that time. Scientists
relied on their deep knowledge of the subject matter, careful observations, and insight-
ful interpretations to uncover the underlying mathematical relationships between
abstract and observable quantities.
The process of discovering equations directly from observations has been a fun-
damental method through which humans have developed their understanding of the
universe. While the historical approach lacked formalism, it highlights the impor-
tance of empirical exploration and the ability to derive insights from experimental
data. Today, with the advent of modern computational techniques and machine learn-
ing algorithms, we have the opportunity to employ formal methods, such as symbolic
regression, to automatically discover mathematical equations directly from data. This
formalization of the process allows for a more systematic and efficient exploration of
mathematical relationships, complementing the intuitive and empirical approaches
of the past.
With the advent of computational and data-driven modeling, these problems can be
solved in a more formal fashion employing combinatorial optimization along with the
symbolic functions and operations through an approach called symbolic regression.
Symbolic regression is a computational technique that aims to discover mathematical
equations directly from data without any prior knowledge of the underlying functional
form. It leverages the principles of evolutionary algorithms and genetic programming
to search through a vast space of mathematical expressions and identify the most
suitable equation that accurately represents the relationship between variables.
In symbolic regression, the goal is to find an equation of the form:

. y = f (x1 , x2 , ..., xn ) (9.9)

where . y represents the dependent variable, .x1 , x2 , ..., xn are the independent vari-
ables, and . f is the unknown mathematical function to be discovered. The objective
is to find the functional form of . f that best fits the given data.
The search for the equation begins by generating an initial population of candidate
equations. These candidate equations consist of a combination of mathematical oper-
ators (e.g., addition, subtraction, multiplication, division) and mathematical functions
(e.g., logarithm, exponential, trigonometric functions). The initial population is typ-
ically generated randomly or based on prior knowledge.
To evaluate the fitness of each candidate equation, a fitness function is defined. The
fitness function quantifies how well the equation fits the observed data. This fitness
evaluation is often based on a measure of the difference between the predicted values
from the equation and the actual observed values.
Genetic programming techniques are then employed to evolve and refine the pop-
ulation of candidate equations over successive generations. This involves applying
genetic operators such as mutation and crossover to create new candidate equations
by modifying or combining existing ones. Mutation introduces random changes to the
equations, while crossover combines elements from two parent equations to generate
offspring equations.
168 9 Interpretable Machine Learning

The evolution process follows the principles of natural selection, where candidate
equations with higher fitness are more likely to be selected for reproduction, and thus
have a higher chance of propagating their genetic material to subsequent generations.
Over multiple generations, the population evolves towards equations that provide a
better fit to the data, allowing the discovery of the underlying functional form.
The evolution process continues until a stopping criterion is met, such as reaching
a maximum number of generations or achieving a desired level of fitness. At the end
of the process, the best-performing equation, as determined by the fitness function,
is selected as the final result.
Symbolic regression has been successfully applied in various domains, including
physics, engineering, finance, and biology. In materials domain, it is sparingly used.
However, this could be an extremely useful approach in discovering empirical rules
based on the data and also to discover symbolic that represents complex data in an
interpretable fashion. There are several open-source packages that can be used for
symbolic regression; some of these are listed below.
1. PySR [9]: https://github.com/MilesCranmer/PySR.
2. Eureqa [10]: Open-source code not available.
3. GPLearn: https://github.com/trevorstephens/gplearn.
4. AI Feynman: https://github.com/heal-research/operon.
5. Operon: https://github.com/heal-research/operon.
6. DSO: https://github.com/brendenpetersen/deep-symbolic-optimization.
7. PySINDy: https://github.com/dynamicslab/pysindy.
8. EQL: https://github.com/martius-lab/EQL.
9. SR-Transformer: https://github.com/martius-lab/EQL.
10. GP-GOMEA: https://github.com/marcovirgolin/GP-GOMEA.
11. Symbolic Physics Learner [11]: Open-source code not available.

9.5 Other Interpretability Algorithms

While SHAP and integrated gradients are widely used model agnostic algorithms,
interpretable ML is a fast growing field and there are several model-specific algo-
rithms that are available and much more are being continuously developed. Here, we
briefly outline some of such interpretability algorithms.

• Decision Trees: Decision trees are widely used for their inherent interpretability.
These models consist of a hierarchical structure of decision nodes and leaf nodes,
allowing easy visualization and understanding of the decision-making process.
• Rule-based Models: Rule-based models, such as decision rules and association
rules, provide explicit rules that dictate model predictions. These models are highly
interpretable as they directly map input features to decision rules.
• Partial Dependence Plots: Partial dependence plots reveal the relationship
between a selected feature and the model’s output by systematically varying that
9.5 Other Interpretability Algorithms 169

feature while keeping others constant. These plots provide insights into how indi-
vidual features impact predictions. The partial dependence function is defined as

1 ∑
N
.PD(x j ) = f (x− j , x j ) (9.10)
N i=1

where .(x j ) represents the selected feature of interest, . N is the number of instances,
x represents all features except .x j , and . f represents the model’s prediction
. −j

function. The partial dependence plot, denoted as .PD(x j ), quantifies the average
predicted outcome. f across all instances when varying the feature.x j while keeping
other features constant.
• Local Interpretable Model-Agnostic Explanations (LIME): LIME explains
complex models locally by generating interpretable explanations for specific
instances. It approximates the behavior of the black-box model around a par-
ticular data point using a locally interpretable model. The explanation is obtained
by solving the following optimization problem:

. arg min L( f, g, πx ) + Ω(g) (9.11)


g∈G

where . f is the black-box model, .g is the locally interpretable model, .πx is the
proximity measure between the instance of interest and perturbed instances, and
.Ω is the complexity penalty.
• Generalized Additive Models (GAM): GAMs extend traditional linear models
by incorporating nonlinearities using smooth functions. In GAM, each feature .xi
is associated with a smooth function . f i that captures its non-linear relationship
with the response variable as follows.


N
. y = β0 + f i (xi ) (9.12)
i=1

where . y represents the predicted outcome or response variable, .β0 is the intercept
term, . N is the number of features, and . f i (xi ) represents the smooth non-linear
functions applied to each feature .xi . These smooth functions can take various
forms, such as splines or kernel functions, and are often estimated using tech-
niques like penalized regression. The final prediction is obtained by summing the
contributions from each feature’s smooth function, along with the intercept term.
Each smooth function . f i can have its own set of parameters, allowing flexibility in
modeling the relationship between each feature and the response. The objective of
GAM is to estimate the smooth functions. f i that minimize the discrepancy between
the observed responses and the predictions. This is typically achieved through opti-
mization methods, such as maximum likelihood estimation or penalized regression
techniques. They allow for a flexible representation of the relationship between
features and the target variable, enabling interpretability. The smooth functions
170 9 Interpretable Machine Learning

allow for capturing non-linear patterns in the data, making GAM a powerful tool
for understanding the dependencies between features and the response variable.
• Contrastive Explanations Method (CEM): CEM generates contrastive explana-
tions by identifying the minimal changes required in the input features to change
the model’s prediction. In other words, CEM provides explanations for individ-
ual predictions by contrasting them with alternative outcomes. The mathematical
equation for CEM is as follows:

CEM(x, y, f, g) = arg min L(x, y, f, g, δ)


. (9.13)
δ

where .x represents the instance to be explained, . y is the true label or target value
associated with .x, . f is the black-box model being explained, .g is the interpretable
model used for generating contrastive explanations, and .δ is the contrastive per-
turbation applied to .x to create an alternative instance. The goal of CEM is to
find the optimal perturbation .δ that minimizes the loss function . L and produces
a contrastive explanation for the prediction made by the black-box model . f . By
exploring perturbed instances and comparing their predictions with the original
instance, CEM provides insights into the key features and factors influencing
the decision made by the model. CEM is a powerful technique for generating
contrastive explanations in various domains, such as image classification, natural
language processing, and recommender systems. This approach helps understand
model behavior by highlighting critical features.
• Feature Importance Techniques: Various feature importance techniques, such as
permutation importance, mean decrease impurity, and coefficient weights, quantify
the importance of each feature in the model’s decision process.

Note that the detailed mathematical explanations of each of these algorithms are
beyond the scope of the book. The aim of the discussion is to give abroad overview
of several available techniques for model interpretability.

9.6 Conclusion

In conclusion, nterpretable machine learning has emerged as a critical field in bridg-


ing the gap between complex models and human understanding. By providing expla-
nations and insights into the decision-making process of machine learning models,
interpretable techniques offer valuable benefits in various domains. The ability to
interpret and explain models not only promotes transparency but also builds trust
and confidence among users and stakeholders. Interpretability allows us to validate
and verify the model’s behavior, detect biases or errors, and address ethical con-
cerns. Furthermore, it enables domain experts to collaborate effectively with machine
learning practitioners, leading to better model development and deployment. This is
especially important in applied domains such as materials, physics, or engineering,
where the governing physics is partially known or understood.
References 171

Looking ahead, interpretable machine learning is expected to continue advanc-


ing. Researchers are exploring new approaches and methods to further enhance
interpretability, including feature importance attribution, rule extraction, and model
simplification. Efforts are also being made to develop standardized evaluation met-
rics and benchmarks to assess the quality of interpretability techniques. Moreover,
interpretable machine learning holds immense potential in domains where decision-
making has high stakes, such as healthcare, finance, and autonomous systems. By
providing interpretable and actionable insights, these techniques can aid in critical
decision support, risk assessment, and safety assurance. As the field progresses, it
is crucial to strike a balance between model complexity and interpretability. The
challenge lies in developing models that are both highly accurate and understand-
able. Researchers and practitioners need to navigate the trade-offs between perfor-
mance and interpretability, considering the specific requirements and constraints of
each application. Considering the rapid pace of advancements in this field, several
ground-breaking advances addressing these challenges can be expected soon.

References

1. C. Molnar, Interpretable Machine Learning (2020). Lulu.com


2. F. Doshi-Velez, B. Kim, Towards a rigorous science of interpretable machine learning (2017).
arXiv:1702.08608
3. M.T. Ribeiro, S. Singh, C. Guestrin, “Why should i trust you?” Explaining the predictions of any
classifier, in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pp. 1135–1144 (2016)
4. A. Shrikumar, P. Greenside, A. Kundaje, Learning important features through propagating
activation differences, in International Conference on Machine Learning PMLR, vol. 70, pp.
3145–3153 (2017)
5. S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, W. Samek, On pixel-wise
explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS One
10(7), e0130140 (2015)
6. S. Lipovetsky, M. Conklin, Analysis of regression in game theory approach. Appl. Stoch.
Models Bus. Ind. 17(4), 319–330 (2001)
7. E. Štrumbelj, I. Kononenko, Explaining prediction models and individual predictions with
feature contributions. Knowl. Inf. Syst. 41(3), 647–665 (2014)
8. S.M. Lundberg, S.-I. Lee, A unified approach to interpreting model predictions, in Proceedings
of the 31st International Conference on Neural Information Processing Systems, pp. 4768–4777
(2017)
9. M. Cranmer, Interpretable machine learning for science with pysr and symbolicregression.jl
(2023). arXiv: 2305.01582 [astro-ph.IM]
10. M. Schmidt, H. Lipson, Distilling free-form natural laws from experimental data. Science
324(5923), 81–85 (2009)
11. F. Sun, Y. Liu, J.-X. Wang, H. Sun, Symbolic physics learner: discovering governing equations
via monte carlo tree search, in The Eleventh International Conference on Learning Represen-
tations (2022)
Part III
Machine Learning for Materials Modeling
and Discovery

In this part, applications of various ML algorithms for materials modeling and


discovery are discussed in detail. Chapter 10 focuses on the most used applica-
tion of ML, that is, property prediction. Chapter 11 focuses on several approaches for
materials discovery including ML-based surrogate models for optimization, material
selection charts, generative models, and reinforcement learning. Chapter 12 discusses
interpretable algorithms to decode the learned input–output relationships by the ML
models. Chapter 13 discussed how ML can be used to accelerate simulations at the
atomic and continuum scales. Chapter 14 discusses image-based algorithms such as
CNNs and neural operators for predicting the properties based on images. Finally,
Chap. 15 discussed the use of natural language processing for information extraction
from materials literature. Altogether, this part covers a broad range of applications
of ML for accelerating the discovery of materials.
Chapter 10
Property Prediction

Abstract Machine learning has revolutionized the field of materials science by


enabling accurate and efficient prediction of material properties. This chapter presents
a comprehensive overview of the key steps involved in developing machine learning
models for structured data in materials science, along with an introduction to physics-
informed machine learning. This chapter highlights the challenges associated with
predicting material properties and emphasizes the importance of robust computa-
tional methods. It discusses the crucial stages of constructing machine learning
models, including data preprocessing, feature engineering and selection, and model
training and evaluation. Furthermore, the chapter introduces how domain-specific
knowledge and fundamental physical principles can be infused with machine learn-
ing models for property prediction, an approach known as physics-informed machine
learning. This integration enhances prediction accuracy and ensures adherence to
underlying material behavior laws and principles. Altogether, this chapter describes
one of the most commonly used application of ML, that is, to predict the proper-
ties of materials as a function of composition based on structured data. Additional
approaches towards property prediction based on microstructure images, and other
semi-structured or unstructured data are discussed in detail in the following chapters.

10.1 Introduction

In the field of materials science, the ability to accurately predict and understand
the properties of new materials is of paramount importance. Traditional approaches
to property prediction often rely on time-consuming and costly experimental tech-
niques. Further, the properties of a material is a complex function of composition,
structure, processing conditions, and testing conditions. However, with the advent of
machine learning, there has been a paradigm shift towards data-driven methods that
leverage the power of computational models and algorithms to expedite the discovery
and characterization of novel materials.
This chapter explores the application of machine learning techniques for property
prediction of materials. By leveraging large datasets, advanced algorithms, and com-
putational models, machine learning offers the potential to revolutionize the way we
© Springer Nature Switzerland AG 2024 175
N. M. A. Krishnan et al., Machine Learning for Materials Discovery,
Machine Intelligence for Materials Science,
https://doi.org/10.1007/978-3-031-44622-1_10
176 10 Property Prediction

Periodic-table-based descriptors

Data collection
Model interpretation
Feature
engineering

Machine learning
Data pre-
model development
processing

Model testing and


deployment
Glass compositions, processing
and testing methods

Fig. 10.1 Basic steps involved in the development of an ML model for predicting the property of
a material

understand and design materials with desired properties. The objective is to develop
predictive models that can accurately estimate various material properties, such as
mechanical strength, electrical conductivity, thermal conductivity, optical proper-
ties, and more. Thus, property prediction is one of the most common application for
which ML is being widely used. The major steps involved in the property prediction
problem are (see Fig. 10.1):
1. Dataset preparation including data collection, and processing
2. Feature engineering,
3. Model development,
4. Hyperparametric optimization,
5. Model testing and deployment.
Note that depending on the nature of the problem, there could be additional steps
involved. However, these steps explain the bare minimum that need to be carried
out for a reasonable model development for property prediction. Further, the model
can also be interpreted using interpretable algorithms such as SHAP. This allows
the domain experts to critically analyze the features learned by the model and then
evaluate whether they are sensible from a physical perspective. The interpretation of
the ML models are explained in detail in the later chapters. These steps mentioned
above are explained in detail below.
10.2 Dataset Preparation 177

10.2 Dataset Preparation

The performance of an ML model is highly contingent upon the availability large-


scale data that is used to prepare the model. As such, dataset preparation forms a
crucial stage for model development in ML. Datasets can be generated using exper-
iments, or using computer simulations (synthetic datasets). As generating ones own
dataset is quite expensive, most works rely on available public domain datasets. Some
of the publicly available databases are listed in the table. Many of these databases
have dedicated application program interfaces (APIs) that allow easy extraction of
data directly from the database. An alternate approach is to extract datasets from
the published literature. This is extremely tedious and human-intensive as a domain
expert need to go through the published literature one-by-one and extract the data
manually to make a structured database like tables. Thankfully, ML comes to our res-
cue in this regard as well. Many journals allow text-mining and data-mining through
their APIs. Several studies have employed automated data extraction from these
journals using semantic parsing and specific keyword search to obtain very specific
dataset on composition–property relationships. Some examples to this extent include
the dissolution data of glasses obtained mined from literature using a table extraction
protocol.
Another critical aspect of the dataset preparation is the inclusion of all possible
input and output parameters as much as possible. It should be noted that the prepared
dataset should be as exhaustive as possible in terms of the input parameters. While
irrelevant input parameters can be removed later, it might be extremely challenging
to include a new input parameter at a later stage. As such, at least a tentative idea of
the nature of the model’s input parameters, whether it will be material composition,
engineered features, or physics-based descriptors, should be decided a priori. In
addition, any other details that are relevant for the property prediction, including the
material synthesis parameters, processing conditions, and testing conditions, may
also be included in the initial dataset. Overall, the development of a successful ML
model will depend totally on the availability of an exhaustive and extensive database
consisting of as much information as possible in a structured fashion.
List of publicly available databases
1. CSD: Cambridge Structural Database
2. CALPHAD
3. Granta Design
4. Pauling File
5. ICSD: Inorganic Crystal Structure Database
6. ESP: Electronic Structure Project
7. AFLOW: Automatic-Flow for Materials Discovery
8. MatNavi
9. AIST: National Institute of Advanced Industrial Science and Technology
Databases
10. COD: Crystallography Open Database
178 10 Property Prediction

11. MatDL: Materials Digital Library


12. The Materials Project
13. CMR: Computational Materials Repository
14. Springer Material
15. OpenKIM
16. NREL CID: NREL Center for Inverse Design
17. MGI: Materials Genome Initiative
18. MatWeb
19. MATDAT
20. CEPDB: The Clean Energy Project Database
21. CMD: Computational Materials Network
22. Catalysis Hub
23. OQMD: Open Quantum Materials Database
24. Open Material Databases
25. NREL MatDB
26. Citrine Informatics
27. Exabyte.io
28. NOMAD: Novel Materials Discovery Laboratory
29. Marvel
30. Thermoelectrics Design Lab
31. MaX: Materials Design at the Exascale
32. CritCat
33. Khazana
34. Material Data Facility
35. MICCOM: Midwest Integrated Center for Computational Materials
36. MPDS: Materials Platform for Data Science
37. CMI2: Center for Materials Research by Information Integration
38. HTEM: High Throughput Experimental Materials Database
39. JARVIS: Joint Automated Repository for Various Integrated Simulations
40. OMDB: Organic Materials Database
41. aNANt
42. Atom Work Adv
43. FAIR Data Infrastructure
44. Materiae
45. Materials Zone
46. MolDis
47. QCArchive: The Quantum Chemistry Archive.
Once the raw dataset is prepared, the dataset needs to be processed before being
used for model development. The basic steps of dataset preparation include the
removal of duplicate, incomplete, and inconsistent entries. While duplicate and
incomplete may be self-evident, inconsistent entries refer to values that are not
consistent with our understanding of the dataset. For instance, a value –10 K for
temperature or a total mole percent of 110% for the alloy composition, should be
discarded as inconsistent entries. Another important aspect of dataset preparation is
10.3 Feature Engineering 179

to identify the outliers in the dataset. Identifying outliers is extremely challenging


especially for a highly non-linear dataset as the data itself might exhibit extreme and
sudden variations. Thus, a data point identified as an outlier might actually be a real
data point. There is no one-size-fit-all solution to outlier detection. There are several
algorithms as explained in Part II, for outlier detection. Further, there are several
open-source packages such PyOD and XGBOD that enable outlier detection. Differ-
ent approaches should be attempted, and the efficiency of these approaches should be
evaluated by human experts before finalizing the best approach. The performance of
outlier detection algorithm should be evaluated in terms of both accuracy and preci-
sion. That is, the algorithm should identify maximum real outliers while identifying
minimum real data points as outliers.
In case of missing data, care should be taken in handling it. This is since any
imputation technique to fill the missing data may add further noise to the dataset
instead of adding information. However, careful analysis of the nature of data and
missing points may provide insights on whether imputation can be carried out for
a specific problem of interest or not. In general, for composition–property datasets
of materials, it is found to be best to ignore the missing and incomplete data than
to apply any imputation techniques, unless it is required. In case it is required, the
generated data should be analyzed carefully by a domain expert to ensure that the
synthetic data is indeed representative of the true data from the domain. A common
method applied to this extent is to use simulated data rather than purely synthetic data
based on data-driven methods. Simulation based data generation is widely adopted
employing physics-based methods such as finite element methods,

10.3 Feature Engineering

Feature engineering is a crucial aspect of model building, wherein we modify or aug-


ment the input features using the domain knowledge. The aim of feature engineering
is to enable the ML model to learn the input–output function in a more efficient and
accurate fashion. The features used to predict properties can be broadly classified
into four categories (Fig. 10.2).

1. Composition-dependent features: In this case, the chemical composition of the


material is given as the input along with additional information such as process
and testing parameters. In this case, the input features for a glass composition
of 0.2(Na.2 O).0.8(SiO.2 ) will be: (i) 0.2 and 0.8 if the input features are mole
percent of (Na.2 O) and (SiO.2 ), or (ii) 0.4/3, 0.8/3, and 1.8/3, if the input features
are in terms of the mole percent of atoms in the material. Similar approaches
could be taken for an alloy or a ceramic. Note that the microstructural features or
processing conditions such as heating rate, cooling rate, annealing temperature,
or annealing time can also be given as an input feature. In such cases, the dataset
should consistently have the same features for all the entries.
180 10 Property Prediction

Fig. 10.2 Features for materials modeling. Reprinted with permission from [1]

2. Physics-driven features: Here, the features are obtained either based on the
physics of the system or directly from the periodic table. These features can range
from periodic table based descriptors such as atomic number, atomic mass, ionic
radius, and electronic orbitals to microstructural features such as the grain size,
and grain orientation for alloys, or chain length, or correlation length for polymers.
If the property is at higher length-scales, such as mesoscale or microscale, then
the relevant parameters should be selected. Further, these features are highly
dependent on the materials of interest as well. The selection of these features is a
strong function of the length and timescales associated with property/phenomenon
of interest. For example, in the case of composite materials, the weight or volume
percent of the matrix and inclusions, thickness and geometry-dependent features,
orientation of the inclusions if the inclusions are fibres or sheets. Note that the
final selection of the features will be contingent upon the model performance.
3. Topology-driven features: In some cases, the specific structure of an atomic
or mesoscale system may play a crucial role in controlling the properties. This
could be the case in case of crystalline systems, organic molecules, polymers,
perovskites, or metal-organic frameworks. In such cases, it might be interesting
to consider the topology dependent features of the atoms. Note that the topological
features could be represented using simple coordination numbers, bond angles,
10.3 Feature Engineering 181

or more sophisticated input features such as two-point correlation functions or


orientation order parameters. In addition, the graphical structure with the nodes
representing the atoms and edges representing the bonds could also be used to
generate complex topology-driven input features. This approach has potential for
polymeric systems, organic molecules, and metal-organic frameworks.
4. Miscellaneous features: In addition to the features mentioned above, some works
attempt to use other features. For instance, this includes derived features such as
the parameters of interatomic potentials, or experimental features such as melting
point or boiling point of individual elements present in the material. Such features
are engineered based on the domain knowledge. As such, the success of these
features rely on their relevance in the context of the predicted property.

Once the features are identified, they should be checked for correlation. Note that
correlation doesn’t imply causation. However, in a representative dataset, correla-
tion analysis can be used to identify how one input variables changes with respect
to another. If two input variables are highly correlated, this means that when one
increases, the other variable also increases and vice-versa. Thus, these input vari-
ables may essentially contain similar information, and having both of them might be
redundant. Thus, the correlation analysis can be used to trim down the feature space
by removing highly correlated input features. It should be noted that in ML models,
unlike the classical regression models such as linear regression, the input variables
need not be independent. The ML models can handle even highly dependent input
variable space. Nevertheless, reduction of the feature space using dimensionality
reduction may allow the development of a simpler ML model, that will have a higher
degree of interpretability.
In addition, dimensionality reduction techniques, such as principal component
analysis, may be employed to project the input features on to a lower dimensional
domain. In this approach, since we project the input features to a lower dimensional
domain, the new features obtained after the dimensionality reduction may be easily
interpretable. The new features may be a weighted combination of multiple input
features and hence, may not be amenable for a physical explanation. However, it has
been observed that dimensionality reduction can occasionally improve the perfor-
mance of the ML models.
Most of the approaches mentioned above involve feature engineering based on
physical intuition or the understanding of the characteristics of the system. Thus,
feature engineering is highly subjective and relies highly on the domain knowledge
of the user. The rise of deep learning comes with the promise that feature engineer-
ing may no longer be required. In deep learning, the raw dataset can be provided to
the neural network, which in turn identifies the relevant features during the learning
process. For instance, using a pixel image of crystals, a convolutional neural network
can automatically identify the features that enable the identification the crystal struc-
ture. Similarly, a graph neural network can use the topology of the graph structure to
identify the node and edge embeddings that maximize the performance of the model.
In such cases, the burden of feature engineering is significantly reduced. However, it
might still be worth trying to see if feature engineering can enhance the performance
182 10 Property Prediction

of the ML models in specific cases of deep learning. Overall, feature engineering


enables the identification of appropriate features that maximize the performance of
the ML models.

10.4 Model Development

Once the processed dataset with the input features and output labels or values are
ready, we use this dataset to develop the ML models. Before the model training,
the dataset should be split into training, validation, and test sets. While exact split
can be based on the dataset size, common practises involve ratios such as 60:20:20,
70:15:15, or 80:10:10 for the training, validation, and test sets respectively. The
validation set is used to develop the optimal model, that is, without underfitting or
overfitting, and hence is actually part of the training data. The test set, also known as
the hold out set, is kept unseen to the model and is only used at the end to evaluate the
model performance. Typically, the train:validation:test split is performed randomly.
However, this approach assumes that the dataset used is a balanced one with the
training and test data following a similar distribution for both input features and
output labels or values. In cases where this is not the case, additional care should
be taken to ensure that the training set used is representative of the entire domain
of the input features. For instance, if one of the input features is present only in
100/1000 data points, care should be taken to ensure that the data points with the
given feature is present in the training data. A randomized 80:20 split or training and
test data does not automatically ensure this. Further, if the feature is not present in
the training set, then we cannot expect the model to learn weights associated with
the feature and hence, the feature and hence the data entries become redundant. At
the same time, while ensuring that the training data is indeed representative of the
dataset, care should be taken to avoid data leakage. Data leakage refers to the use
of information outside the training set (that is, from the test set) for developing the
final model. As such, it is advisable to create the training and test dataset initially
and keep it constant throughout.
The model is trained on the training set and the performance is evaluated on the
validation set for overfitting or underfitting. Thus, the optimal model is identified
with the help of training and validation set–the performance on both the sets should
be comparable. Further details are discussed in the next section on hyperparametric
optimization. The choice of the ML model is completely up to the user, although
some thumb rules can be used to identify the appropriate one. The first among those
is the Occam’s razor principle, that is, to use the simplest model among the possible
ones. For instance, if linear regression can do the job, then there is no need to use
neural networks. Figure 10.3 shows the flowchart that can be used as a guide to select
the models based on the dataset size, and requirements.
While developing an ML model, it is always prudent to start with simple linear
regression or logistic regression for a regression or classification problem, respec-
tively. If the model is not accurate enough higher order polynomials can be evalu-
10.4 Model Development 183

Fig. 10.3 ML model selection guideline based on the size of the dataset

ated. Note that using higher order polynomials always run the risk of overfitting the
model, which can be handled using proper hyperparametric optimization. In some
cases, the model may improve using regularized linear regressions such as lasso,
ridge, or elastic net. If these classical algorithms are not accurate enough (a quality
that has to judged by the user), other models may be used. At this point, the size
of the dataset plays a major role. If the size of the dataset is small (another quality
judged by the user), for instance, less than thousand, then algorithms such as sup-
port vector machine, classification and regression trees (CART), random forest, or
gradient boosted decision trees may be preferable. Note that the CART approaches
with modifications in the regularizer, loss function, or the way new branch is created
have resulted in several algorithms with minor differences. Some other approaches
have tried to combine multiple CART approaches (and sometimes, other algorithms
as well) as an ensemble model as well. Out of these, one algorithm that requires a
special mention is the extreme gradient boosted decision trees (XGBoost) that seems
to work particularly well for both classification and regression tasks. XGBoost has
several features incorporated including multiple in-built regularizers that prevent
overfitting. An interesting aspect of XGBoost is its ability to interpret the relative
importance of the input features based on the number of times they have been used
to create a branch or a leaf. This aspect makes XGBoost an interpretable machine
learning model in comparison to other algorithms that are more black-box in nature.
If the dataset is large (for instance, greater than 1000), then neural network
approaches may work best. Depending on the size of the dataset and the complexity
of the input-output relationship, it could be a simple multilayer perceptron (MLP)
with a single hidden layer or a more complex deep network with multiple hidden lay-
ers, each having several hidden layer units. All the approaches mentioned thus far are
deterministic in nature–for a given set of input features, there is only one output that
can be obtained. In case probabilistic modeling is of interest, then Gaussian process
184 10 Property Prediction

regression (GPR) should be used. GPR allows one to obtain the best estimate for a
given set of input features along with the standard deviation in the prediction. How-
ever, due to the matrix inversion involved in GPR, the training procedure becomes
extremely expensive, even prohibitive, in cases with large number of datapoints. In
such cases, scalable Gaussian processes may be used, for example, kernel interpola-
tion for scalable structured GP (KISS-GP). Altogether, model development should
rely on choosing the simplest model possible after trying all the models.

10.5 Hyperparametric Optimization

Hyperparametric optimization is one of the most crucial steps involved in the training
of a machine learning model. The fundamental difference between the parameters of
a model and the hyperparameters are that the former get updated during the training
process while the latter is set a priori to the training and is kept constant while the
training is in progress. As such, the performance of a ML model is highly dependent
on the hyperparameters chosen.
There are several aspects involved in the hyperparametric optimization. First step
is the identification of the hyperparameters associated with a selected model. The
usual hyperparameters associated with ML models are the training epochs, loss func-
tion, learning rate, regularizer and the associated weights, if any, and batch size.
Depending on the ML model used there could be additional hyperparameters such as
the number of hidden layers and hidden layer units for an MLP, number of branches
and tree depth for CART-based approaches, or the kernel functions associated with
support vector and Gaussian process regressions. Similarly, in the case of MLPs,
dropout can be used ensure that the weights of the hidden layer units are optimal. In
dropout, the percentage dropout per training epoch is another hyperparameter, which
can vary from 0.1 to 0.3 (as a thumb rule). Due to the large number of options avail-
able, several algorithms as mentioned earlier (see Chap. 7), including grid search,
random search, or Bayesian approach, can be employed for identifying the optimized
hyperparameters.
Another approach typically employed during the hyperparametric optimization
is the k-fold cross validation (discussed in Chap. 7). In this approach, the training
dataset (including the validation set) is divided into k folds, where k can be any
number ranging from 5 to 50 (as a thumb rule). Among the k folds, n sets may be
chosen as training and n-k as validation sets. The process is repeated to cover all
possible combinations of taking n and n-k from the k sets. For example, in 10-fold
cross validation, one fold can be taken as validation set and nine folds as training
set. This leaves one with ten possible options for the validation set. The training
should be conducted on all the ten possible options and the validation scores should
be comparable for all the ten sets. This ensures that the developed model is optimal
and have reasonable generalizing capability.
Finally, the model is evaluated by comparing the performance on the test set. The
predicted values by the model are compared with the actual values of the test set
10.6 Physics-Informed ML for Property Prediction 185

Fig. 10.4 Performance plot


with different error measures

that is unseen by the model. There are several measures that are commonly used to
evaluate the test set such as the the root mean squared error (RMSE) or the MSE, mean
absolute error, mean absolute percentage error, or the coefficient of determination,
also known as the R.2 value (Fig. 10.4).

10.6 Physics-Informed ML for Property Prediction

The approaches mentioned thus far simply uses a labeled dataset (that is, with input
features and output properties) and trains the ML model on it. It does not take into
account any physical constraints, either derived from domain knowledge or from first
principle, associated with the property of interest. Indeed, the feature engineering
may be performed to include as many relevant features as possible based on the
physics or chemistry of the problem. However, there is no bias or constraint provided
in the training process itself to respect some known physical constraints. Including
such constraints may sometimes significantly improve the predictive capability of
the model. Here, we focus on how such constraints can be incorporated in the ML
models for property prediction. This approach is termed as physics-informed or
symbolic-reasoning informed ML.
In traditional ML models, the information flow is updated as follows. First, a
model is initiated with random weights. Then the output values corresponding to input
features in the training dataset are computed with this model having random weights.
The error between the predicted and actual output is computed using error metric
such as MSE or other similar loss function. Using this loss function as the objective,
optimization is carried to update the weights so as to minimize this error. Thus, the
weights learned in this process are identified as the one that the minimizes the data
loss (that is, error between predicted and actual property). If additional information in
186 10 Property Prediction

terms physics-based or domain-based knowledge is available regarding the problem,


these information can be included as an inductive bias to ensure that the model obeys
these laws. To this extent, the loss function is modified to have a data loss and physics
loss. Then, the model is trained by minimizing both the losses together. Thus, the
learned weights now minimize both the error on the data and error on the physical
or symbolic law to be followed by the property. This approach only weakly enforces
the physics as it is an additional term in the loss function. An alternate appraoch is to
strongly enforce the physics by directly using it concurrently with the MLP to make
the prediction. In this approach, the input features are used to predict the parameters
of the physical equation of interest, which is then employed in the equation to predict
the final property. TO distinguish this approach from the physics-informed appraoch
(where an additional physics loss is included), we refer to it as physics-enforced
ML. To demonstrate the application of the physics-enforced approach, we discuss
two examples below, namely, the prediction of viscosity and hardness in glasses.
ViscNet: Predicting Temperature Dependent Viscosity in Glasses
Viscosity is a fundamental property that holds significant importance in the study of
materials with respect to their crystallization and glass-forming behavior. In the realm
of oxide glass-forming liquids, the temperature-dependent behavior of viscosity is
crucial for controlling various aspects of glass manufacturing processes, including
conformation and annealing. It also serves as an essential factor in analyzing kinetic
processes such as crystal nucleation and growth. Moreover, several studies have
proposed the use of viscosity at the liquidus temperature as an indicator of glass-
forming ability, further highlighting its significance in the field of materials science.
While several works have aimed to predict the viscosity directly with the glass
composition and temperature, these approaches lack interpretability and generaliz-
ability to unseen temperature.s To address the interpretability challenge, a gray-box
approach based on physics-informed ML has been proposed for viscosity prediction
[2, 3]. This approach combines the power of neural networks with the integration of
a physical model within the machine learning pipeline. By adopting this gray-box
approach, researchers have been able to enhance viscosity prediction by shifting the
focus of neural networks from directly predicting viscosity to estimating the parame-
ters of a viscosity model, such as the MYEGA viscosity model. MYEGA equation [4]
for predicting the viscosity .η of glasses at a temperature .T is given by

Tg ( )
.log10 (η, T, η∞ , Tg , m) = log10 (η∞ ) + 12 − log10 (η∞ )
[( )( T )]
Tg m
× exp −1 −1 (10.1)
T 12 − log10 (η∞ )

where .m = ∂log 10 (η(T ))


∂(Tg /T )
is the fragility index, .Tg is the glass transition temperature
(that is the temperature at which the glass viscosity is .1012 Pa.s), and .η∞ is the
asymptotic viscosity.
The ViscNet architecture is shown in Fig. 10.5. Here, the the viscosity parameters,
namely, .η∞ , Tg , and .m are predicted by the MLP, which takes the composition
10.6 Physics-Informed ML for Property Prediction 187

Fig. 10.5 ViscNet architecture. Reprinted with permission from [2]

and periodic table based descriptors as input. Further, these predicted parameters
are substituted along with the temperature in the MYEGA equation to predict the
viscosity of a glass composition at a given temperature. Then the loss function is
defined as the error between the predicted and actual viscosity in terms of the MSE.
There are several advantages for this this gray-box approach—in contrast with the
traditional data-driven approach as listed below.
1. Interpretability: The parameters governing the viscosity are predicted by the
equation rather the viscosity itself. Thus, the model can also infer meaningful
quantities such as .Tg and .m without directly training on it. These parameters
predicted by the MLP provides insights into the behavior of the glass composition
and can be verified independent of the viscosity.
2. Generalizability: The model strictly follows the MYEGA equation and hence
can be meaningfully extrapolated to unseen temperatures for a given glass com-
position. Further, the model can be used to predict the viscosity of unseen glass
compositions beyond those in the training dataset in a meaningful fashion. The
performance can further be evaluated as additional parameters such as .Tg and .m
are predicted.
It was also demonstrated that the proposed approach can provide improved pre-
dictions on viscosity than purely data-driven approaches, thanks to the additional
inductive bias in terms of the MYEGA equation.
SRIMP: Symbolic-Reasoning Informed Prediction of Hardness
The hardness of glass, a crucial property, is measured using instrumented indentation
experiments. However, it is important to note that the obtained hardness values are
not solely determined by the intrinsic properties of the glass. They are also influ-
enced by various factors, including the loading procedure, indenter geometry, and
environmental conditions. One significant phenomenon that affects glass hardness is
the indentation size effect (ISE). The ISE refers to the observed behavior where the
hardness of glass monotonically decreases and saturates as the applied load increases.
This behavior poses challenges in comparing hardness values obtained under differ-
188 10 Property Prediction

ent loading conditions. The underlying cause of the ISE is the stress concentration
generated by sharp contact loading, leading to localized structural changes in the
glass network and resulting in permanent deformation.
To enable meaningful comparisons of hardness values obtained at varying loads,
it is essential to predict load-independent hardness values. To this extent, a recent
work [5] proposed to combine the Bernhardt’s law with machine learning to develop
a symbolic reasoning informed machine learning procedure (SRIMP) for predicting
glass hardness. The hardness, . H from indentation experiments is given by

2Psin(θ/2)
. H= (10.2)
L 2D

where . L D is the diagonal length of the indent after unloading, .θ is the tip angle of
the indenter, and . P is the applied load. According to the Bernhardt’s model, the ISE
is given by
P
. = a1 + a2 L D (10.3)
LD

Combining these two equations by eliminating . L D and solving the quadratic equa-
tion, we obtain /
C1 C12 + 4C1 C2 P
. H= + C2 + (10.4)
2P 2P

where .C1 = 2a12 sin(θ/2) and .C2 = 2a2 sin(θ/2). Similarly, combining the earlier
two equations and eliminating . P, we get

2a1 sin(θ/2) αI S E
. H = 2a2 sin(θ/2) + = H∞ + (10.5)
LD LD

Careful observation of the parameters reveal that . H∞ = 2a2 sin(θ/2) is a load-


independent hardness, and .α I S E = 2a1 sin(θ/2) is the load-dependent term. Sim-
ilar comparison with (10.4) can be used to obtain the load-dependent and load-
independent terms in terms of .C1 and .C2 .
Figure 10.6 shows the framework employed in SRIMP. In this work, similar to
ViscNet, authors propose to predict the parameters .C1 and .C2 using an MLP, which
is then used to predict the hardness by substituting the respective load values and the
parameters in (10.4). The error between the predicted and the actual hardness values
is minimized to train the model. The work shows that the hardness predicted by
SRIMP allows to interpret the load-dependent and independent hardness in glasses.
Further, it captures the trend of the ISE as a function load. Finally, the approach allows
one to identify the contribution of each of the components in the glass towards the
load-dependent and independent terms in the hardness.
10.7 Summary 189

Fig. 10.6 Predicting Hardness with symbolic reasoning informed machine learning. Reprinted
with permission from [5]

10.7 Summary

This chapter discusses the major approaches employed to predict the properties of
materials based on structured data. Further it provides a comprehensive overview of
the key steps involved in developing machine learning models for structured data,
along with an introduction to the concept of physics-informed machine learning
for predicting material properties. Some of the key challenges inherent in predicting
material properties and the need for advanced computational methods to address these
challenges were briefly discussed. Further, it delves into the essential stages of con-
structing machine learning models for structured data, including data preprocessing,
feature engineering and selection, and model training and evaluation. Note that high-
quality data and effective feature representation are two important aspects to ensure
accurate and robust predictions. Furthermore, the chapter introduces the concept
of physics-informed machine learning, which integrates domain-specific knowledge
and fundamental physical principles into machine learning models. This integra-
tion not only improves prediction accuracy but also ensures that the models adhere
to the governing laws and principles underlying material behavior. Throughout the
chapter, illustrative examples and case studies demonstrate the practical application
of machine learning in predicting material properties across a wide range of materials.
Some interesting additional reading include Refs. [6–13]. We hope these examples
showcase the superiority of machine learning models over traditional methods and
highlight their potential to expedite material discovery and development.
190 10 Property Prediction

References

1. B. Sanchez-Lengeling, A. Aspuru-Guzik, Inverse molecular design using machine learn-


ing: generative models for matter engineering. Science 361(6400), 360–365 (2018). issn:
0036-8075. https://doi.org/10.1126/science.aat2663. https://science.sciencemag.org/content/
361/6400/360.full.pdf. https://science.sciencemag.org/content/361/6400/360
2. D.R. Cassar, ViscNet: neural network for predicting the fragility index and the
temperature-dependency of viscosity. Acta Materialia 206, 116–602 (2021). issn:
1359-6454. https://doi.org/10.1016/j.actamat.2020.116602. https://www.sciencedirect.com/
science/article/pii/S1359645420310399. Accessed 09 May 2021
3. A. Tandia, M.C. Onbasli, J.C. Mauro, Machine learning for glass modeling, in Springer Hand-
book of Glass, pp. 1157–1192 (2019)
4. J.C. Mauro, Y. Yue, A.J. Ellison, P.K. Gupta, D.C. Allan, Viscosity of glass-forming liquids, in
Proceedings of the National Academy of Sciences, vol. 106, no. 47, pp. 19 780–19 784 (2009)
5. S. Mannan, M. Zaki, S. Bishnoi, D.R. Cassar, J. Jiusti, J.C.F. Faria, J.F. Christensen, N.N.
Gosvami, M.M. Smedskjaer, E.D. Zanotto, et al., Glass hardness: predicting composition and
load effects via symbolic reasoning informed machine learning. Acta Materialia 255, 119 046
(2023)
6. J. Li, K. Lim, H. Yang, Z. Ren, S. Raghavan, P.-Y. Chen, T. Buonassisi, X. Wang, Ai appli-
cations through the whole life cycle of material discovery. Matter 3(2), 393–432 (2020).
issn: 2590-2385. https://doi.org/10.1016/j.matt.2020.06.011. https://www.sciencedirect.com/
science/article/pii/S2590238520303015
7. P. Raccuglia, K.C. Elbert, P.D. Adler, C. Falk, M.B. Wenny, A. Mollo, M. Zeller, S.A. Friedler, J.
Schrier, A.J. Norquist, Machine-learning assisted materials discovery using failed experiments.
Nature 533(7601), 73–76 (2016)
8. Y. Liu, T. Zhao, W. Ju, S. Shi, Materials discovery and design using machine learn-
ing. J. Materiomics 3(3), 159–177 (2017). issn: 2352-8478. https://doi.org/10.1016/j.jmat.
2017.08.002. https://www.sciencedirect.com/science/article/pii/S2352847817300515. (High-
throughput Experimental and Modeling Research toward Advanced Batteries)
9. X.-D. Xiang, X. Sun, G. Briceno, Y. Lou, K.-A. Wang, H. Chang, W.G. Wallace-Freedman,
S.-W. Chen, P.G. Schultz, A combinatorial approach to materials discovery. Science 268(5218),
1738–1740 (1995)
10. S. Bishnoi, S. Singh, R. Ravinder, M. Bauchy, N.N. Gosvami, H. Kodamana, N.A. Krishnan,
Predicting Young’s modulus of oxide glasses with sparse datasets using machine learning. J.
Non-Cryst. Solids 524, 119–643 (2019). issn: 00223093. https://doi.org/10.1016/j.jnoncrysol.
2019.119643. https://linkinghub.elsevier.com/retrieve/pii/S0022309319305149. Accessed 18
Nov 2019
11. R. Ravinder, K.H. Sridhara, S. Bishnoi, H. Singh Grover, M. Bauchy, Jayadeva, H.
Kodamana, N.M.A. Krishnan, Deep learning aided rational design of oxide glasses.
Mater. Horiz. (2020). https://doi.org/10.1039/D0MH00162G. https://pubs.rsc.org/en/content/
articlelanding/2020/mh/d0mh00162g. (Publisher: Royal Society of Chemistry). Accessed 10
May 2020
12. R. Bhattoo, S. Bishnoi, M. Zaki, N.A. Krishnan, Understanding the compositional control on
electrical, mechanical, optical, and physical properties of inorganic glasses with interpretable
machine learning. Acta Materialia 242, 118–439 (2023)
13. E. Kim, K. Huang, S. Jegelka, E. Olivetti, Virtual screening of inorganic materials syn-
thesis parameters with deep learning. npj Comput. Mater. 3(1), 1–9 (2017). issn: 2057-
3960. https://doi.org/10.1038/s41524-017-0055-6. https://www.nature.com/articles/s41524-
017-0055-6. Number: 1 Publisher: Nature Publishing Group. Accessed 19 Oct 2020
Chapter 11
Material Discovery

Abstract While the previous chapter focussed on the development of ML mod-


els for property, this chapter discusses how those surrogate models can be used
for discoveing novel materials. Here, we discuss various algorithms that can be
employed for materials discovery. Various ML techniques are discussed, including
ML-based optimization, material selection charts, generative artificial intelligence
(AI) approaches such as generative adversarial networks (GANs) and variational
autoencoders (VAEs), and reinforcement learning. ML-based optimization enables
efficient exploration of the vast design space, facilitating the identification of materi-
als with desired properties. Material selection charts provide a systematic approach
for screening and selecting materials based on key properties. Generative AI meth-
ods, like GANs and VAEs, offer exciting prospects for generating novel materials
with specified characteristics, expanding the traditional design boundaries. Rein-
forcement learning, a subfield of ML, guides the selection of candidate materials
through sequential decision-making processes. The integration of ML techniques
in materials discovery holds great promise, revolutionizing the field by enabling
faster and more efficient identification of materials with tailored properties. Future
prospects include advancements in ML algorithms, enhanced computational power,
increased data availability, integration with experimental techniques, and the fusion
of ML with emerging technologies like quantum computing and advanced imaging.
These developments will drive innovation, leading to the development of advanced
materials with improved performance and functionality.

11.1 Introduction

In the previous chapter, we focused on the property prediction of materials. In that


case, we aim to solve the forward problem of predicting the properties of a mate-
rial when the features such as composition, crystal structure, microstructure, or the
composite arrangement are given. In materials discovery, we aim to solve the inverse
problem of identifying or discovering new composition, structure, or arrangement
having desired target properties. Traditional materials discovery has relied mostly
on expert knowledge, intuition, and lots of trial-and-error approach. The aim of
© Springer Nature Switzerland AG 2024 191
N. M. A. Krishnan et al., Machine Learning for Materials Discovery,
Machine Intelligence for Materials Science,
https://doi.org/10.1007/978-3-031-44622-1_11
192 11 Material Discovery

Fig. 11.1 Materials discovery flowchart. Reprinted with permission from [1]

ML-driven materials discovery is to accelerate the process reducing the design-to-


deploy period to 3–5 years. The main approaches include
1. ML-based surrogate model optimization
2. Material selection charts
3. Generative models
4. Reinforcement learning based materials discovery.
In this chapter, we will be discussing these ML-driven approaches for materials
discovery. Note that these are some of the demonstrated applications of ML for
materials discovery. In addition to these approaches, active research is on-going to
develop tailored algorithms for materials discovery using deep learning and kernel-
based generative models. All of these approaches, though distinct in application and
their overall performance, follow a similar philosophy. Specifically, these approaches
try to learn the underlying distribution of the feature–label space, which is then used
to propose materials for targeted applications (Fig. 11.1).

11.2 ML Surrogate Model Based Optimization

The ML models for property predictions, once trained and validated, are useful in
exploring a larger domain of the input features. The model can be used to explore
new compositions, structures, and processing conditions depending on the input
features. However, most of the input features for property predictions are continuous
variables and can take a wide range of values. Thus, exploring all the possibilities
for the discovery of a new material is challenging.
11.2 ML Surrogate Model Based Optimization 193

Fig. 11.2 Materials discovery. Reprinted with permission from [1]

An alternate approach to this extent is to use optimization. Traditionally, opti-


mization approach has been challenging and not very successful for materials dis-
covery as the composition–structure–processing–property functions of materials are
highly complex. ML surrogate models can be used to supplement this information.
ML models have been extensively developed for property prediction of materials
with various descriptors as input features. These ML models can range from simple
polynomial regressions to more complex models such NNs and GPs. Depending on
the ML models, several optimization schemes can be used to discover new compo-
sitions or processing parameters that can yield materials with targeted properties.
These approaches include gradient descent that is useful for a convex optimization
problem, Bayesian optimization that builds on the prior knowledge, and evolutionary
algorithms such as genetic algorithm, ant-colony and particle swarm optimizations
which use a population based approach to obtain global minimum (Figs. 11.2 and
11.3).
The optimization approaches can be broadly single-objective and multi-objective
with or without constraints. In single objective optimization, the goal is to identify
a material with a target property. This includes a super-hard ceramic, corrosion-
resistant metal or scratch-resistant glass. However, in almost all practical applications
a combination of multiple properties may be desirable. This usually leads to a com-
promise between the properties which forms the optimization objectives. Moreover,
these properties may have inverse relationship. For instance, attempts to enhance a
given property 1 may result in a reduction of property 2 and vice-versa. In such sce-
nario, there may exist a large number of nondominated or Pareto optimal solutions.
These solutions are those in which it is impossible to make any component better
without making at least one other component worse as shown in Fig. 11.4.
194 11 Material Discovery

Fig. 11.3 Solution


landscape of traditional
materials discovery.
Reprinted with permission
from [2]

Fig. 11.4 Illustration of a


Pareto optimal surface. Any
of the red points represents
optimal choices. Other
subjective factors may be
used to choose between the
points on the Pareto surface.
Creative Commons-by-SA
license (https://en.wikipedia.
org/wiki/Pareto_efficiency).
Le and Winkler [2]

In contrast to finding materials with single optimal properties, it is usual when


dealing with two or more properties, that is, objectives, to plot candidate materials
on a so called Pareto plot, where the axes are the properties so that we can define
a characteristic boundary on which lie materials where none of the objectives can
be improved in value without degrading the other objective value. Such boundary
points, the non-dominated data-points, define a Pareto front (PF) that represents the
best trade-off between the objectives. Common examples of Pareto Fronts include
the Ashby plots, which display two or more properties, such as Young’s modulus and
density, for many materials or classes of materials 17,18. Methods to estimate such
fronts, especially if an exhaustive search is too tedious, have been studied and applied
for some time19,20,21,22. Recent studies have used Monte Carlo sampling methods,
in conjunction with machine learning models, to obtain Pareto fronts for dielectric
polymer data. However, few studies have addressed how to guide experiments or
calculations to recommend optimal points in as few measurements or calculations
as possible, especially where the data sizes are relatively small [1].
11.3 Material Selection Chart 195

11.3 Material Selection Chart

The materials selection chart was first proposed by Mike Ashby as a unique way to
visualize the variations in multiple and seemingly independent properties of materi-
als. The materials selection chart enables the identification and selection of materials
with target properties or even cost. Figure 11.5 shows a traditional materials selection
chart. Although the traditional materials selection chart is two-dimensional in nature,
the chart can be made multi-dimensional with more features added as the additional
axes. Although these charts are developed for a large class materials, these selection
charts can be developed for specific materials as well, for instance, silicate glasses or
aluminum alloys. In such cases, populating these charts require detailed experimen-
tal characterization for each of the properties of interest for a given composition or
structure. This can be extremely challenging, both economically and time-wise, due
to the large number of sample preparations and experiments that need to be carried
out.
An alternate approach is to train ML models for properties that can be further
used to develop the materials selection charts. Indeed, the the training of ML models
also require large amounts of data. However, it is not mandatory that for a given
composition, all the properties are available. For instance, independent ML models

Fig. 11.5 Strength versus Density. Chart created using CES EduPack 2019, ANSYS Granta ©2020
Granta Design
196 11 Material Discovery

Fig. 11.6 Glass selection chart for elastic modulus and thermal expansion coefficient

can be developed for each of the properties, while ensuring that the input features for
all these models are same and consistent. The datasets may individually be different.
Once the ML model is developed, then multiple properties can be predicted for any
given composition or structure within the input feature space. This can then be used
to develop the selection chart for a given material, for instance, glass selection chart.
Figure 11.6 shows a glass selection chart developed using ML models for Young’s
modulus and thermal expansion coefficient (TEC). Although the experimental data
is not available for both the Young’s modulus and TEC for a given glass system,
the ML models trained on the dataset can be used to predict the TEC of glasses for
which Young’s modulus is available and vice-versa. The ML models thus allow the
imputation of missing data, and prediction of properties for new compositions based
on which the glass selection charts can be developed. Additional analysis on these
charts can also be used to develop composite materials. The contour lines in the
Fig. 11.6 correspond to compositions with a constant value of . Eα, where . E is the
Young’s modulus and .α is the TEC. Note that . Eα corresponds to the thermal stress
developed in a material subjected to a unit (1.◦ C) change in the temperature. Thus,
materials having constant values of . Eα can be used to develop composite materials
that exhibit zero thermal stress, while having significant different in their modulus
values (Fig. 11.7).
11.4 Generative Models 197

Fig. 11.7 Generative


models. Reprinted with
permission from [3]

11.4 Generative Models

Generative machine learning models have gained significant attention in materials


discovery, particularly in the realm of organic materials. This section explores the
application of various generative models, such as autoencoders (AE), variational
autoencoders (VAE), recurrent neural networks (RNNs), and generative adversarial
networks (GANs), in inverse design of materials.
The objective of a generative model is to learn the data distribution by training on
extensive data and generating similar data. The loss function quantifies the similar-
ity between the observed distribution and the generated distribution, capturing the
differences between them [4]. These models leverage the sequential or graph repre-
sentations of materials to learn the composition or structure rules and generate novel
hypothetical materials. By learning the underlying rules from a large set of samples,
generative models such as GANs can efficiently sample the design space and gen-
erate new samples with target properties. This capability surpasses other sampling
approaches like random sampling or Monte Carlo sampling [5].
Generative models can be implemented using various machine learning algo-
rithms, including variational autoencoders (VAE), generative adversarial networks
(GAN), reinforcement learning (RL), recurrent neural networks (RNN), and their
combinations. Unlike other generative models, GANs do not directly rely on dis-
crepancies between the data and an assumed model distribution (as in VAE). Within
the GAN framework, a generative model is constructed through adversarial training.
In this setup, the generator and discriminator models engage in a competition. The
generator aims to produce synthetic data by sampling from a noise space, while the
discriminator’s objective is to distinguish between synthetic and real data. Through
alternating training, the generator learns to generate data that the discriminator can-
not classify better than random chance. However, achieving convergence in GANs is
not a straightforward task and can be hindered by issues like mode collapse and the
dominance of the discriminator during training. Ongoing research efforts are focused
on enhancing the training process of GANs, particularly when dealing with discrete
data that lacks differentiability.
198 11 Material Discovery

Fig. 11.8 GAN composed


of a generator, which maps
random vectors into
generated samples and a
discriminator, which tries to
differentiate real materials
and generated ones.
Reprinted with permission
from [5]

One of the first works to use GANs to generate a large number of materials was by
Dan et al. [5] by training a GAN on a large number of datasets from databases such
as OQMD, ICSD, and Materials project to develop MatGAN. Figure 11.8 shows the
architecture of MatGAN. In this work, the authors trained a GAN model to generate
new materials. The model achieved a high level of novelty, generating materials that
have not been seen before, with a novelty percentage of 92.53% when producing 2
million samples. Furthermore, the generated samples exhibited a high percentage of
chemical validity, with 84.5% of the generated materials being chemically valid in
terms of charge neutrality and electronegativity balance. Notably, the GAN model
does not explicitly enforce chemical rules but demonstrates its ability to implicitly
learn and adhere to the underlying composition rules for forming compounds.
Interestingly, some of the family of materials were not generated successfully by
the GANs. This was attributed to the limited number of data for these materials to
learn the required composition rules to generate those samples. To evaluate these
materials, authors employed an autoencoder architecture. An autoencoder architec-
ture consists of an encoder-decoder structure with a bottleneck layer in the middle.
The aim of the such a bottleneck layer is to learn a representation with significantly
reduced dimension that can represent the maximum information of the original data.
Thus, the autoencoder is an unsupervised approach towards dimensionality reduc-
tion of an input representation. The hypothesis was that any structure that cannot be
generated by autoencoder can also be not generated by a GAN. Figure 11.9 shows
the VAE architecture employed to evaluate the GAN. Note that one of the major
challenge associated with ML is to identify or develop a metric to identify the diffi-
culty associated with a learning task. Using such surrogate model based approaches
is a useful strategy to identify why a model fails for certain domains of materials.
Figure 11.10 shows the two-dimensional representation of the crystal structures
based on training and test set, along with the new structures generated. The two axes
correspond to the two dimensions after t-SNE-based dimension reduction. The ICSD
materials only occupies a tiny portion of the chemical space of inorganic materials.
Figure 11.10a, b, and c shows training samples (green dots) and leave-out validation
samples (red dots) from ICSD, 50,000 and 200,000 generated samples (blue dots),
11.4 Generative Models 199

Fig. 11.9 Variational autoencoder for materials discovery. Reprinted with permission from [5]

Fig. 11.10 Inorganic materials space composed of existing ICSD materials and hypothetical mate-
rials generated by GAN-ICSD. Reprinted with permission from [5]

respectively. The GAN approach is thus able to explore a significantly larger domain
of crytals while training on a smaller domain (Fig. 11.11).
Note that the approach of GANs is not limited to crystal structures. With the advent
of additive manufacturing and 3D printing, architectured materials with optimized
topology for different loading conditions are ideal are desirable. Several efforts in
this direction also attempt to employ GANs [6].
In addition to generation tasks, inverse design requires control over the genera-
tive process to prioritize desired qualities. Variational Autoencoders (VAEs) enable
explicit optimization of properties by operating on a continuous representation. On
the other hand, Generative Adversarial Networks (GANs) and Recurrent Neural
Networks (RNNs) achieve property optimization by biasing the generation process
using techniques like Reinforcement Learning (RL), where generative behaviors are
rewarded or penalized.
VAEs provide control over data generation through latent variables. An Autoen-
coder (AE) model consists of an encoding and a decoding network. The encoder
200 11 Material Discovery

Fig. 11.11 Examples of GAN-generated architectured materials with . Ẽ mean (.Ω ≤ 5%) achieving
more than 94% of E. H S . Reprinted with permission from [6]

compresses the molecule into a vector representation in a lower-dimensional space


known as the latent space, while the decoder maps the latent vector back to the origi-
nal representation. The AE is trained to faithfully reproduce the input data, capturing
essential features. However, it can potentially memorize the training data and lack
generalizability. To address this, VAEs constrain the encoding network to generate
latent vectors following a probability distribution, often Gaussian. Thus, molecules
are represented as probability distributions over the latent space, enabling sampling
and reconstruction even from noisy vectors.
The latent space of a VAE is of particular interest. Molecules are represented
as continuous and differentiable vectors residing on a probabilistic manifold. The
geometry of the latent space allows for sampling nearby points to generate similar
molecules, interpolating along paths to decode intermediate structures. VAEs initially
focused on encoding SMILES characters and later extended to incorporate grammar
and syntax, improving the generation of valid structures.
Optimizing properties in the latent space can be challenging due to the presence
of multiple local minima. One approach involves exploring a smoothed version of
the manifold using techniques like Bayesian optimization or constrained optimiza-
tion with Gaussian processes. By training the VAE to reproduce both molecules and
properties, the latent molecular space self-organizes, placing molecules with similar
properties in close proximity. Preferred dimensions and directions emerge for spe-
cific properties, and recent works have demonstrated local or global optimization
capabilities across the generated distribution by adjusting the quality of Gaussian
11.5 Reinforcement Learning for Optimizing Atomic Structures 201

processes. Thus, the utilization of VAEs and their latent space provides a framework
for property optimization and exploration in generative modeling, facilitating the
design of molecules with desired characteristics.

11.5 Reinforcement Learning for Optimizing Atomic


Structures

Reinforcement learning is emerging as a promising approach in materials science,


offering new possibilities for accelerating the discovery and design of advanced
materials. Reinforcement learning is a subfield of machine learning that focuses
on training agents to make sequential decisions in an environment to maximize
cumulative rewards. By formulating materials discovery as a sequential decision-
making problem, reinforcement learning algorithms enable researchers to optimize
material properties and explore vast design spaces systematically.
In the context of materials discovery, reinforcement learning algorithms learn an
optimal policy that guides the selection of candidate materials with desired prop-
erties. These algorithms leverage a trial-and-error approach, iteratively exploring
the materials landscape and adapting their decision-making based on the feedback
received from the environment. The goal is to identify materials configurations that
yield high rewards, such as desirable electronic, optical, or mechanical properties
(Figs. 11.12 and 11.13).
Several reinforcement learning algorithms have been applied in materials dis-
covery with promising results. One commonly used algorithm is Q-learning, which
iteratively updates a value function that estimates the expected cumulative reward for
each state-action pair. Q-learning has been successfully applied to optimize mate-
rial properties by selecting appropriate synthesis parameters or exploring material
phase spaces. Deep Q-networks (DQN) extend Q-learning by employing deep neural
networks to approximate the value function. This allows for the handling of high-
dimensional materials data and enables the discovery of complex structure-property
relationships. DQN has shown great potential in accelerating the discovery of novel
materials with specific functionalities. Policy gradient methods, another class of rein-

Fig. 11.12 Reinforcement


learning. Reprinted with
permission from [8]
202 11 Material Discovery

Fig. 11.13 Reinforcement learning for a specific task. Reprinted with permission from [8]

forcement learning algorithms, directly optimize the policy function that maps states
to actions. By iteratively adjusting the policy parameters based on gradient informa-
tion, these methods have been employed to search for optimal material configurations
or optimize material synthesis pathways. Actor-critic models combine elements of
both value-based and policy-based methods. They consist of an actor that selects
actions based on a policy and a critic that estimates the value function. By leverag-
ing this combination, actor-critic models have demonstrated improved performance
in materials discovery tasks. Some of the example applications of RL for material
discovery are outlined here [4, 7, 8] (Figs. 11.14 and 11.15).
One of the most challenging problems in the field of materials is optimize the struc-
ture of a bulk system based on the atomic configuration. This problem is interesting
in identifying a stable structure based on a random initial configuration, optimizing
structures with defects, modeling grain boundaries, optimizing liquid or solid bulk
structures, obtaining stable glassy or disordered structures, and even in drug discov-
ery. To address this problem, one approach is to allow the system learn policies that
discover better minimum energy structures through reinforcement learning (RL).
Recently, it has been demonstrated that RL combined with graph neural networks
can be used to discover stable structures, through a framework, namely, Strider-
Net [10]. Here, we briefly discuss this framework as an example how RL can be
used to address a challenging problem in material discovery.
In this work, authors formulate the problem of discovering stable as an RL problem
as follows. Let .Ωc (x1 , x2 , ...xN ) be a configuration of an . N -atom system with energy
11.5 Reinforcement Learning for Optimizing Atomic Structures 203

Fig. 11.14 Visualization of the molecular design process. The RL agent (depicted by a robot arm)
sequentially places atoms onto a canvas. By working directly in Cartesian coordinates, the agent
learns to build structures from a very general class of molecules. Learning is guided by a reward that
encodes fundamental physical properties. Bonds are only for illustration. Reprinted with permission
from [9]

Fig. 11.15 Neural-network policies trained by evolutionary reinforcement learning can enact effi-
cient time- and configuration-dependent protocols for molecular self-assembly. A neural network
periodically controls certain parameters of a system, and evolutionary learning applied to the weights
of a neural network (indicated as colored nodes) results in networks able to promote the self- assem-
bly of desired structures. The protocols that give rise to these structures are then encoded in the
weights of a self-assembly kinetic yield net. Reprinted with permission from [7]
204 11 Material Discovery

Fig. 11.16 Optimization of a bulk binary LJ liquid system using StriderNet [10]

Fig. 11.17 StriderNet architecture for optimizing atomic structures. The atomic structure is
transformed into a graph, which is passed to a policy network that predicts node displacement, and
reward is computed. Finally, the policy parameters are updated based on the discounted reward [10]

U Ωc sampled from the energy landscape .U N d of the system. Starting from .Ωc , our
.
goal is to obtain the configuration .Ωmin exhibiting the minimum energy .U Ωmin by
displacing the atoms. To this end, we aim to learn a policy .π that displaces the atom
so that the system moves toward lower energy configurations while allowing it to
overcome local energy barriers [10].
Figure 11.16 shows the atomic structure of a binary Kob-Anderson Lennard Jones
liquid with two particles in the ratio 80:20. The optimization of the structure starting
from a random configuration results in a more stable structures with overall lower
energies. Consequently, this also results in a reduction in the energy of each of the
atoms in the system and in their local environment. StriderNet approach takes
inspiration from this observation. It employs a RL approach to learn what displace-
ments should be made to each of the atoms so that the overall energy reduces. While
doing this, care should be taken to balance between the exploration and exploitation
dilemma. Specifically, going always along the direction that decreases energy will
result in a local minima and thus not allowing the system to relax further. However,
11.6 Summary 205

exploring too long could lead the system toward higher and higher energy states
thereby making more challenging to discover the minima. An optimal policy would
allow enough exploration so that the system can overcome local energy barriers.
However, as soon as it reaches a well with extremely low energy values, it should
exploit this fact and reach the minima.
To address this problem, StriderNet employs a graph reinforcement learning
approach. Figure 11.17 describes the architecture of StriderNet. To achieve the
permutation invariance and inductivity, the atomic configuration is represented as a
graph with nodes representing the atoms and edges representing the bonds between
them. Note that a realistic cutoff is selected to create the graph based on the atomic
structure. Further, the graph is dynamic in nature as the atoms move. Subsequently,
a message-passing GNN is used to embed graphs into a feature space. The message-
passing architecture of the GNN ensures both permutation invariance and inductivity.
This graph, in turn, predicts the displacements of each of the atoms based on which the
rewards are computed. Finally, the policy .π is learned by maximizing the discounted
rewards. Note that the parameters of .π is learned using a set of training graphs
exhibiting diverse energies that are sampled from the energy landscape .E N d of an
atomic system with . N -atoms in .d dimensions. Thus, the initial structure, although
arbitrary and possibly unstable, is realistic and physically feasible. Then given a
new structure, the parameters of the learned policy network .π is adapted to the new
structure while optimizing the new graph structure.
The approach was tested on three systems namely, binary LJ system, Si atoms
modeled using a Stillinger-Weber potential, and also a coarse-grained model of
cement hydrate (C–S–H). It was demonstrated that StriderNet outperforms all
other optimization algorithms including gradient descent and fast inertial relaxation
engine (FIRE) and reaches much more stable configurations. Thus, the RL-based
approach can be a strong candidate to bridge the gap in terms of the timescales
between simulations and experiments allowing one to discover stable structures cor-
responding to even millions of years of aging.

11.6 Summary

In conclusion, it is evident that the integration of machine learning techniques in mate-


rials discovery has the potential to revolutionize the field, enabling faster and more
efficient discovery of materials with tailored properties. ML-based optimization pro-
vides a systematic approach for exploring the design space and identifying promising
material compositions. Material selection charts offer a valuable tool for screening
and selecting materials based on critical properties, facilitating the decision-making
process for materials scientists. Generative AI approaches, such as GANs and VAEs,
unlock the ability to generate novel materials with desired properties, expanding the
possibilities for materials design. Reinforcement learning algorithms enable discov-
ery of novel configurations and structures and have shown promise in accelerating
206 11 Material Discovery

the discovery process by guiding the selection of candidate materials and optimizing
atomic structures, potentially having applications in the area of drug discovery.
Looking ahead, the future of materials discovery using ML holds immense poten-
tial. Continued advancements in ML algorithms, computational power, and data
availability will further enhance the capabilities of ML-based approaches. Integra-
tion with experimental techniques, such as high-throughput synthesis and character-
ization, will enable rapid validation and feedback loops, accelerating the discovery
process. Additionally, the development of physics-informed ML models will enable
the incorporation of fundamental scientific principles into the design process, improv-
ing the reliability and interpretability of ML-based predictions. The combination of
ML with other emerging technologies, such as quantum computing and advanced
imaging techniques, will open up new frontiers for materials discovery. Overall, the
continued exploration and integration of ML techniques in materials discovery will
drive innovation, enabling the development of advanced materials with enhanced
performance and functionality.

References

1. A.M. Gopakumar, P.V. Balachandran, D. Xue, J.E. Gubernatis, T. Lookman, Multi-objective


optimization for materials discovery via adaptive design. Sci. Rep. 8(1), 3738 (2018). ISSN:
2045-2322. https://doi.org/10.1038/s41598-018-21936-3.[Online]. https://www.nature.com/
articles/s41598-018-21936-3. Accessed 14 Feb 2019
2. T.C. Le, D.A. Winkler, Discovery and optimization of materials using evolutionary approaches.
Chem. Rev. 116(10), 6107–6132 (2016). PMID: 27171499. https://doi.org/10.1021/acs.
chemrev.5b00691 [Online]
3. K. Hatakeyama-Sato, K. Oyaizu, Generative models for extrapolation prediction in materials
informatics. ACS Omega (2021)
4. B. Sanchez-Lengeling, A. Aspuru-Guzik, Inverse molecular design using machine learn-
ing: generative models for matter engineering. Science 361(6400), 360–365 (2018). ISSN:
0036-8075. https://doi.org/10.1126/science.aat2663. https://science.sciencemag.org/content/
361/6400/360. [Online]
5. Y. Dan, Y. Zhao, X. Li, S. Li, M. Hu, J. Hu, Generative adversarial networks (GAN) based
efficient sampling of chemical composition space for inverse design of inorganic materials.
Npj Comput. Mater. 6(1), 84 (2020). ISSN: 2057-3960. https://doi.org/10.1038/s41524-020-
00352-0. [Online]
6. Y. Mao, Q. He, X. Zhao, Designing complex architectured materials with generative adversarial
networks. Sci. Adv. 6(17), eaaz4169 (2020)
7. S. Whitelam, I. Tamblyn, Learning to grow: control of material self assembly using evolutionary
reinforcement learning. Phys. Rev. E 101(5), 052604 (2020)
8. C. Luo, S. Ning, Z. Liu, Z. Zhuang, Interactive inverse design of layered phononic crystals
based on reinforcement learning. Extrem. Mech. Lett. 36, 100651 (2020)
9. G. Simm, R. Pinsler, J.M. Hernández-Lobato, Reinforcement learning for molecular design
guided by quantum mechanics, in International Conference on Machine Learning (PMLR,
2020), pp. 8959–8969
10. V. Bihani, S. Manchanda, S. Sastry, S. Ranu, N. Krishnan, Stridernet: a graph reinforce-
ment learning approach to optimize atomic structures on rough energy landscapes (2023),
arXiv:2301.12477
References 207

11. M.-P.V. Christiansen, H.L. Mortensen, S.A. Meldgaard, B. Hammer, Gaussian representation
for image recognition and reinforcement learning of atomistic structure. J. Chem. Phys. 153(4),
044107 (2020)
12. D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning representations by back-propagating
errors. Nature 323(6088), 533–536 (1986)
13. S.A. Meldgaard, H.L. Mortensen, M.S. Jørgensen, B. Hammer, Structure prediction of surface
reconstructions by deep reinforcement learning. J. Phys. Condens. Matter 32(40), 404005
(2020)
14. R. Bhattoo, S. Bishnoi, M. Zaki, N.A. Krishnan, Understanding the compositional control on
electrical, mechanical, optical, and physical properties of inorganic glasses with interpretable
machine learning. Acta Mater. 242, 118439 (2023)
15. R. Ravinder, K.H. Sridhara, S. Bishnoi, H. Singh Grover, M. Bauchy, Jayadeva, H. Kodamana,
N.M.A. Krishnan, Deep learning aided rational design of oxide glasses. Mater. Horiz. (2020).
Royal Society of Chemistry. https://doi.org/10.1039/D0MH00162G. [Online]. https://pubs.
rsc.org/en/content/articlelanding/2020/mh/d0mh00162g. Accessed 10 May 2020
16. S. Singla, S. Mannan, M. Zaki, N.A. Krishnan, Accelerated design of chalcogenide glasses
through interpretable machine learning for composition-property relationships. J. Phys. Mater.
6(2), 024003 (2023)
17. S. Bishnoi, R. Ravinder, H. Singh Grover, H. Kodamana, N.M. Anoop Krishnan, Scal-
able Gaussian processes for predicting the optical, physical, thermal, and mechanical prop-
erties of inorganic glasses with large datasets. Mater. Adv. (2021). Royal Society of
Chemistry. https://doi.org/10.1039/D0MA00764A. [Online]. https://pubs.rsc.org/en/content/
articlelanding/2021/ma/d0ma00764a. Accessed 10 Jan 2021
18. T.C. Le, D.A. Winkler, Discovery and optimization of materials using evolutionary approaches.
Chem. Rev. 116(10), 6107–6132 (2016)
Chapter 12
Interpretable ML for Materials

Abstract The ML approaches discussed thus far focused on the developing


composition-property models or to use these models toward the inverse design of
materials. However, understanding the nature of these black-box models are impor-
tant so that an informed decision making process can be employed. To this extent, in
this chapter, interpretable ML models are discussed. First, SHAP, a post-hoc model
agnostic approach is employed to interpret the composition property model. SHAP
provides insights into the features governing a property both in a qualitative and quan-
titative manner. Further, SHAP also provides the coupling or interaction between the
input features for a given property. Finally, the use of support vector machines to
interpret the structure–dynamics relationships in materials through a novel machine-
learned metric, namely, “softness” is discussed. Altogether, the chapter outlines how
interpretable ML can be used to gain insights into the black-box functions for mate-
rials response.

12.1 Introduction

The black-box nature of ML models does not allow a domain expert to gain insights
into the features learned by the model. However, there are several approaches, both
model-specific and model-agnostic, that can be used to interpret ML models. Some
of these models have been explained in Chap. 9. Here, we discuss the application
of these approaches to interpret ML models. Specifically, we will discuss how to
interpret
• composition–property models
• interdependence of input features
• physics associated with atomic motion.
Note that these are a few selected problems, which are aimed at giving insights
into how the ML models can be explained. There are several interesting approaches
applied to interpreting images, such as microstructure, failure patterns, and crystal
structure. These are dealt with separately in Chap. 14.

© Springer Nature Switzerland AG 2024 209


N. M. A. Krishnan et al., Machine Learning for Materials Discovery,
Machine Intelligence for Materials Science,
https://doi.org/10.1007/978-3-031-44622-1_12
210 12 Interpretable ML for Materials

12.2 Composition–Property Relationships

Understanding the composition–property relationships in materials a crucial and


open problem. ML has proven to be effective in predicting the properties as a func-
tion of the composition. Most studies suggest that the composition–property relation-
ships are highly nonlinear for most properties and materials. Accordingly, complex
models such as neural networks and tree-based methods, which suffer from poor
interpretability, are commonly used. In such cases, SHAP can be an effective tool to
understand the composition–property relationships by elucidating the importance of
each input feature of the ML model. SHAP measures a feature’s importance by quan-
tifying the prediction error while perturbing a given feature value. If the prediction
error is large, the feature is important; otherwise, the feature is less important. It is an
additive feature importance method that produces unique solutions while adhering
to desirable properties, namely local accuracy, missingness, and consistency.
Figure 12.1 shows the SHAP plot for Young’s modulus of inorganic glasses. It
should be noted that the SHAP values are provided with respect to the mean of the
overall output value. That is, in absence of the any information, the best estimate of the
output value is the mean of the dataset. This forms the baseline for the SHAP. From the
baseline, when any given feature is added by a given magnitude, the corresponding
SHAP value shows how much of a quantitative change happens to the output from
its mean value; a positive SHAP value means that the output increases and a negative
SHAP value means that the output decreases. Thus, the SHAP value shows the
influence of a given feature on the output both qualitatively and quantitatively.

Fig. 12.1 SHAP plot for Young’s modulus of inorganic glasses


12.3 Interaction of Input Features 211

The SHAP values are visualized using a violin plot and a river flow plot. Both
these plots allow one to analyze the SHAP value from different angles: (i) one from
that of a given component, that is, whether a given component affects the output
positively or negatively, and (ii) the other from that of a given composition, that is, in
a given composition which components increase and which decrease the output from
the mean value. Note that, for a given composition, the sum of the SHAP values of all
the components present and the mean output value should give the actual property
value of that composition. Thus, the SHAP values of the components in a given
composition are additive in nature toward the final output value.
The violin plot, also known as the beeswarm plot, represents the contribution of
a given feature toward the output value. Note that the color of the points represent
the value of the feature and the x-axis represent the corresponding SHAP value.
Thus, the violin plot is colored according to the feature value. The violin allows
one to identify, which features influence the outcome positively, and which affects
negatively. It also shows the features having maximum and minimum influence in a
quantitative fashion. Further, the violin plot also allows one to identify any features
that exhibit mixed effects. Mixed effects are those where a non-monotonic variation
in SHAP value is observed with a monotonic variation in the feature value. It means
that the behavior of the given feature is not independent and is dependent on other
features (elements or compounds) present in the material. This could further be
understood in terms of SHAP interaction values discussed in the next section.
Another way to visualize the SHAP values is through the river flow plot. For a given
material or compositions, the river flow plot shows the contribution of each of the
component in the composition or material toward final property value. Thus, a given
line represents one material and the intermediate points represent the contribution of
different input components towards the property value. In this case, the y-axis values
corresponding to each of the components represent the sum of SHAP values of all
the components up to that components with the mean. Thus, the value corresponding
to the last component represent the total property value for a given composition.
Thus, these paths are created by nudging the prediction from the expected value
towards a particular direction representing that specific glass component’s particular
contribution. Note that the river flow plot is colored according to the final output
value. Altogether, the SHAP beeswarm along with the riverflow plots provide an
approach to analyze the contribution of the individual components toward a given
property value. This feature can hence be used by experimentalists to design materials
with targeted property value by increasing or decreasing the components in a material
based on their SHAP values for the given property.

12.3 Interaction of Input Features

Traditionally, when developing a mathematical model using linear regression, it is


generally assumed that the input features are independent. This is not necessarily so
in ML or DL models. The input features may be interdependent. The dependency
212 12 Interpretable ML for Materials

of the input features is all the more important in materials modeling. For instance,
in oxide glasses, the coordination of boron can be three or four depending on the
presence of a charge-compensating sodium atom in the near vicinity. If sodium is
absent, boron takes a trigonal planar structure (BO.3 ), while in the presence of sodium
atom, boron takes a tetrahedral structure (BO.− 4 ). Such interdependence of the input
features may not be captured by simple correlation analysis. It should be noted that
the interdependence of features may be specific to each property as well. For instance,
the impact of the dependence of input components for different properties might be
different. This depends on the factors governing the property, whether it is electronic,
structural, thermodynamic, or physical. SHAP dependence and interaction analysis
can be used to elucidate such interdependence of input features (Figs. 12.2 and 12.3).

Fig. 12.2 Dependence plot


showing interaction of Na.2 O
and B.2 O.3 for liquidus
temperature

Fig. 12.3 Dependence plot


showing interaction of Na.2 O
and B.2 O.3 for Young’s
modulus
12.4 Decoding the Physics of Atomic Motion 213

Thus, the correlation between several input components in a model can also be
studied using the SHAP interaction values. To obtain the SHAP interaction values,
the variation in the output (predicted) value while perturbing two input components
simultaneously is analyzed. If the magnitude of variation in the output while perturb-
ing a single input component is the same for different values of the second output
components, this suggests that the two input components are not correlated; other-
wise, they are correlated. The degree of this correlation can also be quantified from
the SHAP interaction values.
It is worth noting that the SHAP interaction values for features is obtained for
each property separately. Thus, SHAP interaction values should not be confused
with correlation functions. Rather, SHAP interaction values for each property even
for the same set of input features and dataset could be different. Thus, the interaction
values could provide insights into which features are related for a given property
providing a direction for investigation. The exact nature of the coupling could be
further investigated employing simulations or experiments.

12.4 Decoding the Physics of Atomic Motion

The previous sections focused on the use of interpretable ML for decoding the input-
output relationships. In addition to property predictions, ML can be used to gain
insights into materials’ response itself. This can be achieved by learning the patterns
associated with the atomic motion obtained through in silico experiments. Some of
the open questions in materials research include the understanding of the structure-
dynamics relationship in disordered systems. Disordered systems include inorganic
glasses, colloidal gels, metallic glasses, and even granular systems. These systems
exhibit interesting behavior such as ductile deformation, glass transition, and jam-
ming, the physics of which remain elusive till date. Indeed, the plasticity mechanisms
in crystalline systems are better understood by analysing the dislocation and their
propagation. However, the structural origin ductile behavior in disordered systems
is an active area of research.
One of the first attempts to understand the physics of structure-dynamics rela-
tionship in disordered systems employing supervised ML approaches was through a
quantity named, softness, which attempts to characterize the local structure and the
relation with dynamics thereof [1]. In this work, instead of trying to intuit the rela-
tionship between structure and dynamics, an ML approach using the large amounts
of data from either molecular dynamics or experimental data was employed [1–6].
This approach was then used to address a variety of interesting problems includ-
ing the relationship between structure and relaxation in out-of-equilibrium systems,
the role of defects in governing the fracture behavior, and universal signatures of
structure-property relationships in disordered solids [1–6].
The idea of softness revolves around the use of support vector machine (SVM) to
classify atoms based on their structural features and displacements. Using an SVM
with a linear kernel provides interpretability in terms of the support vector, that is,
214 12 Interpretable ML for Materials

Fig. 12.4 Parametrizing the local structure in a disordered solid through supervised ML. Reprinted
with permission from [1]

the plane that classifies the data points based on their features and labels. The basic
steps involved in the computation of softness is as follows.
1. Identify two populations of atoms or cluster of atoms that are: (i) about to expe-
rience rearrangements, and (ii) stable.
2. Obtain the degrees of freedom of each of these atoms in a quantitative fashion
using structural features such as bond and angle distribution functions or orien-
tational orders.
3. Learn the function (support vector represented by the plane, in this case) that opti-
mally separates (classifies) the rearranging population from the stable population.
4. Compute the distance of each of the atoms with respect to the support vector,
which is defined as the softness, . Si . Farther the distance from the plane is, more
stable (or unstable) the atom is (see Fig. 12.4).
There are several ways to perform each of these tasks and the authors have demon-
strated that the results remain qualitatively unaffected by the choice of the method
[1]. Here, we briefly describe some of the common choices for the tasks mentioned
in the list above.
The two common metrics used to identify the population susceptible (and not
2
susceptible) to motions are . Dmin [7] and . phop [8, 9]. Note that . phop is defined, with
reference to two time intervals . A = [t − 4000δt, t] and . B = [t, t + 4000δt] which
is large-enough to ensure that the system has undergone notable rearrangement, as

. phop (i, t) = <(x i − <x i > B )2 > A <(x i − <x i > A )2 > B (12.1)

where .<> represents the average over a given time interval, and .δt represents the
timestep of the simulation. Note that the intuitive meaning of the . phop is as follows.
If there are no rearrangements in the intervals A and B, then . phop reduces to the
variance of the particle position over time. In case there are rearrangements in the
intervals, then . phop is proportional to the square of the distance the particle moves in
the two intervals as the system moves from one minima to the other (see Fig. 12.5).
Using the distribution of . phop as shown in Fig. 12.5c, two populations of rearranging
and non-rearranging particle can be constructed using a threshold value.
12.4 Decoding the Physics of Atomic Motion 215

Fig. 12.5 Quantifying rearrangements. a The distance in the inherernt structure positions of a
particle as a function of time. Hopping events can be noticed here. b The . phop indicator function of
this trajectory. c The distribution of . phop . A clear cross-over to exponential distribution is observed
at at a well-defined value of . phop . Reprinted with permission from [1]

Once the populations that are mobile are identified, the next step is to obtain
features that quantify the local structure. To this extent, a function that counts the
number of particles around a central atom at a distance .r ± σ , similar to the pair
distribution function, .G iX (r, σ )is defined (see Fig. 12.4a) as

1 ∑ − 12 (Ri j −r)2
. G iX (r, σ ) = √ e 2σ (12.2)
2π j ∈X

where . Ri j = |x i − x j | is the distance the particles .i and . j and . X ∈ {A, B} refers


to the species of the atom. Due to the large number of possibilities, this function
is computed in the discrete intervals of radii .rn = n∆. Practically, a good selection
of .∆ has been found to be 0.1 times the first peak distance of the pair distribution
function [1].
The final step after the development of feature vectors is to identify the support
vector that classifies the atoms into mobile and immobile using the identified feature
vectors. To this extent, a linear support vector is used due to its high interpretability.
Note that a linear support vector directly provides a hyperplane so that .ŵ · G i + b is
positive when the particle has a tendency to rearrange and negative when the particle
is stable. Note that the .ŵ represents the normal to the plane and .b represents the
bias. Once the support vector has been obtained after training, any new structure can
be classified based on the hyperplane which can then be used to directly assess the
propensity of an atom to move based on its current neighborhood. In other words, the
support provides a direct access to the dynamics of the system based on the structure.
More importantly, this approach clearly demonstrates that the structure of a system
holds the key to its dynamics, thereby shedding some light on the long-standing
controversy regarding the relation between structure and dynamic heterogeneity in
the disordered materials. Now, we will briefly discuss three cases where the softness
has been used to decode the physics associated with materials responses.
216 12 Interpretable ML for Materials

Identification of flow defects

In disordered solids, it is hypothesized that flow defects, which are effective in scat-
tering sound waves, are associated with localized particle rearrangements. However,
it is extremely challenging to directly identify and correlate flow defects based on the
local atomic structure only. To this extent, softness was used, due to its unique ability
to identify the structure–dynamics relation [5]. In this work [5], authors studied two
systems, a two-dimensional experimental granular pillar under compression and a
Lennard-Jones (LJ) glass in two and three dimensions (.d = 2, 3) above and below
its glass transition. Figure 12.6 shows the two systems studied.
These systems were trained using the SVM based on the features .G iX (r, σ ) and
.𝚿i (r, ξ, λ, ζ ), where the former represents the radial features, while the latter rep-
X

resents the orientational features. The label used to classify the SVM was based on
2
the probability of atomic motion computed using . Dmin . Figure 12.7 shows the prob-
ability . P(Dmin ) that a particle with given structural features was identified as soft
2
2
with respect to the observed value of . Dmin . Overall, these results suggest that the
SVM is able to identify the soft particles which are more susceptible to plastic flow
represented by increased mobility.
Once the stable and unstable particles were identified, the associated structural
features, which were used as the input for the SVM, were investigated to analyse
the correlation between the structure and mobility of the particles. Figure 12.8 shows
the distribution of the radial and orientational features associated with soft and hard
particles. Interestingly, it can be observed that the radial features, represented by .G BA
is unable to distinguish clearly between the hard particles and soft particles. However,
the orientational feature represented by .𝚿 ABB
is able to clearly differentiate the hard
and soft particles based on their local environment. (Note that the notation for .G and

Fig. 12.6 Snapshot configurations of the two systems studied. Particles are colored gray to red
2
according to their . Dmin value. Particles identified as soft by the SVM are outlined in black. a A
snapshot of the pillar system. Compression occurs in the direction indicated. b A snapshot of the
.d = 2 sheared, thermal Lennard-Jones system. Reprinted with permission from [5]
12.5 Summary 217

Fig. 12.7 Probability that


2
a particle of a given . Dmin is
soft. The vertical dashed
lines are corresponding
2
. Dmin,0 values. a The result
for the pillar system, where
.d AA refers to the large grain
diameter (since this is a
granular system with
macroscopic grains, thermal
fluctuations are negligible). b
The result of using an SVM
trained at a temperature .T
(.T = 0.1, 0.2, 0.3 and .0.4
shown in different colors) to
classify data at the same
temperature for the .d = 2 LJ
glass. [c, d] Results for
species . A and . B,
respectively, for the .d = 3
system at .T = 0.4, 0.5 and
.0.6. Reprinted with
permission from [5]

ϕ used in [5] is slightly different from that followed in the present book. For the
.

sake of consistency with the figure, the notation as per [5] is used in this paragraph).
In summary, the flow defects in disordered systems can be identified clearly using
SVM-based ML techniques. Moreover, the structural features associated with the
local environment or mobile and immobile particles can also be understood, thanks
to the interpretable nature of SVMs. Similar approach employing softness have been
widely used for studying several other problems including the structural relaxation of
glasses, structure–dynamics correlation in self-organising systems, and supercooled
liquids to name a few.

12.5 Summary

In this chapter, the use of interpretable ML approaches to gain insights into the
response of materials were discussed. Specifically, the use of SHAP to analyze the
black-box ML models can provide insights into the role of individual components
in a material. This information can be used for tailoring the design of materials with
targeted properties. Further, SHAP interaction values can be used to understand the
interdependence of the input features for a given property. Finally, a novel approach
to interpret the structure–dynamics relationship in materials through a metric namely
218 12 Interpretable ML for Materials

Fig. 12.8 Distribution of structural features of stable and unstable particles. Distribution of
.G B (i; r p
A AB eak) for soft (red dark) peak and hard (blue medium dark and green light) particles; .r AB
peak
corresponds to the first peak of the partial pair distribution functions .g AB or .g B A . b Distribution
of .𝚿 AB (i; 2.07σ AA , 1, 2), proportional to the density of neighbors with small bond angles near a
B

particle .i, for soft (red dark) and hard (blue medium dark and green light) particles. The inset shows
examples of configurations with corresponding radial and bond orientation properties, where dark
(light) gray neighbors are of species . A(B). Reprinted with permission from [5]

“softness” was discussed. These interpretable approaches provide insights into the
nature of the functions learned by the ML model and thus allows one gain insights
into the underlying physics associated with the material behavior.

References

1. S.S. Schoenholz, Combining machine learning and physics to understand glassy systems.
J. Phys. Conf. Ser. 1036, 012021 (2018). https://doi.org/10.1088/1742-6596/1036/1/012021.
[Online]
2. S.S. Schoenholz, E.D. Cubuk, E. Kaxiras, A.J. Liu, Relationship between local structure and
relaxation in out-of-equilibrium glassy systems. Proc. Natl. Acad. Sci. 114(2), 263–267 (2017)
3. E.D. Cubuk, S.S. Schoenholz, E. Kaxiras, A.J. Liu, Structural properties of defects in glassy
liquids. J. Phys. Chem. B 120(26), 6139–6146 (2016)
4. S.S. Schoenholz, E.D. Cubuk, D.M. Sussman, E. Kaxiras, A.J. Liu, A structural approach to
relaxation in glassy liquids. Nat. Phys. 12(5), 469–471 (2016)
5. E.D. Cubuk, S.S. Schoenholz, J.M. Rieser, B.D. Malone, J. Rottler, D.J. Durian, E. Kaxiras, A.J.
Liu, Identifying structural flow defects in disordered solids using machine-learning methods.
Phys. Rev. Lett. 114(10), 108001 (2015)
6. E.D. Cubuk, R. Ivancic, S.S. Schoenholz, D. Strickland, A. Basu, Z. Davidson, J. Fontaine, J.L.
Hor, Y.-R. Huang, Y. Jiang et al., Structure-property relationships from universal signatures of
plasticity in disordered solids. Science 358(6366), 1033–1037 (2017)
7. M.L. Falk, J.S. Langer, Dynamics of viscoplastic deformation in amorphous solids. Phys. Rev.
E 57(6), 7192–7205 (1998). https://doi.org/10.1103/PhysRevE.57.7192. [Online]. https://link.
aps.org/doi/10.1103/PhysRevE.57.7192
8. A. Smessaert, J. Rottler, Distribution of local relaxation events in an aging three-dimensional
glass: Spatiotemporal correlation and dynamical het- erogeneity. Phys. Rev. E 88(2), 022314
(2013). https://doi.org/10.1103/PhysRevE.88.022314. [Online]. https://link.aps.org/doi/10.
1103/PhysRevE.88.022314
References 219

9. R. Candelier, A. Widmer-Cooper, J.K. Kummerfeld, O. Dauchot, G. Biroli, P. Harrowell,


D.R. Reichman, Spatiotemporal hierarchy of relaxation events, dynamical heterogeneities,
and structural reorganization in a supercooled liquid. Phys. Rev. Lett. 105(13), 135702 (2010).
https://doi.org/10.1103/PhysRevLett.105.135702. [Online]. https://link.aps.org/doi/10.1103/
PhysRevLett.105.135702
10. C. Molnar, Interpretable Machine Learning. Lulu.com (2020)
11. S.M. Lundberg, S.-I. Lee, A unified approach to interpreting model predictions, in Proceedings
of the 31st International Conference on Neural Information Processing Systems, pp. 4768–4777
(2017)
12. K. Yang, X. Xu, B. Yang, B. Cook, H. Ramos, N.M.A. Krishnan, M.M. Smedskjaer, C.
Hoover, M. Bauchy, Predicting the Young’s modulus of silicate glasses using high-throughput
molecular dynamics simulations and machine learning. Sci. Rep. 9(1), 8739 (2019). ISSN:
2045-2322. https://doi.org/10.1038/s41598-019-45344-3. [Online]. https://www.nature.com/
articles/s41598-019-45344-3. Accessed 07 May 2019
13. M. Zaki, Jayadeva, N.A. Krishnan, Extracting processing and testing parameters from mate-
rials science literature for improved property prediction of glasses. Chem. Eng. Process. Pro-
cess. Intensif. 108607 (2021). ISSN: 0255-2701. https://doi.org/10.1016/j.cep.2021.108607.
[Online]. https://www.sciencedirect.com/science/article/pii/S0255270121003020
14. R. Bhattoo, S. Bishnoi, M. Zaki, N.A. Krishnan, Understanding the compositional control on
electrical, mechanical, optical, and physical properties of inorganic glasses with interpretable
machine learning. Acta Mater. 242, 118439 (2023)
15. M. Zaki, V. Venugopal, R. Bhattoo, S. Bishnoi, S.K. Singh, A.R. Allu, Jayadeva, N.A. Krishnan,
Interpreting the optical properties of oxide glasses with machine learning and shapely additive
explanations. J. Am. Ceram. Soc. 105(6), 4046–4057 (2022)
Chapter 13
Machine Learned Material Simulation

Abstract This chapter explores the application of machine learning techniques


in materials simulations, with a focus on three key areas: machine learned inter-
atomic potentials, physics informed machine learning for continuum simulations,
and physics-enforced graph neural networks. Machine learned interatomic poten-
tials offer a powerful approach to accurately model interaction between atoms in
a structure by leveraging machine learning algorithms. Physics informed machine
learning combines domain-specific knowledge and physical equations to enhance
the accuracy and efficiency of continuum simulations. Physics-enforced graph neu-
ral networks model materials as a graph, while strictly enforcing governing laws
as inductive biases. They enable interpretability and generalizability to significantly
larger sizes than those trained. These approaches hold great promise for accelerat-
ing materials discovery, designing materials with tailored properties, and advancing
our understanding of materials behavior. Future research directions include refining
and expanding these techniques, exploring their applicability to new materials sys-
tems, and developing interatomic potentials that can scale the entire periodic table.
Interdisciplinary collaborations will be crucial in pushing the boundaries of machine
learning in materials science and engineering.

13.1 Introduction

Materials are made of atoms, which, in turn, is made of electrons protons and sub-
atomic particles. The atomic motion is at the origin all the microscopic and macro-
scopic responses of materials. Materials’ response is controlled by phenomena occur-
ring at different length and time scales. For instance, the atomic motion occurs in
the scale of femto seconds, the dislocation motion occurs in a few pico seconds,
fracture propagation occurs in a nano seconds, and creep or fatigue occurs over a
period of days or years. Associated with each of these time and length scales, there
are several simulation techniques which are designed to capture the physics at that
particular length scale while ignoring the details that are not relevant. Some of these
techniques along with the associated length scales are shown in Fig. 13.1. The small-

© Springer Nature Switzerland AG 2024 221


N. M. A. Krishnan et al., Machine Learning for Materials Discovery,
Machine Intelligence for Materials Science,
https://doi.org/10.1007/978-3-031-44622-1_13
222 13 Machine Learned Material Simulation

Fig. 13.1 Multiscale


modeling

est length scale, quantum, is traditionally modeled using first principle simulations
or density functional theory. While these approaches can take into account the elec-
tronic motions and accurately predict the atomic structure, they are limited due to a
few tens of atoms and picoseconds due to the prohibitive computational cost.
Atomistic simulations can model larger system sizes of a few millions of atoms
and up to a few nanoseconds of simulation time by ignoring the electronic structure
details. The atomic interactions in these simulations are modeled by the empiri-
cal interatomic potentials, which are fitted against the first principle simulations
or experimental data. To understand phenomena occurring at larger length scales,
coarse-grained simulations can be used. In these simulations, a cluster of atoms are
combined to form a “bead” and the effective interactions between the beads are fitted
based on the all atom simulations.
Beyond coarse-grained simulations, continuum-based approaches such as phase
field simulations, finite element simulations, or particle based simulations including
smoothed particle hydrodynamics or peridynamics are commonly used. For these
simulations, the constitutive model of the material is given as the input and the
material response under different loading conditions are studied.
In all these simulations, the force-displacement or the strain-strain response, also
known as the constitutive relationship, form the basic input that governs the response
of materials. These relationships are either assumed based on domain knowledge,
or fitted against experiments or first principle simulations. ML presents as an ideal
candidate to learn these interactions which can then be used for materials simulations.
Several physics-informed approaches have been developed for modeling materials
at different length-scales. In this chapter, we will discuss about these approaches
13.2 Machine Learning Interatomic Potentials for Atomistic Modeling 223

used in accelerating materials simulation using machine learning. Specifically, we


will cover methods at different length scales as follows.
• Machine learning interatomic potentials for molecular modeling
• Physics-informed neural networks for continuum simulations
• Graph-based simulations for atomistic and continuum simulations.
Some of the commonly used approaches for materials simulations using these meth-
ods will be discussed in detail.

13.2 Machine Learning Interatomic Potentials


for Atomistic Modeling

Molecular dynamics (MD) simulations model atoms as point particles that interact
with each based on a force field, also known as interatomic potentials. The interatomic
potentials are fitted against first principle simulations or experiments for certain prop-
erties of interest, which are then validated against independent measurements. These
interatomic potentials, like other potentials such as gravitational or electric, depends
on the position, type, and charge of the atoms. The iterative dynamics followed in
MD simulations is given in Fig. 13.2. Based on the atomic configuration, the poten-
tial energy of the systems can be obtained from the interatomic potential. Once the
potential energy is obtained, the force on each atom can be computed as the derivative
of energy with respect to positions .(Fi = −∇U (ri )). Once the forces are obtained,
the acceleration is computed using Newton’s laws. Finally, the updated velocities and
positions of atoms can be obtained using numerical integration algorithms such as
Verlet or velocity Verlet. Thus, the accuracy of MD simulations are dependent on the
reliability of the interatomic potentials used for the simulation. Similarly, the unavail-
ability of interatomic potentials for a large number of elements and compounds limits
the ability to simulate these systems using MD (Fig. 13.3).

Fig. 13.2 Iterative dynamics


in molecular dynamics for
updating atomic positions
224 13 Machine Learned Material Simulation

Fig. 13.3 ML potentials as a potential solution to the trade-off between cost and accuracy of
conventional atomistic simulations. Potential future developments include hybrid machine learn-
ing/molecular mechanics (ML/MM) methods, more efficient representations to decrease simulation
times and more accurate training data (proposed by an active learning algorithm) to improve the
model accuracy beyond density functional theory. Potential future applications are shown in the
blue box (only approximately positioned according to their system size and accuracy requirements).
They include the simulation of enzymes and biomolecules such as ribosomes, the quantitative sim-
ulation of chemical reactions and reaction networks as well as the atomistic simulation of complex
reactive materials as found in, for example, Li-ion batteries. The inset on the top left shows the
energy landscape of a protein folding simulation, which is a prototypical example of a classical
force field calculation. No covalent bonds are formed or broken during the simulation. The inset
on the right shows an excited state dynamics simulation of a S1 to S0 transition, which requires
ab initio methods to compute the excited state properties. The ‘Coarse graining’, ‘Reactive force
fields’, and ‘QM/MM’ boxes are faded out as these methods are not discussed in depth. Figure
adapted with permission from: Marat Yusupov, Roland Beckmann and Anthony Schuller (biology
image); [1], American Chemical Society (chemistry image); [2], Elsevier (materials image); [3],
PNAS (left inset); [4], AIP (right inset). Reprinted with permission from [5]

This challenge can be addressed using ML-based potentials. In the ML-based


potential approach, no functional form for the potential is assumed in contrast to
the traditional interatomic potential. Rather, the ML-model is trained to learn the
function over a large set of atomic configurations enabling it to simulate almost
any new configuration. Note that training the ML model directly based on ab-initio
simulations can yield functional forms with quantum accuracy while having the
computational cost of classical MD simulations only. As such, ML potentials hold
the key to gaining insights into some of the unsolved problems of the millennium
including protein folding and glass transition [5].
The idea of representing potential energy surfaces using NNs was first proposed
in 1995 [6]. The inputs for the NNs in this work was the position of the center of
13.2 Machine Learning Interatomic Potentials for Atomistic Modeling 225

Fig. 13.4 Timeline of machine learned interatomic potentials

Fig. 13.5 Fitting procedure for machine learning interatomic potentials. Reprinted with permission
from [8]

mass, and angle of the molecular axis relative to the surface normal. However, these
approaches were restricted to low-dimensional potential energy surfaces due to the
lack of a descriptor that can unambiguously, succinctly, and universally represent
the local environment of an atom. The first ML-based potential that overcame these
limitations was proposed in 2007 [7]. Following this, a variety of ML potentials, each
using either a different descriptor for the atomic environment or a different regressor
to learn the potential energy function (detailed below), have been developed. The
timeline of the some of the major ML potentials are shown in Fig. 13.4.
The broad approach employed in the development of an ML-based potential is
shown in Fig. 13.5. These steps can be broadly outlined as follows.
1. Development of a training set. First step involved in the training of an ML
potential is the development of training set. The training set should have a mini-
mum of the following information: (i) position of all the atoms in a configuration,
which forms the input, and (ii) potential energy of the total system, which forms
the output against which the potential is trained. In addition to these, sometimes
226 13 Machine Learned Material Simulation

velocity of the atoms may also be used as an input parameter. Similarly, the force
per atom and potential energy per atom are some additional output against which
the potential may be trained. This depends on the approach used to train the
ML potential. Note all of these parameters can be obtained directly from density
functional theory or other first principle simulation approaches.
2. Representation of the local structure. Once the training set is developed, the
next important aspect is the development of a representation for the local structure
around an atom. It should be noted that the contributions to the potential energy
of a system is mainly from the first neighbors, followed by the second and third
neighbors. Thus, it is reasonable to use a cutoff for computing the representation
of the local structure around an atom. This cutoff can be similar to, lower than
or even higher than the cutoff of an empirical potential. However, the number of
atoms within the selected region interacting with the central atom can be a variable
for a given configuration even with a cutoff. Thus, a general representation that
can be given as an input to a fixed size ML model is a requirement when using
classical ML models. This is because the number of input features in a classi-
cal ML model such as neural networks, support vector or tree-based approaches
remain constant. An alternative to this approach is to use a graph neural network
which can tackle different number of neighbors. This approach is discussed in
detail in later section. The representations that are commonly used are developed
based on the radial and angular functions or order parameters. A detailed study
on the sensitivity (which represents the accuracy with which the descriptor is able
to capture a local environment uniquely and distinctly from a slightly different
one) and dimensionality (which represents the complexity and consequently the
computational cost of the resulting ML model) of these descriptors used to rep-
resent the local atomic environment can be found in [8]. The list of descriptors
presented in this work are shown in Fig. 13.6. Based on this work [8], some of
the essential properties of descriptors and representations for encoding materials
and molecules suggested are as follows.

• Invariance: descriptors should be invariant under symmetry operations, that is,


permutation of atoms and translation and rotation of structure. Note that a recent
family of graph-based potentials show that equivariance is a more desirable
property than invariance. These potentials form a broad class of equivariant
graph neural network potentials.
• Sensitivity (local stability): small changes in the atomic positions should result
in proportional changes in the descriptor, and vice versa.
• Global uniqueness/faithfulness: the mapping of the descriptor should be unique
for a given input atomic environment (that is, the mapping is injective).
• Dimensionality: relatedly, the dimension of the spanned hyper-dimensional
space of the descriptor should be sufficient to ensure uniqueness, but not larger.
• Differentiability: having continuous functions that are differentiable.
• Interpretability: features of the encoding can be mapped directly to structural
or material properties for easy interpretation of results.
13.2 Machine Learning Interatomic Potentials for Atomistic Modeling 227

Fig. 13.6 Classification of local atomic representations based on their method of construction
(horizontal axis) and when they were first proposed (vertical axis). QSAR and SISSO do not
indicate representations but instead indicates the representations that are constructed or selected
using these methods (see the text). The superscript .a,b,c correspond to the representations that are
classified with multiple methods: direct and connectivity, histogram and mapping functions, and
connectivity and mapping functions. Reprinted with permission from [9]

• Scalability: ideally, descriptors should be easily generalized to any system or


structure with a preference to have no limitations on number of elements, atoms,
or properties.
• Complexity: to have a low computational cost so the method can be fast enough
to scale to the required size of the simulations and to be used in high-throughput
screening of big data.
• Discrete mapping: always map to the same hyper-dimensional space with con-
stant size feature sets, regardless of the input atomic environment.
228 13 Machine Learned Material Simulation

A library which generates several atomic descriptors directly from the atomic
structure namely DScribe [9], has been developed and made publicly available
(see https://github.com /singroup/dscribe). This package, a continuously growing
one, has incorporated several of these descriptors including Coulomb matrix [10],
Sine matrix [11], Ewald sum matrix [11], Atom-centered Symmetry Functions
(ACSF) [12], Smooth Overlap of Atomic Positions (SOAP) [13], Many-body
Tensor Representation (MBTR) [14], Local Many-body Tensor Representation
(LMBTR) [14].
3. Regressor for learning the potential energy function. The third aspect of devel-
oping an ML potential involves the use of different types of regressor for training
the potential. These include kernel ridge regression, neural networks, random for-
est or decision trees, and Gaussian process regression. Each of these approaches
have their own pros and cons. Specifically, Gaussian process regression enables
the computation of standard deviation associated with every prediction. This
allows to compute the reliability of the prediction of the potential energy asso-
ciated with each new configuration. Further, identifying the regions having high
standard deviation can enable an active learning approach for training the inter-
atomic potentials, thereby reducing the number of data points required to train
the ML potentials. The most common regressor that is used in training the ML
potential is the feed forward neural network with two or more hidden layers.
Among the potentials mentioned in Fig. 13.4, SNAP, AGNI and MTP potentials
employs linear regression or a regularized version of it. GAP potential employs its
namesake Gaussian regression. All other potentials use neural networks, shallow
or deep.
Being a very active of area of research, several software packages are available for
the development of ML potentials. Many of these packages are linked to available
open-source packages such as LAMMPS and VASP. Some of these packages are
listed below.
• Aenet (Atomic Energy Network, see: http://ann.atomistic.net) and N2P2 (Neural
Network Potential Package, see: https://compphysvienna.github.io/n2p2/), are two
software packages for training NN potentials proposed by Behler-Parrinello [7].
N2P2 also has an interface which allows running MD simulations with LAMMPS
package. MAISE (Module for ab initio structure evolution [15], see:
http://maise.binghamton.edu/wiki/home.html) is another package which enables
automated generation of NN (Behler-Parrinello type) potentials for global struc-
ture optimization, wherein the DFT database is automatically generated employing
an evolutionary algorithm-based sampling procedure.
• ASE (Atomistic Simulation Environment [16] see: https://wiki.fysik.dtu.dk/ase/)
offers a set of Python tools for pre-processing, running, post-processing atomistic
simulations. It enables DFT database generation and ML potential testing through
an environment that is linked to MD simulation packages such as LAMMPS, and
to DFT packages such as VASP and Quantum Espresso.
13.2 Machine Learning Interatomic Potentials for Atomistic Modeling 229

• MLIP (Machine Learning Interatomic Potentials [17], see: https://mlip.skoltech.ru)


package enables the development of MTP potentials (including multicomponent
potentials) employing an active learning approach.
• KLIFF (KIM-based Learning-Integrated Fitting Framework, see: https://openkim
.org) enables the development of both traditional and NN interatomic potentials.
The package is shared as part of the the OpenKIM project and has access to a large
repository of interatomic potentials enabling simulations in LAMMPS, GULP,
ASE, and DL_POLY.
• DeepMD enables the development neural network based interatomic potentials.
They have inferface with LAMMPS and also have active learning routines that per-
form single point DFT calculations employing VASP or other ab-initio simulation
packages.
• NequIP is a package for developing (E3) equivariant graph neural network poten-
tials. NequIP is integrated with LAMMPS and ASE so that the trained potentials
can be used for simulations in these packages.
Traditional ML potentials, fitted using the procedure mentioned above, are agnos-
tic to the physics of the problem. Recently, ML potentials with physics-informed
inductive bias have been developed [19] that can enable the system to learn the
potential functional faster and more accurately . These classes of potentials, also
known as physics-informed neural network potentials (PINNs) uses neural networks
to learn the parameters associated with a physics-based potential. The fitting proce-
dure of the PINN potential along with some of the interesting applications where it
has been used are shown in Fig. 13.7.
In PINN potentials, for instance, a generic bond-order based functional form may
be assumed for the interatomic potential. The parameters associated with this bond-
order potential are learnt by training the potential against first-principle simulation
data. Here, associated with an atomic structure, the local representations around indi-
vidual atoms are used to predict the parameters of the potential. These parameters are

Fig. 13.7 Fitting procedure for machine learning interatomic potentials along with some of the
commonly used applications. Reprinted with permission from [18]
230 13 Machine Learned Material Simulation

then used to compute the energy of an individual, which when summed over all the
atoms provide the total energy of the system. The mean squared errors between the
predicted total energy and the energy obtained from the DFT simulations is used to
train the neural network. Note that standard techniques of regularization can also be
included in the loss function for the improved training of the potential, while avoiding
overfitting. Note that PINN potentials try to combine the best of both ML potential
and traditional potential. On the one hand, unlike the traditional potential, the param-
eters of PINN potentials are not constant. Rather, they are dynamically changing
depending on the local environment of the atomic structure. Further, this function is
learned from the first principle simulations data. On the other hand, unlike the ML
potential, the functional form of PINN potential is physically-informed having a high
inductive-bias leading to improved performance than purely data-driven ML poten-
tials. Thus, PINN potentials have been shown to provide superior performance over
pure ML-based potential especially in conditions far from equilibrium. Figure 13.7
shows some of the example properties calculated with the PINN potentials for Al (a–
d) [19] and Ta (e–h). These include the phonon dispersion curves (Fig. 13.7), linear
thermal expansion with respect to temperature (Fig. 13.7), the solid-liquid interface
tension computed by the capillary fluctuation method (Fig. 13.7) and crack nucle-
ation and growth on a grain boundary. Predictions are compared with experimental
data wherever available. Further, MD simulations of surfaces in body-centered cubic
Ta on (110) and (112) planes, respectively (Fig. 13.7), Nye tensor plot of the core
structure and Peierls barrier of the screw dislocation in Ta predicted by the PINN
potential (lines) are shown.
In addition, PINNs can be extremely useful in simulations involving interactions
at a very short range much below the equilibrium bond lengths, for example, in the
case of radiation damage or shock simulations. This is because the high repulsive
interactions in the short-range may not be trained properly in purely data-driven
ML potentials due to the sparsity of data in this region. These interactions being
extremely rare may not be represented in a dataset of reasonable size and especially
that of first-principle simulations. However, despite the lack of data in this region,
PINN potentials can successfully capture the exponential repulsion at the short-
range, thanks to the physics-informed functional form of the potential (see Fig. 13.8).
Overall, PINN potential holds to be the promise to tackle a large variety of problems
by incorporating DFT-level accuracy with MD-level computational efficiency, all the
while respecting the underlying the physics of the problem. Thus, PINN potentials
can be reliably extrapolated used in regions where the system has not been trained
thus far.
13.3 Physics-Informed Neural Networks for Continuum Simulations 231

Fig. 13.8 Potential energy


with respect to pair-wise
interatomic distance for
traditional, ML, and PINN
potentials. Reprinted with
permission from [19]

13.3 Physics-Informed Neural Networks for Continuum


Simulations

When simulating systems at the macroscopic scale, the details at the atomic and
microscale may not be relevant. Instead a homogenized model representing the aver-
aged information from the lower length scales may be capable of simulating the
system with expected accuracy. For instance, while simulating the deformation of a
232 13 Machine Learned Material Simulation

steel truss structure (like Eiffel tower!), the microstructural details or atomic level
dislocation motions may not be relevant as long as the Young’s modulus and the yield
strain is maintained at the macroscopic level. Thus, the continuum models ignore the
atomic and microscopic details and relies on a fundamental laws and mathematical
models that are capable of capturing the essential details at the macroscopic level. It
should be noted as we start going toward length scales the continuum rules breakdown
and atomistic movement start becoming prominent and governing. Although, there
are some thumb-rules on the lengthscales associated each of these theories, there are
several reformulations, such as the non-local theories and multiscale models, that
allow one to adapt the theories over a wider range of lengthscales and systems.
Modeling of a continuum system relies on a few fundamental relationships. These
include the:

• conservation laws (mass, energy, momentum)


• constitutive relationship (stress–strain relationship)
• strain-displacement relationships
• compatibility conditions.

While conservation laws are fundamental in nature, the constitutive relationship


depends on the type and state of the material. In addition, it may vary for a given
material depending on various factors including applied strain and strain rate, envi-
ronmental conditions, testing conditions, to name a few. The strain-displacement
relationship and compatibility conditions ensure the continuity and differentiability
of the domain. However, these relationships may be invalid under certain circum-
stances such as extreme plastic deformation, spalling, crack propagation, and frac-
ture. In traditional continuum simulations, the governing equations are constructed
based on these relationships and validated against experimental data points, which
are generally sparse in nature. The validated model is then used to simulate the system
under different conditions to analyze its response. In this approach, it is expected that
the parameters associated with the system and the governing equations are known
precisely to simulate the system. For instance, to study the tensile deformation of a
steel bar, an elasto-plastic constitutive model is used with known Young’s modulus
and yield strain. Since there could be energy dissipation in the process, momentum
conservation equation is used to develop the governing equation as

.ρx = ∇ · σ + b (13.1)

where.ρ represents the density,.x represents the displacement,.σ represents the second
order stress tensor, and .b the body force. Further, it is assumed that the boundary
condition is fixed on both ends of the bar. There are several assumptions in this
model, many of which may not hold valid under different circumstances leading to
accumulation of error in the system.
When large amount of data is available, such approaches can be replaced by purely
data-driven approaches where the constitutive relationship itself is learned directly
using ML. Note that this approach is similar to the development of interatomic
13.3 Physics-Informed Neural Networks for Continuum Simulations 233

Fig. 13.9 Different types of bias in the data of physics. Reprinted from [23] with permission

potentials using ML, albeit for a continuum system. Further, instead of first-principle
simulations, the data needed for training the ML model has to be generated from a
large number of experimental observations. While such approaches have been used
to study atomistic models, these approaches are not commonly used for traditional
continuum modeling. Instead, the governing laws are directly learned from the data
by applying symbolic regression [20, 21] or even combining symbolic regression
with deep learning [22]. However, for most practical applications, the data available
will neither be too large nor too small. In such cases, a physics-informed neural
network (PINN) can significantly enhance the performance of the simulations with
limited assumptions about the model (Fig. 13.9).
PINNs involve biasing an ML model so as to learn the underlying physics of the
problem leading to solutions that are consistent. These biasing modes can be broadly
classified into: (i) observational, (ii) inductive, and (iii) learning bias [23] as detailed
below.
1. Observational bias can be introduced by including a large amount of observa-
tional data. Sometimes, getting a large amount of data can be challenging and
expensive. In such cases, available data can be augmented while respecting the
physical structure of the data. For instance, a microstructural image of a crystal
can be augmented by applying operations such as rotation, reflection, or even
zooming in to selected areas and trimming the remaining regions. All of these
234 13 Machine Learned Material Simulation

Fig. 13.10 A PINN with modified loss function for solving the viscous Burgers’ equation.
Reprinted from [23] with permission

augmented images will still correspond to the original image. Observational bias
is one of the most commonly applied modes used in training ML models.
2. Inductive bias refers to the introduction of a specific architecture of an ML model,
for example NN, which implicitly embed any prior knowledge on the structure
of the data or the predictive task. Examples of such specialized architectures
include convolutional neural networks (which are designed for learning patterns
from images), graph neural networks (which incorporates the specific topology
of the data such as molecules and structures in the form of a graph), or recurrent
neural networks which can learn from the data in the form of a series. Note
that these architectures can be further modified to respect more physics-based
features such as symmetry and translation. In such cases, the data augmentation
by symmetry operations become redundant thereby reducing the observation bias
and increasing the inductive bias. Introducing inductive bias by converting a
molecular structure to a graph structure has been a very effective to predict material
properties and discover new entities.
3. Learning bias is the third approach, in which instead of modifying the architecture
of the ML model, the constraints in terms of the physical laws are imposed in
the loss functions. Figure 13.10 shows a PINN algorithms for solving the viscous
Burgers’ equation for fluid flow given by

∂u ∂u ∂ 2u
. +u =ν 2 (13.2)
∂t ∂x ∂x
13.4 Graph Neural Networks 235

where .u(x, t) represents a the speed of the fluid at the spatial and temporal coor-
dinates of .x and .t, respectively and .ν represents the kinematic viscosity or the
diffusion coefficient. The 13.3 can be modified as

∂u ∂u ∂ 2u
. +u −ν 2 =0 (13.3)
∂t ∂x ∂x
Thus, in PINN, this additional term can be included as a physics-loss in addition
to the data-loss. Following this, the models will be trained on the loss function
containing both the physics and data loss. Thus, the learned weights will be trained
to respect both the physics loss and the data loss. However, the model once trained
will not have any restrictions during inference as the physics-based inductive bias
was employed only in the loss function.
PINN algorithms have now been widely used in different domains to solve prob-
lems in the area of solid and fluid mechanics. However, it should be noted that the
PINN approaches satisfy the governing laws or additional physics-based inductive
biases only in a weak fashion. This is due to the fact that the training of an ML with a
physics-loss will have a finite loss irrespective of how well-trained the model is. The
propagation of this finite loss during the inference phase will be a governing factor in
deciding the quality of the simulation carried out using PINN algorithms. Alternate
approach to address the issues of PINN algorithms is to enfore the physics-laws such
as conservation laws in a strong fashion into the architecture itself. This approach is
discussed next.

13.4 Graph Neural Networks

A major limitations of MLPs used for learning the dynamics of a system is that the
MLPs are transductive in nature, that is, they work only for the systems they are
trained for. For instance, an MLP-based PINN trained for a 5-spring system (that
is, 5 balls connected by 5 springs) can be used to infer the dynamics of the same
system only and not any .n-spring system. This significantly limits the application
of such approaches to simple systems since for each system the training trajectory
needs to be generated and the model needs to be trained. Further, such approaches are
not beneficial to train interatomic potentials from DFT trajectories or to be used in
continuum simulations as the trained model cannot be used for any system other than
the one on which it is trained. It has been shown the transductivity of MLPs could be
addressed by incorporating an additional inductive bias in the structure by accounting
for the topology of the system in the graph framework using a graph neural network
GNNs. GNNs, once trained, has the capability to generalize to arbitrary system sizes.
Most earlier studies on GNNs for dynamical systems use a purely data-driven
approach, where the GNNs are used to learn the updated position and velocity from
the data on trajectories. To address these challenges, several physics-enformed GNNs
236 13 Machine Learned Material Simulation

have been proposed such as the Hamiltonian (HGNN) and Lagrangian (LGNN) graph
neural networks and Graph Neural ODEs. These physics-enforced GNN architectures
are discussed in detail later in this section. In addition, since the GNNs are trained
at the node and edge level, they can potentially learn more efficiently from the same
number of data points in comparison to their fully connected counterparts. Further,
since the learning of the function happens at the node and edge level in a GNN, there
are no limitations on the system size on which the trained GNN can be used. Note that
the graph-based architectures makes GNN directly amenable for molecular or atomic
systems with atoms and bonds, where the nodes represent the atoms and the edges
represent the bonds. Thus, GNNs are widely used for modeling atomic systems.
However, it should be noted that the GNNs are no limited to such discrete systems;
GNNs have also been used to model continuum systems, rigid body systems, and
articulated bodies.

13.4.1 Physics-Enforced GNNs

First, we discuss the preliminaries on the dynamics of particle-based systems


in the framework of Lagrangian, Hamiltonian, and ordinary differential equation
framework. Consider a rigid body system comprising of .n interacting particles.
The configuration of this system is represented as a set of Cartesian coordinates
.x(t) = (x 1 (t), x 2 (t), . . . , x n (t)). Since we are using a graph neural network to model
the physical interactions, it is natural to select the Cartesian coordinates for the fea-
tures such as position as velocity. While this may result in an increased complexity
in the form of the Hamiltonian or Lagrangian, it significantly simplifies the mass
matrix for particle-based systems by making it positive definite.

ODE Formulation of Dynamics

Traditionally, the dynamics of a system can be expressed in terms of the D’Alembert’s


principle as
.Mẍ − F(x, ẋ) = 0 (13.4)

where, in Cartesian coordinates, .M is the constant mass matrix that is independent


of the coordinates [24], and .F represents the dynamics of the system. Accordingly,
the acceleration .ẍ of the system can be computed as:

.ẍ = M−1 (F(x, ẋ)) (13.5)

This equation is essentially equivalent to the Newton’s second law of motion and is
applied for solving several problems including classical molecular dynamics simu-
lations.
13.4 Graph Neural Networks 237

Lagrangian Dynamics
The standard form of Lagrange’s equation for a system with .holonomic constraints
is given by
d
. (∇ẋ L) − (∇x L) = 0 (13.6)
dt

where the Lagrangian is. L(x, ẋ, t) = T (x, ẋ, t) − V (x, t) with.T (x, ẋ, t) and.V (x, t)
representing the total kinetic energy of the system and the potential function from
which generalized forces can be derived, respectively. Accordingly, the dynamics of
the system can be represented using Euler-Lagrange (EL) equations as

.ẍ = (∇ẋẋ L)−1 [∇x L − (∇ẋx L) ẋ] (13.7)

Here, .∇ẋẋ refers to . ∂∂ẋ2 . In Cartesian coordinates, the Lagrangian simplifies to


2

. L(x, ẋ) = ẋ Mẋ − V (x). Exploiting the structure of Lagrangian by decoupling


1 T
2
the kinetic and potential energies, and substituting this expression in 13.7, we obtain
.∇ẋẋ L = M as a constant mass matrix independent of coordinates, .∇ẋx L = 0, and
.∇x V (x) = F. Accordingly, the .ẍ can be obtained as

.ẍ = M−1 F (13.8)

Note that the second order differential equation obtained based on the Lagrangian
mechanics is equivalent to the first order differential equation obtained based on the
D’Alembert’s principle.

Hamiltonian Dynamics

Hamiltonian equations of motion are given by

.ẋ = ∇px H , ṗx = −∇x H (13.9)

where, .px = ∇ẋ L = Mẋ represents the momentum of the system in Cartesian coor-
dinates and . H (x, px ) = ẋT px − L(x, ẋ) = T (ẋ) + V (x) represents the Hamilto-
nian of the system. The equation can be simplified by assuming . Z = [x; px ] and
. J = [0, I ; −I, 0] then the Hamiltonian equation can be written as

. ∇ Z H + J Ż = 0 or Ż = J ∇ Z H (13.10)

since . J −1 = −J . Note that this coupled first order differential equations are equiv-
alent to the Lagrangian 13.7.
238 13 Machine Learned Material Simulation

Physics-enforced GNNs (PEGNNs)


Physics-enforced sGNNs take the configuration (position and velocity) as input and
predict abstract quantities such as Lagrangian or Hamiltonian or force. These out-
put values are then directly substituted along with physics-based equations defined
earlier, to obtain the acceleration of the particles. The acceleration is then integrated
using any symplectic (that is, energy conserving) integrators to obtain the updated
position and velocity. Thus, in physics-enforced GNNs, the neural network essen-
tially learns the function relating the position and velocity to quantities such as force
or energy, by training on trajectory. There are some interesting consequences of this
approach as outlined below.
1. Training on trajectory. The training of PEGNNs is purely on trajectory or the
dynamics of the evolving system. Thus, no a priori knowledge on the functional
forms or the exact on the abstract quantities such as force, energy or Lagrangian
is required. This is directly learned by the GNN. Interestingly, it has been demon-
strated that the learned functions, that is, energy or force, by these GNNs indeed
correspond to the actual ground truth observations of energies and forces. Accord-
ingly, the physics-enforced approach is useful to learn the interaction potentials
or interaction laws directly from the trajectory.
2. Conservation laws. The strong enforcement of the physics guarantees that the
PEGNNs strictly follows the conservation laws of energy and momentum; the
governing equations on which they are trained. Thus, the trajectory predicted by
PEGNNs remain stable in terms of the error in energy and momentum and the
overall energy of the trajectory predicted by PEGNNs remain exactly conserved.
This is in contrast to PINNs where the governing equations are only weakly satis-
fied and no guarantees on the stability of long-term trajectories can be provided.
3. Generalizability. Thanks to the graph architecture, the PEGNNs trained on a
small system can generalize to large system sizes that are orders of magnitude
larger than the training systems. This is due to the fact that the functions are
learned at the node and edge level based on the local environment and hence they
are agnostic to the overall size of the graph.
4. Interpretability. The learned functions can be meaningfully interpreted in PEG-
NNs to understand the functional forms of the equations relating the observables
such as positions and velocities. Thus, PEGNNs can be used to discover interac-
tion laws, governing the dynamics of systems, directly from their trajectory.
Figure 13.11 shows the architecture of the Hamiltonian graph neural network. The
physical system is modeled as an undirected graph.G = (V, E) with nodes as particles
and edges as connections between them. For instance, in a .n-ball-spring system, the
balls are represented as nodes and springs as edges. The raw node features are.t p (type
of particle) as one-hot encoding, .x, and .ẋ, and the raw edge feature is the distance,
.d = ||x j − xi ||, between two particles .i and . j. A notable difference in the HGNN

architecture from previous works is the presence of global and local features—local
features participate in message passing and contribute to quantities that depend on
topology, while global features do not take part in message passing. In HGNN, the
position .x, velocity .ẋ are employed as global features for a node while .d and .t p are
used as local features.
13.4 Graph Neural Networks 239

Fig. 13.11 The architecture of Hamiltonian graph neural network

An .l-layer message passing GNN, which takes an embedding of the node and
edge features created by MLPs as input is used as the graph architecture. The local
features participate in message passing to create an updated embedding for both
the nodes and edges. The final representations of the nodes and edges, .z i and .z i j ,
respectively, are passed through MLPs to obtain the Hamiltonian of the system.
The Hamiltonian of the system is predicted as the sum of .T and .V in the HGNN.
Typically, the potential energy of a system exhibits significant dependence on the
topology of its underlying structure. In order to effectively capture this information,
multiple layers of message-passing among interacting particles (nodes) is employed
in HGNN. During the .l th layer of message passing, the node embeddings are itera-
tively updated according to the following expression:
⎛ ⎛ ⎞⎞
∑ ( )
hl+1 = squareplus ⎝MLP ⎝hli +
. i WlV · hlj ||hli j ⎠⎠ (13.11)
j∈Ni

where, .Ni = {u j ∈ V | (u i , u j ) ∈ E} is the set of neighbors of particle .u i . .WlV is


a layer-specific learnable weight matrix. .hli j represents the embedding of incoming
edge .ei j on .u i in the .l th layer, which is computed as follows.
( ( ( )))
hl+1 = squareplus MLP hli j + WlE · hli ||hlj
. ij (13.12)

Similar to .WlV , .WlE is a layer-specific learnable weight matrix specific to the edge
set. The message passing is performed over . L layers, where . L is a hyper-parameter.
The final node and edge representations in the . L th layer are denoted as .zi = hiL and
.zi j = hi j respectively.
L
240 13 Machine Learned Material Simulation


∑ The total potential energy of an .n-body system is represented as .V = i vi +
i j vi j . Here, .vi denotes the energy associated with the position of particle .i, while
.vi j represents the energy arising from the interaction between particles .i and . j.
For instance, .vi corresponds to the potential energy of a bob in a double pendu-
lum, considering its position within a gravitational field. On the other hand, .vi j
signifies the energy associated with the expansion and contraction of a spring con-
necting two particles. In the proposed framework, the prediction for .vi is given
by .vi = squareplus(MLPvi (hi0 || xi )). Similarly, the prediction for the pair-wise
interaction energy .vi j is determined by .vi j = squareplus(MLPvi j (zi j )).
Finally, the . H of the system is obtained from HGNN is substituted in the (Eq.
13.10) to obtain the acceleration and velocity of the particles. These values are

Fig. 13.12 Evaluation of HGNN on the pendulum, spring, binary LJ, and gravitational systems.
a Predicted and b actual phase space (that is, .x1 -position vs. .x2 -velocity), predicted with respect
to actual c kinetic energy, d potential energy, and e forces in 1 (blue square), and 2 (red triangle)
directions of the .5-pendulum system. f Predicted and g actual phase space (that is, .1-position, .x1
vs .2-velocity, .ẋ2 ), predicted with respect to actual h kinetic energy, i potential energy, and j forces
in 1 (blue square) and 2 (red triangle) directions of the .5-spring system. k Predicted and l actual
positions (that is,.x1 and.x2 positions), predicted with respect to actual m kinetic energy, n pair-wise
potential energy, .Vij for the (0–0), (0–1), and (1–1) interactions, and o forces in 1 (blue square),
2 (red triangle), and 3 (green circle) directions of the .75-particle LJ system. p Predicted and q
actual positions (that is, .x1 − and .x2 −positions), predicted with respect to actual r kinetic energy, s
potential energy, and t forces in 1 (blue square), and 2 (red triangle) directions of the gravitational
system
13.4 Graph Neural Networks 241

integrated using velocity Verlet, a symplectic integrator, to compute the updated


position. The loss function of HGNN is computed by using the predicted and actual
positions at timesteps .2, 3, . . . , T in a trajectory .T, which is then back-propagated
to train the MLPs. Specifically, the loss function is as follows.
( n T
1 ∑ ∑ ) T,t )2
.L = xi − x̂iT,t (13.13)
n i=1 t=2

Figure 13.12 shows the results of HGNN on four systems, namely, .n−pendulum,
n−spring, gravitational systems and binary Kob-Anderson Lennard Jones systems.
.

HGNN learns the dynamics directly from the trajectory in excellent agreement with
the ground truth trajectory. Further, the forces and energies predicted by the HGNN
are also in excellent agreement with the ground truth. This suggests that the HGNN
trained purely on the trajectory can indeed learn the exact forces and energies of each
of the particles in the system wihout explicit training on them. Thus, HGNN can be
used to learn the interactions between the systems directly from their trajectory.
This can be evaluated further by analyzing the functions learned by the MLPs corre-
sponding to the nodes and edges. Figure 13.13 shows the learned potential and kinetic
energy functions by the HGNN. It is demonstrated that the functions learned by the
HGNN exhibits an exact match with the actual functions. Further, symbolic regres-
sion can be used to discover the functional forms based on the interpreted data points
of the functions. Table 13.1 shows the functions obtained from SR based on the data
points shown in Fig. 13.13. For most of the functions, the learned functions exhibit
a close match with the original function. The best equation is determined based on
the score function, which is a balance of the complexity and loss. The equation with
the maximum score represents the one with optimal complexity and correspond-
ingly low loss values. Note that the equation obtained from SR also depends on the
hyperparameters and also for the number of epochs for which the SR is run. By
increasing the number of epochs better equations could be discovered. Altogether,
the PEGNN-based framework can be used to learn the dynamics directly from the

Fig. 13.13 Interpreting the learned functions in HGNN. a Potential energy of pendulum system
with the .2-position of the bobs. b Kinetic energy of the particles with respect to the velocity for
the pendulum bobs. c Potential energy with respect to the pair-wise particle distance for the spring
system. d The pair-wise potential energy of the binary LJ system for 0–0, 0–1, and 1–1 type of
particles. The results from HGNN are shown with the markers, while the original function is shown
as dotted lines
242 13 Machine Learned Material Simulation

Table 13.1 Original equation and the best equation discovered by symbolic regression based on
the score for different functions. The loss represents the mean squared error between the data points
from HGNN and the predicted equations
Functions Original equation Discovered equation Loss Score
Kinetic . Ti = 0.5m|ẋi |2 . Ti = 0.500m|ẋi |2 .7.96 × 10−10 .22.7
energy
( )2
Harmonic . Vi j = 0.5(ri j − 1)2 . Vi j = 0.499 ri j − 1.00 .1.13 × 10−9 .3.15
spring
Binary LJ . Vi j = 2.0
ri12
− 2.0
ri6j
. Vi j = 1.90
ri12
− 1.95
ri6j
.0.00159 .2.62
j j
(0–0)
Binary LJ . Vi j = . Vi j = 2.33
ri9j
− 2.91
ri8j
.3.47 × 10−5 .5.98
(0–1) 0.275
ri12
− 0.786
ri6j
j

Binary LJ . Vi j = . Vi j = 0.215
ri12
− 0.464
ri6j
.1.16 × 10−5 .5.41
j
(1–1) 0.216
ri12
− 0.464
ri6j
j

trajectory and further to interpret the learned dynamics and the abstract quantities
governing it. Finally, it can also be used to scale to large system sizes, thus, making
the framework an ideal candidate to learn the dynamics from ab-initio data which
can then be used in classical molecular dynamics simulations.

13.5 Summary

In conclusion, this chapter has provided an in-depth exploration of machine learning


techniques for materials simulations, focusing on three key areas: machine learned
interatomic potentials, physics informed machine learning for continuum simula-
tions, and physics-enforced graph neural networks. These approaches offer inno-
vative solutions to address the challenges in accurately modeling and simulating
complex materials systems. Machine learned interatomic potentials have emerged as
powerful tools for capturing the intricate interactions between atoms in materials. By
leveraging ML algorithms, these potentials can effectively reproduce the behavior of
complex materials, enabling efficient and accurate simulations. The development of
accurate and transferable interatomic potentials holds great promise for accelerating
materials discovery and design as they can scale to almost the entire elements or
combinations thereof in the periodic table. Physics informed machine learning for
continuum simulations combines the power of machine learning with the governing
physical equations to enhance the accuracy and efficiency of simulations. By incor-
porating domain-specific knowledge and constraints, these models can learn from
limited data and provide accurate predictions over a wide range of conditions. This
References 243

approach bridges the gap between data-driven machine learning and physics-based
modeling, enabling the simulation of large-scale systems with improved accuracy
and reduced computational cost. Finally, physics-enforced graph neural networks
have shown great potential in capturing the underlying dynamics of materials sys-
tems represented as graphs. By incorporating physical principles and constraints
into graph neural networks, these models can effectively learn the interactions, while
ensuring that the governing laws are strictly. Thanks to their interpretability and gen-
eralizability, they hold promise in learning quantum-accuracy simulations at much
higher length scales.
Collectively, these machine learning approaches for materials simulations offer
novel and powerful tools for advancing our understanding of materials behavior,
accelerating materials discovery, and enabling the design of materials with tailored
properties. By combining data-driven machine learning techniques with the underly-
ing physics and chemistry of materials, materials scientists can unlock new insights,
overcome computational challenges, and drive innovation in materials science and
engineering. Looking ahead, future research should focus on further refining and
expanding these machine learning techniques, exploring their applicability to new
materials systems, and addressing challenges such as interpretability, robustness,
and scalability. Additionally, interdisciplinary collaborations between materials sci-
entists, data scientists, and computational physicists will be crucial in pushing the
boundaries of machine learning for materials simulations and realizing its full poten-
tial in revolutionizing materials research and development.

References

1. G.N. Simm, M. Reiher, Error-controlled exploration of chemical reaction networks with gaus-
sian processes. J. Chem. Theory Comput. 14(10), 5238–5248 (2018)
2. S.J. An, J. Li, C. Daniel, D. Mohanty, S. Nagpure, D.L. Wood III., The state of understanding
of the lithium-ion-battery graphite solid electrolyte interphase (SEI) and its relationship to
formation cycling. Carbon 105, 52–76 (2016)
3. G. Reddy, Z. Liu, D. Thirumalai, Denaturant-dependent folding of GFP. Proc. Natl. Acad. Sci.
109(44), 17832–17838 (2012)
4. Y. Shu, B.G. Levine, Communication: non-radiative recombination via conical intersection at
a semiconductor defect. J. Chem. Phys. 139(8), 081102 (2013)
5. P. Friederich, F. Häse, J. Proppe, A. Aspuru-Guzik, Machine-learned potentials for next-
generation matter simulations. Nat. Mater. 20(6), 750–761 (2021)
6. T.B. Blank, S.D. Brown, A.W. Calhoun, D.J. Doren, Neural network models of potential energy
surfaces. J. Chem. Phys. 103(10), 4129–4137 (1995)
7. J. Behler, M. Parrinello, Generalized neural-network representation of high-dimensional
potential-energy surfaces. Phys. Rev. Lett. 98, 146401 (2007). https://doi.org/10.1103/
PhysRevLett.98.146401
8. B. Onat, C. Ortner, J.R. Kermode, Sensitivity and dimensionality of atomic environment rep-
resentations used for machine learning interatomic potentials. J. Chem. Phys. 153(14), 144106
(2020). https://doi.org/10.1063/5.0016005
9. L. Himanen, M.O. Jäger, E.V. Morooka, F. Federici Canova, Y.S. Ranawat, D.Z. Gao, P. Rinke,
A.S. Foster, Dscribe: library of descriptors for machine learning in materials science. Comput.
244 13 Machine Learned Material Simulation

Phys. Commun. 247, 106949 (2020). ISSN: 0010-4655. https://doi.org/10.1016/j.cpc.2019.


106949. https://www.sciencedirect.com/science/article/pii/S0010465519303042
10. M. Rupp, A. Tkatchenko, K.-R. Müller, O.A. von Lilienfeld, Fast and accurate mod-
eling of molecular atomization energies with machine learning. Phys. Rev. Lett. 108,
058301 (2012). https://doi.org/10.1103/PhysRevLett.108.058301. https://link.aps.org/doi/10.
1103/PhysRevLett.108.058301
11. F. Faber, A. Lindmaa, O.A. von Lilienfeld, R. Armiento, Crystal structure representations for
machine learning models of formation energies. Int. J. Quantum Chem. 115(16), 10941101
(2015). https://doi.org/10.1002/qua.24917
12. J. Behler, Atom-centered symmetry functions for constructing high-dimensional neural net-
work potentials. J. Chem. Phys. 134(7), 074106 (2011). https://doi.org/10.1063/1.3553717
13. A.P. Bartók, R. Kondor, G. Csányi, On representing chemical environments. Phys. Rev. B
87, 184115 (2013). https://doi.org/10.1103/PhysRevB.87.184115. https://link.aps.org/doi/10.
1103/PhysRevB.87.184115
14. H. Huo, M. Rupp, Unified representation of molecules and crystals for machine learning (2017),
arXiv:1704.06439
15. S. Hajinazar, A. Thorn, E.D. Sandoval, S. Kharabadze, A.N. Kolmogorov, Maise: construction
of neural network interatomic models and evolutionary structure optimization. Comput. Phys.
Commun. 259, 107679 (2021)
16. A.H. Larsen, J.J. Mortensen, J. Blomqvist, I.E. Castelli, R. Christensen, M. Dulak, J. Friis,
M.N. Groves, B. Hammer, C. Hargus, et al.: The atomic simulation environment–a python
library for working with atoms. J. Phys. Condens. Matter 29(27), 273002 (2017)
17. I.S. Novikov, K. Gubaev, E.V. Podryabinkin, A.V. Shapeev, The MLIP package: Moment tensor
potentials with MPI and active learning. Mach. Learn. Sci. Technol. 2(2), 025002 (2020)
18. Y. Mishin, Machine-learning interatomic potentials for materials science. Acta Mater. 214,
116980 (2021)
19. G.P. Pun, R. Batra, R. Ramprasad, Y. Mishin, Physically informed artificial neural networks
for atomistic modeling of materials. Nat. Commun. 10(1), 1–10 (2019)
20. S.L. Brunton, J.L. Proctor, J.N. Kutz, Discovering governing equations from data by sparse
identification of nonlinear dynamical systems. Proc. Natl. Acad. Sci. 113(15), 3932–3937
(2016)
21. M. Schmidt, H. Lipson, Distilling free-form natural laws from experimental data. Science
324(5923), 81–85 (2009)
22. M. Cranmer, A. Sanchez Gonzalez, P. Battaglia, R. Xu, K. Cranmer, D. Spergel, S. Ho, Dis-
covering symbolic models from deep learning with inductive biases. Adv. Neural Inf. Process.
Syst. 33 (2020)
23. G.E. Karniadakis, I.G. Kevrekidis, L. Lu, P. Perdikaris, S. Wang, L. Yang, Physics-informed
machine learning. Nat. Rev. Phys. 3(6), 422–440 (2021)
24. Y.D. Zhong, B. Dey, A. Chakraborty, Benchmarking energy-conserving neural networks for
learning dynamics from data, in Learning for Dynamics and Control, PMLR, 2021, pp. 1218–
1229
25. S. Bishnoi, R. Bhattoo, S. Ranu, N. Krishnan, Enhancing the inductive biases of graph neural
ode for modeling dynamical systems (2022), arXiv:2209.10740
26. R. Bhattoo, S. Ranu, N.A. Krishnan, Learning the dynamics of particle based systems with
Lagrangian graph neural networks. Mach. Learn. Sci. Technol. (2023)
27. R. Bhattoo, S. Ranu, N.A. Krishnan, Learning articulated rigid body dynamics with Lagrangian
graph neural network, in Advances in Neural Information Processing Systems (2022)
28. A. Thangamuthu, G. Kumar, S. Bishnoi, R. Bhattoo, N.A. Krishnan, S. Ranu, Unravelling the
performance of physics-informed graph neural networks for dynamical systems. in Thirty-sixth
Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2022)
Chapter 14
Image-Based Predictions

Abstract This chapter explores the application of machine learning (ML) algo-
rithms to image-based data for a comprehensive understanding of materials. The
focus is on various aspects, including the investigation of structure-property rela-
tionships, prediction of ionic conductivity, accelerated property prediction through
the combination of finite element analysis and image-based modeling, and the use
of molecular dynamics and image-based modeling to predict crack propagation in
atomic systems. Additionally, the chapter discusses the use of neural operators to effi-
ciently learn stress and strain fields from limited ground truth data. The integration
of ML algorithms with image-based data has shown promising results in advancing
materials science, enabling deeper insights into material behavior and accelerating
property prediction. Future directions involve the development of more advanced
neural operator frameworks, integration with quantum mechanics, exploration of
complex material systems, and incorporation of experimental techniques. Overall,
the application of ML algorithms to image-based data offers exciting opportuni-
ties for materials design and optimization, paving the way for the discovery of novel
materials with tailored properties and improved performance in various applications.

14.1 Introduction

Images constitute majority of the information on materials’ structure and properties.


Crystal structure, microstructure, texture, microcracks, phase separation and many
other structural features of materials are analyzed and understood through images.
Further, the deformation mechanisms of materials failure such as crack propaga-
tion, plasticity, creep, spalling, fatigue, and fracture have all been understood either
through in situ or postmortem analysis of images, in addition to other means. Thus,
images form the key element in discerning the structure of the material which even-
tually controls its property. Although traditional analysis of images are carried out
by domain experts, recent success in computer vision suggests that many of the
problems in materials could be tackled using ML techniques.

© Springer Nature Switzerland AG 2024 245


N. M. A. Krishnan et al., Machine Learning for Materials Discovery,
Machine Intelligence for Materials Science,
https://doi.org/10.1007/978-3-031-44622-1_14
246 14 Image-Based Predictions

Here, we will discuss about several ML techniques used to analyse images for
understanding and predicting material structure, properties, and responses. Although
various steps involved in this process are similar to those in case of traditional ML-
based property prediction as outlined in Chap. 10, the algorithms used in this case vary
significantly. Specifically, images represent unstructured data with a large number
of dimensions to process. For instance, a .64 × 64 pixel image has 4096 pixels each
representing a unique input feature. This is extremely large for a traditional MLP
to handle. Further, the order in which the pixels are organized do convey some
meaningful information structurally. To address this challenge, convolutional neural
networks (CNNs) are widely used, which significantly reduce the input dimension
through operations such as convolution and pooling (Fig. 14.2).
The broad approach used in materials for image-based property prediction are
outlined in Fig. 14.1. This involves the preprocessing of the images to make the

Fig. 14.1 Computer vision based approach for predicting material properties. Reprinted with
permission from [1]

Fig. 14.2 Applications of computer vision toward various downstream tasks in materials domain
14.2 Structure–Property Prediction Using CNN 247

dataset consistent, for instance, making them greyscale, same pixel dimensions, etc.
Following this, the architecture of the CNN is finalized and the model is trained
using the training set. The hyperparametric optimization may then be carried out
using the validation set or by k-fold cross validation. Finally, the performance of
the model is evaluated using the test set. In this chapter, we will discuss several ML
algorithms that exploits image-based data for improved understanding of materi-
als. These include the understanding of structure-property linkages, predicting the
ionic conductivity, combining finite element analysis with image-based modeling
for accelerated property prediction, and combining molecular dynamics with image-
based modeling for predicting crack propagation in atomic systems. Finally, we also
discuss neural operators that learn the stress and strain fields efficiently from sparse
ground truth data.

14.2 Structure–Property Prediction Using CNN

Understanding the structure–property relationships has been an outstanding problems


in materials engineering. To this extent, various strategies have been employed as
shown in Fig. 14.3. Note that the input features are represented by x and the output
is represented y. Thus, the ML model aims to predict property y as a function of
the known features x. Figure 14.3a shows the most classical approach of extracting
hand-crafted features such as “number of voids’,‘number of grains’,‘average area
of voids’, and ‘average length of grain boundaries’ by a domain expert by manual
analysis of the image. These features will be selected in such a way so as to express
a large amount of information about the structure in a quantitative and terse fashion.
These features are generally averaged over a large area or number of samples to
ensure statistical relevance and are then used in simple approaches such as ordinary
least square (OLS) regression to predict the target property. Note that such approaches
have high interpretability, but poor generalizability. More importantly, the success
of such methods strongly rely on the intuition of the domain expert.
Improving on this approach, Fig. 14.3b shows the use of more sophisticated, but
still handcrafted, features such as the two-point correlation functions [3–5]. In this
approach, the local states .h(r) = {h 1 (r), h 2 (r), ..., h n (r)} are defined, where .r is the
position vector. Note that these values of position-dependent features .h i (r) are evalu-
ated directly from the microscope images by image processing and phase extraction.
Note that any spatial information, such as chemical composition, crystal structure,
∑ can be provided as .h i (r) so long as the necessary condition
and crystal orientation,
.0 ≤ h i (r ) ≤ 1 and . i h i (r ) = 1 are satisfied. This condition ensures that the mean
and variance associated with each feature is comparable and in the same interval.
In case this is not satisfied, feature normalization should be carried out to satisfy
these conditions. Based on these features, a two-point correlation function . f np (r),
is defined as

. f np (r) = J−1 [J[h n (r)]∗ J[h p (r)]] (14.1)


248 14 Image-Based Predictions

Fig. 14.3 Comparison of strategies for structure-property linkage; a, b conventional scheme


and c CNN (present scheme). Black and orange arrows indicate manual proceedings and machine
learning, respectively. (OLS: Ordinary least squares regression, FFT: fast Fourier transformation,
PCA: principal component analysis, conv: convolutional layer, GAP: global average pooling, FC:
fully connected layers.). Reprinted with permission from [2]

where .n and . p denotes the two states between which the two-point correlation is
computed, .J[·] denotes the Fourier transformation with respect to position .r and
the complex conjugate .[·]∗ . Once the two-point correlation functions are obtained,
dimensionality reduction techniques such as PCA can be employed to reduce the
size of the input dimension. Specifically, the top .n principal components can be
selected and used as input features based on their variance. These features can then
be used in OLS or other regression techniques to predict the target property y. These
techniques suffer with the curse of dimensionality, reducing which using PCA or other
approaches results in the loss of information. Moreover, the feature engineering in
this approach is still manual and depends on the skill of the domain expert.
14.2 Structure–Property Prediction Using CNN 249

These issues can be addressed using the CNN approach as illustrated in Fig. 14.3c.
In this case, no a priori feature engineering is required—representative features are
extracted from the raw images by the CNN while passing through the convolutional
layers. These features are then used to predict the properties through a fully connected
layer. In other words, CNN predicts the property y directly from the raw images
in an end-to-end fashion without any intermediate intervention in terms of feature
engineering. However, it should be noted that the features learnt by the CNN may
not be easily interpretable by humans. To this extent, post hoc approaches such as
integrated gradient, gradient SHAP, or other interpretable algorithms may be used
[2, 6–8].

14.2.1 Predicting the Ionic Conductivity

Exploiting CNNs for property predicting directly from the images can be an
extremely useful for materials researchers. A CNN properly trained for a prop-
erty based on microstructure can then predict the properties for new microstructures
directly from the image. Kondo et al. [2] employed this approach to predict the ionic
conductivity of yttria-stabilized zirconia (YSZ) based on their microstructures. First
column of the Fig. 14.4 shows the microstructures of different YSZ obtained by vary-
ing the sintering temperature (1400 .◦ C, 1440 .◦ C, and 1480 .◦ C) and sintering time
(1, 5, 10, and 30 hours). y value represents the ionic conductivity of these samples
in mS/cm. By training with these images (after applying some image augmentation
techniques such as rotation, cropping, flipping) as inputs, CNN was able to predict
the ionic conductivity on unseen dataset reasonably. It is worth noting that only seven
original images were used to train the CNN, which consists of 70,000 parameters
that are learned during the training.
In order to further understand the features learnt by the CNN, a feature visualiza-
tion method similar to CAM [9] and Grad-CAM [10] was employed. Specifically,
this approach identifies the feature maps, which is then used to define masking maps.
The masking maps hide irrelevant features that have little or no role in governing
the ionic conductivity, which is the output property. Figure 14.4 shows the regions
governing low and high ionic conductivity as learnt by the CNN. Specifically, the
blue and red masks represent the regions ignored by the CNN when predicting low
and high ionic conductivity, respectively. Thus, low ionic conductivity YSZ are char-
acterized by increased voids while high ionic YSZ contain fewer crystal defects. This
observation is consistent with the experimental results wherein the ionic conductivity
decreases with decreasing sintered density [11]. Altogether, CNNs can be used to
capture the structure–property relationships in a reasonable fashion from very few
data points.
250 14 Image-Based Predictions

Fig. 14.4 The first columns are the input images, and the second and third columns are the masked
maps for low and high ionic conductivity, respectively. The blue and red regions are locations that
are ignored by the CNN for low and high ionic conductivity, respectively. Reprinted with permission
from [2]
14.2 Structure–Property Prediction Using CNN 251

14.2.2 Predicting the Effective Elastic Properties


of Composites

Elastic properties of composite materials are highly dependent on their arrangement


of the constituent materials at multiple length scales. The effective of properties of
the composite are governed by the complex hierarchical microstructure. To predict
the effective elastic properties of a three-dimensional (3-D) microstructure, Cecen et
al. [1] employed a 3-D CNN. The approach aimed to learn the salient features of the
material microstructures that lead to good predictive performance for the effective
property of interest. The microstructure was given as the input, while the target
output included the Young’s modulus and Poisson’s ratio. Figure 14.5 shows three
different microstructures having two materials A and B (note that B represents void
in this case) and their corresponding two-point correlation functions. To benchmark
the predictions, first, two-point correlation functions along with a simple third order
polynomial regression were used. PCA was used to reduce the dimensionality of
the two-point correlation functions by selecting the first 13 features. The prediction
was found to yield reasonable accuracy. Further, a 3-D CNN was trained on the
data to develop a model to predict the Young’s modulus and the Poisson’s ratio of
the composite. The 3-D CNN model exhibited improved accuracy. To investigate
whether 3-D CNN was able to learn additional features than the 2-point statistics, a

Fig. 14.5 Visualizations of three microstructures that show clearly contrasting architectural features
(top row), and their spatial statistics (bottom row). The overall trends in the structures are reflected
in the isosurface contours of 2-point statistics. Reprinted with permission from [1]
252 14 Image-Based Predictions

polynomial fit using a concatenated feature set including the CNN features and the 2-
point correlation features were used (a total of 256.+ 132651.= 132907 features). This
unified model exhibited significantly improved performance with an error reduction
by almost 50% in comparison to the 3-D CNN approach. It is interesting to note that
both the 2-point statistic and CNN had unique information which complemented
each other in the combined model to yield an improved performance in comparison
to each of the individual models. Overall, the work demonstrated that 3-D CNN can
be effectively used to predict the elastic properties of composite materials based on
the 3-D microstructure, which can then be used to extrapolate for new microstructures
[1].
Several other works have also followed similar approaches either with minor
variations or for different kinds of materials such as steel, ballistic composites, and
porous materials and have demonstrated that image-based approaches can be used
effectively to predict composite properties [12–17]. It has been further extended
to predict plastic properties [18] as well as the entire stress versus strain curve of
the composite materials [19, 20]. Note that the CNN model can also be potentially
extended to develop tailored microstructures with targeted elastic properties through
topology optimization [21] which can be potentially realized through a 3D printing
or additive manufacturing approach [14, 22].

14.3 Combining CNN with Finite Element Modeling

Unlike homogeneous materials with uniform composition, heterogeneous materials


are composed of distinct constituents. Establishing an explicit relationship between
the macroscale mechanical property and the microstructure of heterogeneous mate-
rials is known to be challenging. Machine learning methods, particularly deep neural
networks, have emerged as powerful tools for uncovering hidden patterns and cor-
relations from large datasets. A recent work [23] presents an approach to implicitly
map the effective mechanical property of heterogeneous materials to their mesoscale
structure. The proposed method is demonstrated using shale as a case study. Shale
is a complex composite with multiple mineral constituents distributed randomly,
each with significantly varying mechanical properties. By employing a stochastic
reconstruction algorithm, a considerable number of shale samples are generated
from mesoscale scanning electron microscopy images. Image processing techniques
are then utilized to convert these images into finite element models. The effective
mechanical properties of the shale samples are evaluated using finite element anal-
ysis. Following this, a convolutional neural network is trained on the images of the
stochastic shale samples and their corresponding effective moduli. The trained net-
work is successfully validated, demonstrating its ability to accurately and efficiently
predict the effective moduli of real shale samples. Importantly, the proposed method
can be extended to predict the effective mechanical properties of other heterogeneous
materials, making it a valuable tool for materials research and engineering (Figs. 14.6
and 14.7).
14.4 Combining Molecular Dynamics and CNN for Crack Prediction 253

Fig. 14.6 Materials discovery flowchart. Reprinted with permission from [23]

Fig. 14.7 Convolutional neural network for predicting the modulus of shale. Reprinted with
permission from [23]

14.4 Combining Molecular Dynamics and CNN for Crack


Prediction

The understanding of fracture, a critical process in assessing the integrity and sustain-
ability of engineering materials, can be greatly enhanced through advanced machine
learning techniques. Molecular simulations is a powerful tool that allows atomic level
information to be captured in crack propagation. Traditional approaches to studying
254 14 Image-Based Predictions

brittle fracture in solids have relied on continuum mechanics modeling methods, such
as the extended finite element method (XFEM), phase field modeling, and cohesive
zone modeling (CZM), among others. These methods aim to estimate fragmentation
patterns and fracture dynamics. However, the dynamic propagation of cracks in brit-
tle materials involves atomistic bond breaking, which necessitates in-depth analysis
using atomistic-level modeling. Unfortunately, incorporating atomistic details into
continuum mechanics models is challenging due to the assumption of a continuum
and the lack of explicit information about chemical bond behavior. While atomistic
models offer sophistication and predictability, they are computationally expensive
and not conducive to rapid material performance predictions. This limitation ham-
pers their effective use in material optimization, particularly when the atomic scale
serves as the fundamental design parameter (Figs. 14.8 and 14.9).
Combining molecular simulation with a physics-based data-driven multiscale
model can be a powerful tool to predict fracture processes [24] in a computation-
ally efficient fashion. By employing atomistic modeling and an innovative image-
processing technique, the researchers compiled a comprehensive training dataset
comprising fracture patterns and toughness values for various crystal orientations.
The predictive power of the machine-learning model was extensively evaluated,
demonstrating excellent agreement not only in computed fracture patterns but also
in fracture toughness values, even under both mode I and mode II loading condi-
tions. Additionally, the model’s capability was examined to predict fracture patterns
in bicrystalline materials and materials with gradients of microstructural crystal ori-
entation, further confirming its outstanding predictive performance. These results

Fig. 14.8 Combining molecular dynamics with LSTM-CNN for predicting crack propagation in
materials. Reprinted with permission from [24]
14.4 Combining Molecular Dynamics and CNN for Crack Prediction 255

Fig. 14.9 Several unseen test cases to evaluate the ML model. Here the crack images of overall
fracture were substituted for the dataset for machine learning. a Prediction of overall fracture
of small-difference bicrystal material. b Prediction of overall fracture of big-difference bicrystal
material. c Prediction of overall fracture of gradient crystal material (such a system could be subject
to optimization, e.g., to maximize fracture toughness or crack path tortuosity). Reprinted with
permission from [24]

highlight the significant potential of the developed model, offering promising appli-
cations in material design and development.
It is worth noting that the data generated from MD simulations are in the form
of atomic positions and hence converting these continuous images require addi-
tional data processing. The work presented several approaches for data representa-
tion through a processing method for analyzing MD simulation results. Here, the
discrete atomic information was embedded into image-based data structures. Some
256 14 Image-Based Predictions

of the advantages of the approach are outlined here. Firstly, a dataset with labels
in matrix form could be automatically constructed from MD simulation results,
reducing the manual efforts required in common supervised learning approaches.
Secondly, the dataset could intuitively incorporate information on the temporal and
spatial behavior of cracking for ML models. Moreover, the approach had the poten-
tial to be well integrated with other simulation methods, such as particle methods,
phase field modeling, CZM, or XFEM, given sufficient geometric information. This
opens up several avenues for future research, integrating multiparadigm modeling
into the neural network framework proposed in this study. Moreover, the training set
could even incorporate both experimental and simulation-based data, enabling the
development of predictive models from a rich and diverse set of raw data.
The primary objective of this study was to not only adopt a scalable machine-
learning model to bypass complex, computationally intensive simulations but also
to predict dynamic fracture paths for different crystalline structures and boundary
conditions. The model’s capability to predict crack patterns under different loading
conditions was demonstrated, offering a general framework for diverse fracture sce-
narios. The results exhibited a good agreement with the trend of crack length in the
distribution of crystalline orientations, indicating the potential for exploring more
complex systems. Future studies could improve the model’s performance by incor-
porating additional data from MD simulation results with complicated geometric
conditions. This approach not only aligns with general machine learning methods
but also confirms the creation of a generalizable and feasible way to represent data
from MD modeling for AI applications.
It is worth noting that the predictive method solely relied on the geometry and
position of the initial crack to make predictions, providing a highly efficient pro-
cess for modeling this complex physical phenomenon and introducing a new mate-
rial design approach at the nanoscale. The machine-learning algorithm for fracture
mechanics may offer new opportunities for designing engineering materials, such
as high-performance composites, and understanding how these materials respond
to various crack propagation scenarios. Some of the results in the bicrystal cases
exhibited deviations between the MD results and the machine-learning model. This
discrepancy can likely be attributed to the model not learning the behavior from
MD simulations of bicrystals. Nonetheless, these cases were included to explore the
model’s ability to make adequate predictions for variations in angles or bicrystal
interfaces, and the results were promising, demonstrating the predictive power of
the method to extrapolate beyond the cases included in the training set. Future work
could build upon this by implementing an autonomous retraining (or transfer learn-
ing) approach to expand the training set, if necessary, within a multiscale modeling
setup.
Based on the findings of this study, it may be concluded that the AI-based approach
for predicting fracture patterns and possible toughness opens the door to generative
methods, enabling the reverse engineering of mechanical properties. In future work,
the reported method could be further extended by incorporating adversarial training
14.5 Fourier Neural Operator for Stress-Strain Prediction 257

to generate crack-insensitive materials based on prior probabilities evaluated from the


training dataset. The algorithm could be utilized for designing composite materials
by optimizing microstructures to achieve specific crack patterns.

14.5 Fourier Neural Operator for Stress-Strain Prediction

Several studies have focussed on predicting the stresses and strains of materials
directly from their microstructure. In such cases, CNNs become the natural archi-
tectural choice due to its ability to capture local and global patterns while preserving
translation invariance. Indeed, other approaches based on RNNs, and generative
models such as cGANs have also been applied to predict the stress-strain evolution
in materials. However, these approaches have several limitations when applied to the
problem of learning stress and strain fields. The major limitation of these models is
their inability to generalize, thereby, failing to make predictions for the input settings
unseen to the model. This would include different loading conditions, boundary con-
ditions, or different microstructure to name a few. Additionally, such pixel-to-pixel
learning-based methods are incapable of resolving higher resolution inputs unseen
during model training. These challenges can be addressed by employing an operator
based learning approach.
Fourier Neural Operators (FNO) are a class of neural network architectures that
combine the mathematical framework of Fourier analysis with neural networks to
address operator learning tasks. In FNO, the goal is to learn operators that map
inputs to outputs in a data-driven manner, leveraging the expressive power of neural
networks and the efficiency of Fourier analysis. The architecture of a FNO is shown
in Fig. 14.10. Let us consider an operator .L defined on a domain . that maps an input

Fig. 14.10 Architecture of the Fourier neural operator. Reprinted with permission from [25]
258 14 Image-Based Predictions

function.u to an output function.v, i.e.,.L : u → v. The FNO framework seeks to learn


an approximate representation of the operator .L using a neural network. To achieve
this, FNO introduces a spectral representation of the operator .L by decomposing
it into a sum of Fourier modes. The Fourier modes capture the spatial frequency
components of the input function .u and provide a compact representation of the
operator.
The FNO architecture consists of two main components: an encoder and a decoder.
The encoder maps the input function.u to its spectral representation, which is achieved
by applying a Fourier transform to .u. Mathematically, this can be expressed as:

û(ξ ) = F[u](ξ )
. (14.2)

where .û(ξ ) represents the Fourier coefficients of .u at frequency .ξ and .F[·] denotes
the Fourier transform. The decoder, on the other hand, takes the Fourier coefficients
.û(ξ ) and maps them to the output function .v. This mapping is performed using a
neural network that learns the underlying operator .L. Mathematically, the decoder
can be represented as:
.v = G[û](ξ ) (14.3)

where .G[·] denotes the neural network mapping.


To train the FNO model, a dataset of input-output pairs .(u i , vi ) is used. The loss
function is defined to measure the discrepancy between the predicted output .G[û i ](ξ )
and the true output.vi . The model parameters are then optimized using gradient-based
methods to minimize the loss and improve the accuracy of the learned operator.
By leveraging the power of neural networks and the efficiency of Fourier analysis,
Fourier Neural Operators provide an effective framework for learning operators in a
data-driven manner. They have shown promising results in various operator learning
tasks, such as partial differential equations, fluid dynamics, and image processing,
where accurate and efficient representations of operators are crucial for modeling
and analysis. Thus, this suggests a promising framework to learn the stress-strain
evolution of materials as well.
To this extent, a framework based on Fourier Neural Operators (FNO) was pro-
posed to predict the non-linear stress-strain response for a 2D hierarchical compos-
ite [25]. The graphical workflow followed in this study is illustrated in Fig. 14.11. A
binary composite, consisting of two materials with different stiffness, was consid-
ered. The initial geometry was generated in a chequered pattern, where each square
was randomly assigned one of the two materials. To obtain the ground truth data,
finite element (FE) simulations were conducted. The FNO model was then applied to
the randomly generated geometric configurations to predict the stress-strain response
of the composite. This prediction was achieved through operator learning in a super-
vised fashion, with the ground truth data extracted from the FE simulations serving
as the training dataset for the FNO model. The work shows that the predictions by
FNO exhibits an excellent agreement with the ground truth observation. Moreover,
the accuracy was obtained with a very small dataset of 1000 images, which is orders
of magnitude less than what is required by CNNs. Further, the trained model exhib-
14.6 Summary 259

Fig. 14.11 Employing FNO to predict the stress and strain fields. Reprinted with permission
from [25]

ited generalizability to unseen conditions such as different pixel resolution, different


microstructure resolution, different loading conditions and different boundary con-
ditions. Figure 14.12 shows the performance of an FNO trained on a chess board
geometry tested on completely new microstructure with varying fraction of stiff and
soft components. Further, the new microstructure has a different resolution than the
original training data. The predictions of the strains by FNO exhibit excellent agree-
ment with the ground truth for the unseen conditions. These results suggest that FNO
can be a strong candidate to develop a general model that learn constitutive relation-
ships directly from simulated and then be used for multiscale models, thanks to their
super-resolution capability.

14.6 Summary

This chapter focused on the application of machine learning (ML) algorithms to


image-based data for enhanced understanding of materials. Various topics were cov-
ered, including the investigation of structure-property relationships, prediction of
ionic conductivity, integration of finite element analysis with image-based modeling
for accelerated property prediction, and the combination of molecular dynamics with
image-based modeling for predicting crack propagation in atomic systems. Addition-
ally, the chapter explored the use of neural operators to efficiently learn stress and
strain fields from limited ground truth data.
The approaches presented in the chapter suggest that by leveraging image-based
data, researchers can gain a deeper understanding of the relationships between mate-
rial structure and properties. The discussed ML algorithms have demonstrated their
effectiveness in various applications, such as predicting ionic conductivity and crack
260 14 Image-Based Predictions

Fig. 14.12 The model trained on chessboard geometry of soft and stiff units is tested against
arbitrary non-chequered geometries with varying fractions of soft/stiff units. Direct comparison
of strain in y direction predicted by ML model versus FEM shown for three typical examples.
Reprinted with permission from [25]
References 261

propagation. The combination of finite element analysis and image-based model-


ing has enabled accelerated property prediction, while the integration of molecular
dynamics and image-based modeling has provided valuable insights into atomic-
scale behavior. Furthermore, the use of neural operators has shown potential in
efficiently learning stress and strain fields, even when limited ground truth data
is available. This opens up new possibilities for modeling and predicting material
behavior with reduced reliance on extensive experimental or simulation data.
It is worth noting that the field of ML in materials science is rapidly evolving, and
there are several exciting directions for future research. One area of interest is the
development of more advanced neural operator frameworks that can handle sparse or
incomplete data with improved accuracy. Additionally, efforts can be made to explore
the integration of ML algorithms with other computational methods, such as quantum
mechanics, to capture a broader range of material behaviors. Furthermore, expanding
the scope of image-based data analysis to include more complex material systems
and multi-scale phenomena would provide valuable insights for materials design and
optimization. Integration with experimental techniques, such as advanced imaging
and characterization methods, could also enhance the reliability and applicability of
ML models.

References

1. A. Cecen, H. Dai, Y.C. Yabansu, S.R. Kalidindi, L. Song, Material structure-property linkages
using three-dimensional convolutional neural networks. Acta Mater. 146, 76–84 (2018)
2. R. Kondo, S. Yamakawa, Y. Masuoka, S. Tajima, R. Asahi, Microstructure recognition using
convolutional neural networks for prediction of ionic conductivity in ceramics. Acta Mater.
141, 29–38 (2017)
3. Y. Jiao, F. Stillinger, S. Torquato, Modeling heterogeneous materials via two-point correlation
functions. ii. algorithmic details and applications. Phys. Rev. E 77(3), 031135 (2008)
4. Y. Jiao, F. Stillinger, S. Torquato, Modeling heterogeneous materials via two-point correlation
functions: basic principles. Phys. Rev. E 76(3), 031110 (2007)
5. Y. Jiao, F. Stillinger, S. Torquato, A superior descriptor of random textures and its predictive
capacity. Proc. Natl. Acad. Sci. 106(42), 17634–17639 (2009)
6. M. Sundararajan, A. Taly, Q. Yan, Axiomatic attribution for deep networks, in International
Conference on Machine Learning (PMLR, 2017), pp. 3319–3328
7. G. Erion, J.D. Janizek, P. Sturmfels, S.M. Lundberg, S.-I. Lee, Improving performance of deep
learning models with axiomatic attribution priors and expected gradients. Nat. Mach. Intell.
1–12 (2021)
8. S.M. Lundberg, S.-I. Lee, A unified approach to interpreting model predictions, in Proceedings
of the 31st International Conference on Neural Information Processing Systems (2017), pp.
4768–4777
9. B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, A. Torralba, Learning deep features for discrim-
inative localization, in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (2016), pp. 2921–2929
10. R.R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-cam: visual
explanations from deep networks via gradient-based localization, in Proceedings of the IEEE
International Conference on Computer Vision (2017), pp. 618–626
11. X. Chen, K. Khor, S. Chan, L. Yu, Influence of microstructure on the ionic conductivity of
yttria-stabilized zirconia electrolyte. Mater. Sci. Eng. A 335(1–2), 246–252 (2002)
262 14 Image-Based Predictions

12. X. Lei, X. Wu, Z. Zhang, K. Xiao, Y. Wang, C. Huang, A machine learning model for predicting
the ballistic impact resistance of unidirectional fiber-reinforced composite plate. Sci. Rep.
11(1), (2021). Cited By 0. https://doi.org/10.1038/s41598-021-85963-3
13. G.X. Gu, C.-T. Chen, M.J. Buehler, De novo composite design based on machine learning
algorithm. Extrem. Mech. Lett. 18, 19–28 (2018)
14. G.X. Gu, C.-T. Chen, D.J. Richmond, M.J. Buehler, Bioinspired hierarchical composite design
using machine learning: simulation, additive manufacturing, and experiment. Mater. Horiz.
5(5), 939–945 (2018)
15. C.-T. Chen, G.X. Gu, Machine learning for composite materials. MRS Commun. 9(2), 556–566
(2019)
16. J. Zhang, Y. Li, T. Zhao, Q. Zhang, L. Zuo, K. Zhang, Machine-learning based design of digital
materials for elastic wave control. Extrem. Mech. Lett. 48, 101372 (2021). ISSN: 2352-4316.
https://doi.org/10.1016/j.eml.2021.101372
17. O. Keles, Y. He, B. Sirkeci-Mergen, Prediction of elastic stresses in porous materials using
fully convolutional networks. Scr. Mater. 197, (2021). https://doi.org/10.1016/j.scriptamat.
2021.113805
18. D. Abueidda, S. Koric, N. Sobh, H. Sehitoglu, Deep learning for plasticity and thermo-
viscoplasticity. Int. J. Plast. 136, (2021). Cited By 12. https://doi.org/10.1016/j.ijplas.2020.
102852
19. C. Yang, Y. Kim, S. Ryu, G. X. Gu, Prediction of composite microstructure stress-strain curves
using convolutional neural networks. Mater. & Des. 189, 108509 (2020)
20. A. Yamanaka, R. Kamijyo, K. Koenuma, I. Watanabe, T. Kuwabara, Deep neural network
approach to estimate biaxial stress-strain curves of sheet metals. Mater. Des. 195, 108970
(2020). ISSN: 0264-1275. https://doi.org/10.1016/j.matdes.2020.108970
21. H.T. Kollmann, D.W. Abueidda, S. Koric, E. Guleryuz, N.A. Sobh, Deep learning for topology
optimization of 2d metamaterials. Mater. & Des. 196, 109098 (2020)
22. Z. Jin, Z. Zhang, K. Demir, G.X. Gu, Machine learning for advanced additive manufacturing.
Matter 3(5), 1541–1556 (2020)
23. X. Li, Z. Liu, S. Cui, C. Luo, C. Li, Z. Zhuang, Predicting the effective mechanical property of
heterogeneous materials by image based modeling and deep learning. Comput. Methods Appl.
Mech. Eng. 347, 735–753 (2019)
24. Y.-C. Hsu, C.-H. Yu, M.J. Buehler, Using deep learning to predict fracture patterns in crystalline
solids. Matter 3(1), 197–211 (2020)
25. M.M. Rashid, T. Pittie, S. Chakraborty, N.A. Krishnan, Learning the stress-strain fields in
digital composites using Fourier neural operator. Iscience 25(11), (2022)
Chapter 15
Natural Language Processing

Abstract This chapter provides an overview of the application of natural language


processing (NLP) techniques in materials science. It explores the challenges and
opportunities in information extraction from both text and tables in materials science
literature. The chapter highlights the significance of materials-domain language mod-
els, such as MatSciBERT, in improving topic classification, relation classification,
and question answering in materials science texts. Additionally, it discusses the use
of graph neural networks (GNNs) in extracting material compositions from tables.
The chapter concludes by emphasizing the potential of NLP techniques to enhance
materials science research and presents future outlooks for advancements in the field.
Overall, this chapter showcases the valuable role of NLP in unlocking the wealth of
information embedded in materials science literature.

15.1 Introduction

Natural Language Processing (NLP) is a field of artificial intelligence (AI) that


focuses on the interaction between computers and human language. It involves the
development of algorithms and models that enable computers to understand, inter-
pret, and generate human language meaningfully. NLP has found applications in
various domains, including materials science, where it plays a crucial role in ana-
lyzing and extracting valuable information from textual data related to materials
research (Fig. 15.1).
In the context of materials science, NLP techniques have been employed to pro-
cess and extract knowledge from scientific literature, patents, technical reports, and
other textual resources. These techniques enable researchers to efficiently search
for relevant information, discover patterns, and gain insights into various aspects of
materials properties, synthesis, characterization, and applications. Specifically, most
of the knowledge in the literature is stored in the form of unstructured data, such as
text, or semi-structured data, such as tables. Developing ML models for composition
or property prediction require such data to be extracted and stored in a machine-
readable and structured form. Moreover, in a research publication, the information

© Springer Nature Switzerland AG 2024 263


N. M. A. Krishnan et al., Machine Learning for Materials Discovery,
Machine Intelligence for Materials Science,
https://doi.org/10.1007/978-3-031-44622-1_15
264 15 Natural Language Processing

Fig. 15.1 Applications of natural language processing in materials domain

is spread over multiple entities such as tables, text, and images. Thus, information
extraction from research literature requires a multi-pronged approach that extracts
information from all these different entities, combines them in a meaningful manner
(for instance, a knowledge graph), and presents it to the user in an easily accessible
form. To this extent, NLP presents a strong and powerful tool.
Although, the field of NLP has been around for more than 60 years, applications
of NLP to materials are not more than a decade old. One of the early approaches in
NLP for materials science involves using rule-based methods. These methods rely on
predefined linguistic rules and patterns to extract relevant information from text. For
instance, in materials science, specific rules can be defined to identify and extract
mentions of materials, properties, synthesis methods, or experimental techniques
from scientific articles. While rule-based approaches can be effective for specific
tasks, they often require manual effort in crafting and maintaining the rules, which
can limit their scalability and adaptability to different contexts. In this context, three
major approaches that have resulted several seminal studies are briefly discussed
below, namely, Word2Vec, BERT and ChatGPT.
Word2Vec is a popular algorithm in the field of NLP that learns distributed rep-
resentations (word embeddings) of words based on their contextual usage. These
word embeddings capture semantic and syntactic relationships between words,
allowing algorithms to understand and infer meaning from the text. In materials sci-
ence, Word2Vec models have been used to explore relationships between materials,
properties, and synthesis methods. By leveraging the learned word embeddings,
researchers can perform tasks such as similarity analysis, document classification,
and information retrieval in a more meaningful way.
15.2 Materials-Domain Language Model 265

BERT is a state-of-the-art deep learning model introduced by Google in 2018.


It utilizes a transformer architecture to generate contextualized word embeddings
by considering the entire sentence context. BERT has significantly advanced the
field of NLP by achieving state-of-the-art performance in various language under-
standing tasks. In materials science, BERT-based models have been applied for
tasks such as named entity recognition (identifying materials, properties, etc., in
text), sentiment analysis of scientific literature, and text classification based on
materials-related topics. BERT’s ability to capture the contextual information of
words has proven valuable in understanding the nuanced language used in mate-
rials science.
ChatGPT, developed by OpenAI, is a powerful language model based on the GPT
(Generative Pre-trained Transformer) architecture. It has been trained on a vast
amount of text from diverse sources and can generate coherent and contextually
relevant responses to user inputs. In the context of materials science, ChatGPT
can be utilized to assist researchers in answering questions, providing explana-
tions, or generating summaries related to materials properties, synthesis routes,
characterization techniques, and more. It serves as a conversational tool to facil-
itate knowledge dissemination and interactive exploration of materials science
concepts.
In this chapter, uses of NLP to meaningfully extract information from the literature
is discussed. This includes information extraction from unstructured data such as text
and semi-structured data such as tables. Further, the use of NLP to extract additional
information such as metadata, image captions, and to develop knowledge graphs are
briefly discussed. Several interesting applications can be found in Refs. [1–12].

15.2 Materials-Domain Language Model

Most of the earlier works on NLP in materials were relying on rule-based approaches
such as ChemDataExtractor, which has also been used in predicting phase diagrams,
and generation battery and superconducting materials databases. A seminal work
following this was the use of word vectors in converting semantic queries to vector
algebra which was further extended to the prediction of thermoelectrics. Following
this, there several works focused on extraction of synthesis recipes, testing protocols
and processing conditions using rule based approaches. This was also followed by the
development of Matscholar, a comprehensive material science search and discovery
engine that is able to automatically identify materials, properties, characterization
methods, phase descriptors, synthesis methods, and applications from a given text
through a custom built named entity recognition (NER) system. A major departure
from these approaches was the development of a materials science language models
such as MatBERT and MatSciBERT.
Figure 15.2 shows the overall workflow employed in the training of MatSciBERT.
While existing LMs like BERT and SciBERT have been trained on large datasets,
266 15 Natural Language Processing

Fig. 15.2 MatSciBERT training and evaluation workflow. Reprinted with permission from [13]

they do not include materials-related text. To address this gap, the authors collected
research papers from the materials science domain, specifically in the categories of
inorganic glasses and ceramics, metallic glasses, cement and concrete, and alloys.
They queried the Crossref metadata database and obtained a list of articles, then
downloaded papers from the Elsevier Science Direct database. A custom XML parser
was used to extract text from the downloaded articles, including full sections when
available and abstracts otherwise. For specific material categories like concrete and
alloys, relevant papers were identified through manual annotation and the use of SciB-
ERT classifiers. The resulting dataset, called the Material Science Corpus (MSC),
was divided into training and validation sets, with 85% used for LM training and
15% for validation. Using this dataset, MatSciBERT was pretrained employing the
RoBERTa approach for 10 days in 2 NVIDIA V100 32GB GPUs. Once the LM is
pretrained, they can be finetuned for several downstream tasks such as named-entity
recognition, relation classification, question answering, to name a few. One of the
major challenges when testing the LM for downstream tasks is the availability of high
quality datasets that are manually labeled. Developing such datasets can be extremely
useful for evaluating the performance of LMs on materials-domain specific tasks.
Results suggest that the LM model trained on domain-specific text outperform the
existing LMs such as BERT and SciBERT. This evaluation was carried out on three
tasks, namely, named entity recognition, abstract classification, and relation classifi-
cation. In all three tasks, it was observed that the domain-specific LM outperformed
the generic LMs. Further, some of the applications of the materials domain specific
LMs are discussed below.
1. Document classification: The increasing number of published manuscripts on
materials-related topics presents a challenge in effectively identifying relevant
papers. Traditional approaches like TF-IDF and Word2Vec, coupled with classifi-
15.2 Materials-Domain Language Model 267

cation algorithms, have limitations in capturing contextual meaning. However, the


introduction of MatSciBERT, which leverages contextual embeddings, allows for
improved topic classification. MatSciBERT demonstrates higher accuracy (96%)
compared to simple logistic regression based on TF-IDF (90%), as reported in a
previous study. This approach holds potential for accurate classification of docu-
ments from a larger set of abstracts in the materials science literature.
2. Topic modeling: is an unsupervised technique that aims to group documents with
similar topics together. Traditionally, algorithms like latent Dirichlet allocation
(LDA) are combined with TF-IDF or Word2Vec to cluster documents based on the
frequency or embeddings of words. However, these approaches lack contextual
information in the clustering process. By utilizing the context-aware embeddings
learned in MatSciBERT, we can greatly enhance the task of topic modeling. This
can be useful to cluster papers with similar topics together and also get an overview
of various topics present in the materials domain or a sub-domain.
3. Information extraction from images: Images contain valuable information
about the structure and properties of materials. Analyzing the captions of these
images can serve as a proxy for identifying relevant images. However, extracting
the relevant keywords from each caption can be a challenging task, as captions
often contain multiple entities. MatSciBERT, which has been fine-tuned on the
Matscholar Named Entity Recognition (NER) dataset, proves to be a useful tool
for extracting information from figure captions. To demonstrate the effectiveness
of MatSciBERT, entities were extracted entities from around 110,000 image cap-
tions related to inorganic glasses, using the model fine-tuned on the Matscholar
NER dataset. Further, the extracted entities were categorized into different classes,
including descriptors, applications, inorganic materials, material properties, char-
acterization methods, synthesis methods, and symmetry/phase labels. It should be
noted that a single caption can be associated with multiple entities, allowing for a
more comprehensive representation of the information within the image dataset.
The extracted entities can then be used to retrieve relevant images based on spe-
cific queries. For example, queries such as “XRD measurements of glasses used
for coating,” “emission spectra of doped glasses,” or “SEM images of bioglasses
with Ag” can be answered by leveraging the extracted entities. The authors also
presented a comparison between the manual annotations by domain experts and
the entities extracted by the MatSciBERT NER model for a subset of selected
captions. The results demonstrate the ability of the model to extract multiple enti-
ties from each caption, thereby capturing a significant amount of information that
may have been overlooked using traditional approaches (see Fig. 15.3).
4. Relation classification and question answering: MatSciBERT offers potential
for addressing other tasks such as relation classification and question answering in
materials science. The relation classification task demonstrated in this manuscript
provides valuable insights into the sequential aspects of materials science, includ-
ing synthesis and testing protocols, and measurement sequences. By utilizing
MatSciBERT, researchers can discover optimal pathways for material synthesis
or identify new pathways by analyzing the relationships between different steps.
Furthermore, this approach can be employed to investigate the effects of various
268 15 Natural Language Processing

Fig. 15.3 Named entities recognised by MatSciBERT. The manual labels identified against each
caption is also included. Reprinted with permission from [13]

testing and environmental conditions on material properties, along with the rel-
evant parameters. Properties such as hardness or fracture toughness, which are
highly sensitive to sample preparation protocols, testing conditions, and equip-
ment used, can benefit from this analysis. MatSciBERT enables the extraction
of information regarding synthesis and testing conditions that may be otherwise
difficult to uncover within the text.

15.3 Extracting Material Composition from Tables

In materials science articles, important information related to synthesis, characteriza-


tion, and material compositions is typically reported in the text. However, a significant
portion (approximately 85%) of material compositions and associated properties are
reported in tables rather than in the text. This observation highlights the importance
of information extraction (IE) from tables to gain a comprehensive understanding of
a research paper and to expand the coverage of resulting knowledge bases (KBs).
However, there are no generic ML-based frameworks that can extract compositions
directly from tables. This challenge in fact introduces a novel natural language pro-
15.3 Extracting Material Composition from Tables 269

(a) (b) (c)


Fig. 15.4 Examples of composition tables a Multi-cell complete-info b Multi-cell partial-info with
caption on top c Single-cell. Reprinted with permission from [14]

cessing (NLP) task: the extraction of materials, their constituents, and their relative
percentages from tables. Developing a model for this task requires addressing several
challenges. Some of these key challenges are described below [14].

Distractor rows and columns: Additional information such as material properties,


molar ratios, and std errors in the same table. E.g., in Fig. 15.4a, the last three rows
are distractor rows.
Orientation of tables: Tables can be oriented row-wise or column-wise. The table
in Fig. 15.4a is a column-oriented table.
Different units: Compositions can be in different units such as mol%, weight%,
mol fraction, weight fraction. Some tables express composition in both molar and
mass units.
Material IDs: Authors refer to different materials in their publication by assign-
ing them unique IDs. These material IDs may not be specified every time, (e.g.,
Fig. 15.4c).
Single-cell compositions (SCC): In Fig. 15.4a, all compositions are present in
multiple table cells. Some authors report the entire composition in a single table
cell, as shown in Fig. 15.4c.
Percentages exceeding 100: Sum of coefficients may exceed 100, and
re-normalization is needed. A common case is when a dopant is used; its amount
is reported in excess.
Percentages as variables: Contributions of constituents may be expressed using
variables like .x, y.
Partial-information tables: It is also common to have percentages of only some
constituents in the table; the remaining composition is to be inferred based on
paper text or table caption, e.g., Fig. 15.4b. Another example: if the paper is on
silicate glasses, then SiO.2 is assumed.
Other corner cases: There are several other corner cases like percentages missing
from the table, compounds with variables (e.g., R.2 O in the header; the value of R
to be inferred from material ID), and highly unusual placement of information.

Thus, the task of automated extraction of compositions from tables can be formu-
lated as follows. The task involves extracting compositions expressed in a given
table .T , along with its caption and the complete text of the publication in which
the table appears. The desired output of the extraction process is a set of tuples
270 15 Natural Language Processing

Fig. 15.5 Different types of tables in the DisCoMaT framework. Reprinted with permission
from [14]

K id
.(id, ckid , pkid , u id
k )k=1 , where .id represents the material ID as defined by researchers
in the field of materials science. The material ID is used to reference the composition
in text and other tables. Each tuple consists of a constituent element or compound
id id
.ck present in the material, the total number of constituents . K in the material, the
percentage contribution . pk > 0 of .ck in the composition, and the unit .u id
id id id
k of . pk
(either mole% or weight%). For example, the desired output tuples corresponding to
ID A1 from Fig. 15.4a are (A1, MoO.3 , 5, mol%), (A1, Fe.2 O.3 , 38, mol%), and (A1,
P.2 O.5 , 57, mol%).
To address this task, the materials tables are divided into non-composition (NC),
single cell composition (SCC), multi-cell composition with complete informa-
tion (MCC-CI), and mult-cell composition with partial information (MCC-PI) (see
Fig. 15.5). Further, each row or column is also annotated into four labels in the
dataset: ID, composition, constituent, and other. While training data is created using
distant supervision, dev and test sets are hand annotated. To extract information from
the tables, authors proposed a framework, named, distantly supervised compositions
extraction from materials tables (DisCoMaT). The DiSCoMaT architecture is shown
in Fig. 15.6. The first task is to determine whether the table .T is a single-cell com-
pound (SCC) table, which can be identified based on the presence of multiple numbers
and compounds in single cells. DiSCoMaT utilizes a graph neural network (GNN)
based SCC predictor to classify the table .T as an SCC table or not. If it is classified
as an SCC table, the system employs a rule-based composition parser to extract the
compositions. For tables that are not classified as SCC tables, the system employs a
second GNN (referred to as GNN.2 ) to label the rows and columns of the table .T as
compositions, material IDs, constituents, or others. If no constituents or composition
15.4 Future Directions 271

Fig. 15.6 DisCoMaT architecture. Reprinted with permission from [14]

predictions are found, the table .T is categorized as a non-composition (NC) table. On


the other hand, if constituents or composition predictions are present, it is categorized
as a multiple-cell compound (MCC) table. In the case of an MCC table, DiSCoMaT
predicts whether the table contains complete information or if some information is
missing using a partial-information predictor. If the table is determined to have com-
plete information, the predictions made by GNN.2 are post-processed to extract the
compositions. If the table is found to have missing information, DiSCoMaT incor-
porates the caption and text of the paper along with GNN.2 ’s predictions to perform
the final composition extraction.
Results show that DiSCoMaT is able to extract compositions from tables with a
material-level F1 score of 63.53. Thus, although DiSCoMaT presents the first attempt
to extract compositions from table, further improvements are required to enhance its
performance. The poor performance could be attributed to the percolation errors that
occur at each step of the architecture. It is also due to poor performance in certain
classes of tables such as partial information table.

15.4 Future Directions

With the advent of transformer models such as ChatGPT and GPT4, the NLP field has
been revolutionized and many tasks which were thought to be extremely challenging
has become possible with a prompt. However, these language models are still black-
box in nature and knowing their limitations are crucial for the reliable and secure
development of tools that can be used for practical applications. Accordingly, there
has been a lot of emphasis on data-centric AI, a paradigm shift from model-centric
AI. The development of datasets that probe and aid in understanding the limitations
of such language models are imperative.
272 15 Natural Language Processing

Fig. 15.7 MaScQA: A question-answer database on materials domain

To this extent, MaScQA is a question-answer database on materials domain (see


Fig. 15.7). This database of 600 questions and answers covers different types of ques-
tions such as multiple choice, matching questions, numerical, theoretical, reasoning-
based and recollection-based questions. Further, it covers several domains in materi-
als such as mechanical properties, structure-property relationships, mechanics, elec-
trical, optical and transport properties. Critical analysis of the performance of lan-
guage models on such database can provide insights into the limitations of the model
and scope for improvement.
Another area of research in NLP would be towards multimodal NLP. Multimodal
NLP is a field that focuses on understanding and processing information from mul-
tiple modalities, such as text, images, and audio, in a unified manner. In the context
of materials science, multimodal NLP becomes particularly relevant as materials
research involves the analysis of various types of data, including textual informa-
tion, images of material structures, and spectroscopic data. By integrating multiple
modalities, multimodal NLP enables a more comprehensive analysis of materials-
related information. For example, combining text and image data allows for a richer
understanding of material properties, structure-property relationships, and synthe-
sis processes. This integration can facilitate tasks such as material classification,
property prediction, and material discovery.
In materials science, textual information provides valuable context and descrip-
tions of experimental procedures, while images offer visual representations of mate-
rial structures and morphologies. By leveraging both modalities, researchers can gain
deeper insights into materials’ behavior and characteristics. Multimodal NLP tech-
niques can extract relevant information from text and images, link them together,
15.4 Future Directions 273

and perform joint analysis to uncover hidden patterns, correlations, and knowledge.
Moreover, multimodal NLP can be used to bridge the gap between different types
of data sources in materials science. For instance, by combining text-based litera-
ture with experimental data and images, researchers can enhance the interpretation
and validation of experimental results, enabling more reliable and efficient material
analysis.
Another open area in NLP would be the development of materials domain knowl-
edge graph combining information from tables, text, and images. The development of
a materials domain knowledge graph that combines information from tables, text, and
images holds great potential for advancing materials science research. A knowledge
graph is a structured representation of knowledge that captures entities, their proper-
ties, and the relationships between them. By integrating data from diverse modalities,
such as tables, text, and images, we can construct a comprehensive knowledge graph
that encompasses a wide range of materials-related information.
Tables often contain valuable data on material compositions, properties, and exper-
imental results. By extracting information from tables, we can populate the knowl-
edge graph with structured data, enabling efficient querying and analysis. This data
can include material compositions, synthesis methods, characterization techniques,
and various material properties.
Textual information, such as research articles, provides rich descriptions, expla-
nations, and context surrounding materials science. Natural language processing
techniques can be employed to extract relevant information from the text, such as
material synthesis procedures, properties, and relationships. This extracted informa-
tion can then be linked to entities in the knowledge graph, enriching its content and
facilitating semantic connections between different data points.
Images play a crucial role in materials science, providing visual representations
of material structures, morphologies, and properties. Advanced image processing
and computer vision techniques can be applied to extract features and metadata from
images, which can be integrated into the knowledge graph. This allows researchers to
explore the relationships between material structures, properties, and performance,
leveraging both visual and textual information.
By combining information from tables, text, and images, the materials domain
knowledge graph becomes a powerful tool for researchers. It enables holistic explo-
ration and analysis of materials data, facilitating discovery of new materials, identifi-
cation of structure-property relationships, prediction of material behavior, and opti-
mization of synthesis and processing methods. The knowledge graph also supports
data-driven approaches, enabling researchers to navigate and leverage vast amounts
of interconnected materials information.
274 15 Natural Language Processing

15.5 Summary

This chapter explores the application of natural language processing (NLP) tech-
niques in the field of materials science. It discusses the challenges associated with
information extraction from both text and tables in materials science literature. The
chapter highlights the importance of extracting material compositions, synthesis
methods, and characterization techniques from text, as well as the significance of
extracting material compositions from tables.
The chapter introduces the concept of materials-domain language models, with a
focus on MatSciBERT. These language models are specifically trained on materials
science literature and enable improved topic classification, relation classification, and
question answering. They capture the contextual meaning of materials-related terms
and enhance the understanding of materials science texts. Additionally, the chapter
presents a pipeline approach for information extraction from tables using graph
neural networks (GNNs), namely, DiSCoMaT. The pipeline includes steps such as
identifying different types of tables, labeling rows and columns, and distinguishing
between complete and partial information tables. It discusses the challenges involved
in table analysis and highlights the potential of NLP techniques in addressing these
challenges.
Altogether, the chapter emphasizes the value of NLP techniques in materials
science research. The use of materials-domain language models, such as MatSciB-
ERT, improves the extraction of relevant information from materials science texts
and enhances various downstream tasks. The pipeline approach for table analy-
sis demonstrates the potential of GNNs in handling table structures and extracting
material compositions effectively.
Looking ahead, further advancements in NLP techniques hold great promise for
materials science research. The development of more refined materials-domain lan-
guage models can enhance the understanding of complex materials-related concepts
and facilitate knowledge discovery. Additionally, the application of NLP techniques
to other materials science tasks, such as materials property prediction or materials
design, opens up new avenues for research and innovation. The integration of NLP
with other computational methods, such as computer vision, leading to multimodal
NLP can result in more comprehensive and efficient analysis of materials science
literature.

References

1. C. Manning, H. Schutze, Foundations of Statistical Natural Language Processing (MIT Press,


May 1999). Google-Books-ID: 3qnuDwAAQBAJ, isbn: 978-0-262-30379-8
2. H. Huo, Z. Rong, O. Kononova, W. Sun, T. Botari, T. He, V. Tshitoyan, G. Ceder, Semi-
supervised machine-learning classification of materials synthesis procedures. npj Com-
put. Mater. 5(1), 1–7 (2019). Number: 1 Publisher: Nature Publishing Group, issn: 2057-
3960. https://doi.org/10.1038/s41524-019-0204-1. https://www.nature.com/articles/s41524-
019-0204-1. Accessed 19 Oct 2020 https://www.nature.com/articles/s41524-019-0204-1 (vis-
ited on 10/19/2020)
References 275

3. V. Venugopal, S. Sahoo, M. Zaki, M. Agarwal, N.N. Gosvami, N.M.A. Krishnan, Looking


through glass: knowledge discovery from materials science literature using natural language
processing. Patterns 2(7), 100–290 (2021). issn: 2666-3899. https://doi.org/10.1016/j.patter.
2021.100290. https://www.sciencedirect.com/science/article/pii/S2666389921001239
4. E. Kim, Z. Jensen, A. van Grootel, K. Huang, M. Staib, S. Mysore, H.-S. Chang, E. Strubell,
A. McCallum, S. Jegelka, E. Olivetti, Inorganic materials synthesis planning with literature-
trained neural networks. J. Chem. Inf. Model. 60(3), 1194–1201 (2020). Publisher: American
Chemical Society, issn: 1549-9596. https://doi.org/10.1021/acs.jcim.9b00995. https://doi.org/
10.1021/acs.jcim.9b00995. Accessed 19 Oct 2020
5. V. Venugopal, S.R. Broderick, K. Rajan, A picture is worth a thousand words: apply-
ing natural language processing tools for creating a quantum materials database map.
MRS Commun. 9(4), 1134–1141 (2019). Publisher: Cambridge University Press, issn:
2159-6859, 2159-6867. https://doi.org/10.1557/mrc.2019.136. https://www.cambridge.org/
core/journals/mrs-communications/article/picture-is-worth-a-thousand-words-applying-
natural-language-processing-tools-for-creating-a-quantum-materials-database-map/
8956AFA3C1D282BAF0A85DA36AB0F6B2. Accessed 19 Oct 2020
6. M.C. Swain, J.M. Cole, ChemDataExtractor: a toolkit for automated extraction of chemical
information from the scientific literature. J. Chem. Inf. Model. 56(10), 1894–1904 (2016).
Publisher: American Chemical Society, issn: 1549-9596. https://doi.org/10.1021/acs.jcim.
6b00207. https://doi.org/10.1021/acs.jcim.6b00207. Accessed 19 Oct 2020
7. E. Kim, K. Huang, O. Kononova, G. Ceder, E. Olivetti, Distilling a materials synthesis ontology.
Matter 1(1), 8–12 (2019). issn: 25902385. https://doi.org/10.1016/j.matt.2019.05.011. https://
linkinghub.elsevier.com/retrieve/pii/S2590238519300360. Accessed 06 May 2021
8. H. Uvegi, Z. Jensen, T. N. Hoang, B. Traynor, T. Aytas, R.T. Goodwin, E.A. Olivetti, Liter-
ature mining for alternative cementitious precursors and dissolution rate modeling of glassy
phases. J. Am. Ceramic Soc. 104(7), 3042–3057 (2021,). eprint: https://ceramics.onlinelibrary.
wiley.com/doi/pdf/10.1111/jace.17631, issn: 1551-2916. https://doi.org/10.1111/jace.17631.
https://ceramics.onlinelibrary.wiley.com/doi/abs/10.1111/jace.17631. Accessed 09 May 2021
9. E. Kim, K. Huang, S. Jegelka, E. Olivetti, Virtual screening of inorganic materials synthe-
sis parameters with deep learning. npj Comput. Mater. 3(1), 1–9 (2017). Number: 1 Pub-
lisher: Nature Publishing Group, issn: 2057-3960. https://doi.org/10.1038/s41524-017-0055-
6. https://www.nature.com/articles/s41524-017-0055-6. Accessed 19 Oct 2020
10. V. Tshitoyan, J. Dagdelen, L. Weston, A. Dunn, Z. Rong, O. Kononova, K.A. Persson, G. Ceder,
A. Jain, Unsupervised word embeddings capture latent knowledge from materials science
literature. Nature 571(7763), 95–98 (2019)
11. L. Weston, V. Tshitoyan, J. Dagdelen, O. Kononova, A. Trewartha, K.A. Persson, G. Ceder, A.
Jain, Named entity recognition and normalization applied to large-scale information extraction
from the materials science literature. J. Chem. Inf. Model. 59(9), 3692–3702 (2019)
12. E.A. Olivetti, J.M. Cole, E. Kim, O. Kononova, G. Ceder, T.Y.-J. Han, A.M. Hiszpanski, Data-
driven materials research enabled by natural language processing and information extraction.
Appl. Phys. Rev. 7(4) (2020)
13. T. Gupta, M. Zaki, N.A. Krishnan, Mausam, Matscibert: a materials domain language model
for text mining and information extraction. npj Comput. Mater. 8(1), 102 (2022)
14. T. Gupta, M. Zaki, N. Krishnan, et al., Discomat: distantly supervised composition extraction
from tables in materials science articles (2022). arXiv:2207.01079
Correction to: Machine Learning
for Materials Discovery

Correction to:
N. M. A. Krishnan et al., Machine Learning for Materials
Discovery, Machine Intelligence for Materials
Science, https://doi.org/10.1007/978-3-031-44622-1

The original version of the book was inadvertently published without the ESM in
chapters 1, 2, 4, 5, 6. Which has now been updated. The correction chapters and the
book have been updated with these changes.

The updated version of the book can be found at


https://doi.org/10.1007/978-3-031-44622-1

© The Editor(s) (if applicable) and The Author(s), under exclusive license C1
to Springer Nature Switzerland AG 2024
N. M. A. Krishnan et al., Machine Learning for Materials Discovery, Machine
Intelligence for Materials Science, https://doi.org/10.1007/978-3-031-44622-1_16
Index

A Distortion, 39, 40, 117


Acronyms, list of, xix
Adaptive Synthetic Sampling Technique
(ADASYN), 25, 45 F
Forgetting factor, 66
Fracture, 13, 15, 44 , 150, 213, 221, 232, 245,
B 253–256, 268
Bagging, 90, 91
Ball-and-stick, 7
Bar graph, 26, 45 G
Batch LMS, 68, 70, 81 Gaussian process, 9, 85, 86, 108, 110, 111,
Bi-modal distribution, 31 140, 183, 184, 200, 228
Bin size, 31, 32 Generative Adversarial Networks (GANs),
Box plot, 37 145, 146, 150, 151, 154, 158, 191, 197–
199, 205
Gradient ascent, 57, 80
C Gradient descent, 13, 62, 65, 66, 68, 69, 81,
Categorical data, 26 96, 148, 156, 193, 205
Central measures, 26, 34–36, 43, 45 Graph Neural Network (GNN), 16, 17, 24,
Confidence intervals, 10, 11, 38 145, 146, 152–154, 158, 181, 202, 205,
Convolutional Neural Network (CNN), 13– 221, 226, 229, 234–236, 238, 239, 242,
15, 50, 145–148, 150, 154, 158, 174, 243, 263, 270, 271, 274
181, 234, 246–249, 251–254, 257, 258

Clustering, 9, 24, 47, 49, 55, 56, 59, 113, H


114, 117–120, 128, 267 Heat map, 25, 26, 28, 45
Crack,14, 230, 232, 245, 247, 253–257, 259 Heavy-tailed, 39
Hierarchical data, 28
Higher-order measures, 25, 39, 45
D Histogram, 10, 11, 25, 26, 31, 32, 35, 41, 45,
Data augmentation, 26, 44, 233 227
Data-driven models, 8
Data imputation, 44, 45
Density plot, 26, 32 I

© Springer Nature Switzerland AG 2024 277


N. M. A. Krishnan et al., Machine Learning for Materials Discovery,
Machine Intelligence for Materials Science,
https://doi.org/10.1007/978-3-031-44622-1
278 Index

Imbalanced data, 44, 45 131–134, 137, 139, 140, 142, 143, 145–
In-silico, 5, 7 147, 150–152, 154, 156–160, 162–165,
Interatomic, 16, 153, 181, 221–225, 228, 168–171, 175–189, 191–199, 202, 205,
229, 231, 232, 235, 242 206, 209–211, 213, 217, 218, 221–224,
Interquartile range, 43 226, 231–236, 242, 243, 247, 251, 252,
254–261, 263–267, 269, 271, 272, 274

K Multilayer Perceptron (MLP), 86, 93, 97,


K-Means, 9, 49, 55, 59, 113, 114, 118–120, 110, 183
128
k Nearest Neighbor (kNN), 41, 44, 52, 53
Kurtosis, 26, 39, 40 N
Natural Language Processing (NLP), 14,
158, 170, 174, 263, 264, 273, 274
Neural network, 9, 16, 17, 24, 41, 50, 51,
L
55–57, 59, 85, 97, 99, 132, 139, 145–
Learning rate, 66, 70, 74, 132, 139, 184
149, 152–154, 159, 165, 181–183, 186,
Least Angle Regression (LAR), 61, 62, 75,
197, 199, 201–203, 210, 221, 223, 226,
77, 78, 81
228–231, 233, 234–236, 238, 239, 242,
Least mean square, 66, 70
243, 246, 252, 253, 256–258, 263, 270,
Likelihood function, 72, 80, 122, 123
274
Logistic regression, 9, 53, 59, 61, 62, 78, 80,
Nonlinearly distributed dataset, 73
81, 85, 96, 182, 267
Normal distribution, 38–40, 109
Loglikelihood, 72, 73, 80, 81, 122
Long short-term memory (LSTM), 15, 145,
146, 148–150, 158 O
Outlier detection, 25, 26, 40–46, 179
Outlier ensemble, 44
M Outliers, 9, 25, 26, 29–31, 35–37, 40–46, 50,
Machine learning, 6, 8–10, 46–51, 55, 57, 56, 179
59, 113, 126, 128, 129, 131, 132, 139, Oversampling, 44, 45
142, 143, 145, 146, 150, 156, 158–160,
167, 170, 171, 175, 183, 184, 186, 188,
189, 194, 197, 201, 205, 221, 223–225, P
229, 242, 243, 245, 248, 252, 253, 255, Parametric methods, 61, 62, 81
256, 259 Percentile, 36–38
Materials databases, 5, 43, 265 Physics-informed, 17, 185, 186, 189, 206,
Mathematical models, 7, 232 222, 223, 229–231, 233
Maximum likelihood estimation, 72, 80, 169 Population, 34, 38, 167, 168, 193, 214, 215
Potential, 8, 9, 13, 14, 43, 51, 59, 104, 146,
Mean, 26, 34–36, 38, 42, 44, 45, 53–55, 62, 151, 153, 158, 162, 171, 175, 181, 189,
72, 76, 90, 108, 109, 114, 118, 137, 155, 201, 205, 206, 223–226, 228–231, 237,
161, 163, 165, 170, 185, 210, 211, 230, 239, 240, 241, 243, 255, 256, 261, 263,
242, 247 267, 273, 274
Measures of spread, 36 PyOD, 40–42, 179
Median, 26, 34–37, 41, 43–45
Median absolute deviation, 41, 43
Metallurgy, 4 Q
Mode, 26, 34–36, 45, 197, 233, 234, 254, Quartile, 37, 43
258
Models, 7–13, 15, 17, 25, 26, 28, 40–42,
44, 47–57, 59, 61–63, 65, 66, 68, 70– R
78, 80, 85–90, 93, 96, 99, 108, 109, Random forest, 9, 54, 55, 59, 85, 86, 90, 91,
113, 114, 118, 120, 122, 123, 126–128, 110, 159, 183, 228
Index 279

Range, 10, 26, 28, 31, 36–38, 43, 44, 62, 121, Steepest descent, 65
132, 145, 180, 189, 192, 193, 230, 232, Stochastic LMS, 68, 70, 81, 99
242, 261, 273 Sturges’ rule, 31
Recurrent, 50, 148, 150, 197, 199, 234 Summarizing statistics, 34
Recurrent Neural Network (RNN), 50, 148, Supervised learning, 9, 48, 50, 59, 113, 151,
197, 199, 234 256
Regression, backward stepwise, 76 Support vector, 9, 54, 59, 85, 86, 101, 103,
Regression, batch LMS, 68, 70, 81 105, 107, 108, 110, 139, 159, 183, 184,
Regression, forward stepwise, 76, 77 209, 213–215, 226
Regression, gradient descent, 66 Symbols, list of, xix
Regression, LAR, 77
Regression, linear, 63
Regression, linearisation, 73 T
Regression, LWR, 73 Tree maps, 25, 26, 28, 45
Regression, random forest, 90 Trial-and-error, 4, 191, 201
Regression, stepwise, 62, 75–77, 81
Regression, stochastic LMS, 70
Regression, subset selection, 74 U
Regression tree, 85–87, 90, 93, 183 Undersampling, 44
Reinforcement learning, 9, 10, 47–49, 51, Unsupervised learning, 9, 47, 48, 55, 59, 113,
57–59, 145, 146, 156, 191, 192, 197, 114
199, 201–203, 205

V
S Variance, 26, 36, 38, 56, 74, 90, 109, 114,
Sample, 31, 34–38, 44, 63, 70, 72, 90, 114, 115, 126, 133, 154, 214, 247, 248
151, 154, 155, 163, 165, 195, 197, 268 Variational Autoencoder (VAE), 41, 154–
156, 197–200
Scatter plot, 25, 26, 29, 45
SHapley Additive exPlanations (SHAP),
159–165, 168, 176, 209–213, 217, 249, W
What-if scenarios, 4, 5
Skewness, 26, 39, 40 Widrow-Hoff learning rule, 66
SMOTE, 25, 44, 45
Softness, 209, 213–218
Standard deviation, 38, 42, 72, 109, 110, 184, X
228 XGboost, 9, 86, 93–95, 138, 140–142, 183

You might also like