Machine Learning in Python For Process Systems Engineering: Ankur Kumar, Jesus Flores-Cerrillo
Machine Learning in Python For Process Systems Engineering: Ankur Kumar, Jesus Flores-Cerrillo
Machine Learning in
Python for Process
Systems Engineering
Ankur Kumar
Jesus Flores-Cerrillo
`
Dedicated to our spouses, family, friends, motherland, and all the data-science
enthusiasts
`
www.MLforPSE.com
All rights reserved. No part of this book may be reproduced or transmitted in any form or
in any manner without the prior written permission of the authors.
.
Every effort has been made in the preparation of this book to ensure the accuracy
of the information presented and obtain permissions for usage of copyrighted materials.
However, the authors make no warranties, expressed or implied, regarding errors or
omissions and assume no legal liability or responsibility for loss or damage resulting
from the use of information contained in this book
.
Table of Contents
• Preface
• Part 1 Introduction and Fundamentals
• Chapter 1 Machine Learning for Process Systems Engineering 1
o 1.1 What are Process Systems
▪ 1.1.1 Characteristics of process data
o 1.2 What is Machine Learning
▪ 1.2.1 Machine learning workflow
▪ 1.2.2 Type of machine learning systems
o 1.3 Machine Learning Applications in Process Industry
▪ 1.3.1 Decision hierarchy levels in a process plant
▪ 1.3.2 Application areas
o 1.4 ML Solution Deployment
o 1.5 The Future of Process Data Science
• Chapter 8 Finding Groups in Process Data: Clustering & Mixture Modeling 170
o 8.1 Clustering: An Introduction
▪ 8.1.1 Multimode semiconductor manufacturing process
o 8.2 Centroid-based Clustering: K-Means
▪ 8.2.1 Determining the number of clusters via elbow method
▪ 8.2.2 Silhouette analysis for quantifying clusters quality
▪ 8.2.3 Pros and cons
o 8.3 Density-based Clustering: DBSCAN
▪ 8.3.3 Pros and cons
o 8.4 Probabilistic Clustering: Gaussian Mixtures
▪ 8.4.1 Mathematical background
▪ 8.4.2 Determining the number of clusters
o 8.5 Multimode Process Monitoring via GMM for Semiconductor
Manufacturing Process
▪ 8.5.1 Fault detection indices
▪ 8.5.2 Fault detection
Appendix
Preface
Everyday we hear stories about new feats being achieved by machine learning (ML) and
artificial intelligence (AI) researchers that have the potential to revolutionize our world.
Through humanoid robots, speech recognition, computer vision, driverless cars, fraud
detection, personalized recommendations, and automated health diagnosis, machine
learning has already become an integral part of our daily life. Moreover, away from the
glitz of these ‘visible’ high-tech products, machine learning has also been making silent
advances in process industries (chemical industry, biopharma industry, steel industry,
etc.) where ML-based solutions are being increasingly used for predictive equipment
maintenance, product quality assurance, process monitoring, fault diagnosis, process
control, and process optimization. With increasing global competition and stricter product
quality standards, industrial plants are relying upon machine learning tools (such as
reinforcement-learning-based auto-adaptive process controller) to provide them the
winning edge over competitors.
Perhaps you are reading this book because you too have been inspired by the capabilities
of machine learning and would like to use it to solve problems being faced by your
organization. However, you might be struggling to find a definite guide that can help you
decide which specific methodology to chose among the myriad of available
methodologies. You may have come across a nice research article that showcases an
interesting process systems application of a ML method. However, you might be facing
difficulties trying to understand the intricate details of the algorithm. We won’t be surprised
if you have struggled to find a data-science book that caters to the needs of a process
systems engineer, considers unique characteristics of industrial process systems, and
uses industrial-scale process systems for illustrations. We, the authors, have been in that
phase. A process engineer will arguably find it more relevant and useful to learn principal
component analysis (PCA) by working through a process monitoring application (the most
popular application area of PCA in process industry) and learning how to compute the
monitoring metrics. Similar arguments could be made for several other popular ML
methods. There is a gap in available machine learning resources for industrial
practitioners and this book attempts to cover this gap.
In one sense, we wrote this book for our younger selves; a book that we wish had existed
when we started experimenting with machine learning techniques. Drawing from our
years of experience in developing data-driven industrial solutions, this book has been
written with the focus on de-cluttering the world of machine learning, giving a
comprehensive exposition of ML tools that have proven useful in process industry,
providing step-by-step elucidation of implementation details, cautioning against the
`
pitfalls and listing various tips & tricks that we have encountered over the years, and using
dataset from industrial-scale process systems for illustrations. We strongly believe in
‘learning by doing’ and therefore we encourage the readers to work through in-chapter
illustrations as they follow along the text. For reader’s assistance, Jupyter notebooks with
complete code implementations are available for download. We have chosen Python as
the coding language for the book as it convenient to use, has large collection of ML
libraries, and is the de facto standard language for ML. No prior experience with Python
is assumed. The book has been designed to teach machine learning from scratch and
upon completion, the reader will feel comfortable at using ML techniques.
Machine learning will continue to play significant role in unleashing the next wave of
productivity improvements in process industry. We hope that this book will inspire its
readers to develop novel ML solutions to challenging problems at work. We wish all the
best to the budding process systems data scientist.
Pre-requisites
No prior experience with machine learning or Python is needed. Undergraduate-level
knowledge of basic linear algebra and calculus is assumed.
`
Book organization
The book follows a holistic and hands-on approach to learning ML where readers first
gain conceptual insight and develop intuitive understanding of a methodology, and then
consolidate their learning by experimenting with code examples. Every methodology is
demonstrated by using simple process system or numerical example to illustrate major
characteristics of the method and then by implementing on industrial-scale processes.
The book has been divided into four parts. Part 1 provides a perspective on the
importance of ML in process systems engineering and lays down the basic foundations
of ML. Part 2 provides in-detail presentation of classical ML techniques and has been
written keeping in mind the various characteristics of industrial process systems such as
high-dimensionality, non-linearity, multimode operations, etc. Part 3 is focused on
artificial neural networks and deep learning. While deep learning is the current buzzword
in ML community, we would like to caution the reader against the temptation to deploy a
deep learning model for every problem at hand. Often, simpler classical models can
provide as good, if not better, results as those from neural net models. For example,
partial least squares (PLS) are still the most popular models for soft sensor development
in process industry due to its simplicity and powerful capabilities of handling noisy and
correlated data. Part 4 covers the important topic of deploying an ML solution over web.
It was a deliberate decision to not divide the book in terms of supervised / unsupervised
/ reinforcement-learning categories or application areas (process modeling, monitoring,
etc.). This is because several methods overlap these categories which make it difficult to
put them under a specific category. For example, SVM and SVR methods fall under
supervised category while the related SVDD method falls under unsupervised category.
Similar situation holds for PCA/PCR/PLS methods. A reader who is interested in a specific
application area may use the table of contents as a guide to relevant sections in Parts 2
and 3. Care has been taken to title the subsections in different chapters appropriately.
Symbol notation
The following notation has been adopted in the book for representing different types of
variables:
- lower-case, bold-face letters refer to vectors (𝒙 ∈ ℝ𝑚×1) and upper-case, bold-face
letters denote matrices (𝑿 ∈ ℝ𝑛×𝑚 )
- individual element of a vector and a matrix are denoted as 𝑥𝑗 and 𝑥𝑖𝑗 , respectively.
- any ith vector in a dataset gets represented as subscripted lower-case, bold-faced
letter (𝒙𝒊 ∈ ℝ𝑚×1 )
`
Part 1
Introduction & Fundamentals
Chapter 1
Machine Learning for Process Systems
Engineering
Imagine yourself in the shoes of an operator or engineer responsible for uninterrupted and
optimal operation of an oil refinery. Keeping an eye on each of the 1000s of process
measurements being made every second to look for process abnormalities or opportunities
for plant performance improvement is akin to finding a needle in a haystack. The task is
undoubtedly overwhelming and is the primary reason why plant managers often complain
about having ‘too much data but little knowledge and insight’.
However, unlike humans, computers can be programmed to parse through large amounts of
data in real-time, extract patterns, trends, and assist plant personnel in making informed
business and operational decisions. This practice of learning about systems from data or
machine learning has become an indispensable tool in process operations in the age of
increasing global competition and stricter product quality standards.
This chapter provides an overview of how the power of machine learning is harnessed for
process systems engineering. Specifically, the following topics are covered
• Unique characteristics of process data
• Types of ML systems and typical workflow to convert data into insights
• Classical applications of ML techniques in process industry
• Common ML solution deployment infrastructure employed in industry
Let’s now tighten our seat-belts as we embark upon this exciting journey of de-mystifying
machine learning for process systems engineering.
1
`
Process systems refer to a collection of physical structures that convert raw materials (wood,
natural gas, crude oil, coal, etc.) into final consumer products (paper, fertilizers, diesel, energy,
etc.) or intermediate products which are then used to manufacture other commodity materials.
These process systems can range from a simple water-heating system to complex oil
refineries. Figure 1.1 shows an example (petrochemical) plant comprising several processing
units such as distillation columns, heat exchangers, pumps. Process industry is a broad
category that encompasses, amongst others, chemical industry, bioprocess industry, power
industry, pharmaceuticals industry, steel industry, semiconductor industry, and waste
management industry.
In process industry, the task of optimizing production efficiency, controlling product quality,
monitoring the process are categorized as process systems engineering (PSE) activities.
These tasks often require a mathematical model of the plant. The traditional practice has been
to use first principles mathematical description of physio-chemical phenomena occurring
inside each process unit to mathematically characterize the whole plant. However, as you
may already know, building such fundamental models are time-consuming and difficult for
complex systems. Machine learning (ML)-based methods provide a convenient alternative
where process data are used to build empirical plant model which can be used for
optimization, control, and monitoring purposes. Availability of large amount of sensor data has
further boosted the trend of incorporating ML techniques for PSE and demand for process
data scientists.
2
`
Let’s take a look at the kind of data generated in process industry. Majority of the process
data include process stream flowrate, temperature, pressure, level, power, and composition
measurements as shown in figure 1.2. Additionally, obtaining vibration signals from rotating
equipment (motors, compressors), infrared or visual images, and spectra data are also
common now-a-days. These indirect data are frequently utilized for predictive equipment
maintenance and product quality control.
Figure 1.2: A process flowsheet with typical flow (FI), temperature (TI), pressure (PI),
composition (Analyzers), level (LI), power (JI) measurements
3
`
• Time-varying: Correlations between process variables change over time due to gradual
changes in process parameters. For example, heat transfer coefficient in a heat-
exchanger may change due to fouling or catalyst activity may degrade due to aging.
• Multirate sampling: Not all process measurements are made at the same frequency.
While temperature or pressure readings are sampled every second or minute,
composition/analyzer measurements are often sampled at much lower frequency
(once an hour or a day).
In this book, we will study several ML methods in detail which have been designed to handle
these different varieties of process systems. Let us first understand what we mean by machine
learning.
4
`
At its core, machine learning simply means using computer programs and data for finding
relationship between different components of a system and discovering patterns that were not
immediately evident. This characteristic of using data to learn about the system makes
machine learning interesting and different. System specific knowledge is not explicitly
embedded into a ML program; the program extracts the knowledge from data. Figure 1.3
compares ML approach with a non-ML approach for distillation column modeling: ML
approach does not require system specific data such as number of stages or the type of
packing.
Figure 1.3: Computer program using first-principal approach (left) and ML approach (right)
for modeling a distillation column
We often marvel at the accuracy of recommendations made by Netflix or amazon for potential
shows or products. These companies do not possess explicit information about our personal
preferences or psychology (whether we like sci-fi movies or not). The data does all the trick!
ML algorithms process past purchase data to discern the likes and dislikes of its customers
and make recommendations. In process industry, manufacturers use ML methods to
determine optimal equipment maintenance schedule using past maintenance and operating
conditions data. Here again, data-based analysis provides considerable convenience over the
alternative method of complex metallurgical analysis.
The reliance on data alone to obtain reasonably good system approximation is one of the
major reasons behind ML’s growing popularity. The barrier of requiring specific domain
knowledge to be able to analyze a system has been circumvented by machine learning. ML
algorithms are also universal in nature. For example, we can use the same data-clustering
algorithm for analyzing demographics data, factory data, or economic data to obtain
actionable knowledge. These properties combined with the surge in the amount of data
5
`
collected and the dip in the cost of computational resources had led machine learning to
revolutionize several industries.
“Why don’t we just build a model using first-principles and get very accurate models? Why
rely on data?” This is a valid question. There is no doubt that fundamental models have better
accuracy and generalization capability, however, developing fundamental models are often
time consuming and require expert resources. These models can sometimes be too complex
to execute in real-time. Adopting ML methodology can help getting around these difficulties.
The models are built offline using historical process data. This offline exercise is performed
once or repeated at regular intervals for model update. Brief description of the essential steps
performed are provided below:
• Sample and variable selection: One does not simply dump all the available historical
data and sensor measurements into model training module. Only the portion of
historical data that best represent the current process behavior or the behavior of
interest is utilized. For process systems, it is common to use data over the past couple
of years as training data. If steady-state models are being build, then data from steady-
state operation periods are used.
Input variable selection warrants judicious consideration as well. While including too
many model inputs leads to overfitting and high computational complexity, leaving out
important variables leads to underfitting. The basic principle is to include only those
inputs that are known to influence the model outputs. Specific algorithms for variable
selection are covered in Chapter 4.
6
`
Distinct from the offline-online paradigm, there is another approach employed in process
industry, especially for nonlinear and multimode processes. It is called just-in-time learning or
lazy learning. As shown in Figure 1.5, the model building exercise is carried out online as
well. When new process data comes in, relevant data are fetched from the historical dataset
that are similar to the incoming samples based on some nearest neighborhood criterion. A
7
`
local model is built using the fetched relevant data. The obtained model processes the
incoming samples and is then discarded. A new local model is built when the next samples
come in.
8
`
Supervised learning
Supervised learning is used when training data includes sets of inputs and associated outputs.
As illustrated in Figure 1.7, supervised learning models learn the input-output relationship and
use this relationship to predict the unknown output for a new input. The outputs can be discrete
values (for classification problems) or continuous values (for regression problems).
Unsupervised learning
Unsupervised learning is used when training data is not divided into inputs and outputs. The
primary purpose is to find hidden pattern, if any, in data. An example situation is illustrated in
Figure 1.8, where an unsupervised learning model finds prevailing structure (distinct clusters)
in historical data and buckets them into different groups. The model can now be used to assign
any new incoming sample into either of the groups. Unsupervised learning is often used in
conjunction with supervised learning. For example, in Figure 1.8, separate local models can
be built via supervised learning for data in different clusters. As you would have guessed
correctly, data would need to be separated into inputs and outputs before application of
supervised learning.
The RL agent takes actions to adjust the tap opening according to the current system state to
maintain the level at some setpoint. The trivial policy of opening and closing the tap completely
upon any level change can lead to high water level fluctuations. Therefore, during training the
agent learns the best control policy automatically by just interacting with its environment.
Product quality control, safe work environment, optimal operations, and sustainable
operations are the primary objectives in any industrial plant. Today, ML-based solutions are
being utilized for all these purposes. Surrogate models are developed for online prediction of
key quality variables and optimizing plant productivity, process monitoring and fault diagnosis
models are developed for real-time tracking of process operating conditions, data mining is
used for alarm management, data clustering is used for operation mode identification, and the
list goes on.
Several success stories on machine learning application in process industry are publicly
available. Shell1 used recurrent neural networks (RNNs) for early prediction of valve failures,
Saudi Aramco used ML tools for alarm analytics and predictive maintenance of turbines 2, a
polymer manufacturing company used ML-based feature extractions3 for troubleshooting
1 https://www.aiche.org/conferences/aiche-spring-meeting-and-global-congress-on-process-
safety/2018/proceeding/paper/37a-digital-twins-predicting-early-onset-failures-flow-valves
2 https://pubs.acs.org/doi/abs/10.1021/acs.iecr.8b06205
3
https://www.yokogawa.com/at/library/resources/references/successstory-sumitomoseika-chemicals-en/
11
`
quality control issues. In recent times, there has been a proliferation in the number of
commercial data-analytics software (Aspen Mtell, IBM SPSS) or services offerings for process
industry. This is just a testament to the growing trend of ML-driven process control and
operations.
As process data scientists, we should be proud of the fact that process industry has always
been a pioneer in utilizing process data for plant operations. Model predictive control (MPC)
is a classic example which uses data-based model for process control. It has been used as a
standard supervisory controller long before Big Data and ML became the buzzwords. Partial
least squares (PLS), a popular dimensionality reduction-based soft sensing method, has long
been used for online product quality predictions. The new craze about machine learning has
only imparted a renewed push to explore non-traditional applications of ML in process
industry.
12
`
At the base level, in Figure 1.11, resides the basic regulatory control layer which primarily
comprises of control valves; these valves are used to modulate the flow of process streams
and indirectly the pressures and temperatures as per process requirements. These
requirements are in turn determined by the multivariable control layer which usually consists
of MPC and RTO (real-time optimization) modules. This layer determines the base layer
requirements using multivariable relationships between plant variables to ensure optimal
performance of the plant. This layer is also responsible for ensuring that plant operations
remain safe by ensuring safety-critical variables remain within stipulated bounds. The process
diagnostics layer, if present, ensures reliability of the process through timely fault detection
and diagnosis. This layer may also perform the task of controller performance assessment.
The production scheduling layer has models and methods to determine resource allocation
and short-term production schedules considering external influences such as electricity prices
or raw material price variations. Time-based or predictive equipment maintenance decisions
may also be made in this layer. Results from this layer are communicated to the multivariable
control layer. The top-most layer, planning and logistics, make enterprise-wide decisions. An
enterprise operating multiple facilities use this layer to determine site-wise production targets
based on the enterprise’s strategic goals.
Application areas
After familiarization with the process operation decision hierarchy, we are now well-equipped
to see the broad application areas of ML in a process plant. We have already seen some of
the application areas in Figure 1.6. In this section we will use a furnace (Figure 1.12) as an
example system to investigate these in more details.
The furnace system consists of several catalyst-filled tubes suspended vertically in a natural-
gas fired furnace. Unreacted gas stream enters the tubes from the top, undergoes chemical
reactions as they flow down the tubes and exit at the bottom. The heat from fuel combustion
provides energy for the chemical reactions.
Soft Sensing
Soft sensors, also called virtual sensors or inferential sensors, are mathematical models used
to estimate the values of unknown process variables using available measurements. Soft
sensor models can be first principles-based or data-based and are employed when physical
sensors are very costly or real-time measurements are not possible (in case of composition
measurements).
Usage of data-based soft sensors are popular where process systems are too complex to
build mechanistic models, or the mechanistic models are too complex to provide required
estimates in real-time. For example, for the furnace system, one can build a computational
fluid dynamic (CFD) model to predict the species mole fractions in the processed gas stream
as a function of process inputs – fuel, air, unprocessed gas. This estimate can be used by
multivariable control layer to adjust input flows accordingly, for example, increase fuel if
conversion is low. However, CFD models have large execution times. As an alternative, data-
based models can utilize past process data to estimate an appropriate relationship and
provide compositions in real-time.
As a process modeler, you should pay careful attention that the training data is
sufficiently rich in information, otherwise, a poor/low-accuracy model will result.
One way to ascertain data-richness is to check if process inputs show adequate
variability.
Partial least squares (PLS), principal component regression (PCR), support vector regression
(SVR) are some of the popular ML method choices for estimating process quality variables.
In recent times, artificial neural networks (ANNs) have also seen increased usage.
Process Monitoring
Process monitoring/fault detection/abnormality detection is among the most popular
application of ML in process industry. The ML model flags an alarm when current process
data shows inconsistency with historical behavior as estimated from historical data. These
discrepancies could be indications of severe process faults. In the furnace system, there are
14
`
White-box/grey-box/black-box models
For soft sensing, terms like white-box, grey-box, black-box are often used. Figure
1.13 clarifies the difference between the terms. White-box models utilize
fundamental mass/energy/momentum conservation laws to relate model inputs and
outputs. Black-box methods build estimation models using only process data. Grey-
box methods combine the two approaches to generate a hybrid model.
Consider our furnace system again. CFD model would be the white-box model. PLS
model relating furnace inputs to product stream composition would fall in black-box
model category. However, to balance the trade-off between model accuracy and
computational expense, a hybrid model can be built. Mechanistic model of radiation
energy transfer is the most complex part of the furnace model. One can build a
black-box model to estimate the amount of radiation energy supplied to the tubes.
This model can then be combined with the mechanistic model of tube to predict
product composition.
15
`
Principal component analysis (PCA) is the most popular method employed for process
monitoring. PLS, independent component analysis (ICA), support vector data description
(SVDD), self-organizing maps are some other commonly used methods. We will study all
these methods and their applications for process monitoring in the later chapters.
Fault Classification
Once a process fault has been detected, the next challenge lies in timely identification of root-
cause of the issue or identification of the process variables that are responsible for the process
upset. This exercise is called fault diagnosis. If historical data on past failures are available, a
classification model could be built to determine the specific fault.
For furnace system, refractory damage, tube leaks, burner malfunctions, catalyst damage can
all cause temperature upsets. These different faults tend to impact tube temperatures
differently; these differences are exploited by a fault classification model to identify the current
fault. In this book, we will study how methods like ANNs, support vector machine (SVM) and
linear discriminant analysis (LDA) can help in fault classifications.
Predictive Maintenance
Predictive maintenance is another very popular usage of ML in process industry. Predictive
maintenance models are built to determine the time to failure of any equipment or detect
4 Real-Time Optimization and Control of Nonlinear Processes Using Machine Learning, Zhang et. al.,
Mathematics, 2019
5 Adaptive PID controller tuning via deep reinforcement learning, US patent 187631, 2019
16
`
patterns in process data that could signal an impending process failure. Detailed planned
maintenance can be carried out if failure times are known in advance. In the furnace system,
tube leak occurrences may be preceded by a specific pattern in temperatures of neighboring
tubes. These patterns are identified during model training and then utilized for real-time failure
predictions. Advance warning can help plant operators plan plant shutdown properly.
Forecasting
Uncertainty in product demands and prices of raw materials lead to poor production
scheduling. Advanced ML forecasting models are built to determine optimal production plan
to maximize resource utilization and minimize production costs. In furnace system, frequent
furnace temperature swings have detrimental effect on tube lifespan. If accurate monthly
product demand is known in advance, then the furnace can be operated at a steady state
throughout the month while using product storage to handle momentary spikes in demand.
While working on any project, you are most likely to experiment with several ML algorithms
on your personal laptop/computer. After having done all the hard work and having decided
the final form of your ML workflow, you might find yourself asking the following questions:
17
`
The answer to the above questions is summarized in Figure 1.14 which shows a common
architecture followed for deploying a Python machine learning solution from scratch within an
enterprise network. You will install Python and transfer your tool (Python files) in a tool server
machine (a virtual or physical machine where the ML tool would run). The tool server will be
configured to execute the tool continuously (as a windows service) or on a schedule. During
execution, your ML tool will fetch historical or real-time plant data and store the ML results in
a database (MS SQL, MYSQL, etc.).
Let’s tackle the second question now. The end-users such as plant operators can access
tool’s results via a web browser. The user interface could be either built using third-party
visualization software (Tableau, Sisense, Power BI) or completely custom-built using front-
end web frameworks like bootstrap. If building custom website, then you will also need to
setup a web server (using Python, .Net, etc.) which will serve the user-interface webpage
when requested through web browser. The user interface communicates with the database
to display appropriate data to the end-user. The web server may be configured to execute
your tool on demand as well. The web server may be hosted on a separate machine or on the
tool server machine itself.
That is all it takes to deploy a ML solution in a production environment. If all this IT stuff has
overwhelmed you, don’t worry! It is simpler than it seems and in chapter 14 we will build and
deploy an end-to-end solution following this architecture.
18
`
It is not an exaggeration to say that this is a wonderful time to be a process data scientist.
Process industry is witnessing higher and higher product demands due to increasing
population and growing lifestyle globally, but there is also a push to run production facilities
more efficiently and sustainably. Consequently, adoption of Industry 4.0, which mandates
utilizing process data for process improvements all along the production chain, is on the rise.
There is palpable interest among process industry executives to implement ML-based
solutions and the responsibility to show that the ML hype is true has fallen on the shoulders
of process data scientists.
It’s a foregone conclusion that ML is a powerful tool for PSE. Process data hold tremendous
power if they are put to use in the right way. However, blind application of ML often leads to
discouraging results. As a process data scientist with expert process knowledge and ML skills,
you are in a unique position to combine process systems knowledge and power of ML to
unleash the true potentials of data science in process industry. Let’s cheer to your bright
career prospects as a process data scientist and continue our journey to now learn the
intricate details of ML algorithms.
Summary
In this chapter we tried to get a conceptual understanding of where ML fits in the world of
process industry. We looked at the different types of machine learning workflows and
methodologies. We also explored some application areas in process industry where ML has
proved useful. We hope that you got the chapter’s overarching message that process data
science has already proven to be an indispensable tool in process operations to turn data into
knowledge and support effective decision making. In the next chapter we will take the first
step and learn about the environment you will use to execute your Python scripts containing
ML code.
19
`
Chapter 2
The Scripting Environment
In the previous chapter, we studied the various aspects of machine learning and learned about
its different uses in process industry. In this chapter, we will quickly familiarize ourselves with
the Python language and the scripting environment that we will use to write ML codes, execute
them, and see results. This chapter won’t make you an expert in Python but will give you
enough understanding of the language to get you started and help understand the several in-
chapter code implementations in the upcoming chapters. If you already know the basics of
Python, have a preferred code editor, and know the general structure of a typical ML script,
then you can skip to Chapter 3.
The ease of using and learning Python, along with the availability of a plethora of open-access
useful packages developed by the user-community over the years has led to immense
popularity of Python. In recent years, development of specialized libraries for machine and
deep learning has made Python the default language of choice among ML community. These
advancements have greatly lowered the entry barrier into the world of machine learning for
new users.
With this chapter, you are putting your first foot into the ML world. Specifically, the following
topics are covered
• Introduction to Python language
• Introduction to Spyder and Jupyter, two popular code editors
• Overview of Python data structures and scientific computing libraries
• Overview of a typical ML script
20
`
Installing Python
One can download official and latest version of Python from the python.com website.
However, the most popular and convenient way to install and use Python is to install
Anaconda (www.anaconda.com) which is an open-source distribution of Python. Along with
the core Python, Anaconda installs a lot of other useful packages. Anaconda comes with a
GUI called Anaconda Navigator (Figure 2.2) from where you can launch several other tools.
21
`
Jupyter Notebooks are another very popular way of writing and executing Python code. These
notebooks allow combining code, execution results, explanatory text, multimedia resources in
a single document. As you can imagine, this makes saving and sharing complete data
analysis very easy.
In the next section, we will provide you with enough familiarity on Spyder and Jupyter so that
you can start using them.
22
`
Figure 2.3 shows the interface6 (and its different components) that comes up when you launch
Spyder. These are the 3 main components:
• Editor: You can type and save your code here. Clicking button executes the code
in the active editor tab.
• Console: Script execution results are shown here. It can also be used for executing
Python commands and interact with variables in the workspace.
• Variable explorer: All the variables generated by running editor scripts or console are
shown here and can be interactively browsed.
Like any IDE, Spyder offers several convenience features. You can divide your script into cells
and execute only any selected cell if you choose to (by pressing Ctrl + Enter buttons).
Intellisense allows you to autocomplete your code by pressing Tab key. Extensive debugging
functionalities make troubleshooting easier. These are only some of the features available in
Spyder. You are encouraged to explore the different options (such as pausing and canceling
script execution, clearing out variable workspace, etc.) on the Spyder GUI.
6
If you have used MATLAB, you will find the interface very familiar
23
`
With Spyder, you have to run your script again to see execution results if you close and reopen
your script. In contrast to this, consider the Jupyter interface in Figure 2.4. Note that the
Jupyter interface opens up in a browser. We can save the shown code, the execution outputs,
and explanatory text/figures as a (.ipnb) file and have them remain intact when we reopen the
file in Jupyter notebook.
You can designate any input cell as a code or markdown (formatted explanatory text). You
can press Shift + Enter keys to execute any active cell. All the input cells can be executed via
the Cell menu.
This completes our quick overview of Spyder and Jupyter interface. You can choose either of
them for working through the code in the rest of the book.
24
`
In the current and next sections, we will see several simple examples of manipulating data
using Python and scientific packages. While these simple operations may seem unremarkable
(and boring) in the absence of any larger context, they form the building blocks of more
complex scripts presented later in the book. Therefore, it will be worthwhile to give these
atleast a quick glance.
Note that you will find ‘#’ used a lot in these examples; these hash marks are used to insert
explanatory comments in code. Python ignores (does not execute) anything written after # on
a line.
Tuples are another sequence construct like lists, with a difference that their items and sizes
cannot be changed Since tuples are immutable / unchangeable, they are more memory
efficient.
# creating tuples
tuple1 = (0,1,'two')
tuple2 = (list1, list2) # equals ([2, 4, 6, 8], ['air', 3, 1, 5])
A couple of examples below illustrate list comprehension which is a very useful way of creating
new lists from other sequences
Note that Python indexing starts from zero. Very often, we need to work with multiple items of
the list. This can be accomplished easily as shown below.
26
`
print(list4[4:len(list4)]) # displays [3,1,5]; len() function returns the number of items in list
print(list4[4:]) # same as above
print(list4[::3]) # displays [2, 'air', 5]
print(list4[::-1]) # displays list 4 backwards [5, 1, 3, 'air', 6, 4, 2]
list4[2:4] = [0,0,0] # list 4 becomes [2, 4, 0, 0,s 0, 3, 1, 5]
# selectively execute code based on condition # compute sum of squares of numbers in list3
if list1[0] > 0: sum_of_squares = 0
list1[0] = 'positive' for i in range(len(list3)):
else: sum_of_squares += list3[i]**2
list1[0] = 'negative'
print(sum_of_squares) # displays 78
# list1 becomes ['positive', 4, 6]
Custom functions
Previously we used Python’s built-in functions (len(), append()) to carry out operations pre-
defined for these functions. Python allows defining our own custom functions as well. The
advantage of custom functions is that we can define a set of instructions once and then re-
use them multiple times in our script and project.
For illustration, let’s define a function to compute the sum of squares of items in a list
return sum_of_squares
You might have noticed in our custom function code above that we used
different indentations (number of whitespaces at beginning of code lines) to
separate the ‘for loop’ code from the rest of the function code. This practice is
actually enforced by Python and will result in errors or bugs if not followed.
While other popular languages like C++, C# use braces ({}) to demarcate a
code block (body of a function, loop, if statement, etc.), Python uses
indentation. You can choose the amount of indentation but it must be consistent
within a code block.
This concludes our extremely selective coverage of Python basics. However, this should be
sufficient to enable you to understand the codes in the subsequent chapters. Let’s continue
now to learn about specialized scientific packages.
While the core Python data-structures are quite handy, they are not very convenient for the
advanced data manipulations we require for machine learning tasks. Fortunately, specialized
packages like NumPy, SciPy, Pandas exist which provide convenient multidimensional
tabular data structures suited for scientific computing. Let’s quickly make ourselves familiar
with these packages.
NumPy
In NumPy, ndarrays are the basic data structures which put data in a grid of values.
Illustrations below shows how 1D and 2D arrays can be created and their items accessed
28
`
# create a 1D array
arr1D = np.array([1,4,6])
Note that the concept of rows and columns do not apply to a 1D array. Also, you would have
noticed that we imported the NumPy package before using it in our script (‘np’ is just a short
alias). Importing a package makes available all its functions and sub-packages for use in our
script.
29
`
Executing arr2D.sum() returns the scalar sum over the whole array, i.e., 25.
30
`
# slicing
arr8 = np.arange(10).reshape((2,5)) # rearrange the 1D array into shape (2,5)
print((arr8[0:1,1:3]))
>>> [[1 2]]
print((arr8[0,1:3])) # note that a 1D array is returned here instead of the 2D array above
>>> [1 2]
An important thing to note about NumPy array slices is that any changes made on sliced view
modifies the original array as well! See the following example
The feature becomes quite handy when we need to work on only a small part of a large
array/dataset. We can simply work on a leaner view instead of carrying around the large
dataset. However, situation may arise where we need to actually work on a separate copy of
subarray without worrying about modifying the original array. This can be accomplished via
the copy method.
31
`
Fancy indexing is another way of obtaining a copy instead of a view of the array being indexed.
Fancy indexing simply entails using integer or boolean array/list to access array items.
Examples below clarifies this concept
Vectorized operations
Suppose you need to perform element-wise summation of two 1D arrays. One
approach is to access items at each index at a time in a loop and sum them.
Another approach is to sum up items at multiple indexes at once. The later
approach is called vectorized operation and can lead to significant boost in
computational time for large datasets and complex operations.
# vectorized operations
vec1 = np.array([1,2,3,4])
vec2 = np.array([5,6,7,8])
vec_sum = vec1 + vec2 # returns array([6,8,10,12]); no need to loop through index 0 to 3
Broadcasting
Consider the following summation of arr2D and arr1D arrays
# item-wise addition of arr2D and arr1D
arr_sum = arr2D + arr1D
32
`
Pandas
Pandas is another very powerful scientific package. It is built on top of NumPy and offers
several data structures and functionalities which make (tabular) data analysis and pre-
processing very convenient. Some noteworthy features include label-based slicing / indexing,
(SQL-like) data grouping / aggregation, data merging / joining, and time-series functionalities.
Series and dataframe are the 1D and 2D array like structures, respectively, provided by
Pandas
Note that s.values and df.values convert the series and dataframe into corresponding NumPy
arrays.
7
numpy.org/doc/stable/user/basics.broadcasting.html
33
`
Data access
Pandas allows accessing rows and columns of a dataframe using labels as well as integer
locations. You will find this feature pretty convenient.
# column(s) selection
print(df['id']) # returns column 'id' as a series
print(df.id) # same as above
print(df[['id']]) # returns specified columns in the list as a dataframe
>>> id
0 1
1 1
2 1
# row selection
df.index = [100, 101, 102] # changing row indices from [0,1,2] to [100,101,102] for illustration
print(df)
>>> id value
100 1 10
101 1 8
102 1 6
print(df.loc[101]) # returns 2nd row as a series; can provide a list for multiple rows selection
print(df.iloc[1]) # integer location-based selection; same result as above
Data aggregation
As alluded to earlier, Pandas facilitates quick analysis of data. Check out one quick example
below for group-based mean aggregation
102 2 24
File I/O
Conveniently reading data from external sources and files is one of the strong forte of Pandas.
Below are a couple of examples that we will frequently employ in this book.
# reading from excel and csv files
dataset1 = pd.read_excel('filename.xlsx') # several parameter options are available to customize
what data is read
dataset2 = pd.read_csv('filename.xlsx')
This completes our very brief look at Python, NumPy, and Pandas. If you are new to Python
(or coding), this may have been overwhelming. Don’t worry. Now that you are atleast aware
of the different data structures and ways of accessing data, you will become more and more
comfortable with Python scripting as you work through the in-chapter code examples.
By now you understand where to write your ML code and how to check the ML output. Let’s
now take a slightly closer look at the ML script that we executed in the earlier section. This
35
`
script will give a fairly good idea about what a typical ML script looks like. You will also
understand how Python, NumPy, and advanced ML packages are utilized for ML scripting.
We will study several ML aspects in much greater detail in the next few chapters, but this
simple script is a good start. The objective of this simple script is to take data for an input and
an output variable from a file and build a regression model between them. The first few code
lines take care of importing the libraries that the script will employ.
# import libraries
import numpy as np
from sklearn.preprocessing import PolynomialFeatures Used for data transformations
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression Used for building linear models
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt Used for model assessment and plotting
We studied NumPy before; another library8 that we see here is Sklearn which is currently the
most popular library for machine learning in Python. Sklearn provides several modules (like
preprocessing, linear_model, metrics that are used here) to handle different aspects of ML.
The last library is matplotlib9 which is used for creating visualization plots. The next few lines
of code fetch raw data.
# read data
data = np.loadtxt('quadratic_raw_data.csv', delimiter=',')
x = data[:,0:1]; y = data[:,1:] # equivalent to y = data[:,1,None] which returns 2D array
Here NumPy’s loadtxt function is used to read comma-separated data in the file
quadratic_raw_data.csv. The data get stored in a 2D NumPy array, data, where the first and
2nd columns contain data for the input and output variables, respectively. NumPy slicing is
8
If any package/library that you need is not installed on your machine, you can get it by running the command pip install
<package-name> on Spyder console
9
Seaborn is another popular library for creating nice-looking plots
36
`
used to separate the x and y data. The next part of the script performs input data pre-
processing to transform it into proper form before generating a regression model.
Here, two pre-processing transformations are done on the input data. First, quadratic powers
of x are generated and then both x and x2 values are scaled. You will learn more about these
transformations in the next chapter. Next, a regression model is fitted and used to make
predictions
# plot predictions
plt.figure(figsize=(4, 2))
plt.plot(x, y, 'o', label='raw data')
plt.plot(x, y_predicted, label='quadratic fit')
plt.legend()
plt.xlabel('x'), plt.ylabel('y')
Although there are many more advanced ML aspects, the above code is very representative
of what goes in a ML script. Hopefully, you also got a feel of the powerful capabilities of the
37
`
Virtual environments
Imagine that you wrote a Python ML code two years ago that makes use of Sklearn
packages which you now want to execute. However, the version of Sklearn
packages that you have on your computer currently installed is most likely different
from what you had 2 years ago. Therefore, your old script may show errors upon
execution (due to changed syntaxes for example) and it can be frustrating to have
to debug the old code due to package version mismatches. This is one of the many
reasons why you may want to have separate Python environments for separate
projects such that these projects have access to separate sets of libraries. Note
that this does not mean that you need to install Python itself multiple times. The
package venv can help create virtual Python environments where different versions
of the same packages can reside ‘peacefully’. In our hypothetical scenario, if you
had used a virtual environment two years ago, you could simply activate that virtual
environment and run your old script without any package-version-related issues.
Using virtual environments is one of the best practices we recommend you should
adopt as you become a more experienced programmer. For more details you can
check out the official documentation docs.python.org/3/tutorials/venv.html.
Summary
In this chapter we made ourselves familiar with the scripting environment that we will use for
writing and executing ML scripts. We looked at two popular IDEs for Python scripting, the
basics of Python language, and learnt how to manipulate data using NumPy and Pandas. In
the next chapter, you will be exposed to several crucial aspects of ML that you should keep
in mind as you plan your ML strategy for any given problem.
38
`
Chapter 3
Machine Learning Model Development:
Workflow and Best Practices
A model is the most crucial part of a machine learning-based solution as it encapsulates the
relationships among process variables extracted from data. A universally accepted completely
automated ML tool that can generate good models for all kinds of problems is not yet available.
ML model development is still an art and it takes an artist to create a good model! While it is
true that we don’t necessarily need to know the algorithmic implementation details of the
models, there are several model development stages which require our active participation.
You cannot obtain a good model by just dumping raw data into a model training module. You
need to ensure that you are providing right data for model training, using correct performance
assessment metric, properly assessing if model is overfitting or underfitting, and fine-tuning
the model properly if needed
There are several established best practices that have been devised to help simplify a
modeler’s task. You can use these as guidance as you go about developing your ML model.In
this chapter we will study these best practices and learn what it takes to obtain a good model
(as well as how to quantify the ‘goodness’ of a model!). This knowledge will help us when we
start working with different models in the next parts of the book. Specifically, the following
topics are covered
• Transforming raw process data to obtain better model training dataset
• Overview of performance metrics and best practices for reporting model performance
• Diagnosis of overfitting and underfitting using validation curves
• Tackling overfitting via regularization
• Best practices for model fine-tuning and hyperparameter optimization
39
`
The prime objective of the ML modeling task is to obtain a model that generalizes well, which
means that the model performs satisfactorily when supplied with data different from that used
during model training. Achieving this takes more than just executing a ‘model = <some
ML_model>.fit()’ command on the available data. In chapter 1, we had a brief introduction to
ML modeling workflow. In this chapter, we will learn about the different components of that
workflow in more details. Figure 3.1 lists the common tasks you can expect yourself to carry
out while building a ML model. While separate books can be written on each of these tasks,
we will cover enough details on each of these to get you started on your ML journey.
As you can see, there are four main steps in the workflow. In the first step, the available raw
data are pre-processed to ensure ‘correct’ dataset is supplied to the subsequent model fitting
step. This step can involve data cleaning (discarding irrelevant variables or
corrupted/incomplete data samples) and data transformation (generating new features or
modifying existing ones). While we will study data-transformation techniques in this chapter,
data selection techniques are treated extensively in the next chapter. The next step of the
workflow involves selection of a suitable model and estimating its parameters. While
parameter estimation is handled by off-the-shelf model libraries available in sklearn, model
selection would require your judgement based on your expert knowledge of the specific
process system. A practical guideline for model selection is Occam’s razor which advices
selection of simplest model that generalizes well. After training, the model is evaluated to
validate the modeling assumptions and assess model’s generalization power. Model
evaluation step often provides good hint about which part of the model (or hyperparameters)
to tweak, which leads to the last step of model tuning. Specialized techniques are available to
assist model tuning. With this broad overview of the workflow, let’s now start learning about
these techniques in detail.
40
`
Process data are seldom available in ‘ready-to-model’ format. The raw dataset may contain
highly correlated variables or may have variables with very different value ranges. These data
characteristics tend to negatively influence model training, and therefore, warrant data
transformation to de-correlate or properly scale data. Sometimes, creating new features10 out
of provided variables can help to improve model accuracy. Overall, the aim of this pre-
processing step is to transform given variables such that model training and model accuracy
improves. Let’s look at some commonly employed transformation techniques.
The most common technique is standard scaling or standardization, where each variable (in
training dataset) is transformed to 0 mean and standard deviation of 1. Within an ML script,
you will transform the test data using the computed mean and standard deviation of the
training dataset. The illustration below shows how sklearn facilitates standardization
10
Feature and (independent) variable mean the same thing. While in ML literature, ‘feature’ is used to denote variables
which go into predicting outputs, the term ‘independent variable’ is commonly used for this in statistics.
41
`
Another popular technique is min-max scaling, also referred to as normalization. Here, each
variable is transformed to a common range of values (typically 0 to 1). For 0 to 1 normalization,
each variable is subtracted by its minimum value and then divided by its range. This is
accomplished in Sklearn as follows
Although standard scaling is often the default choice, there is no straight-forward answer to
which scaling method is better. Either of the methods will usually work fine. Standard scaling
is preferred if variables are gaussian distributed or there are (a few) extreme outliers which
can skew the ‘min’ and ‘range’ estimates during normalization.
11
A class is like a blueprint defining the properties and behavior (functions) that are common among its objects. In OOP
architecture, we can define a class once and then create its objects as many times as we want.
42
`
𝑥 − 𝑚𝑒𝑑𝑖𝑎𝑛(𝑥)
𝑥𝑀𝐴𝐷 𝑠𝑐𝑎𝑙𝑖𝑛𝑔 =
𝑀𝐴𝐷(𝑥)
Figure 3.2 shows that under MAD scaling, normal data does get centered around zero!
Figure 3.2: Application of standard and robust MAD scaling on data contaminated with outliers.
Solid red line denoted the preferred centered location of 0. Complete code for generating these
plots is available in online code repository.
Sklearn provides a RobustScaler class for robust scaling which is similar to MAD scaling,
except that it uses interquartile range (IQR) instead of MAD as a robust estimator of data
spread. Similar results can be expected with either of these robust scaling methods.
Feature extraction
Process systems have a lot of measurements which leads to increased data analysis
complexity. However, you will often find that the original variable set can be reduced in
dimensionality by extracting features which are combination of original variables and contain
the same amount of information as the original variable set. This is a consequence of strong
correlation among process variables.
Consider Figure 2 for example. Here, the three original variables (x, y, z) exhibit strong linear
relationship to the extent that it is possible to transform (as shown) the 3D measurement
space into a 2D (or even 1D) feature space while retaining most of the information about data
variability. Accomplishing this in Sklearn is as simple as writing the following code
After feature extraction, rest of the analysis/model building is carried out using the extracted
features. Specific techniques are available to decide optimal number of extractable features
43
`
for linear and nonlinear systems. Details on these techniques, concept of dimensionality
reduction without ‘losing information’, PCA method, and other feature extraction methods such
as ICA, FDA are covered in detail in Part 2 of the book. Apart from lower dimensionality,
feature spaces may exhibit other nice characteristics such as uncorrelated or independent
features, and, therefore, using these features instead of the original variables can generate
better models.
Feature engineering
While feature extraction is an unsupervised exercise (we do not explicitly specify the
underlying correlation structure; algorithms extract relationships automatically), feature
engineering involves careful and explicit transformation of variables which can assist ML
algorithms in delivering better models. For example, if we examine the raw (x vs y) data in
Figure 3.3, it is clear that a quadratic model may be needed. To accomplish this, we will
engineer an additional feature (x2) from the given x data and then fit a linear model between
{x, x2) and y, as shown in the code below
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_poly)
The advantage in the above example was that instead of doing a non-linear fit between y and
x, we ended up doing just a simple linear fit. This is the power of feature engineering. It won’t
be an exaggeration to say that half of your task is done if you can engineer very relevant and
informative features. Unfortunately, currently no well-defined generic approach exists for
effective feature engineering. The task is system-specific and require domain knowledge.
Nonetheless, let’s look at a couple of feature engineering techniques that are commonly
employed.
Mathematical transformations
These transformations make use of functions like log, square, square root, product, temporal
aggregation (using min, max, standard deviation), sin, tanh, etc. to create custom features.
As seen in the previous illustration, these transformations allow usage of simple models, while
making the effective relationship between the original variables more complex. Since, the
selection of proper mathematical function is the most crucial and still largely a manual task,
automated feature engineering (AFE) is currently an active research area in machine learning.
Quite often, AFE techniques involve generating an exhaustive list of features from commonly
used mathematical functions and then performing variable selection to select potentially useful
features. However, universally accepted, generically applicable AFE methods don’t exist yet
and this makes your domain knowledge-guided feature engineering very important!
45
`
One-hot encoding
Most ML modeling algorithms require numerical data. But sometimes you may find yourself
working with categorical data: for example, one of the model input variable may have entries
like ‘plant type A’, ‘plant type B’, ‘plant type C’, etc. To handle such situations, you can convert
the categorical data into numerical values. A naive approach would be to simply replace ‘type
A’ with 0, ‘type B’ with 1, ‘type C’ with 2, and so on. However, this approach is erroneous
unless there is a natural order among the plant type categories (maybe plant types A/B/C
represent small/medium/large categories of plants). ML models may end up giving more
importance to plant type C relative to types A (& B) due to type C’s higher designated value
of 2. The preferred approach is to use one-hot encoding, which does the following
transformation
x = np.array([['type A'],
['type C'],
['type B'],
['type C']])
ohe = OneHotEncoder(sparse=False) # sparse=False returns array 2nd column corresponds
X_encoded = ohe.fit_transform(x) # x in numerical form to type B category
As you can infer, a column has been generated for each unique plant type category
(automatically identified by OneHotEncoder in the above code) and a binary value of 0/1 is
assigned depending on occurrence of that category in a sample. For example, the 3rd sample
is of ‘type B’ and therefore the 3rd entry in the 2nd column is assigned a value of 1.
46
`
The above code gives the same results as our previous code without pipeline and is more
concise (and potentially easy to manage). Figure 3.4 shows the sequence of steps that got
executed behind-the-scenes when we called the fit and predict methods of the pipeline. Really
convenient, isn’t it? Note that a pipeline does not have to include an estimator as its last
element – you can only have transformers and call pipeline.transform() to transform your
data through all the included transformers.
Figure 3.4: Streamlining ML workflow by combining transformers and predictors via pipeline
This completes our quick look into data transformation techniques. The takeaway message is
that before dumping all the raw data for model training, you should always give some thought
on what transformation you can apply to the raw data that can help the ML algorithm provide
a good model. Carefully designed transformation can alleviate the need for large number of
47
`
training samples, allow usage of simple models, and reduce model training time. Next, we will
see what techniques can be used to evaluate how well our model is performing.
After model parameters have been fitted, we need some means of assessing the model
performance to see how well (or poorly) the model fits the training data. For this purpose,
commonly computed metrics for regression and classification tasks are listed below.
While it is easy to write our own python code to compute these metrics, it is even easier to
use the sklearn.metrics module where these metrics are all implemented. We can import the
desired metric class from the module and use it as shown below our quadratic fitting example.
# performance metric
from sklearn.metrics import r2_score
print('Fitting metric (R2) = ', r2_score(y, y_predicted))
Let us now take a quick look into the different metrics listed in Table 1.
Regression metrics
R-squared (R2) is a good measure, technically for linear models, to determine how well the
predicted output (𝒚̂) compares to the actual y. As seen in the metric formula table below, R2
measures how much of variability in the output variable can be explained by the model. An R2
of 1 indicates a perfect fit.
48
`
Table 2: Formulae for common regression metrics. The summation is carried out over the N
samples. 𝐲̅ is mean of y
R2 does not directly convey the magnitude of prediction errors. For this, the other metrics can
be used. RMSE is the most popular metric for evaluating regression models and is computed
as the square-root of the average of prediction errors. A related metric is the MSE which does
not take the square-root. MAE is similar to MSE, except that it considers the absolute value
of errors. Any of these metrics can be used to compare different models: smaller the values
of these metrics (except R2), better is the model fit.
Several Sklearn models have their own score methods which computes an
evaluation metric suitable for the problem they are designed to solve. For
example, the score() method of LinearRegression object returns the R2
metric. Pipelines have a score method as well which simply executes the
score method of the final model. For our quadratic fitting example, it can be
used as follows
# compute metric using score method
R2 = pipe.score(x, y) # no need to import metrics module
Classification metrics
While in regression problems we care about how close the predicted values are to the actual
values, in classification problems we care about the perfect match between the actual and
predicted labels. Before we look at different classification metrics, let’s first learn about a
confusion matrix. The figure below shows a sample confusion matrix for a binary classification
task; this matrix provides a comprehensive overview of how well a model has correctly (&
incorrectly) classified samples belonging to the positive and negative classes. The shown
matrix indicates that out of 25 (24 + 1) samples that actually belonged to negative class, the
model correctly labeled 21 of those.
49
`
Confusion matrix can be used to compute several single-valued metrics, summarized in the
metric formula table below. The most straightforward metric is ‘accuracy’ which simply
communicates the ratio of true predictions to the total number of predictions. Accuracy can,
however, be a misleading metric for imbalanced datasets (where more samples belong to one
particular class). For example, in our example, overall accuracy is ~95% (88 out of 93), but
we can see that the accuracy in correctly detecting negative class is only 84% (21 out of 25).
Therefore, several other metrics are frequently employed to convey different aspects of
model’s performance.
Precision metric returns the ratio of number of correct positive predictions to the total number
of positive predictions by the model while recall returns the ratio of number of correct positive
predictions to the total number of positive samples in the data. For a process monitoring tool
(where positive label implies presence of a process fault), recall denotes model’s ability to
detect process fault, while precision denotes accuracy of model’s prediction of process fault.
A model can have high recall if it detects process faults successfully, but occurrence of lots of
false alarms will lower its precision. In other scenario, if the model can detect only a specific
type of process fault and gives very few false alarms then it will have low recall and high
precision. A perfect model will have high values (close to 1) for both precision and recall.
However, we just saw that it is possible to have one of these two metrics high and the other
low. Therefore, both metrics need to be reported. However, if only a single metric is desired
then F1 score is utilized which returns a high value if both precision and recall are high and a
50
`
low value if either precision or recall is low. FPR (false positive rate) looks at the fraction of
negative samples that get classified as positives. In our example, a high FPR of 0.16 slightly
spoils the very rosy picture being presented by accuracy, precision, and recall numbers!!
For a multiclass problem, the confusion matrix would look something like the plot below where
there is a row and column for each class type. Several important clues can be obtained from
a multiclass confusion matrix plot. For example, in the plot below, it seems that some samples
belonging to classes 1 & 2 tend to get classified as class 3 which suggest that some additional
features should be added to help the model better distinguish between classes 1 & 2 and
class 3. Metrics like precision can be computed by one vs all approach where you would
choose one particular class as a positive class and the rest of the classes as a negative class.
51
`
Setting aside a portion of data for cross-validation is called holdout method. A good thumb-
rule is to keep 20% of data in test set. Holdout can be accomplished in Sklearn using
train_test_split utility function as shown below for our quadratic fit example
# performance metrics
from sklearn.metrics import mean_squared_error as mse
The random_state parameter in the train_test_split function is used to shuffle the dataset
before applying the split. Shuffling dataset is a good practice to ensure that there is no
unintentional data pattern between traning and test datasets. For example, assume a dataset
comes from a manufacturing plant and is sorted by production rates: without shuffling, we may
end up with data only with low production rates in the training set, which is undesirable.
After an estimate of model performance has been made using test data, the
model may be re-trained, if desired, using the complete dataset. Since, an
implicit assumption is always made that both the training and test datasets
follow the same distribution, the model’s generalization performance will only
get better when re-trained with more data.
Residual analysis
Residual analysis is another way to validate the adequacy of (regression) models. It involves
checking the model errors / residuals to ensure there are no undesired patterns in them.
Presence of any obvious trend in residuals indicates a need for further data pre-processing
or change of model. A convenient way to study the residuals is to simply draw scatterplots.
Ideally, residuals would be small, centered around zero, independent of model inputs and
outputs (predicted and measured), and uncorrelated in time (for dynamic models). An ideal
residual plot would look something like this
In the above plot, no systematic pattern is apparent. If systematic patterns are present in
residual plot then it implies that the model has not been able to extract all systematic pattern
in the dataset. The specific patterns observed also often provide clues on potential remedies.
Figure 3.5 shows some examples of ‘bad’ residual plots.
In the above subplot (a), residuals show a nonlinear pattern which immediately conveys that
the model is not able to adequately capture nonlinearities in the data. Nonlinear features may
need to be included or a different model that can capture complex nonlinearities is warranted.
In subplot (b) heteroscedasticity is apparent as residuals get larger for higher value of
predicted output. Fixing this issue often entails appropriate transformation of variables such
as log transformation. Heteroscedasticity may also indicate that a variable is missing. In
subplot (c), while most of the samples show small residuals, several samples have high
positive residuals. The suggested remedy in this case is the same as that for
heteroscedasticity. Finally, in subplot (d), a group of samples appear as outliers. These
outliers may need to be removed or further analyzed to check if these are normal samples; if
samples turn out to be normal then model may need to be changed to account for the
presence of these distinct groups of data.
For dynamic dataset, autocorrelation and cross-correlation (between residuals and past
input/output data) function plots are used to check if the residuals are uncorrelated. Residual
histograms or normal probability plots are also used to check if the residuals are normally
distributed. If uncorrelatedness or normality assumptions are found violated, these plots, like
scatter plots, give crucial hints on potential remedies.
We can see that residual analysis is a powerful diagnostic tool. As a best practice, you should
always check the residual plots to ensure no modeling assumptions have been violated or no
obvious room for improvement exists.
Before we send our model for generalization assessment using test dataset, we need to
ensure that the obtained model is the best model we can obtain. Or, if our model does not
provide adequate generalization performance, some model adjustments are warranted.
These tasks fall under the ‘model tuning’ step of the modeling workflow, and primarily involve
optimizing values of model hyperparameters. Model hyperparameters, unlike model
parameters (which are the unknowns that are estimated during model fitting), are model
configuration variables which need to be specified before model fitting. For our quadratic fitting
example, the weight coefficients of the linear model are the model parameters and the
polynomial degree is the hyperparameter. For ANN models, number of neurons and layers
are important hyperparameters, while weight coefficients are estimable model parameters.
Finding optimal hyperparameter settings is also referred to as model selection in the ML
literature.
54
`
Figure 3.6 shows the resulting model curves when our pipeline is fitted with different
polynomial degrees. We can see that the model with high degree fits the training data better,
but the fitting is a bit ‘too much’! We would not be comfortable accepting this model although
it has very low training error. How do we then select the optimal hyperparameter values before
we make the final evaluation using test data? How can we diagnose if our model is fitting data
too much or too less? What steps we can take to ensure that neither of these two undesirable
situations occur? We will answer these important questions in the next few subsections.
To plot validation curve, we will use a dataset that is kept separate from the data used for
model fitting. We will call it validation dataset and the data used to fit the model as fitting or
estimation dataset (you may wonder, ‘what happened to the test dataset?’ Hold onto that
question, we will explain that shortly). Now, validation curve is simply a plot of model error (or
any performance metric) for fitting and validation dataset for different hyperparameter values
as shown in Figure 3.7. When hyperparameters are set such that model does not have enough
55
`
complexity or ‘handles’ to be able to describe the inherent structure in data, modeling errors
will be large for both fitting and validation dataset leading to underfitting. As model complexity
is increased, error on fitting dataset will decrease, however, the resulting model may not
generalize well to the validation data leading to overfitting. Therefore, overfitting is
characterized by small fitting errors and large validation errors. The sweet spot where the both
fitting and validation errors are small, and validation errors just start increasing corresponds
to the optimal hyperparameters.
If your model is underfitting, your recourse should be to increase model complexity (for
example by adding more depth in ANNs or using higher degree polynomials). You may
explore adding more informative features. For overfitting, constrain model flexibility to prevent
it from learning noise or random errors in fitting dataset. Removing irrelevant features or
adding more fitting data can also help. If the optimal validation error is still very high, then a
change of model may be needed (for example, using SVMs instead of decision trees).
To generate the validation curve, you can use Sklearn’s utility function validation_curve
available in the model_selection module or write your own code as shown below.
Figure 3.8 shows the validation curve for our quadratic fitting example dataset. Although the
distinction between fitting and validation MSEs are not very pronounced, 2nd degree seems to
be the optimal hyperparameter value.
57
`
A few more interesting things to note here. While validation MSEs are expected to be higher
than fitting MSEs in general, here, we find them to be lower for the first 2 polynomial degrees.
Also, if you extend this validation curve for even higher degrees, you will find validation MSEs
decreasing instead of continuously increasing! So, does this mean that we can’t trust
inferences made using validation curves? You can actually trust these curves. What is
happening here is that due to small dataset size, the validation dataset just happened to have
data-points which show these unexpected behaviors. Infact, for small datasets, specialized
multi-fold validation techniques exist which we will learn shortly.
Train/validation/test split
We used away the dataset that we had kept aside from the fitting dataset for model validation
We can no longer use it as a test dataset for generalization assessment (since this dataset
has already been ‘seen’ by the model and thus is no longer an unseen dataset). The solution
is to set aside another chunk of dataset as shown below. This approach is called 3-way
holdout method.
To implement 3-way holdout, we need to call train_test_split function twice as shown below
# train-validate-test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)
x_fit, x_val, y_fit, y_val = train_test_split(x_train, y_train, test_size=0.3, random_state=1)
After model selection is completed, you may, if desired, retrain your model
with the selected optimal hyperparameter settings on the combined fitting and
validation dataset. Moreover, as mentioned before, after model performance
assessment has been made using test data, the final model may be obtained
by retraining on the entire dataset.
K-fold cross-validation
The 3-way holdout method is an elegant way of ensuring that model tuning and final
assessment are based on unbiased estimates of model performance (i.e., using data not seen
during model fitting and model training, respectively). However, there are several drawbacks
as well with this method. If our original dataset is not large enough, we may not have the
luxury of having 3 distinct subsets of data. Moreover, the performance estimates can be highly
influenced by what data-points makeup the subsets and thus may lead to unfortunate
selection of hyperparameters and final assessment.
The popular alternate solution is k-fold cross validation. The idea is simple. Instead of having
a single test (or validation) subset, multiple subsets are created, and overall performance
estimate is computed as the average of performances on individual subsets. Figure 3.9
illustrates this idea where k-fold cross validation is used for both final performance
assessment and model tuning.
Figure 3.9: Nested K-fold cross validation for model tuning (inner loop) and final performance
assessment (outer loop)
59
`
Very often, k-fold cross validation is only employed for model tuning and the final performance
assessment is made using only a single test holdout set. This conventional cross-validation
approach is illustrated in Figure 3.10.
Let’s apply k-fold cross validation technique to find the optimal polynomial degree with our
illustration dataset. We will use the KFold utility function provided by Sklearn which returns
the indices of data-points that belong to the fitting & validation subsets in each of the k splits.
60
`
max_polyDegree = 5
for poly_degree in range(1,max_polyDegree+1): # loop over hyperparameters
pipe['poly'].degree = poly_degree
overall_fit_MSEs.append(np.mean(split_fit_MSEs))
overall_val_MSEs.append(np.mean(split_val_MSEs))
Figure below shows the resulting validation curve which is more along the expected lines.
Note that Sklearn provides a cross_val_score function which can be used to replace the loop
over the splits in our code above.
61
`
K-fold cross-validation is the default cross validation techniques inbuilt in several of Sklearn’s
functions such as validation_curve. The downside to this technique is high computational
cost as model fitting is carried out multiple times. Therefore, for deep learning-based solutions,
which anyway employs large datasets, 3-way holdout method is often the preferred choice.
Regularization
If your model is overfitting, one of the things to do is to check the values of the model
parameters (weight coefficients for linear models). You may find some or all of them with very
large values. This is problematic because the model becomes very sensitive to input variables
with high coefficients, i.e., model predicts big response changes for even very small variations
in these input variables. This is unlikely to be true for real process systems. For example,
following are the model coefficients with 10 degree polynomial features for our illustration
dataset. Look at the higher degree coefficients
62
`
If you want to utilize all your training data for model fitting or validation subset cannot
be set aside for model selection (e.g., when using gaussian mixture models), BIC
technique can be used which provides an optimal trade-off between model
complexity (or number of model parameters) and model accuracy. We have seen
that model accuracy on training data can be increased by adding more model
parameters, but this can also lead to overfitting. BIC resolves this conundrum by
minimizing a metric that includes a penalty term for the number of model
parameters.
We can see that a model with high accuracy/likelihood and fewer parameters will
result in low BIC value, and hence, optimal model corresponds to the minimum BIC
value. BIC owes its popularity for its effectiveness and low computational cost: It
has been shown theoretically that for sufficiently large training data, if the correct
model is included among the set of candidate models, then BIC is guaranteed to
select it as the best model. We will see an application in Chapter 8.
Another popular model selection metric is AIC (Akaike information criterion). The
penalty term is smaller in AIC. Consequently, AIC tends to select more complex
models compared to those selected by BIC. Do remember that more complex
models might overfit while simpler models might underfit. Therefore, selection
between AIC and BIC methods is problem dependent.
Therefore, to keep model parameters ‘simple’, modeling algorithms are tweaked to decrease
the likelihood of overfitting. This modulation is called regularization and is a widely used
approach to tackle overfitting. You should always employ some form of regularization in your
model.
The additional term penalizes the algorithm for choosing high coefficient values. Therefore,
during model fitting, the algorithm will be forced to minimize fitting error while keeping
coefficients small. This version of regularization is called L2 or ridge regularization. The ‘𝛼’
term is a hyperparameter which modulates the magnitude of the penalty term.
Here, the penalty term is simply proportional to the summation of absolute value of the
coefficients. This is called L1 or lasso regularization. Let’s see how these mechanisms help
us with the overfitting problem with our 10th dgeree polynomial fitting. L1 / L2 can be
implemented by just switching the linear model in the pipeline with the regularized linear
models as shown below
Figure 3.11 confirms that regularization works!! To see what is happenning under the hood,
let’s check the model coefficients of the regularized models
>>> [ 0.27326 3.49270 -0.35861 0.59074 -1.20522 1.90652 -1.80640 0.86690 0.87244
-3.20685]
64
`
print(pipe_L1['model'].coef_)
While both L1 and L2 regularizations ensure that model coefficients remain small, Lasso
actually has pushed several coefficients to zero. This is representative of the general behavior
of Lasso which is often used for variable selection as the coefficients of irrelevant variables
get set to zero. The advantage of L2 is that it is computationally favorable as the penalty term
is differentiable unlike L1’s penalty term. The choice between L1 and L2 is problem dependent
and requires trial-and-error.
L1 and L2 regularizations can be combined into elastic-net regularization. Not all forms of
regularizations involve putting penalty terms in objective function. You will study methods like
dropout, batch normalization, early stopping in Chapter 11, which help to ensure model
coefficients remain ‘stable’ to avoid overfitting.
You will have noticed that we have talked about regularizing only the weight
coefficients and not the intercepts. This is true in general. Large intercepts (or
biases in neural networks) will not cause large response changes upon small
variations in inputs, and therefore, regularizing them is often not needed.
Choosing value of 𝜶
The ‘𝛼’ term is an important hyperparameter which should be carefully specified before model
fitting. Very small 𝜶 will make regularization ineffective potentially leading to overfitting. Very
high value will push the model coefficients to zero causing underfitting. The optimal value is
difficult to know before-hand and need to found via cross-validation. However, a general
recommendation is to keep 𝜶 low rather than high.
65
`
As expected, GridSearchCV also returns the optimal polynomial degree as 2. Note that we
are using 3-fold cross validation here as specified via cv=3. By default, GridSearchCV refits
the estimator on the complete supplied dataset using the best found hyperparameters. The
best estimator can be obtained via gs.best_estimator_ for subsequent predictions
(predictions can also be done via gs.predict function). In the above code we explicitly supplied
mean squared error as the scoring metric; the default behavior is to use the estimator’s default
scorer which would be R squared for linear regression in sklearn.
This concludes our quick look into the standard ML model tuning approaches and ML
modeling workflow. You now know how raw data should be transformed before model fitting,
how to evaluate and report model’s performance, and how to find optimal model
hyperparameters during model tuning.
Summary
In this chapter, we learnt the best practices of setting up a ML modeling workflow. During the
course of learning the workflow, we saw several nice utilities (pipelines, kFold, grid-search,
etc.) provided in Sklearn library which make ML scripting easier and manageable. The
takeaway message from this chapter should be that ML modeling is a collaborative work
between you and machine. This chapter exposed you to your side of the task!!
66
`
Chapter 4
Data Pre-processing: Cleaning Process Data
A universal statement about machine learning is that more data is always better! However,
not all data-points and/or variables in your dataset are necessarily good for your model. You
may wonder why? The reason is that industrial process data inevitably suffers from issues of
outliers, measurement noise, missing data, and irrelevant variables. While irrelevant variables
become part of modeling dataset often due to imprecise knowledge of the process system at
hand (for complex processes, it may not be very straightforward to manually choose the right
set of model variables right away), other data impurities creep in due to failures of sensors
and data acquisition system, process disturbances, etc. These impurities negatively impact
the quality of our ML models and can lead to incorrect conclusions.
Fortunately, several methods exist which can be used for automated data cleaning. We will
study these methods in this chapter. Specifically, the following topics are covered
• Removing measurement noise
• Selection of relevant variables for improved model performance
• Outlier removal in univariate and multivariate settings
• Common techniques for handling missing data
67
`
Process measurements are inevitably contaminated with high-frequency noise. If not dealt
with appropriately, noise can sometimes result in errors in model parameter estimation or
unfavorable characteristics in predicted variables. For example, in Figure 4.1, we can see the
noisy fluctuations in a flow measurement signal. These kinds of fluctuations are generally not
real process variations and are just an artifact of measurement sensors. If not removed, these
may result in noisy fluctuations in predicted variables, let’s say product purity, which would be
very undesirable.
Figure 4.1: Time series signal smoothed by simple moving average (SMA) and SG filters
Two common ways to de-noise process signals is to smooth them using moving window
averages and Savitzky-Golay (SG) filtering. Figure 4.1 shows the smoothed flow signals. We
can see that crucial process variations have been preserved while noisy fluctuations have
been removed! Let’s quickly study the two techniques and learn how to implement them.
common variant, simple moving average (SMA), the combination is a simple average as
shown below
∑𝑊−1
𝑗=0 𝑥(𝑡 − 𝑗)𝑟𝑎𝑤
𝑥(𝑡)𝑠𝑚𝑜𝑜𝑡ℎ𝑒𝑑 =
𝑊
Where W is the window size. To generate the SMA smoothed signal in Figure 4.1, Pandas
was utilized as follows
Alternatives to SMA are LWMA (linearly weighted moving average) and EWMA (exponentially
weighted moving average) where the past measurements are weighted in a finite and infinite
length window, respectively. The value of W in finite window length variants depend on ‘how
much’ smoothing is desired and is guided by the knowledge of the underlying process system
(window size should ideally be smaller than the period over which systematic process
variations occur) or through trial and error. As would be obvious, larger W achieves more
smoothing.
SG filter
In SG filtering, the smoothed value at any point t* is realized by approximating the values of x
in an odd-sized window centered at t* with a polynomial of time (t) and then evaluating the
polynomial at t*. Specifically, at point t*, a mth order polynomial of the following form is fitted
and then evaluated as
𝒎
where 𝒃𝒌 are the estimated polynomial coefficients. In Figure 4.1, SG filter was implemented
using Scipy with 2nd order polynomials as follows
Note that, under the default setting, the (W-1)/2 values at the extreme edges are handled by
fitting a polynomial to the W values at the edges and evaluating this polynomial to get
smoothed values for the (W-1)/2 data-points nearest to the edges. You are encouraged to
check the official documentation12 for other available settings.
SG filter enjoys the advantage of preserving the important features of the raw signals better
than moving window average filters. SMA has another drawback which is apparent in Figure
4.1 - if we observe closely around sample 300 where a step change in flow occurs, the SMA
smoothed signal shows a small time-delay/offset unlike SG smoothed signal which follows
the original signal very closely during the step change. This time offset can become
problematic for certain problems. SMA, however, enjoys computational advantage as it is
faster to compute.
SMA and SG filters belong to the category of low-pass filters which block high
frequency component of raw signals and allow low frequencies to pass
through. This effectively removes spurious fast transitions, spikes, and noise
while retaining the slowly evolving systematic variations. On the other hand,
high-pass filters block low frequency components and allow high frequencies.
These are often utilized to remove long-term slow drifts from process signals.
There are other advanced methods as well like wavelet filters, LOWESS smoothing which can
be explored if SMA or SG filters don’t provide satisfactory performance. Irrespective of the
method deployed, you should pay due attention tuning the filters to ensure that the smoothed
signal is acceptable around edges, transients, and step changes in the original signal.
Imagine that you are tasked with building a regression (or a classifier) model for predicting
product quality in a large petrochemical plant. For such complex and highly integrated
processes, it would not be surprising if you don’t have enough process insights about which
variables effect product quality and therefore, you end up with hundreds of variables as
potential inputs to your model. The problem with inclusion of irrelevant inputs is that model
training (and update) time increases, the model becomes prone to overfitting, and there is a
greater requirement of training samples to adequately represent the model input space.
Moreover, unnecessarily bulky input set makes physical interpretation of the model difficult.
12
https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.savgol_filter.html
70
`
The task of selecting the most relevant subset of inputs/feature is called variable selection /
feature selection. A trivially optimal way of variable selection is to explore all possible
combinations of input variables and select the one that gives the best performance on the
validation dataset. However, this can be computationally infeasible when dealing with large
number of input candidates. To circumvent these issues, several variable selection methods
have been devised which try to find a good subset of relevant model inputs. Figure 4.2 lists
some of these methods that are commonly employed. In the next few subsections, we will
learn these and try to understand their strengths and drawbacks.
As you can see, the methods are divided into three categories depending on the employed
search strategy for the relevant inputs. Filter methods use statistical measures such as
correlation coefficients, mutual information to quantify the relevance of any input w.r.t. the
target variable. Wrapper methods utilize the model to quantify the predictive power of different
input subsets via metrics like MSE, BIC, etc. In the third category, the embedded methods
perform variable selection during the process of model fitting itself. Before we delve deeper
into these methods, let’s take a quick look at the dataset we will use in this section.
Illustration dataset
The dataset is provided in the file VSdata.csv and has been simulated using a mechanism
(with slight modifications) devised by Chong & Jun13.The mechanism was designed to ensure
that the dataset mimics a real manufacturing process with interconnected unit processes.
There are 40 input variables out of which 10 (inputs 17 to 26) are known to be relevant while
the rest are irrelevant to the target variable. Fitting and validation dataset contain 1000 and
250 samples, respectively. The target samples were generated using the relevant inputs as
follows
𝟐𝟔
13
Ching & Jun, Performance of some variable selection methods when multicollinearity is present, Chemometrics and
intelligent laboratory systems, 2005
71
`
where 𝜺𝒔𝒂𝒎𝒑𝒍𝒆 is measurement noise. All the inputs are generated from a multivariate normal
distribution with zero mean and covariance matrix 𝜞 which is designed to induce strong
correlations among neighboring process inputs.
Figure 4.3 shows the target data in the training dataset and the input data. From these plots,
it is not immediately clear whether an input is (more) relevant for prediction of the target or
not. A multivariate linear regression with all the inputs gives a R2 value of 0.614 on the
validation dataset, while using only 10 relevant inputs results in slightly better R2 of 0.634. Let
us see if our variable selection techniques can correctly find the relevant inputs in the dataset.
Figure 4.3: Target vs input plots for a simulated manufacturing process dataset
72
`
Filter methods
In the most common form of filter methods, the relationship between each input and the target
is estimated using some statistical measure and then the inputs are ranked according to the
estimated relationship strengths. Once ranked, the top ranked variables can be picked. Figure
4.4 shows an overview of the strategy.
Correlation coefficients
For continuous inputs and target, the common statistical measure is Pearson correlation
coefficient which measure the linear correlation between an input and the target and is given
by
∑𝒊(𝒙𝒊 − 𝒙
̅)(𝒚𝒊 − 𝒚
̅)
𝑹𝒙𝒚 =
̅)𝟐 ∑𝒊(𝒚𝒊 − 𝒚
√∑𝒊(𝒙𝒊 − 𝒙 ̅ )𝟐
For variable selection, the top k variables with highest correlations, where k is known
beforehand or decided via cross-validation, or all variables with correlations significantly
different from zero can be selected.
Mutual information
Correlation coefficients can be deceptive for nonlinear dependencies between target and
inputs and can lead to wrong variable selections. The solution is to use ‘mutual information
(MI)’ metric which can efficiently quantify any kind of dependency between variables.
Mathematically, MI is given by
𝒑(𝒙, 𝒚)
𝑴𝑰 (𝒙, 𝒚) = ∬ 𝒑(𝒙, 𝒚)𝒍𝒐𝒈 𝒅𝒙𝒅𝒚
𝒑(𝒙)𝒑(𝒚)
where denotes 𝒑(𝒙, 𝒚) joint probability density of variables x and y and denotes the marginal
probability density of x. From process data, these densities can be estimated using k-nearest
neighbor method (used by sklearn), KDE, or histograms.
73
`
Figure 4.5: Linear correlation and mutual information metrics for zero, linear, and nonlinear
dependencies between target and input
Consider the scenarios in Figure 4.3. Pearson correlation coefficient fails to identify the
nonlinear (sinusoidal) dependency between target and input, but mutual information is able to
determine the dependencies correctly. Colloquially speaking, MI measures the amount of
information about a variable (y) that is provided by another variable (x). Therefore, if a target
and an input are independent then no information about y can be obtained from knowing x
and therefore, their MI would be 0. On the other hand, if y depends on x then MI > 0 and
higher value implies higher dependency.
As you would have noticed, filter methods do not entail building any predictive models and
therefore are fast to execute. Filter methods are particularly favored when the number of input
candidates are very large. However, the quality of ‘relevant’ input subset may be low. This is
because filter methods consider the inputs in isolation and don’t consider the interactions
among the inputs. An input that is ranked low may provide significant boost in predictive
accuracy when combined with some other input. Therefore, these methods are often used as
a first step of variable selection to only remove very low ranked variables and the remaining
variables are further screeded via other more sophisticated methods.
Sklearn implementation
Sklearn provides f_regression and mutual_info_regression methods to compute the
statistical metrics we need. f_regression computes linear correlation between each input
variable and target which is then converted into an F score. F scores come from F-test (you
may remember from your statistics classes) which does a statistical check on whether the
difference in modeling errors between 2 linear models (one relating target to just a constant
and the other to an input and a constant) are significant or just happened by chance. If this
does not ring any bell, don’t worry. It suffices to know that the ranking obtained through F
scores is the same as that obtained using correlation coefficients.
74
`
# read data
import numpy as np
VSdata = np.loadtxt('VSdata.csv', delimiter=',')
# separate X and y
y = VSdata[:,0]
X = VSdata[:,1:]
For our simulated dataset, inputs 22, 21, 24, 20, 30, 29, 32, 31, 28, 27 got selected as the top
10 relevant variables, i.e., only 4 inputs got correctely picked. We can see that the selection
is not that good, primarily due to the multivariable dependence of our target. Let’s see if
wrapper methods can do any better.
Wrapper methods
Figure 4.6 gives an overview of wrapper approach to feature selection. Here, unlike filter
methods, a predictive model is used to find the relevant features. The method begins with an
initial selection of feature subset and enters a loop. In each cycle of the loop, a new model is
built with the currently selected subset and the decision to include/exclude a feature from the
subset is based on model’s performance or some model attributes. The specific strategy
employed distinguishes the different sequential methods. The loop concludes once the
desired number of relevant features are obtained or some criteria on model performance is
met. Note that the model performance may be assessed using a validation dataset or fitting
dataset itself using BIC/AIC metrics. Let’s study a couple of well-known wrapper methods.
75
`
PLS is a very popular regression method for process systems which we will study in detail in
Chapter 5. We are mentioning this here because PLS allows for computation of VIP (variable
importance in projection) scores that quantify the importance of each input. PLS VIP14-based
RFE algorithm can be used to iteratively exclude inputs that contribute the least to the
prediction of the target variable.
The other variant, backward SFS, works similarly as forward SFS, except that it starts with all
the inputs. The input whose exclusion leads to the best model performance is excluded from
the subset. This procedure is repeated to exclude inputs one by one.
14
PLSRegression class in sklearn does not have any built-in VIP computation. pyChemometrics package can be used which
provides a ChemometricsPLS class with a VIP method
76
`
Since wrapper methods use models, they outperform filter methods. However, because they
execute multiple model fittings, they can become computationally infeasible for very large
number of inputs and complex models (such as ANNs). Between forward and backward SFS,
backward SFS takes into account the interactions effects of all the variables better than the
forward SFS but can entail higher number of model fittings when k is much lower than the
number of input candidates.
Sklearn implementation
Sklearn provides RFE and SequentialFeatureSelector classes that transform raw input data
matrix into relevant subset using any specified estimator/model. Below is an implementation
for backward SFS using linear regression model.
# scale data
from sklearn.preprocessing import StandardScaler
xscaler = StandardScaler()
X_scaled = xscaler.fit_transform(X)
yscaler = StandardScaler()
y_scaled = yscaler.fit_transform(y[:,None])
We can see that wrapper method has performed much better than filter method and has
recovered 8 out of 10 relevant inputs! Another impressive thing to highlight here is how
SequentialFeatureSelector allows condensing all the complexities of SFS variable selection
with cross-validation into just a couple of lines of code!!
77
`
Embedded methods
Embedded methods make use of algorithms where selection of variables is an inherent part
of the model fitting process itself. We already saw one of the methods in the previous chapter,
Lasso regression, where irrelevant inputs are eventually removed from model by assigning
them zero coefficients. Decision trees and random forests are another set of algorithms which
directly provide feature importances after model fitting. We will study them in Chapter 9.
Embedded methods are computationally less expensive than wrapper methods and work best
when the number of samples are much higher than the number of inputs. Below is a quick
implementation of Lasso regression for our simulated dataset where we use the model
coefficients to judge any input’s importance.
Sklearn implementation
We will employ LassoCV model which automatically selects a penalization strength using
cross-validation. As shown, Lasso has recovered 8 out of 10 relevant inputs.
This concludes our look into the variable selection methods. The presence of nonlinearity,
input interactions, computational resource availability, and the number of input candidates
influence the eventual method selection. You are now familiar with the mechanism of these
methods which should help you choose the method most suitable for your problem.
78
`
Feature extraction and feature selection are similar in the sense that both
achieve feature dimension reduction. Feature extraction creates new
features by using all the original features and therefore does not help in
simplification of original measurement space. The focus is more on
dimensionality reduction and removal of correlation among the features.
Feature selection, as we have seen, discards irrelevant features altogether.
Here the focus is on improved model performance with simpler physical
interpretation due to low number of model inputs.
Outliers are abnormal observations that are inconsistent with the majority of the data.
Presence of outliers in the training dataset results in a model that does not adequately
represent the true process and therefore negatively impacts its prediction accuracy. It is
imperative to remove outliers before model fitting or use a modeling algorithm that is robust
to outliers.
Several methods are at our disposal - the choice depends on whether we are dealing with
univariate or multivariate system, the degree of outlier contamination, gaussian or non-
gaussian data distribution, etc. Figure 4.8 lists some of these methods that are commonly
employed.
Univariate methods
Univariate methods clean each variable separately. The 3-sigma rule and Hampel identifier
are the popular methods available under this category.
79
`
3𝝈 rule
The 3-sigma rule works as follows: if 𝝁 and 𝝈 denote the mean and standard deviation of a
variable, then any sample beyond the range 𝝁 ± 𝟑𝝈 are considered as outliers. This rule
follows from the properties of a gaussian-distributed variable for which 99.87% of data lie
within the 𝝁 ± 𝟑𝝈 range. However, by using a factor other than 3, this rule is often used for
non-gaussian variables as well.
Hampel identifier
You will agree than the 3𝝈 rule is quite simple and easy to implement. Expectedly, it has some
limitations. If a variable is severely contaminated with outliers, then 𝝁 and 𝝈 can get impacted
to such an extent that abnormal samples end up falling within the 𝝁 ± 𝟑𝝈 range and fail to be
identified as outliers. The solution is to use Hampel identifier which replaces 𝝁 and 𝝈 with
median and MAD, respectively. For any variable x, Hampel identifier tags an observation as
an outlier if it lies outside the range 𝒎𝒆𝒅𝒊𝒂𝒏(𝒙) ± 𝟑𝝈𝑴𝑨𝑫 where 𝝈𝑴𝑨𝑫 = 𝟏. 𝟒𝟖𝟐𝟔 ∗
𝒎𝒆𝒅𝒊𝒂𝒏(|𝒙 − 𝒎𝒆𝒅𝒊𝒂𝒏(𝒙)|).
We worked with median and MAD in the previous chapter for robust scaling. Using the same
dataset, Figure 4.9 now illustrates the effectiveness of Hampel identifier for univariate outlier
detection. The ‘normal range’ computed using 3-sigma rule is inflated enough to include all
the data within the normal range! Hampel identifier, however, provides a normal range which
is more representative of the normal data and therefore can flag the abnormal samples as
outliers. We can see that Hampel identifier provides quite a superior performance with very
minimal additional complexity.
Figure 4.9: Univariate bounds obtained with 3-sigma and Hampel identifier rules for data
contaminated with outliers
80
`
Multivariate methods
In process dataset where variables are highly correlated, outliers can escape detection when
variables are ‘cleaned’ separately. For example, consider Figure 4.10. The red samples are
clearly outliers in the 2D space. However, if we look only along the x1 (or x2) axis alone, these
abnormal samples lie close to the center of the data distribution, and therefore will not be
considered outliers in univariate sense.
The solution is to use multivariate methods of outlier detection. These methods take into
consideration the shape of multivariate data distribution and are often based on using some
distance metrics to flag samples that are far away from the center of data distribution. Let’s
study these methods in detail.
Mahalanobis distance
Mahalanobis distance (MD) is a classical multivariate distance measure that takes into
account the covariance/shape of data distribution for computing distances from the center of
data. Specifically, MD of any observation xn is given by
̅)𝜮−𝟏 (𝒙𝒏 − 𝒙
𝑴𝑫(𝒙𝒏 ) = √(𝒙𝒏 − 𝒙 ̅ )𝑻
where 𝒙̅ and 𝜮 are mean and covariances of the sampled observations. The presence of 𝜮
differentiates MD from Euclidean distance. Computation of MD essentially converts a
multivariate outlier detection problem into a univariate problem. For illustration, Figure 4.11
shows the MDs for the dataset from Figure 4.10. We can see that the outlier samples are
clearly very distinct from the normal samples and can be easily flagged via univariate
techniques.
81
`
Figure 4.11: Multivariate outlier detection via Mahalanobis distance and Hampel identifier
In this illustration, the shown bounds were obtained via Hampel identifier on the cubic roots
of MDs (cubic root taken to make the MDs distribution approximately gaussian15) as shown in
the code below
# read data
import numpy as np
data_2Doutlier = np.loadtxt('simple2D_outlier.csv', delimiter=',')
# compute Mahalanobis distances and transform into gaussian distribution using cubic-root
from sklearn.covariance import EmpiricalCovariance
emp_cov = EmpiricalCovariance().fit(data_2Doutlier)
MD_emp_cov = emp_cov.mahalanobis(data_2Doutlier)
MD_cubeRoot = np.power(MD_emp_cov, 0.333)
upperBound_MD = np.power(median+3*sigma_MAD, 3)
lowerBound_MD = np.power(median-3*sigma_MAD, 3)
# plot Mahalanobis distances with bounds (last 5 samples are the outliers)
plt.figure(), plt.plot(MD_emp_cov[:-5], '.', markeredgecolor='k', markeredgewidth=0.5, ms=9)
plt.plot(np.arange(300,305), MD_emp_cov[-5:], '.r', markeredgecolor='k', markeredgewidth=0.5, ms=11)
plt.hlines(upperBound_MD, 0, 305, colors='r', linestyles='dashdot', label='Upper bound')
plt.hlines(lowerBound_MD, 0, 305, colors='r', linestyles='dashed', label='Lower bound')
15
Wilson & Hilferty, The distribution of chi-square, Proceedings of the National Academy of Sciences of the United States
of America, 1931
82
`
MCD estimator
Consider the data distribution and its Mahalanobis distances in Figure 4.12. Here again a
bunch of samples lie away from the normal data, but, as shown, the MD-based method fails
miserably in detecting all the outliers. So, what happened here?
Figure 4.12: : Outliers in 2D space difficult to detect via classical Mahalanobis distance method
What’s happening here is that the outliers are distorting the estimated shape/covariance
structure of the data distribution. As shown in the middle subplot, the contours of MDs have
been inflated to such an extent that MDs of several outliers are similar to those of inliers. This
unwanted effect is called masking effect. Moreover, if inliers get classified as outliers due to
the impact of outliers, it is called swamping effect.
Previously, in univariate methods, we saw that a robust outlier detection solution can be
obtained by utilizing robust estimates of data location and spread. Similar concept exists for
multivariate methods. A popular estimator of location (centroid) and scatter (covariance
matrix) is MCD (Minimum Covariance Determinant). MCD finds h samples out of the whole
dataset for which the covariance matrix has lowest determinant, or, simply speaking, the
enveloping ellipsoid for a given confidence level has lowest volume among all possible
subsets of size h. The centroid and covariance of h samples are returned as robust estimates.
The MinCovDet class in Sklearn implements a FAST-MCD algorithm which can efficiently
analyze large datasets with high dimensionalities. The superior performance of this method
is validated in Figure 4.13. The MD contours now fit the inlying samples much better and the
outliers are well beyond the computed upper bound. Note that Sklearn provides a
EllipticEnvelop class that uses FAST-MCD to directly detect an outlier but requires
specification of data contamination level which may not be known beforehand. The code
below shows the computation of MCD-based MDs; bound are computed as before.
# read data
import numpy as np
83
`
MCD_cov = MinCovDet().fit(data_2Doutlier)
MD_MCD = MCD_cov.mahalanobis(data_2Doutlier)
Figure 4.13: Multivariate outlier detection via MCD-based robust Mahalanobis distances
PCA
Not all outliers are bad and warrant removal. For example, consider Figure 4.14. Here, two
kinds of outliers are highlighted. While samples A tend to break away from the majority
correlation, samples B follow the correlation but lie far away from the majority of data.
Figure 4.14: Different types of outliers which can be distinguished via PCA
Depending on the context of our problem, we may or may not want to remove samples B. The
MD-based approach won’t distinguish between these two kinds of outliers and therefore, won’t
work for us if we want to retain samples B. In that case, PCA (principal component analysis)
can be employed. We will learn PCA in detail in Chapter 5. There you will understand how
PCA distinguishes between these two different kinds of outlying observations. Classical PCA
is based on sample covariance of data. Therefore, if the given dataset is highly contaminated
with outliers, MCD-based covariance can be used in PCA for robust analysis.
84
`
Data-mining methods
The multivariate methods we studied in previous section make an implicit assumption about
gaussianity or unimodality of data distribution. In real life, it is not rare to encounter situations
where these assumptions do not hold. One such data distribution is shown in Figure 4.15,
where 3 distinct clusters of data is present with outliers in between them. MD or PCA-based
approach will fail here!
Advanced data-mining methods are employed to handle these complicated situations. The
methods can be categorized into density-based, clustering-based, and neighbor-based.
Density-based methods (like KDE) rely on estimation of distribution density of the data;
observations lying in low-density regions are flagged as outliers. Cluster-based methods (like
GMM, Kmeans) find distinct clusters of data; samples belonging to small-sized or low-
importance clusters are flagged as outliers. Neighbor-based methods (like KNN, LOF) rely on
finding the neighbors of each data-point. Neighboring data-points are used to compute any
sample’s distance from its k-nearest neighbor or the local density; samples with large
distances or low densities are flagged as outliers.
We will study most of these techniques in Part 2 of the book. Once you study their nuances,
you will realize that these are very powerful and useful methods but do require careful
selection of their intrinsic model parameters such as the number of clusters in Kmeans,
number of neighbors in LOF, or the kernel function & its width in KDE.
85
`
*Zhu et. al., Review and big data perspectives on robust data mining approaches for industrial process
modeling with outliers and missing data, Annual Reviews in Control, 2018
+ Frosch + Bro, Robust methods for multivariate data analysis, Journal of Chemometrics, 2005
This concludes our study of methods for outlier detection. Our objective was to help you
understand the pros and cons of different techniques because there is no universal method
applicable to all problems. A good understanding of the underlying process and the nature of
data can greatly help make the right choice of technique.
86
`
Missing data issues arise when values of a variable or multiple variable are missing in one or
more observations of the dataset. The samples with missing data can simply be discarded if
their number is much smaller than the total number of available samples. However, if this is
not the case or there are reasons to believe that discarding samples will negatively impact
model fitting, imputation techniques can be employed which attempt to fill-in the missing
values of an variable using available data from the same variable or other variables.
A simple imputation strategy is mean imputation where the missing values of a variable are
replaced with the mean of all available data for the variable and is illustrated in the code
snippet below
# Mean imputation
import numpy as np
from sklearn.impute import SimpleImputer
print(mean_imputeModel.fit_transform(sample_data))
Another widely used approach is hot-deck imputation, where missing values of a sample are
replaced from the available values of ‘similar’ samples. Sklearn provides KNNImputer class
which imputes missing value with the weighted (or mean) combination from k-nearest
neighbors in the fitting dataset. The distance metric between 2 samples is computed using
features that neither samples are missing.
# KNN imputation
from sklearn.impute import KNNImputer
knn_imputeModel = KNNImputer(n_neighbors=2)
print(knn_imputeModel.fit_transform(sample_data))
87
`
>>> [[1. 2. 4. ]
[3. 4. 3. ]
[5.5 6. 5. ]
[8. 8. 7. ]]
Although not currently available in Sklearn, more advanced techniques like regression
imputation, likelihood-based imputation also exist. The former method imputes the missing
values in a variable using a regression model with other correlated variables as predictors.
The later method uses expectation-maximation (EM) algorithm for imputing missing values in
a probabilistic framework.
As with variable selection and outlier handling methods, it cannot be said beforehand which
missing data technique will work best for your problem. A good recourse is to select a few
methods based on your process knowledge, build separate models with them, and select the
best performing one!
Summary
In this chapter, we studied the techniques commonly employed to handle different types and
degree of data contamination. We looked into the issues of measurement noise, irrelevant
features, outliers, and missing data. A natural pre-processing workflow entails denoising the
dataset first, followed by variable selection and outlier removal. With this, we conclude the
first leg of our ML journey. With the fundamentals in place, let’s start the second phase of the
journey where we will look into several classical and very useful ML algorithms.
88
`
Part 2
Classical Machine Learning Methods
89
`
Chapter 5
Dimension Reduction and Latent Variable
Methods (Part 1)
PCA and PLS are among the most popular latent variable-based statistical tools and have
been used successfully in several process monitoring and soft sensing applications. This
chapter provides a comprehensive exposition of the PCA and PLS techniques and teaches
you how to apply them on process data. Specifically, the following topics are covered
• Introduction to PCA and PLS
• Process modeling and monitoring via PCA and PLS
• Fault diagnosis for root cause analysis
• Nonlinear and dynamic variants of linear PCA and PLS
90
`
16
The popularity of latent-variable techniques for process control and monitoring arose from the pioneering work by John
McGregor at McMaster University.
91
`
Mathematical background
Consider a data matrix 𝑿 ∈ ℝ𝑵×𝒎 consisting of N samples of m input variables where each
row represents a data-point in the original measurement space. It is assumed that each
column is normalized to zero mean and unit variance. Let 𝒗 ∈ ℝ𝒎 represent the ‘loading’
vector that projects data-points along PC1 and can be found by solving the following
optimization problem
(𝑿𝒗)T 𝑿𝒗
max eq. 1
𝒗≠0 𝒗𝑇 𝒗
It is apparent that eq. 1 is trying to maximize the variance of the projected data-points along
PC1. Loading vectors for other PCs are found by solving the same problem with the added
constraint of orthogonality to previously computed loading vectors. Alternatively, loading
vectors can also be computed from eigenvalue decomposition of covariance matrix (S) of X
𝟏
𝑿𝑻 𝐗 = 𝑺 = 𝑽𝜦𝑽𝑻 eq. 2
𝑵−𝟏
Above is the form you will find more commonly in PCA literature. The columns of eigenvector
matrix 𝑽 ∈ ℝ𝒎×𝒎 are the loading vectors that we need. The diagonal eigenvalue matrix 𝜦
equals diag{𝝀𝟏 , 𝝀𝟐 , . . . , 𝝀𝒎 }, where 𝝀𝟏 ≥ 𝝀𝟐 ≥ ⋯ ≥ 𝝀𝒎 are the eigenvalues. Infact, 𝝀𝒋 is equal
to the variance along the jth PC. If there is significant correlation in original data, only the first
few eigenvalues will be significant. Let’s assume that k PCs are retained, then, the first k
columns of 𝑽 (which corresponds to the first k 𝝀𝒔 ) are taken to form the loading matrix 𝑷 ∈
ℝ𝒎×𝐤 . Transformed data in the PC space can now be obtained
Projected values
𝒕𝒋 = 𝑿𝒑𝒋 𝒐𝒓 𝑻 = 𝑿𝑷 eq. 3
along jth PC
The m dimensional ith row of X has been transformed into k (< m) dimensional ith row of T.
𝑻 ∈ ℝ𝑵×𝒌 is called score matrix and the jth column of T (tj) contains the (score) values along
the jth PC. The scores can be projected back to the original measurement space as follows
̂ = 𝑻𝑷𝑇
𝑿 eq. 4
Figure 2.2: Process data from a polymer manufacturing plant. Each colored curve
corresponds to a process variable
For this dataset, it is reported that the process started behaving abnormally around sample
70 and eventually had to be shut down. Therefore, we use samples 1 to 69 for training the
PCA model using the code below. The rest of the data will be utilized for process monitoring
illustration later.
# normalize data
scaler = StandardScaler()
data_train_normal = scaler.fit_transform(data_train)
# PCA
pca = PCA()
score_train = pca.fit_transform(data_train_normal)
93
`
After training the PCA model, loading vectors/principal components can be accessed from
transpose of the components_ attribute of pca model. Note that we have not accomplished
any dimensionality reduction yet. PCA has simply provided us an uncorrelated dataset in
score_train. To confirm this, we can compute the correlation coefficients among the columns
of score_train. Only the diagonal values are 1 while the rest of the coefficients are 0!
# confirm no correlation
corr_coef = np.corrcoef(score_train, rowvar = False)
>>> print('Correlation matrix: \n', corr_coef[0:3,0:3]) # printing only a portion
Correlation matrix:
[[ 1.00000000e+00 8.24652750e-16 -1.88830953e-16]
[ 8.24652750e-16 1.00000000e+00 2.36966153e-16]
[-1.88830953e-16 2.36966153e-16 1.00000000e+00]]
For dimensionality reduction we will need to study the variance along each PC. Note that the
sum of variance along the m PCs equals the sum of variance along the m original dimensions.
Therefore, the variance along each PC is also called explained variance. The attribute
explained_variance_ratio gives the fraction of variance explained by each PC and Figure
2.3 clearly shows that not all 33 components are needed to capture all the information in data.
Most of the information is captured in the first few PCs itself.
plt.figure()
plt.plot(cum_explained_variance, 'r+', label = 'cumulative % variance explained')
plt.plot(explained_variance, 'b+', label = 'variance explained by each PC')
plt.ylabel('Explained variance (in %)’), plt.xlabel('Principal component number'), plt.legend()
A popular approach for determining the number of PCs to retain is to select the number of
PCs that cumulatively capture atleast 90% (or 95%) of the variance. The captured variance
threshold should be guided by the expected level of noise or non-systematic variation that you
do not expect to be captured. Alternative methods include cross-validation, scree tests, AIC
criterion, etc. However, none of these methods are universally best in all the situations.
Thus, we have achieved ~60% reduction in dimensionality (from 33 to 13) by sacrificing just
10% of the information. To confirm that only about 10% of the original information has been
lost, we will reconstruct the original normalized data from the scores. Figure 2.4 provides a
visual confirmation as well where it is apparent that the systematic trends in variables have
been reconstructed while noisy fluctuations have been removed.
V_matrix = pca.components_.T
P_matrix = V_matrix[:,0:n_comp]
Figure 2.4: Comparison of measured and reconstructed values for a few variables
95
`
The 90% threshold could also have been specified during model training itself through the
n_components parameter: pca = PCA(n_components = 0.9). In this case the insignificant PCs
are not computed and the score_train_reduced matrix can be computed from the model using
the inverse_transform method.
# alternative approach
pca = PCA(n_components = 0.9)
score_train_reduced = pca.fit_transform(data_train_normal)
data_train_normal_reconstruct = pca.inverse_transform(score_train_reduced)
R2_score = r2_score(data_train_normal, data_train_normal_reconstruct)
PCA makes the monitoring task easy by summarizing the state of any complex multivariate
process into two simple indicators or monitoring indices as shown in Figure 2.4. During model
training, statistical thresholds are determined for the indices and for a new data-point, the new
indices’ values are compared against the thresholds. If any of the two thresholds are violated,
then presence of abnormal process conditions is confirmed.
𝒌(𝑵𝟐 −𝟏)
𝑻𝟐𝑪𝑳 = 𝑭𝒌,𝑵−𝒌 (𝜶) eq. 6
𝑵(𝑵−𝒌)
𝑭𝒌,𝑵−𝒌 (𝜶) is the (1-α) percentile of the F distribution with k and n-k degrees of freedom. In
essence, 𝑻𝟐 ≤ 𝑻𝟐𝑪𝑳 represents an ellipsoidal boundary around the training data-points in the
PC space. The second index, Q, represents the distance between the original and
reconstructed data-point. Let ei denote the ith row of E. Then
𝑸 = ∑𝒎 𝟐
𝒋=𝟏 𝒆𝒊,𝒋 eq. 7
Again, with normality assumption, the control limit for Q is given by the following expression
𝟐 𝜽𝟏 𝜽𝟑
𝒉𝟎 = 𝟏 − and 𝜽𝒓 = ∑𝒎 𝒓
𝒋=𝒌+𝟏 𝝀𝒋 ; 𝒓=1,2,3
𝟑𝜽𝟐𝟐
𝒛𝜶 is the (1-α) percentile of a standard normal distribution. We now have all the information
required to generate plots of the monitoring indexes, also called the control charts, for training
data.
97
`
T2_train = np.zeros((data_train_normal.shape[0],))
for i in range(data_train_normal.shape[0]):
T2_train[i] = np.dot(np.dot(score_train_reduced[i,:],lambda_k_inv),score_train_reduced[i,:].T)
N = data_train_normal.shape[0]
k = n_comp
theta1 = np.sum(eig_vals[k:])
theta2 = np.sum([eig_vals[j]**2 for j in range(k,m)])
theta3 = np.sum([eig_vals[j]**3 for j in range(k,m)])
h0 = 1-2*theta1*theta3/(3*theta2**2)
z_alpha = scipy.stats.norm.ppf(1-alpha)
Q_CL = theta1*(z_alpha*np.sqrt(2*theta2*h0**2)/theta1+ 1 + theta2*h0*(1-h0)/theta1**2)**2
98
`
Figure 2.5 shows that quite a few data-points in training data violate the thresholds, which
was not expected with 99% control limits. This indicate that the multivariate normality
assumption does not hold for this dataset. Other specialized ML methods like KDE, SVDD
can be employed for control boundary determination for non-gaussian data. We will study
these methods in later chapters. Alternatively, if N is large, another popular approach is to
directly find the control limits as 99 percentiles of the T2 and Q values.
Figure 2.6: (Left) Flow measurements across a valve (Right) Mean-centered flow readings
with two abnormal instances (samples a and b)
For example, consider the scenario in Figure 2.6. The two flow measurements are expectedly
correlated. Normal data-points lie along the 45° line (PC1 direction), except, instances ‘a’ and
‘b’ which exhibit different type of abnormalities. For sample ‘a’, the correlation between the
two flow variables are broken which may be the result of a leak in valve. This results in
abnormally high Qa value; 𝑻𝟐𝒂 however is not abnormally high because the projected score, ta,
99
`
is similar to those of normal data-points. For sample ‘b’, the correlation remains intact resulting
in low (zero) Qb value. The score, tb, however, is abnormally far away from the origin resulting
in abnormally high 𝑻𝟐𝒃 value.
Fault detection
It’s time now to check whether our T2 and Q charts can help us detect the presence of process
abnormalities from test data (samples 70 onwards). For this, we will compute the monitoring
statistics for the test data.
# calculate T2_test
T2_test = np.zeros((data_test_normal.shape[0],))
for i in range(data_test_normal.shape[0]): # eigenvalues from training data are used
T2_test[i] = np.dot(np.dot(score_test_reduced[i,:],lambda_k_inv),score_test_reduced[i,:].T)
# calculate Q_test
error_test = data_test_normal_reconstruct - data_test_normal
Q_test = np.sum(error_test*error_test, axis = 1)
Figure 2.7 juxtaposes the monitoring indexes for training and testing data. By looking at these
plots, it is immediately evident that the test data exhibit severe process abnormality. Both T2
and Q values are significantly above the respective control limits.
Figure 2.7: Monitoring charts for training and test data. Vertical cyan-colored line separates
training and test data
100
`
Fault diagnosis
After detection of process faults, the next crucial task is to diagnose the issue and identify
which specific process variables are showing abnormal behavior. The popular mechanism to
accomplish this is based on contribution plots. As the name suggests, a contribution plot is a
plot of the contribution of original process variables to the abnormality indexes. The variables
with highest contributions are flagged as potentially faulty variables.
For SPE (square prediction error) or Q, let’s reconsider eq. 7 as shown below where SPEj
denotes the SPE contribution of the jth variable.
𝑺𝑷𝑬 = ∑𝒎 𝟐 𝒎
𝒋=𝟏 𝒆𝒊,𝒋 = ∑𝒋=𝟏 𝑺𝑷𝑬𝒋 eq. 9
Therefore, SPE contribution of a variable is simply squared error for that variable. If SPE index
has violated its control limit, then the variables with large SPEj values compared to other
variables are considered the potentially faulty variables. For T2 contributions, calculations are
not as straight-forward. Several expressions have been postulated in literature17. The
commonly used expression below was proposed by wise et al18.
𝟐
𝑻𝟐 𝐜𝐨𝐧𝐭𝐫𝐢𝐛𝐮𝐭𝐢𝐨𝐧 𝐨𝐟 𝐯𝐚𝐫𝐢𝐚𝐛𝐥𝐞 𝒋 = 𝒋𝒕𝒉 𝐞𝐥𝐞𝐦𝐞𝐧𝐭 𝐨𝐟 (𝑫𝟏/𝟐 𝒙) eq. 10
−𝟏
𝑫 = 𝑷𝜦 𝑷𝑻
Note that these contributions are computed for each data-point. Let’s find which variables
need to be further investigated at 85th sample.
# T2 contribution
sample = 85 - 69
data_point = np.transpose(data_test_normal[sample-1,])
D = np.dot(np.dot(P_matrix,lambda_k_inv),P_matrix.T)
T2_contri = np.dot(scipy.linalg.sqrtm(D),data_point)**2 # vector of contributions
plt.figure()
plt.plot(T2_contri), plt.ylabel('T$^2$ contribution plot')
# SPE contribution
error_test_sample = error_test[sample-1,]
SPE_contri = error_test_sample*error_test_sample # vector of contributions
17
S. Joe Qin, Statistical process monitoring: basics and beyond, Journal of Chemometrics, 2003
18
Wise et. al., PLS toolbox user manual, 2006
101
`
plt.figure()
plt.plot(SPE_contri), plt.ylabel('SPE contribution plot')
Variable # 23 makes large contributions to both the indexes and in Figure 2.9 we can see that
there was a sharp decline in its value towards the end of the sampling period. A plant operator
can use his/her judgement to further troubleshoot the abnormality to isolate the root-cause.
Several implicit assumptions were made in the previously discussed PCA methodology.
Latent variables were assumed to be a linear combination of measured variables, each
sample was assumed to be statistically independent of past samples, and data were assumed
to be gaussian distributed. It is common to find these assumptions violated in process data
and therefore, in this section, we will relax these assumptions and learn how to implement
PCA for nonlinear and dynamic data. The methodologies primarily involve transforming raw
data appropriately such that standard PCA can be applied on them to get satisfactory results.
102
`
To tackle process dynamics, Ku et al.19 proposed a simple extension of static PCA where past
data-points are treated as additional process variables as shown in the illustration below.
Each sample (except the first 𝒍 samples) in the original data matrix has been augmented with
data-points from the past 𝒍 samples. Static PCA can now be performed on this augmented
matrix after mean-centering and scaling just like we did before. Ku et al. have shown that this
approach is efficient at extracting out linear static and dynamic correlations. A simple example
of dynamic correlation could be
Augmenting just one previous sample would suffice for such a system. Augmented sample
row is similarly generated for test data when using the fitted PCA model to derive the scores
for a test data-point. DPCA approach has proven to be useful in several industrial applications.
The code below shows how to generate the augmented matrix with lag period of 5. Note that
the lag period is a hyperparameter and its value needs to be set judiciously. Cross-validation
19
Ku et al., Disturbance detection and isolation by dynamic principal component analysis, Chemometrics and intelligent
laboratory systems, 1995
103
`
can be used. Another approach is to apply PCA on original data and observe the ACFs (auto-
correlation function values) of the scores. Lagged values are augmented if autocorrelation is
observed. This process is repeated until no autocorrelation remains.
data_train_augmented = np.zeros((N-lag+1,lag*m))
for sample in range(lag, N):
dataBlock = data_train.iloc[sample-lag:sample,:].values # pandas dataframe to NumPy array
data_train_augmented[sample-lag,:] = np.reshape(dataBlock, (1,-1), order = 'F')
Multiway PCA
PCA has been found useful for monitoring batch processes as well where raw data have an
additional layer of variability, the inter-batch variability. To capture the variability in batch data,
the three-dimensional data is unfolded to form a two-dimensional data matrix as shown in
Figure 2.10. Static PCA is applied on this unfolded data matrix after normalizing each
column20. Once a test batch data becomes available, test scores are obtained for further
analysis.
Note that in Figure 2.10, in each ‘batch-row’, batch data has been re-arranged by time-
instances order. However, if desired, you can group them by variables as well. You only need
to ensure that the same rearrangement is adopted for the test data as well. We will employ
multiway PCA to a semiconductor manufacturing batch process data in Chapter 8.
20
Nomikos & MacGregor, Monitoring batch processes using multiway principal component analysis, AIChE journal, 1994
104
`
Kernel PCA
Kernalized methods have become very attractive for dealing with nonlinear data while
retaining the simplicity of their linear counterparts. For illustration, consider Figure 2.11 where,
unlike in Figure 2.1, the variables are not related linearly. Nevertheless, it is apparent that the
3-dimensional data lie along a lower-dimensional manifold. However, the linear PCA-based
abnormality detection will fail to detect the shown abnormal data-point.
In such scenarios, the frequently employed solution is to transform the data into a higher-
dimensional space where different classes (normal and abnormal classes) of data become
linearly separable. To understand this trick, consider Figure 2.12, where the two classes of
data are not linearly separable in the original measurement space. However, when an artificial
variable (𝒙𝟏𝟐 + 𝒙𝟐𝟐 ) is added to the system, the classes become easy to separate. This high-
dimensional space is called the feature space.
Kernel PCA (KPCA) simply entails implementing PCA in the feature space. However, the task
of finding the nonlinear mapping that maps raw data to feature space is not trivial. To
overcome this, another trick, called the kernel trick, is employed. While we will defer the
detailed study of kernel trick until Chapter 7, the reader is encouraged to see the work by Lee
105
`
et al.21 for details on computation of monitoring indexes for KPCA-based process monitoring.
The code below shows how to analyze the data in Figure 2.11 using the KernelPCA library.
Partial least squares (PLS) is a supervised multivariate regression technique that estimates
linear relationship between a set of input variables and a set of output variables. Like PCA,
PLS transforms raw data into latent components - input (X) and output (Y) data matrices are
transformed into score matrices T and U, respectively. Figure 2.13 provides a conceptual
comparison of PLS methodology with those of other popular linear regression techniques,
principal component regression (PCR) and multivariate linear regression (MLR).
Figure 2.13: PLS, PCR, MLR methodology overview. Note that the score matrix, T, for PLS
and PCR can be different.
21
Lee et al., Nonlinear process monitoring using kernel principal component analysis, Chemical Engineering Science,
2004
106
`
While MLR computes the least-squares fit between X and Y directly, PCR first performs PCA
on input data and then computes least-squares fit between the score matrix and Y. By doing
so, PCR is able to overcome the issues of collinearity, high correlation, noisy measurements,
and limited training dataset. However, the latent variables are computed independent of the
output data and therefore, the score matrix may capture those variations in X which are not
relevant for predicting Y. PLS overcomes this issue by estimating the score matrices, T and
U, simultaneously such that the variation in X that is relevant for predicting Y is maximally
captured in the latent variable space.
Note that if the number of latent components retained in PLS or PCR model
is equal to the original number of input variables (m), then PLS and PCR
models are equivalent to MLR model.
The unique favorable properties of PLS along with low computational requirements has led to
its widespread usage in process monitoring for real-time process monitoring, soft-sensing,
fault classification, and so on.
Mathematical background
PLS performs 3 simultaneous jobs:
To see how PLS achieves its objectives, consider again the data matrix 𝑿 ∈ ℝ𝑵×𝒎 consisting
of N observations of m input variables where each row represents a data-point in the original
measurement space. In addition, we also have an output data matrix with p (≥ 𝟏) output
variables, 𝒀 ∈ ℝ𝑵×𝒑 . It is assumed that each column is normalized to zero mean and unit
variance in both the matrices. The first latent component scores are given by:
The vectors w1 and c1, termed weight vectors, are computed such that the covariance
between t1 and u1 are maximized. Referring to the definition of covariance, we can see that
by maximizing the covariance, PLS tries to meet all the three objectives simultaneously.
107
`
In eq. 13, E and F are called residual matrices and represent the part of X and Y that have
not yet been captured. To find the next component scores, the above three steps are repeated
with matrices E1 and F1 replacing X and Y. Note that the maximum number of possible
components equals m. For each component, the weight vectors are found via iterative
procedures like NIPALS or SIMPLS. For now, it suffices to know that once the required
number (k) of components have been computed, the following relationships hold
𝐘 = 𝐗𝜷 + 𝑭
With coefficient matrix, 𝜷, computed, we can predict y for a new input data-point x. If these
algorithmic details appear intimidating, do not worry. Sklearn provides the class
PLSRegression which is very convenient to use as we will see in the next section where we
will develop a PLS-based soft sensor.
Soft sensors are used in process industry to provide estimates or predictions of key process
outputs or product qualities using all other available process variable measurements. Soft
sensing proves especially useful when cost of physical sensors is high or real-time product
quality measurements are not available. For illustration, we will study ‘Kamyr digester’ dataset
from a pulp and paper manufacturing process. In this process, wood chips are processed into
pulp whose quality is quantified by Kappa number. In the dataset, 301 hourly samples of the
Kappa number and 21 other process variables are provided. Figure 2.14 shows that there is
considerable variability in the product quality and our goal now is to develop a soft sensor
application to predict Kappa number using other process data.
Before we build our PLS model, some pre-processing is needed. A quick glance at data
indicates that there are a lot of missing values. To keep the analysis simple, we will remove
variables with large number of missing values and then remove the samples with any missing
value.
108
`
# fetch data
data = pd.read_csv('kamyr-digester.csv', usecols = range(1,23))
# find the # of nan entries in each column
na_counts = data.isna().sum(axis = 0)
# separate X, y
y = data_cleaned.iloc[:,0].values[:, np.newaxis] # StandardScaler requires 2D array
X = data_cleaned.iloc[:,1:].values
We now split the remaining samples into training data (80%) and test data (20%), and
normalize them.
# scale data
from sklearn.preprocessing import StandardScaler
109
`
X_scaler = StandardScaler()
X_train_normal = X_scaler.fit_transform(X_train)
X_test_normal = X_scaler.transform(X_test)
y_scaler = StandardScaler()
y_train_normal = y_scaler.fit_transform(y_train)
y_test_normal = y_scaler.transform(y_test)
Although not perfect, we have obtained a reasonably good inferential sensor. You are
encouraged to change the number of latent components to maximum possible (19 in this
case) to get MLR-equivalent accuracies. Negligible improvement in accuracies is obtained.
Figure 2.15: Measured vs predicted Kappa number for training and test data
110
`
We used 9 latent components in our PLS model. This was determined via K-fold cross-
validation procedure. As shown in code below, training data is split into 10 folds. For each
possible n_comp, average of the MSE computed for each of the 10 folds is stored. For n_comp
= 9, a local minimum can be observed in validation MSE plot.
scaler = StandardScaler()
fit_MSE = []
validate_MSE = []
for n_comp in range(1,20):
X_fit_normal = scaler.fit_transform(X_train[fit_index])
X_validate_normal = scaler.transform(X_train[validate_index])
y_fit_normal = scaler.fit_transform(y_train[fit_index])
y_validate_normal = scaler.transform(y_train[validate_index])
local_fit_MSE.append(mean_squared_error(y_fit_normal, pls.predict(X_fit_normal)))
local_validate_MSE.append(mean_squared_error(y_validate_normal,
pls.predict(X_validate_normal)))
fit_MSE.append(np.mean(local_fit_MSE))
validate_MSE.append(np.mean(local_validate_MSE))
111
`
Determining number of retained components can be automated by simply looking at the ratio
of validation MSE for consecutive n_comp. If the ratio is greater than some threshold (say
0.95), then the search is concluded. The underlying logic remains that the number of retained
latents is not increased unless significantly better validation prediction is obtained.
PLS framework renders itself useful for process monitoring as well. The overall methodology
is similar to PCA-based monitoring: after PLS modeling, monitoring indices are computed,
control limits are determined, and violation of the control limits are checked for fault detection.
PLS-based monitoring is preferred when process data can be divided into input and output
blocks. For illustration, we will use data collected from an LDPE (low-density polyethylene)
manufacturing. The dataset consists of 54 samples of 14 process variables and 5 product
quality variables. It is known that a process fault occurs sample 51 onwards.
Our objective here is to build a fault detection tool that clearly indicates the onset of process
fault. To appreciate the need for such a tool, let’s look at the alternative conventional
monitoring approach. If a plant operator was manually monitoring the 5 quality variables
continuously, he/she could notice a slight drop in values for the last 4 samples. However,
given that the quality variables exhibit large variability during normal operations, it is difficult
to make any decision without first examining other process variables because the quality
variables may simply be responding to ‘normal’ changes elsewhere in the process.
Unfortunately, it would be very inconvenient to manually interpret all the process plots
simultaneously.
112
`
# scale data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_train_normal = scaler.fit_transform(data_train)
pls = PLSRegression(n_components = 3)
pls.fit(X_train_normal, Y_train_normal)
22
Kourti & MacGregor, Process analysis, monitoring and diagnosis, using multivariate projection methods, Chemometrics
and Intelligent Laboratory Systems, 1995
113
`
Computation of captured variances reveal that just 56% of the information in X can explain
almost 90% of the variation in Y; this implies that there are variations in X which have only
minor impact on quality variables.
Tscores = pls.x_scores_
X_train_normal_reconstruct = np.dot(Tscores, pls.x_loadings_.T)
# can also use pls.inverse_transform(Tscores)
A look at t vs u score plots further confirms that linear correlation was a good assumption for this
dataset. We can also see how the correlation becomes poor for higher components.
Figure 2.18: X-scores vs Y-scores. Here tj and uj refer to the jth columns of T and U
matrices, respectively.
𝒕𝟐𝒊,𝒋
𝑻𝟐 = ∑𝒌𝒋=𝟏 = 𝒕𝑻𝒊 𝜦−𝟏
𝒌 𝒕𝒊
𝝈𝒋
114
`
𝜦𝒌 , a diagonal matrix, is the covariance matrix of T with 𝝈𝒋 (variance of jth component scores)
as its diagonal elements. 𝑻𝟐𝑪𝑳 is again obtained by the following expression
𝒌(𝑵𝟐 −𝟏)
𝑻𝟐𝑪𝑳 = 𝑭𝒌,𝑵−𝒌 (𝜶)
𝑵(𝑵−𝒌)
The second and third indices, SPEx and SPEy, represents the residuals or the unmodelled
part of X and Y, respectively. Let ei and fi denote the ith row of E and F, respectively. Then
𝒎
𝑺𝑷𝑬𝒙 = ∑ 𝒆𝟐𝒊,𝒋
𝒋=𝟏
𝒑
𝑺𝑷𝑬𝒚 = ∑ 𝒇𝟐𝒊,𝒋
𝒋=𝟏
Note that if output measurements are not available in real-time then SPEy is not calculated.
With normality assumption for the residuals, the control limit for SPE statistic is given by the
following expression
𝟐𝝁𝟐 𝝈
𝒉 = , 𝒈 =
𝝈 𝟐𝝁
𝝌𝟐𝜶 is the (1-α) percentile of a chi-squared distribution23 with h degrees of freedom; 𝝁 denotes
the mean value and 𝝈 denotes the variance of the SPE statistic. Note that this expression
could be used for PCA SPE statistic as well.
T2_train = np.zeros((data_train_normal.shape[0],))
for i in range(data_train_normal.shape[0]):
T2_train[i] = np.dot(np.dot(Tscores[i,:],T_cov_inv),Tscores[i,:].T)
# SPEx
x_error_train = X_train_normal - X_train_normal_reconstruct
SPEx_train = np.sum(x_error_train*x_error_train, axis = 1)
# SPEy
23
Yin et al., A review of basic data-driven approaches for industrial process monitoring, IEEE Transactions on Industrial
Electronics, 2014
115
`
# control limits
#T2
import scipy.stats
N = data_train_normal.shape[0]
k=3
# SPEx
mean_SPEx_train = np.mean(SPEx_train)
var_SPEx_train = np.var(SPEx_train)
g = var_SPEx_train/(2*mean_SPEx_train)
h = 2*mean_SPEx_train**2/var_SPEx_train
SPEx_CL = g*scipy.stats.chi2.ppf(1-alpha, h)
# SPEy
mean_SPEy_train = np.mean(SPEy_train)
var_SPEy_train = np.var(SPEy_train)
g = var_SPEy_train/(2*mean_SPEy_train)
h = 2*mean_SPEy_train**2/var_SPEy_train
SPEy_CL = g*scipy.stats.chi2.ppf(1-alpha, h)
of comparison. Monitoring charts clearly indicates that the process has encountered severe
abnormality at the end of the sampling period. Significantly high SPEx indicates that the
abnormality has significantly affected input variable correlations.
Variants of classical PLS have been devised to deal with dynamic and non-linear systems.
For dynamic systems, past values of process inputs affect current values of outputs and
117
`
therefore including past measurements improve predictive accuracy24. For dynamic PLS, an
augmented input data matrix is built (using the same procedure as that for dynamic PCA)
using lagged values of only input variables or both input and output variables. Using only
lagged inputs leads to a FIR (finite impulse response) model while using both lagged inputs
and outputs leads to an ARX (autoregressive with exogenous variables) model. The value of
lag period can be determined using cross-validation or time-constant of the process.
For nonlinear systems, kernel PLS is an efficient method for soft sensing quality variables.
Using kernel function, the input variables are implicitly mapped onto a high-dimensional
feature space where data behaves more linearly. Linear PLS model is then built between the
feature variables and output variables. The reader is encouraged to see the work of Zhang et
al.25 for details on kernel PLS-based soft sensing.
Model fidelity degrades over time due to changes in process correlations because
of aging equipment or changing process conditions. Therefore, it becomes crucial
to regularly update PCA/PLS models. There are two broad techniques of model
adaptation: recursive update++ and moving window update**. In moving-window
approach, existing model is discarded, and a completely new process model is built
by replacing oldest data with new data. In recursive approach, the existing model
is not discarded, rather a new model is built by updating existing model with new
data.
24
Kano et al., Inferential control system of distillation compositions using dynamic partial least squares regression, Journal
of Process Control, 2000
25
Zhang et al., Nonlinear multivariate quality estimation and prediction based on kernel partial least squares, Industrial
Engineering & Chemistry Research, 2008
++ Li et al., Recursive PCA for adaptive process monitoring, Journal of Process Control, 2000
++ S. Joe Qin, Recursive PLS algorithms for adaptive data modeling, Computers & Chemical Engineering, 1998
** Wang et al., Process monitoring approach using fast moving window PCA
118
`
Summary
With this chapter we have reached a significant milestone in our ML journey. You have seen
how hidden process knowledge can be conveniently extracted from process data and
converted into process insights. With PCA and PLS tools in your arsenal you are now well-
equipped to tackle most of the process modeling and monitoring related problems. However,
our journey does not end here. In the next chapter, we will study a few more latent-variable-
based techniques that are equally powerful.
119
`
Chapter 6
Dimension Reduction and Latent Variable
Methods (Part 2)
By now you must be very impressed with the powerful capabilities of PCA and PLS
techniques. These methods allowed us to extract uncorrelated latent variables which
maximized the captured variances. However, you may ask, “Are these the best dimensionality
reduction techniques to use for all problems?”. We are glad that you asked! Other powerful
methods do exist which may provide better performance. For example, Independent
component analysis (ICA) can provide latent variables with stricter property of statistical
independence rather than only uncorrelatedness. Independent components may be able to
characterize the process data better than principal components and thus may result in better
monitoring performance. If you are reducing dimensionality with the end goal of classifying
process faults into different categories for fault diagnosis, then, maximal separation between
data from different classes of faults would be your primary concern rather than maximal
capture of data variance. Fisher discriminant analysis (FDA) would be better suited for this
task.
In this chapter, we will learn in detail the properties of ICA and FDA. We will apply these
methods for process monitoring and fault classification for a large-scale chemical plant.
Specifically, the following topics are covered
• Introduction to ICA
• Process monitoring of non-gaussian processes
• Introduction to FDA
• Fault classification for large scale processes
120
`
Figure 6.1: Simple illustration of ICA vs PCA. The arrows in the x1 vs x2 plot show the direction
vectors of corresponding components. Note that he signals t1 and t2 are not independent as
value of one variable influences the range of values of the other variable.
ICA uses higher-order statistics for latent variable extractions, instead of only second order
statistics (mean, variance/covariance) as done by PCA. Therefore, for non-gaussian
121
`
Independence vs Uncorrelatedness
Before jumping into the mathematics behind ICA, let us take a few seconds to ensure that we
understand the concepts behind independence and uncorrelatedness. Two random variables
y1 and y2, are said to be independent if the value of one signal does not impact the value of
the other signal. Mathematically, this condition is stated as
𝑝(𝑦1, 𝑦2) = 𝑝(𝑦1)𝑝(𝑦2)
Where 𝑝(𝑦1) is the probability density function of y1 alone, and 𝑝(𝑦1, 𝑦2) is the joint probability
density function. The variables y1 and y2 are said to be uncorrelated if their covariance is zero
𝐶(𝑦1, 𝑦2) = 𝐸{(𝑦1 − 𝐸(𝑦1))(𝑦2 − 𝐸(𝑦2))} = 𝐸(𝑦1 ∗ 𝑦2) − 𝐸(𝑦1)𝐸(𝑦2) = 0
Where 𝐸(. ) denotes mathematical expectation. Using the independence condition, it can be
easily shown that if the variables are independent, they are also uncorrelated but not vice
versa. Therefore, uncorrelatedness is a weaker form of independence.
122
`
Mathematical background
Consider data matrix 𝑿 ∈ ℝ𝒎×𝐍 consisting of N observations of m input variables where each
column represents a data-point in the original measurement space. Note that in contrast to
PCA, transposed form of data matrix is employed here. In ICA, it is assumed that measured
variables are a linear combination of d (≤ m) independent components s1, s2, …, sd.
𝑿 = 𝑨𝑺 eq. 1
where 𝑺 ∈ ℝ𝒅×𝐍 and 𝑨 ∈ ℝ𝒎×𝐝 is called the mixing matrix. The objective of ICA is to estimate
the unknown matrices A and S from the measured data X. This is accomplished by finding a
demixing matrix, W, such that the ICs or the rows of estimated matrix (𝑺 ̂) become as
independent as possible.
𝑆̂ = 𝑊𝑋 eq. 2
Before estimating W, the initial step in ICA involves removing correlations between the
variables in the data matrix X. The step, called as whitening or sphering, is accomplished via
PCA.
𝑍 = 𝑄𝑋 eq. 3
where 𝑸 ∈ 𝑹𝒅×𝐝 is called whitening matrix. Whitening makes the rows of Z uncorrelated. To
see how it helps, let B = QA and consider the following relationships ,
𝑍 = 𝑄𝑋 = 𝑄𝐴𝑆 = 𝐵𝑆 eq. 4
Therefore, whitening converts the problem of finding matrix A into that of finding matrix B. The
advantage lies in the fact that B is an orthogonal matrix - can be shown considering that
whitened variables are uncorrelated and ICs are independent - and hence fewer parameters
need to be estimated. Using orthogonality property, the following relationship results,
𝑺 = 𝑩𝑻 𝒁 = 𝑩𝑻 𝑸𝑿
⇒ 𝑾 = 𝑩𝑻 𝑸 eq. 5
The above procedure summarizes the steps involved in ICA. Note that the sets {A/n, S/n} and
{A, S} result in the same measured data matrix X for any scalar n, and therefore the sign and
magnitude of the original ICs cannot be uniquely estimated. This however does not affect
26
www.sci.utah.edu/~shireen/pdfs/tutorials/Elhabian_ICA09.pdf is an excellent quick read for more details on this.
123
`
usage of ICA for process monitoring because the estimated ICs for both training and test data
get scaled by the same scalar implicitly. FastICA (a popular algorithm for ICA) computes ICs
such that the L2 norm of each IC score is 1. This is the reason why the reconstructed IC
signals in Figure 6.1 seems to be scaled version of original IC signals. We will use this fact
later, so do not forget it!
Let’s quickly see the process impact of one of the fault conditions (fault 10) which include
disturbances in one of the feed’s temperature. The impact of this fault can be seen in abnormal
stripper temperature profile (Figure 6a). The plot in PC space (Figure 6b) shows more clearly
how the faulty operation data are different from the normal operation data. We will use ICA
and FDA for detecting and classifying these faults automatically.
# fetch TE data
TEdata_noFault_train = np.loadtxt('d00.dat').T
TEdata_Fault_train = np.loadtxt('d10.dat')
# quick visualize
plt.figure()
plt.plot(TEdata_noFault_train[:,17])
plt.xlabel('sample #'), plt.ylabel('Striper Tempearture'), plt.title('Normal operation')
plt.figure()
plt.plot(TEdata_Fault_train[:,17])
plt.xlabel('sample #'), plt.ylabel('Striper Tempearture'), plt.title('Faulty operation')
124
`
Figure 6.3 (a): Normal vs faulty process profile in Tennessee Eastman dataset
Figure 6.3 (b): Normal vs faulty (Fault 10) TE process data in PC space
27
Lee et al., Statistical process monitoring with independent component analysis, Journal of Process Control, 2004
125
`
Equation 2 shows that the ith row (wi) of demixing matrix W corresponds to the ith IC. Therefore
Lee et al. suggested using the Euclidean norm (L2) of wi to quantify the importance of ith IC
and subsequently order the ICs in decreasing order of importance. Figure 6.4 shows that not
all ICs are equally important as the L2 norm of several ICs are much smaller than the rest.
# fetch TE data and select variables (discarding composition measurements for now)
TEdata_noFault_train = np.loadtxt('d00.dat').T
# scale data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_train_normal = scaler.fit_transform(data_noFault_train)
plt.figure()
plt.plot(L2_norm, 'b'), plt.xlabel('IC number (unsorted)'), plt.ylabel('L2 norm')
plt.figure()
plt.plot(L2_norm_sorted_pct, 'b+'), plt.xlabel('IC number (sorted)'), plt.ylabel('% L2 norm')
126
`
Figure 6.4: (Unsorted) L2 norm of each row of W and (sorted) percentage of the L2 norms
After the ordering of the ICs, 2 approaches could be utilized to determine the optimal number
of ICs. Approach 1 uses the sorted L2 norm plot to determine a cut-off IC number beyond
which the norms are relatively insignificant. For our dataset, as seen in Figure 6.4, no clear
cut-off number is apparent. Another approach entail choosing the number of ICs equal to the
number of PCs. This approach also ensures fair comparison between ICA and PCA. We will
use this 2nd approach in this chapter.
# decide # of ICs to retain via PCA variance method and compute ICs
from sklearn.decomposition import PCA
pca = PCA().fit(data_train_normal)
Note that the FastICA.fit method expects each row of data matrix (X) to represent a sample
while we used a transposed form in our mathematical descriptions. This may cause confusion
at times. Nonetheless, the shapes of the extracted mixing, demixing, and whitening matrices
are same in both the places.
127
`
𝐼 𝟐 = 𝑠𝒊𝑻 𝑠𝑖 eq. 6
You may wonder why we have not included the covariance matrix as we did for T2 metric
computation. This is because the variance of each IC score is same (remember, the L2 norm
is same) and thus inclusion of covariance matrix is redundant.
The second index, SPE, represents the distance between the measured and reconstructed
data-point in the measurement space. For its computation, let us construct a matrix Bd by
selecting the columns from matrix B whose indices correspond to the indices of the rows
selected from W when we generated matrix Wd. Let xi and 𝒙̂𝒊 denote the ith measured and re-
constructed sample. Then,
𝑥̂𝒊 = 𝑄 −𝟏 𝐵𝒅 𝑠𝒊 = 𝑄 −𝟏 𝐵𝒅 𝑊𝒅 𝑥𝒊
𝑒𝒊 = 𝑥𝒊 − 𝑥̂𝒊
𝑺𝑷𝑬 = 𝑒𝒊𝑻 𝑒𝒊 eq. 7
The third metric, 𝑰𝟐𝒆 , is based on the excluded ICs. Let We contain the excluded rows of matrix
W. Then
𝑠𝒊𝒆 = 𝑊𝒆 𝑥𝒊
128
`
Due to the non-gaussian nature of the ICs, we do not have convenient closed-form equations
for computing the thresholds/control limits of the above computed indices. A standard practice
is to use Kernel density Estimation (KDE) for threshold determination of ICA metrics. Since
we will learn KDE in a later chapter, we will employ the percentile method here.
Below we define a function that takes an ICA model and data matrix as inputs, and returns
the monitoring metrics. Figure 6.5 shows the monitoring charts along with 99% control limits.
# Define function to compute ICA monitoring metrics for training or test samples
def compute_ICA_monitoring_metrics(ica_model, number_comp, data):
"""
data: numpy array of shape = [n_samples, n_features]
Returns
----------
monitoring_stats: numpy array of shape = [n_samples, 3]
"""
n = data.shape[0]
# model parameters
W = ica.components_
L2_norm = np.linalg.norm(W, 2, axis=1)
sort_order = np.flip(np.argsort(L2_norm))
W_sorted = W[sort_order,:]
# compute I2
Wd = W_sorted[0:number_comp,:]
Sd = np.dot(Wd, data.T)
I2 = np.array([np.dot(Sd[:,i], Sd[:,i]) for i in range(n)])
# compute Ie2
We = W_sorted[n_comp:,:]
Se = np.dot(We, data.T)
Ie2 = np.array([np.dot(Se[:,i], Se[:,i]) for i in range(n)])
# compute SPE
Q = ica.whitening_
Q_inv = np.linalg.inv(Q)
129
`
A = ica.mixing_
B = np.dot(Q, A)
B_sorted = B[:,sort_order]
Bd = B_sorted[:,0:n_comp]
parameters
----------------
ICA_statistics: numpy array of shape = [n_samples, 3]
CLs: List of control limits
trainORtest: 'training' or 'test'
"""
130
`
A well-designed monitoring tool has high FDR and low FAR values. Low FDR and high FAR
values are both undesirable because while low FDR value leads to safety issues and
economic losses due to delayed/no abnormality detection, high FAR value leads to loss of
user’s confidence in the tool and eventually tool’s demise.
Let us now use these terms to compare the performances of PCA and ICA for detecting Faults
10 and 5. Figures 6.6 and 6.7 show the ICA/PCA monitoring charts for these faults. For Fault
10, ICA gives significantly higher FDR. There are many samples between 400 to 600 where
PCA metrics are below their control limits despite the presence of process abnormalities. The
performance difference is even more prominent for Fault 5 test data. Sample 400 onwards
PCA charts incorrectly indicate a normal operation although it is known that the faulty
condition remains till the end of the sampling period. Another point to note is that both PCA
28
Lee et al., Statistical monitoring of dynamic processes based on dynamic independent component analysis, Chemical
Engineering Science, 2004
29
Yin et al., A comparison study of basic data-driven fault diagnosis and process monitoring methods on the benchmark
Tennessee Eastman process, Journal of Process Control, 2012
131
`
and ICA have low FAR values as not many samples before sample 160 violate the control
limits. Here again ICA has lower/better FAR values. Similar FAR behavior is observed for the
non-faulty test dataset. We have not shown FAR values, but you should compute them and
confirm that ICA gives lower FAR values.
parameters
-----------
monitoring_stats: numpy array of shape = [n_samples, 3]
CLs: List of control limits
Returns
----------
alarmRate: float
"""
violationFlag = monitoring_stats > CLs
alarm_overall = np.any(violationFlag, axis=1) # violation of any metric => alarm
alarmRate = 100*np.sum(alarm_overall)/monitoring_stats.shape[0]
return alarmRate
xmeas = TEdata_Fault_test[:,0:22]
xmv = TEdata_Fault_test[:,41:52]
data_Fault_test = np.hstack((xmeas, xmv))
# scale data
data_test_scaled = scaler.transform(data_Fault_test)
132
`
Figure 6.6: Monitoring charts for Fault 10 data with ICA (top, FDR = 90.8%) and PCA
(bottom, FDR = 75.6%) metrics
Figure 6.7: Monitoring charts for Fault 5 data with ICA (top, FDR = 100%) and PCA (bottom,
FDR = 41.5%) metrics
Although PCA gave poor performance compared to ICA for the 2 test datasets, there is no
general criteria that we could have used to predict this. The monitoring performance depends
on the specific system under study and trial-and-error method is frequently employed for
selection of a monitoring algorithm.
133
`
Like PCA and PLS, nonlinear and dynamic variants of ICA have also been developed. The
reader is encouraged to see the works of Lee et al.28 for dynamic ICA and Lee et al.30 for
nonlinear ICA.
Fisher Discriminant Analysis (FDA), also called linear discriminant analysis (LDA), is a
multivariate dimensionality reduction technique which maximizes the ‘separation’ in the lower
dimensional space between data belonging to different classes. For conceptual
understanding, consider the simple illustration in Figure 6.8 where a 2D dataset (with 2
classes of data) has been projected onto 1D spaces by FDA and PCA. The respective 1-D
scores show that while the two classes of data are well segregated in LD space, the
segregation is very poor in PC space. This observation was expected because PCA, while
determining the projection directions, does not consider information about different data
classes. Therefore, if your intention is to reduce dimensionality for subsequent data
classification and training data is labeled into different classes, then FDA is more suitable.
Figure 6.8: Simple illustration of FDA vs PCA. The arrows in the x1 vs x2 plot show the
direction vectors of 1st components of the corresponding methods
Due to the powerful data discrimination ability of FDA, it is widely used in process industry for
operating mode classification and fault diagnosis. Large-scale industrial processes often
30
Lee et al., Fault detection of non-linear processes using kernel independent component analysis, The Canadian Journal
of Chemical Engineering, 2007
134
`
Mathematical background
To facilitate data classification, FDA not only maximizes the separation between the classes
but also minimizes the variation/scatter within each class. To see how this is achieved, let us
first consider a dataset matrix 𝑿 ∈ ℝ𝑵×𝐦 consisting of N samples, N1 of which belong to class
1 (𝜔1) and N2 belong to class 2 (𝜔2 ). FDA seeks to find a projection vector w ∈ ℝ𝐦 such that
the projected scalars/samples (𝑧 = 𝒘𝑇 x) are maximally separated. Let 𝜇̃1 and 𝜇̃2 denote the
means of projected values of classes 1 and 2, respectively
1
𝜇̃𝑖 = ∑ 𝑧
𝑁𝑖
𝒛 ∈ 𝜔1
Class separation could be quantified as distance between the projected means |𝜇̃1 − 𝜇̃2 |; this,
however, is not a robust measure as shown by the illustration below.
𝑠̃𝑖2 = ∑ (𝑧 − 𝜇̃𝑖 )2
𝒛 ∈ 𝜔1
The separation criterion is now defined as the normalized distance between the projected
means. This formulation seeks to find a projection where the class means are far apart and
samples from the same class are close to each other.
|𝜇̃1 − 𝜇̃2 |2
𝐽(𝒘) =
𝑠̃12 + 𝑠̃22
135
`
Using the base relation 𝑦 = 𝒘𝑇 x and straightforward algebraic manipulations31, one can
equivalently represent the objective 𝐽(𝒘) as follows which also holds for any (p) number of
data classes
𝒘𝑇 𝑺𝑏 𝒘
𝐽(𝒘) = eq. 9
𝒘𝑇 𝑺𝑤 𝒘
𝑺𝑤 = ∑ 𝑺𝑗
𝑗=1
𝑺𝑗 = ∑ (𝒙𝑖 − 𝝁𝑗 )(𝒙𝑖 − 𝝁𝑗 )𝑇
𝒙𝑖 ∈ 𝜔 1
where, 𝝁 ∈ ℝ𝐦 and 𝝁𝑗 ∈ ℝ𝐦 denote the mean vectors of all the N samples and Nj samples
from jth class, respectively, in the measurement space. The first FDA vector, w1, is found by
maximizing J(w) and the subsequent vectors are found by solving the same problem with the
added constraints of orthogonality to previously computed vectors. Note that there can be at
most p-1 FDA vectors. Alternatively, like PCA, the vectors can also be computed as solutions
of a generalized eigenvalue problem
𝑺−1
𝑤 𝑺𝑏 𝒘 = 𝜆𝒘 eq. 10
where 𝜆 = J(w). Therefore, the eigenvalues (𝜆𝑠 ) indicate the degree of separability among the
data classes when projected onto the corresponding eigenvectors. The first discriminant/FDA
vector/eigenvector corresponds to the largest eigenvalue, the 2nd FDA vector is associated
with the 2nd largest eigenvalue, and so on. Once the FDA vectors are determined, data-points
can be projected, and classification models can be built in the reduced FDA space. Overall,
FDA transformation from m dimensional space to p-1 dimensional FDA space can be
represented as
𝒁 = 𝑾𝒑 𝑿
where 𝑾𝒑 ∈ ℝ𝒎×(𝐩−𝟏) contains the p-1 FDA vectors as columns and 𝒁 ∈ ℝ𝑵×(𝐩−𝟏) is the data-
matrix in the transformed space where each row is the transformed sample. The transformed
samples are optimally separated in the FDA space.
31
Elhabian & Farag, A tutorial on data reduction Linear Discriminant Analysis, September 2009
136
`
# scale data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
Faultydata_train_scaled = scaler.fit_transform(data_Faulty_train)
137
`
Figure 6.9 shows the transformed samples in 2 dimensions after FDA and PCA. FDA is able
to provide a clear separation of Fault 5 samples; however, it could not separate Fault 10 and
19 data. Infact, the 2nd discriminant (FD2) contributes little to the discrimination and therefore,
only FD1 is needed to separate samples from Fault 5. To segregate Faults 10 and 19, kernel
FDA can be explored32. Linear PCA, on the other hand, fails to separate any of the classes.
Figure 6.9: FDA and PCA scores in 2 dimensions with 3 fault classes from TEP training (top)
and test (bottom) dataset
While FDA is a powerful tool for fault diagnosis, it can also be used for fault
detection by including data from ‘no-fault’ or normal plant operation as a
separate class.
32
Hyun-Woo Cho, Nonlinear feature extraction and classification of multivariate process data in kernel feature space,
Expert Systems with Applications, 2007
138
`
After projecting the samples onto the FDA space, any classification technique can be chosen
to classify or diagnose the specific fault. A popular T2 statistic-based approach entails
computing a T2 control limit for each fault class using the training data. The T2 control limit
(𝑻𝟐𝑪𝑳,𝒋 ) for the jth fault class represents a boundary around the projected samples from the jth
fault in the lower-dimensional space – any given sample lying inside the boundary belongs to
the jth fault class. Mathematically, this can be specified as
̃ 𝒋 )𝑻 𝑺̃𝒋 (𝒛𝒔𝒂𝒎𝒑𝒍𝒆 − 𝝁
𝑻𝟐𝒔𝒂𝒎𝒑𝒍𝒆,𝒋 = (𝒛𝒔𝒂𝒎𝒑𝒍𝒆 − 𝝁 ̃𝒋 )
where, 𝝁 ̃ 𝒋 and 𝑺̃𝒋 denote the mean and covariance matrix of the projected samples belonging
to the jth fault class in training dataset, respectively, and 𝒛𝒔𝒂𝒎𝒑𝒍𝒆 denotes the projected sample
in the FDA space. 𝑻𝟐𝑪𝑳,𝒋 can be obtained using the same expression we used in PCA,
𝒌(𝑵𝟐𝒋 −𝟏)
𝑭𝒌,𝑵−𝒌 (𝜶), where k denotes the number of dimensions retained in the FDA space.
𝑵𝒋 (𝑵𝒋 −𝒌)
For illustration, let us see how many samples from the Fault 5 test data get correctly identified.
n_rows_test = TEdata_Fault5_test.shape[0]
xmeas = TEdata_Fault5_test[:,0:22]
xmv = TEdata_Fault5_test[:,41:52]
data_Faulty_test = np.hstack((xmeas, xmv))
About 98% of the samples have been correctly identified as belonging to Fault 5. As shown
in Figure 6.10, some of the samples which fall far away from the mean violate the 𝑻𝟐𝑪𝑳,𝟓 and
therefore are not classified as Fault 5.
In our process monitoring applications so far, we have used terms like fault
detection, fault diagnosis, and fault classification. You may, however, also
encounter other terms like fault identification, fault isolation. While these terms may
get used interchangeably, there exist nuanced differences. Fault detection refers
to the task of determining whether abnormal process conditions have occurred.
Fault identification and fault isolation, both refer to the task of identifying the process
variables that exhibit abnormal behavior. Identifying these variables can help in
determining the root-cause of the fault. The step of ‘fault diagnosis’ in the PCA
chapter should have been called ‘fault identification’!
Fault diagnosis refers to the task of finding which specific fault has occurred or
determining the root-cause of the fault. Fault classification falls in this category.
While fault isolation is non-supervised, fault diagnosis is supervised.
Summary
With this chapter, we have now covered all the popular classical dimensionality reduction
techniques that are frequently utilized for analyzing process data. This chapter has also
provided an important message: blind application of a single technique all the times may not
yield best results. The techniques should be chosen according to the process system
(gaussian vs non-gaussian) and objective (fault detection vs fault classification) at hand. ICA
and FDA are powerful techniques and there lies much more to them that what we have
touched in this chapter. While you are encouraged to explore more about these methods (now
that you have conceptual understandings), we will move to study another powerful ML
technique called Support Vector Machines in the next chapter.
141
`
Chapter 7
Support Vector Machines & Kernel-based
Learning
In previous chapters, we saw methods that help us overcome the ‘curse of dimensionality’.
Wouldn’t it be great to have ML methods that do not ‘mind’ dealing with high-dimensional
systems? SVM (support vector machine) is one such algorithm which excels in dealing with
high-dimensional, nonlinear, and small or medium-sized data. SVMs, by design, minimize
overfitting to provide excellent generalization performance. Another major quality of SVMs is
that they are extremely flexible and can be employed for classification and regression tasks
in both supervised and unsupervised settings. You may already be thinking, “Wow! SVMs
seem to pack a lot of good features.” Infact, before ANNs became the craze in ML community,
SVMs were the toast of the town. Even today, SVM is a must-have tool in every ML
practitioner’s toolkit.
Above, we mentioned only some of the features of SVMs. You will find more as you work
through this chapter. In terms of use in process industry, SVMs have been employed for fault
classification, process monitoring, outlier detection, soft sensing, etc.
To understand different aspects of SVMs, we will cover the following topics in this chapter
• Fundamentals of SVMs
• The kernel trick for nonlinear modeling
• SVDD (support vector data description) for unsupervised classification
• Fault detection via SVDD for semiconductor manufacturing process
• Fundamentals of SVR (support vector regression)
• Soft sensing via SVR in a polymer plant and a petroleum refinery
142
`
The classical SVM is a supervised linear technique for solving binary classification problems.
For illustration, consider Figure 7.1a. Here, in a 2D system, the training data-points belong to
2 distinct (positive and negative) classes. The task is to find a line/linear boundary that
separates these 2 classes. Two sample candidate lines are also shown. While these lines
clearly do the stated job, something seems amiss. Each of them passes very close to some
of the training data-points. This can cause poor generalization: for example, the shown test
observation ‘A’ lies closer to the positive samples but will get classified as negative class by
boundary L2. This clearly is undesirable.
Figure 7.1: (a) Training data distribution with test sample A (b) Optimal separating boundary
The optimal separating line/decision boundary, line L3 in Figure 7.1b, lies as far away as
possible from either class of data. L3, as shown, lies midway of the support planes (planes
that pass-through training points closest to the separating boundary). During model fitting,
SVM simply finds this optimal boundary that corresponds to the maximum margin (distance
between the support planes). In Figure 7.1, any other orientation or position of L3 will reduce
the margin and will make L3 closer to one class than to the other. Large margins make model
predictions robust to small perturbations in the training samples.
Points that lie on the support planes are called support vectors 33 and completely determine
the optimal boundary, and hence the name, support vector machines. In Figure 7.1, if support
vectors are moved, line L3 may change. However, if any non-support vectors are removed,
L3 won’t get affected at all. We will see later how the sole dependency on the support vectors
imparts computational advantage to the SVMs.
33
Calling data-points as vectors may seem weird. While this terminology is commonly used in general SVM literature,
support vectors refer to the vectors originating from origin with the data-points on support planes as their tips.
143
`
Mathematical background
Let there be N training samples (x, y) where x is an input vector in m-dimensional input space
and y is the class label (±1). Let the optimal separating hyperplane (a line in 2D space) be
represented as 𝑤 𝑻 𝒙 + 𝑏 = 0 where the model parameters (𝑤, 𝑏) are found such that
1 2
min ||𝒘|| eq. 1
𝑤,𝑏 2
s.t. 𝑦𝑖 (𝑤 𝑻 𝑥𝒊 + 𝑏) ≥ 1, 𝑖 = 1, ⋯ 𝑁
Once the model has been fitted, class predictions for test sample, 𝒙𝒕 , are made as follows.
The expression inside the sign function is also called decision function and therefore, positive
decision function results in positive class prediction and vice-versa.
The optimization formulation in Eq. 1 and all the others that we will see in
this chapter share a very favorable property of possessing a unique global
minimum. This is a huge advantage when compared to other powerful ML
methods like neural networks where the issue of local minimums can be an
inconvenience.
144
`
# read data
import numpy as np
data = np.loadtxt('toyDataset.csv', delimiter=',')
X = data[:, [0, 1]]; y = data[:, 2]
The above code provides us the optimal separating boundary shown in Figure 7.134. As with
other Sklearn estimators, predict() method can be used to predict the class of any test
observation. We will soon cover the hyperparameters (kernel and C) used in the above code.
34
Check out the online code to see how the separating boundary and support planes are plotted
145
`
Figure 7.2: Presence of the shown bad sample makes perfect linear separation infeasible
To deal with such scenarios, we add a little flexibility into our SVM optimization program by
modifying the constraints as shown below
𝒘𝑻 𝒙𝑖 + 𝒃 ≥ 𝟏 − 𝝃𝒊 for 𝑦𝑖 = 1
𝒘𝑻 𝒙𝑖 + 𝒃 ≤ −𝟏 + 𝝃𝒊 for 𝑦𝑖 = −1
Here, we use slack variables (𝝃𝒊 ) to allow each sample the freedom to end up on the wrong
side of the support plane and potentially be misclassified during model fitting. However, we
would like to keep the number of such violations low as well which we can achieve by
penalizing the violations. The revised SVM formulation looks like this
1 2
min ||𝒘|| + 𝐶 ∑𝑁
𝑖=1 𝝃𝒊 eq. 2
𝑤,𝑏,𝜉 2
s.t. 𝑦𝑖 (𝑤 𝑻 𝑥𝒊 + 𝑏) ≥ 1 − 𝝃𝒊 , 𝑖 = 1, ⋯ 𝑁
𝝃𝒊 > 𝟎
The above formulation is called soft margin classification (as opposed to the previous hard
margin classification). Sklearn implements soft margin formulation. The positive constant, C,
is a hyperparameter (C=1 in Sklearn by default) and corresponds to the hyperparameter we
saw in the previous code. For our toy dataset 2 (in Figure 7.2), with the previously shown
code, we end up with the same separating boundary as shown in Figure 7.1. Class prediction
expression remains the same as 𝑦̂𝑡 = 𝑠𝑖𝑔𝑛(𝒘𝑻 𝒙𝑡 + 𝑏).
C as regularization hyperparameter
The slack variables not only help find a solution in the presence of gross impurity, but they
also help to avoid overfitting noisy data. For example, consider the scenario in Figure 7.3. If
no misclassifications are allowed, we end up with a very small margin, while with a single
146
`
misclassification we get a much better margin with potentially better generalization. Therefore,
we see that there is a trade-off between margin maximization and training accuracy.
The hyperparameter C is the knob to control the trade-off. A large value of C implies heavy
penalization of the constraint violations which will prevent misclassifications, while small value
of C allows more misclassifications during model fitting for better margin.
While soft margin classification formulation is quite flexible, it won’t work for nonlinear
classification problems where curved boundaries are warranted. Consider the dataset in
Figure 7.4. It is clear that a linear boundary is inappropriate here.
However, all is not lost here. One idea to circumvent this issue is to map the original input
variables/features into a higher dimensional space where they become linearly separable. For
the data in Figure 7.4, the following transformation would work quite well
147
`
As we can see in Figure 7.5, in the 3D space, the data is easily linearly separable!
SVM can be trained on the new feature space to obtain the optimal separating hyperplane.
Any new test data point can be transformed via the same mapping function, 𝜑, for its class
determination via the fitted SVM model. While this solution looks great, there remains a small
issue. How do we find an appropriate mapping for a high-dimensional input dataset? As it
turns out, you don’t need to find this mapping explicitly and this is made possible by a neat
‘kernel trick’. Let’s learn what is this trick and how it is used. For this we will revisit the
mathematical background of SVMs.
1
min ∑𝑁 𝑁 𝑻 𝑁
𝑖=1 ∑𝑗=1 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑖 𝒙𝒊 𝒙𝑗 − ∑𝑖=1 𝛼𝑖 eq. 3
𝜶 2
s.t. ∑𝑁
𝑖=1 𝒚𝑖 𝛼𝑖 = 0
0 ≤ 𝛼𝑖 ≤ 𝐶 𝑖 = 1, ⋯ 𝑁
Here, the optimization parameters 𝒘, 𝑏 have been replaced by 𝛼𝑠 (also called Lagrange
multipliers). This equivalent form is called dual form (which you may remember from your
optimization classes; it’s perfectly fine if you have not encountered this term before). Once 𝛼𝑠
have been found, 𝒘 and 𝑏 can be computed via the following
𝑁
𝟏
𝒘 = ∑ 𝛼𝑖 𝒚𝑖 𝒙𝑖 , 𝒃= ∑ 𝟏 − 𝒚𝑖 𝒘𝑻 𝒙 𝑖
𝑵𝒔
𝑖=1 𝒊∈{𝑺𝑽}
148
`
where Ns is number of support vectors and {SV} is the set of support vector indices. Any test
data point can be classified as
𝑦̂𝑡 = 𝑠𝑖𝑔𝑛(∑𝑁 𝑻
𝑖=1 𝛼𝑖 𝑦𝑖 𝒙𝒊 𝒙𝑡 + 𝑏) eq. 4
In the dual formulation, it is found that 𝛼𝑠 are non-zero for only the support vectors and zero
for the rest of the training samples. This implies that Eq. 4 can be reduced to
Strictly speaking, support vectors need not lie on the separating hyperplane.
For soft margin classification, data-points with non-zero slacks are also
support vectors and their 𝜶𝒔 are non-zero (defining characteristic of the
support vectors). The presence/absence of the support vectors impacts the
solution (the objective function and/or the model parameters).
At this point, you may be wondering why have we made things more complicated; why not
solve the problem in the original form (Eq. 2) which seemed more interpretable? The reason
for doing this will become clear to you very soon. For now, imagine that you are solving the
nonlinear problem where SVM finds a separating hyperplane in the higher dimension. Eq. 3
will look like the following
1
min ∑𝑁 ∑𝑁 𝛼 𝛼 𝑦 𝑦 𝜑(𝒙𝒊 )𝑻 𝜑(𝒙𝒋 ) − ∑𝑁
𝑖=1 𝛼𝑖 eq. 6
𝜶 2 𝑖=1 𝑗=1 𝑖 𝑗 𝑖 𝑖
s.t. ∑𝑁
𝑖=1 𝒚𝑖 𝛼𝑖 = 0
0 ≤ 𝛼𝑖 ≤ 𝐶 𝑖 = 1, ⋯ 𝑁
The most crucial observation here is that the transformed variables (𝜑(𝒙)) appear only as
inner (dot) products. Now, if somehow, we knew the values of these dot products then we
would not need to know the exact form of 𝜑(𝒙) or the mapping at all and we can fit our SVM
model. This is made possible via kernel functions, 𝑲(𝒙𝒊 , 𝒙𝒋 ), which provide the relationship
149
`
As you can see, kernel functions allow us to compute dot products in the ‘unknown’
transformed space as a function of vectors in the original space! There are several forms of
kernel functions to choose from and thus this choice becomes a model hyperparameter. Once
a kernel functions is chosen, Eq. 6 becomes
1
min ∑𝑁 𝑁 𝑁
𝑖=1 ∑𝑗=1 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑖 𝐾(𝒙𝒊 , 𝒙𝒋 ) − ∑𝑖=1 𝛼𝑖 eq. 8
𝜶 2
s.t. ∑𝑁
𝑖=1 𝒚𝑖 𝛼𝑖 = 0
0 ≤ 𝛼𝑖 ≤ 𝐶 𝑖 = 1, ⋯ 𝑁
Above is the form in which a SVM model is fitted and predictions are made as follows
These kernel functions allow us to obtain powerful nonlinear classifiers while retaining all the
benefits of the original linear SVM method!
Table below lists some commonly used kernel functions. The first one is simply the familiar
dot product of two vectors, while the 2nd one, RBF (radial basis function) or gaussian kernel
is the most popular (and usually the default choice) kernel for nonlinear problems.
Linear 𝑲(𝒙, 𝒛) = 𝒙𝑻 𝒛
‖𝒙 − 𝒛 ‖𝟐 𝝈
Gaussian 𝑲(𝒙, 𝒛) = 𝒆𝒙𝒑(− )
𝟐𝝈𝟐
𝒅 𝜸, 𝒓, 𝒅
Polynomial 𝑲(𝒙, 𝒛) = (𝜸𝒙𝑻 𝒛 + 𝒓)
𝜸, 𝒓
Sigmoid 𝑲(𝒙, 𝒛) = 𝒕𝒂𝒏𝒉(𝜸𝒙𝑻 𝒛 + 𝒓)
150
`
Let’s use the polynomial kernel to illustrate how using kernel functions amounts to higher
dimensional mapping. Assume that we use the following kernel
2
𝑲(𝒙, 𝒛) = (𝒙𝑻 𝒛 + 𝟏)
where 𝒙 = [𝑥1 , 𝑥2 ]𝑻 and 𝒛 = [𝑧1 , 𝑧2 ]𝑻 are two vectors in the original 2D space. We claim
that the above kernel is equivalent to the following mapping
Therefore, if you use the above polynomial kernel, you are implicitly assuming that your
dataset is linearly separable in the shown 6th dimensional feature space! In general, you will
have to experiment with the kernel hyperparameters to determine what exact form of
polynomial kernel works best for your problem.
If you were amazed by the previous illustration, you will find it more interesting to know that
gaussian kernels map original space into an infinite dimensional feature space, making
gaussian kernels very powerful mechanism. Luckily, we don’t need to know the form of this
feature space.
Sklearn Implementation
Let’s try to find the nonlinear classifier boundary for the toy dataset in Figure 7.4 using
gaussian kernel. We will also find optimal values of C and 𝝈 via grid-search and cross-
validation.
# generate data
from sklearn.datasets import make_circles
151
`
You will notice that Sklearn uses the hyperparameter gamma which is simply 1/2𝜎 2 . Optimal
C and gamma come out to be 0.1 and 1, respectively, with the classifier solution shown in
Figure 7.6. The figure also shows the boundary regions for low and high values of the
hyperparameters. As we saw before, large C leads to overfitting (boundary impacted by the
noise). As far as gamma is concerned, large value (or small 𝜎) also leads to overfitting.
Figure 7.6: Nonlinear binary classification via kernel SVM and impact of hyperparameters.
[Code for plotting the boundaries is provided online]
152
`
A better intuition behind the kernels can help us understand the impact of kernel
hyperparameters on the classification boundaries. Kernels provide an indirect
measure of similarity between any two points in the high-dimensional space.
Consider again the gaussian kernel
Here, if two points (𝒙, 𝒛) are close to each-other in the original space, then their
similarity (or kernel value) in the mapped space will be higher, compared to when
𝒙 and 𝒛 are far away from each-other. Now, let’s look at the classifier prediction
formula for a sample 𝒛
Therefore, the classifier is nothing but a sum of gaussian bumps from the support
vectors (plus an offset b)!
Given bandwidth (𝝈), during training, SVM tries to find the optimal values of the
bump multipliers (𝜶𝒔 ) such that the training samples get correct labels while
keeping maximum separation between classification boundary and training
samples. The boundary is simply the set of points where the net summations of
bumps and offset become zero. Small values of 𝝈 lead to very localized bumps near
any support vector, resulting in higher number of support vectors with too much
‘wiggles’ in the separating boundary which often indicates overfitting.
153
`
Support vector data description (SVDD) is the unsupervised form of SVM algorithm used for
dealing with problems where training samples belong to only one class and the model
objective is to determine if any test/new sample belongs to the same class of data or not.
Consider the motivating example in Figure 7.7. Here, how do we obtain a model of the training
data to be able to call sample X an outlier / abnormal? Such problems are quite common in
process industry. Process monitoring and equipment health monitoring are some example
areas where most / all the available process data may belong to only normal plant operations
class and the modeling objective is to identify any abnormal data-point.
Figure 7.7: 2D training dataset with only one class. Boundary in red shows a potential
description model of the dataset that can be used to distinguish the test sample X from
training samples.
The idea behind SVDD is to envelop training data by a hypersphere (circle in 2D, sphere in
3D) containing maximum number of data-points within a smallest volume. Any new
observation that lies farther than the hypersphere radius from hypersphere center can be
regarded as abnormal observation. But the data in Figure 7.7 don’t look like it can be suitably
enveloped with a circle? That is correct and our recourse is to use kernel functions to implicitly
project original data onto a higher dimensional space where data can be adequately
enveloped within a compact hypersphere. The projection of the optimal hyperplane onto the
original space will show up as a tight nonlinear boundary around the dataset!
Just like classical 2-class SVM, only a small set of training samples get to completely define
the hypersphere. These data-points or support vectors lie on the circumference or outside of
the hypersphere (or the nonlinear boundary in the original space).
154
`
Mathematical background
Assume again that 𝜑(𝒙) represents a data-point in the higher dimensional feature space. In
this space, the optimal hypersphere is found via the following optimization problem
min 𝑅 2 + 𝐶 ∑𝑁
𝑖=1 𝝃𝒊
𝑅,𝑎,𝜉
2 2
s.t. ||𝜑(𝒙𝒊 ) − 𝑎|| ≤ 𝑅 + 𝝃𝒊 , 𝑖 = 1, ⋯ 𝑁
𝝃𝒊 ≥ 𝟎
As is evident, the above program is trying to minimize the radius (R) of the hypersphere
centered at ‘a’ such that most of the data-points lie within the hypersphere. Slack variables,
𝜉 , allow certain samples to fall outside and the number of such violations is tuned via the
hyperparameter C. As before, the problem is solved in its dual form
min ∑𝑁 𝑁 𝑁
𝑖=1 ∑𝑗=1 𝛼𝑖 𝛼𝑗 𝐾(𝒙𝒊 , 𝒙𝒋 ) − ∑𝑖=1 𝛼𝑖 𝐾(𝒙𝒊 , 𝒙𝒊 )
𝜶
s.t. ∑𝑁
𝑖=1 𝛼𝑖 = 1
0 ≤ 𝛼𝑖 ≤ 𝐶 𝑖 = 1, ⋯ 𝑁
Like SVM, the alphas indicate the position of training samples w.r.t. the optimal boundary.
The following relationships hold true
where 𝒙𝒔 is any support vector lying on the boundary. Any test observation 𝒙𝒕 is abnormal if
its distance from center a in the mapped space is greater than R where the distance is given
as follows
155
`
2
𝐷𝑖𝑠𝑡(𝜑(𝒙𝒕 ), 𝑎)2 = ||𝜑(𝒙𝒕 ) − 𝑎||
= 𝐾(𝒙𝒕 , 𝒙𝒕 ) − 𝟐 ∑ 𝛼𝑖 𝐾(𝒙𝒕 , 𝒙𝒊 ) + ∑ ∑ 𝛼𝑖 𝛼𝑗 𝐾(𝒙𝒊 , 𝒙𝒋 )
𝒊∈{𝑺𝑽} 𝒊∈{𝑺𝑽} 𝒋∈{𝑺𝑽}
As you can see, specifications of kernel functions and other model hyperparameters is all that
is needed; no knowledge of mapping 𝜑 is required.
OC-SVM vs SVDD
There is another technique closely related to SVDD, called one class SVM (OC-SVM). Infact,
OC-SVM is the unsupervised SVM algorithm currently available in Sklearn. OC-SVM finds a
separating hyperplane that best separates the training data from the origin. Its kernelized dual
form is given by
1
min ∑𝑁 𝑁
𝑖=1 ∑𝑗=1 𝛼𝑖 𝛼𝑗 𝐾(𝒙𝒊 , 𝒙𝒋 )
𝜶 2
s.t. ∑𝑁
𝑖=1 𝛼𝑖 = 1
0 ≤ 𝛼𝑖 ≤ 𝐶 𝑖 = 1, ⋯ 𝑁
You will notice that for gaussian kernel, OC-SVM formulation becomes equivalent to that of
SVDD because 𝐾(𝒙𝒊 , 𝒙𝒊 ) = 1 and we end up with the same values of the multipliers. The
decision boundaries are the same as well. For other kernels with 𝐾(𝒙𝒊 , 𝒙𝒊 ) ≠ 1, results would
be different.
Previously, we saw that C controls the trade-off between volume of hypersphere and the
number of misclassifications in the training dataset. C can also be written as
156
`
1
𝐶=
𝑁𝑓
Where N is the number of samples and f is the expected fraction of outliers in the training
dataset. Smaller value of f (correspondingly larger C) will lead to less samples being put
outside the hypersphere. Infact if C is set to 1 (or greater) the hypersphere will include all the
samples (as ∑ 𝛼𝑖 = 1 and 𝛼 = 𝐶 outside the hypersphere). Therefore, C can be set with some
educated presumptions on the outlier fractions. In absence of any advance knowledge, f =
0.01 is often specified to exclude away 1% of sample lying farthest from hypersphere center.
As far as 𝜎 is concerned, we previously saw that at low value of 𝜎, data boundary becomes
very wiggly with high number of support vectors, resulting in overfitting. Conversely, at high
value of 𝜎, boundary tends to become spherical in the original space itself resulting in
underfitting (or non-compact bounding of data). One approach for bandwidth selection is to
use empirical methods which are based on obtaining a kernel matrix (whose i,jth element is
𝐾(𝒙𝒊 , 𝒙𝒋 )) with favorable properties. One such method, modified mean criterion35, gives
bandwidth as follows
̅2
𝐷
𝜎= √
𝑁−1
ln ( 2 )
𝛿
2
∑𝑖<𝑗 ||𝒙𝒊 − 𝒙𝒋 ||
̅2 =
𝐷
𝑁(𝑁 − 1)
2
𝛿 = −0.14818008∅4 + 0.2846623624∅3 − 0.252853808∅2 + 0.159059498∅ − 0.001381145
1
∅=
ln (𝑁 − 1)
Another approach for bandwidth selection is to choose largest value of 𝜎 that gives the desired
confidence level on the validation dataset. For example, for a confidence level of 99%, 𝜎 is
increased until 99% of validation samples are correctly classified as inliers. Any higher value
of 𝜎 will include more validation samples within the hypersphere. The modified mean criterion
can be used as the initial guess with subsequent search made around it. Let’s now find the
nonlinear boundary for the dataset in Figure 7.7.
# read data
import numpy as np
X = np.loadtxt('SVDD_toyDataset.csv', delimiter=',')
35
Kalde & Sadek, The mean and median criterion for kernel bandwidth selection for support vector data description,
IEEE 2017
157
`
N = X.shape[0]
phi = 1/np.log(N-1)
delta = -0.14818008*np.power(phi,4) + 0.2846623624*np.power(phi,3) - 0.252853808*np.power(phi,2)
+ 0.159059498*phi - 0.001381145
D2 = np.sum(scipy.spatial.distance.pdist(X, 'sqeuclidean'))/(N*(N-1)/2)
sigma = np.sqrt(D2/np.log((N-1)/delta*delta))
gamma = 1/(2*sigma*sigma)
# SVM fit
from sklearn.svm import OneClassSVM
model = OneClassSVM(nu=0.01, gamma=gamma).fit(X) # nu corresponds to f
Figure 7.8 shows the bounding boundary for different values of gamma with f (or nu in Sklearn)
kept at 0.01. A value of gamma (= 1) close to that given by modified mean criterion method
(= 0.58) provided a satisfactory boundary.
Figure 7.8: SVDD application for data description and impact of model hyperparameter.
We hope that by now you are convinced of the powerful capabilities of SVM for discriminating
between different classes of data and compact bounding of normal operational data. A big
requirement for successful application of SVM is that the training dataset should be very
representative of the ‘normal’ dataset and fully characterize all the expected variations. Next,
we will look at a case study with real process data and then move onto another variant of SVM
utilized for regression applications.
158
`
To illustrate a practical application of SVDD for process monitoring, we will use data from a
semiconductor manufacturing process. This batch process dataset contains 19 process
variables measured over the course of 108 normal batches and 21 faulty batches. The batch
durations range from 95 to 112 seconds (see appendix for more details). Figure 7.9 shows
the training samples and the faulty test samples in the principal component space.
Figure 7.9: Normal (in blue) and faulty (in red) batches in PCA score space
For this illustration, the raw data has been processed using multiway PCA and the
transformed 2D (score) data is provided in the Metal_etch_2DPCA_trainingData.csv file. Note
that we could also implement SVDD in the original input space but pre-processing via PCA to
remove variable correlation is generally a good practice. Moreover, we use the 2D PC space
for our analysis just for the ease of illustrating the SVDD boundary. In actual deployment, you
would work in higher dimensional PC space for better accuracy. Let’s see if our model can
identify the faulty samples as outliers or not in the multi-clustered dataset.
159
`
# read data
import numpy as np
X_train = np.loadtxt('Metal_etch_2DPCA_trainingData.csv', delimiter=',')
# fit SVM
from sklearn.svm import OneClassSVM
model = OneClassSVM(nu=0.01, gamma=0.025).fit(X_train) # gamma from modified mean
criterion = 0.0025
print('Number of faults identified: ', np.sum(y_test == -1), ' out of ', len(y_test))
Figure 7.10 shows the boundary around the training samples and the faulty samples labeled
according to their correct or incorrect identification. Seventeen out of twenty faulty data
samples have correctly been identified as outliers. This example illustrates the power of SVDD
for compactly describing clustered datasets.
Figure 7.10: (a) SVDD / OC-SVM boundary (in red) around metal-etch training dataset in 2D
PC space (b) Position of correctly and incorrectly diagnosed faulty samples
We should get the same results if we use the distances from the hypersphere center for fault
detection. The results from SVDD and OC-SVM will differ if RBFs are not used as kernel.
Unfortunately, Sklearn currently does not provide SVDD implementation. Nonetheless, a
SVDD package is available on GitHub36.
36
https://github.com/iqiukp/SVDD
160
`
Support vector regression (SVR) is another variant of SVM used for linear and nonlinear
regressions. SVR attempts to find a regression curve that is as close as possible to the training
observations. Geometrically, as shown in Figure 7.11 for kernelized-SVR, a tube of pre-
specified width (𝜀) is fitted to the data and any sample lying outside the tube is penalized. You
will see later that SVR’s optimization program is designed to obtain a good balance between
model generalization capabilities and the ability to fit the training data.
SVR generates a linear hyperplane that best describes the output as a function of the inputs.
For kernelized SVR, the hyperplane translates into an appropriate nonlinear curve in the
original measurement space. As shown in Figure 7.11, in feature space, most of the data-
points lie within 𝜀 distance to the optimal hyperplane. Slack variables, 𝜉 , allow some samples
to fall outside the 𝜀 tube; however, such violations are penalized. Samples lying on the edge
or outside the tube are the support vectors which completely determine the optimal solution
for a given 𝜀. Adding or removing any non-support vector (or samples inside the tube), does
not affect the solution.
SVR has been found to provide performance superior to ANNs for small and medium-sized
high-dimensional datasets. We will see a couple of applications later in the chapter.
Mathematical Background
SVR’s optimization program for parameter estimation takes the following form in the feature
space
161
`
1 2
Min∗ ||𝒘|| + 𝐶 ∑𝑁 ∗
𝑖=1(𝝃𝒊 + 𝝃𝑖 )
𝑤,𝑏,𝜉,𝜉 2
s.t. 𝑦𝑖 − 𝑤 𝑻 𝜑(𝒙𝒊 ) − 𝑏 ≤ 𝜀 + 𝝃𝒊
𝑤 𝑻 𝜑(𝒙𝒊 ) + 𝑏 − 𝑦𝑖 ≤ 𝜀 + 𝝃∗𝑖
𝜉𝑖 , 𝜉𝑖∗ ≥ 0, 𝑖 = 1, ⋯ 𝑁
1 2
The first part of the objective ( ||𝒘|| ) attempts to keep the model as less complex as possible
2
or the model output as flat as possible. For example, for y=wTx+b, w=0 gives the simplest
model y=b with least dependency on inputs. The 2nd part tries to keep the number of samples
outside of the 𝜀 tube as low as possible. The hyperparameter C controls the trade-off between
the two.
In the SVR formulations, there are no errors associated with samples that
lie within the 𝜺 tube. In SVM literature, this is referred to as implementing
𝜺-insensitive loss function as shown below.
If you notice carefully, here we assign two multipliers (𝜶, 𝜶∗ ) for each data-point. Both the
multipliers end up being 0 for non-support vectors (training samples lying within the 𝜀 tube).
Once multipliers have been found, predictions can be made by
162
`
The specification of the kernel function and hyperparameters allow us to fit the model (find
multipliers) and make predictions. Figure 7.12 shows the impact of the SVR hyperparameters
(C, 𝛾, 𝜀) with RBF kernel for a SISO system.
Figure 7.12: Impact of SVR hyperparameters on model fitting quality. Encircled training
samples denote the support vectors selected by the respective models.
Figure 7.12 is along the expected lines w.r.t. gamma and C. Let’s understand the impact of 𝜀.
Large 𝜀 makes the 𝜀 tube big allowing SVR to keep the model ‘flat’ leading to underfitting. On
the other hand, very small 𝜀 makes the tube very small which results in high number of support
vectors making the fit prone to overfitting. Nonetheless, it is not uncommon to specify small 𝜀
and control overfitting via C. To see SVR implementations using Sklearn, let’s check out a
couple of industrial applications in the next sections.
163
`
SVR vs ANN
[We will learn about ANNs in detail in Chapter 11. If you are not familiar with them,
you can re-visit this note later. But we figured that if you already know, this
comparison could throw some more insights into SVR algorithm.]
In ANNs, the number of fitted model parameters (and therefore the tendency to
overfit) increases with increase in dimensionality of input vectors. But in SVR,
number of parameters depend on the number of support vectors and not on input
dimensionality. Adding more input variables in a SVR model doesn’t increase model
complexity if the number of support vectors don’t change. This greatly helps SVRs
overcome overfitting issue arising from high input dimensionality. Moreover,
through support vectors, SVR models provide an explicit knowledge of training
data-points which are important in defining the regression prediction; this can often
help in rationalizing model predictions. Overall, SVRs are as powerful as ANNs for
nonlinear modeling with some unique advantageous features.
164
`
For our first illustration, we will apply SVR for soft sensing for predicting plant outputs in a
polymer plant. The dataset obtained from Dupont consists of 61 samples of 14 variables (10
predictors, 4 plant outputs). We chose this dataset to illustrate the utility of SVRs for small-
sized datasets. Figure 7.13 provides a quick glimpse into the distribution of some of the
process inputs and a process output. The sparse nature of data distribution is immediately
evident.
Sparse data modeling is known to be a difficult task in machine learning. Nonetheless, let’s
see how well our SVR model can handle this challenge. The code below generates an SVR-
based soft sensor model to predict each of the process output one at a time. Hyperparameters
are determined via grid-search and k-fold CV.
# read data
import numpy as np
data = np.loadtxt('polymer.dat')
165
`
gs.fit(X, y)
print('Optimal hyperparameter:', gs.best_params_)
plt.figure()
plt.plot(y, y_predicted_SVR, '.', markeredgecolor='k', markeredgewidth=0.5, ms=9)
plt.plot(y, y, '-r', linewidth=0.5)
plt.xlabel('measured data'), plt.ylabel('predicted data ')
Note that all the variables in the raw dataset are already scaled between 0.1 and 0,9 and
therefore no further scaling is done before model fitting. Figure below shows the model
predictions. Predictions from PLS models are provided for comparison. The first 2 outputs
proved difficult for the linear PLS models, but SVR models could provide good fit. Outputs 3
and 4 seem to be more linearly related to the inputs with both PLS and SVR models providing
similar fits.
Figure 7.14: SVR and PLS predictions for polymer plant dataset. The red line denotes the
ideal yprediction = ymeasured reference.
166
`
In this illustration we will use a medium-sized dataset that comes from a debutanizer column
operation in a petroleum refinery (see Appendix for system details). The butane (C4) content
in gasoline bottoms product of the debutanizer column is not available in real-time and
therefore, is required to be predicted using other process data around the column. The dataset
contains 2394 samples of input-output process values. Seven process (pressures,
temperatures, flows around the column) variables are used as predictors. Figure 7.15 shows
that dataset has decent process variability. Note that data have been provided in normalized
form.
Figure 7.15: Plots of input and output (last plot) variables for the debutanizer column dataset
The output variable shows strong nonlinearity and therefore PLS model fails miserably as
shown in Figure 7.16. Let’s build an SVR-based soft sensor.
# read data
import numpy as np
data = np.loadtxt('debutanizer_data.txt', skiprows=5)
model = SVR(epsilon=0.05)
param_grid = [{'gamma': np.linspace(1,10,10), 'C': np.linspace(0.01,500,10)}]
gs = GridSearchCV(model, param_grid, scoring='neg_mean_squared_error', cv=10)
gs.fit(X_train, y_train)
As shown in Figure 7.16, with a very coarse hyperparameter grid, we are able to obtain a SVR
model that provides reasonable accuracy on test data. Difference in training and test
accuracies suggest that some overfitting, which may be overcome with more exhaustive
hyperparameter search.
Figure 7.16: SVR and PLS model performance comparison for debutanizer dataset
This concludes our look into support vector machines. SVMs are in a league of their own and
are well-suited for industrial processes with difficult to estimate process parameters. With
elegant mathematical background, just a few hyperparameters, excellent generalization
capabilities, and guaranteed unique global optimum, SVMs are among the best ML
algorithms.
168
`
Summary
In this chapter we studied the support vector machine algorithm and its varied forms for
supervised and unsupervised classification and regression. We saw its applications for binary
classification, process monitoring, fault detection, and soft sensing. Through kernelized
learning, we learned the art of nonlinear modeling. In summary, we have added a powerful
tool to your data science toolkit. Next, we will continue building our toolkit and learn how to
find clusters/groups in process dataset.
169
`
Chapter 8
Finding Groups in Process Data: Clustering &
Mixture Modeling
When exploring a dataset, one of the things that you should always do is to check if data
shows presence of distinct clusters. Most industrial datasets exhibit multiple operating modes
due to variations in production levels, feedstock compositions, ambient temperature, product
grades, etc. and data-points from different modes tend to group into different clusters.
Whether you are building a soft sensor or a monitoring tool, judicious incorporation of the
knowledge of these data clusters into process models will lead to better performance and,
alternatively, failure to do so will often lead to unsatisfactory results.
In absence of specific process knowledge or when the number of variables is large, it is not
trivial to find the number of clusters or to characterize the clusters. Fortunately, several
methodologies are available which you can choose from for your specific solution.
In this chapter, we will learn some of the popular clustering algorithms and understand their
strengths and weaknesses. We will conclude by building a monitoring tool for a multimode
semiconductor process. Specifically, the following topics are covered
• Introduction to clustering
• Finding groups using classical k-means clustering
• Grouping arbitrarily shaped clusters using DBSCAN
• Probabilistic clustering via gaussian mixture modeling
• Process monitoring of multimode processes
170
`
Clustering is an unsupervised task of grouping data into distinct clusters such that the data-
points within a cluster are more similar to each-other than the data-points in other clusters. In
process systems, clustering occurs naturally due to multiple reasons. For example, in a power
generation plant, production level changes according to the demand leading to significantly
different values of plant variables with potentially different inter-variable correlations at
different production levels. The multimode nature of data distribution causes problems with
traditional ML techniques. To understand this, consider the illustrations in Figure 8.1. In
subfigure (a), data indicates 2 distinct modes of operation. From process monitoring
perspective, it would make sense to draw separate monitoring boundaries around the two
clusters; doing so would clearly identify the red-colored data-point as an outlier or a fault. The
Conventional PCA-based monitoring, on the other hand, would fail to identify the outlier. In
subfigure (b), the correlation between the variables is different in the two clusters. From soft
sensing perspective, it would make sense to build separate models for the two clusters. The
Conventional PLS model would give inaccurate results.
Figure 8.1: Illustrative scenarios for which conventional ML techniques are ill-suited
Once the clusters have been characterized in the training data and cluster-wise models have
been built, prediction for a new sample can be obtained by either only considering the cluster-
model most suitable for the new sample or combining the predictions from all the models as
shown in Figure 8.2. The decision fusion module can also take various forms. For example,
for a process monitoring application, a simple fusion strategy could be to consider a new
sample as a normal sample if atleast one of the cluster-models predict so. A different strategy
could be to combine the abnormality metrics from all the models and make prediction based
on this fused metric. Similarly, for soft sensing application, response variable prediction from
individual models can be weighted and combined to provide final prediction.
171
`
There are different clustering algorithms which primarily differ in the way the ‘similarity’
between the data-points is defined. The popular algorithms can be divided into the following
four categories.
• Centroid-based algorithms: In these algorithms, the similarity between data-points is
quantified by the distance of the data-points from the centroid of the clusters. K-Means,
Fuzzy C-Means models belong to this category.
• Density-based algorithms: Density models find areas of high density of data in the
multivariable space and assign data-points to different clusters/regions. DBSCAN and
OPTICS are popular examples of density models.
172
`
# fetch data
import scipy.io
variable_names = Etch_data[0,0].variables
Figure 8.3: Select variable plots for all batches in metal etch dataset. Each colored curve
corresponds to a batch.
173
`
Figure 8.3 does indicate multimode operations with mean and covariance changes. It is
however difficult to estimate the number of operation modes by examining high-dimensional
dataset directly. A popular practice is to reduce process dimensionality via PCA and then
apply clustering to facilitate visualization. Performing PCA serves other purposes as well. We
will see later that expectation-maximization (EM) algorithm is employed to estimate cluster
parameters in K-Means and GMM models. High dimensionality implies high number of
parameters to be estimated which increases possibility of EM converging to locally optimum
results and correlated variables causes EM convergence issues. PCA helps to overcome
these two problems simultaneously.
We will employ multiway PCA for this batch process dataset. We will follow the approach of
He et al.37 where for each batch 85 sample points are retained to deal with batch length
variability, first 5 samples are ignored to eliminate initial fluctuations in sensor measurements,
and 3 PCs are retained.
# scale data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_train_normal = scaler.fit_transform(unfolded_dataMatrix)
# PCA
from sklearn.decomposition import PCA
pca = PCA(n_components = 3)
score_train = pca.fit_transform(data_train_normal)
37
He and Wang, Fault detection using the k-nearest neighbor rule for semiconductor manufacturing processes, IEEE
Transaction on Semiconductor Manufacturing, 2007
174
`
Figure 8.4: Score plot of PC1 and PC2 for calibration batches in metal etch dataset
Figure 8.4 confirms existence of 3 operating modes. While visual inspection of score plots can
help to decide the number of clusters, we will, nonetheless, learn ways to estimate this in a
more automated way.
K-Means is one of the most popular clustering algorithm due to its simple concept, ease of
implementation, and computational efficiency. Let K denote the number of clusters and {xi}, i
= 1, … , N be the set of N m-dimensional points. The cluster assignment of the data points is
determined such that the following sum of squared errors, also called cluster inertia, is
minimized
𝑆𝑆𝐸 = ∑𝐾 2
𝑘=1 ∑𝒙𝑖 ∈ 𝑘 𝑡ℎ 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 ||𝒙𝑖 − 𝝁𝑘 ||2 eq. 1
Here, 𝝁𝒌 is the centroid of the kth cluster and ||𝒙𝒊 − 𝝁𝒌 ||𝟐𝟐 denotes the Euclidean distance of
𝒙𝒊 from 𝝁𝒌 . To solve eq 1, k-means adopts the following intuitive iterative procedure.
175
`
plt.figure()
plt.scatter(score_train[:, 0], score_train[:, 1], c = cluster_label, s = 20, cmap = 'viridis')
cluster_centers = kmeans.cluster_centers_
cluster_plot_labels = ['Cluster ' + str(i+1) for i in range(n_cluster)]
for i in range(n_cluster):
plt.scatter(cluster_centers[i,0], cluster_centers[i,1], c = 'red', s = 40, marker = '*', alpha = 0.9)
plt.annotate(cluster_plot_labels[i], (cluster_centers[i,0], cluster_centers[i,1]))
As expected, Figure 8.5 shows that k-means does a good job at cluster assignment. K-means
clustering results are strongly influenced by initial selection of cluster centers; a bad selection
can result in improper clustering. To overcome this, k-means algorithm allows a parameter,
n_init (default value is 10), which determines the number of times independent k-means
clustering is performed with different initial centroids assignment; the clustering with the lowest
SSE is selected as the final model. The strategy for selection of initial centroids can also be
changed via init parameter; the default k-means++ option adopts a smarter (compared to the
random option) way to speed up convergence by ensuring that the initial centroids are far
away from each-other.
176
`
plt.figure()
plt.plot(range(1,10), SSEs, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('SSEs')
Figure 8.6: Cluster inertias for different number of clusters for metal-etch dataset
177
`
Silhouette coefficient or value of a data-point ranges from -1 to 1 and is a measure of how far
the data-point is from data-points in neighboring cluster as compared to data-points in the
same cluster. A value of 1 indicates that the data-point is far away from the neighboring
cluster and values close to 0 indicate that the data-point is close to the boundary between the
two clusters. Negative values indicate wrong cluster assignment.
Figure 8.7 shows the silhouette plot for the cluster shown in Figure 8.5. Each of the colored
bands is formed by stacking the silhouette coefficient of all data-points in that cluster and
therefore the thickness of the band is an indication of the cluster size. The overall silhouette
score is simply the average of silhouette coefficients of all the data-points. As expected,
average score is high and cluster 2 shows highest coefficients as it is far away from the other
two clusters.
# silhouette plot
from matplotlib import cm
plt.figure()
silhouette_values = silhouette_samples(score_train, cluster_label)
y_lower, y_upper = 0, 0
yticks = []
for i in range(n_cluster):
cluster_silhouette_vals = silhouette_values[cluster_label == i]
cluster_silhouette_vals.sort()
y_upper += len(cluster_silhouette_vals)
color = cm.nipy_spectral(i / n_cluster)
plt.barh(range(y_lower, y_upper),cluster_silhouette_vals,height=1.0,edgecolor='none',color=color)
178
`
yticks.append((y_lower + y_upper) / 2)
y_lower += len(cluster_silhouette_vals)
Figure 8.7: Silhouette plot for metal etch data with 3 clusters determined via k-means. Red
dashed line denotes the average silhouette value.
For comparison, let’s look at a silhouette plot for a sub-optimal clustering in Figure 8.8. Lower
sample-wise coefficients and lower overall score clearly indicate worse clustering.
179
`
n_samples = 1500
X, y = make_blobs(n_samples=n_samples, random_state=100)
plt.figure()
plt.scatter(X_transformed [:,0], X_transformed [:,1])
180
`
DBSCAN, a Density-based algorithm, can easily handle irregularly shaped data distributions
as shown in Figure 8.10 (don’t be alarmed by those disconnected dark-brown colored data-
points! We will come back to these shortly). DBSCAN works by grouping together data-points
that form regions of high data densities. Specifically, each data-point is classified into one of
the following 3 categories:
• A core point if there are more than a specified number (minPts) of data-points within a
specified distance (𝜺)
• A border point if less than minPts data-points lie within its 𝜺 neighborhood, but the
data-point itself lies within the 𝜺 neighborhood of another core point
• A noise point if it isn’t classified as either core or border point
After classification into core, border, and noise points, the clusters are defined by the sets of
connected core and border data-points. Noise data-points are not assigned to any cluster.
The disconnected dark-brown colored data-points that we saw in Figure 8.10 are the noise
points. The ability to deal with arbitrarily shaped data, robustness to noise and outliers make
DBSCAN well-suited for clustering process data.
The big shortcoming with the DBSCAN algorithm is the need to specify reasonable values of
the two hyperparameters, minPts and 𝜺, to ensure optimal clustering. If set improperly, we
may end up with either one giant cluster or thousands of small clusters. Let’s study this aspect
on the metal etch data. Figure 8.11 shows the clustering with properly (minPts = 3, 𝜺 = 5) and
improperly (minPts = 5, 𝜺 = 3) set hyperparameters; the improper setting results in several
noise data-points and two extra clusters. Trial and error or domain specific knowledge can
help to decide the hyperparameter values.
181
`
plt.figure()
plt.scatter(score_train[:, 0], score_train[:, 1], c = cluster_label, s=20, cmap='viridis')
plt.xlabel('PC1 scores')
plt.ylabel('PC2 scores')
>>> Cluster labels: [-1 0 1 2 3 4] # noise points are given the label of -1
Figure 8.11: Clustering via DBSCAN for metal etch data with (a) improper hyperparameters
and (b) proper hyperparameters
While studying PCA and PLS methodologies, we made assumptions about gaussian
distribution of latent variables for control limits determinations. However, as seen for metal
etch data, this assumption fails for processes with multiple operating modes. Nonetheless, it
may still be appropriate to characterize data from each individual operating mode/cluster
through local gaussian distributions. This is the underlying concept behind gaussian mixture
models (GMMs) and as can be seen in Figure 8.12, it works very well for non-hyperspherical
data distributions.
plt.figure()
plt.scatter(X_transformed[:, 0], X_transformed[:, 1], c = cluster_label, s=20, cmap='viridis')
Another big advantage with GMMs is that we can compute the (posterior) probability of a data-
point belonging to any cluster. This cluster membership measure is provided by
predict_proba method. Hard clustering is performed by assigning the data-point to the
183
`
cluster with highest probability. Let’s compute the probabilities for a data-point that lies
between clusters 3 and 2 (encircled in Figure 8.12).
# membership probabilities
probs = gmm.predict_proba(X_transformed[1069, np.newaxis]) # requires 2D array
print('Posterior probablities of clusters 1, 2, 3 for the data-point: ', probs[-1,:])
GMM thinks that the data-point belongs to cluster 3 with 100% probability! It may seem
surprising given that the point seems to lie equidistant (in terms of Euclidean distance) to
clusters 3 and 2. We will study in the next subsection how these probabilities were obtained.
Mathematical background
Let 𝒙 ∈ ℝ𝒎 be a m-dimensional sample obtained from a multimode process with K operating
modes. In GMM, the overall probability density is formulated as a combination of local
gaussian densities. Let Ci denote the ith local gaussian cluster with parameters 𝜽𝒊 = {𝝁𝒊 , 𝜮𝒊 }
(mean vector and covariance matrix) and density
𝐠(𝐱|𝜽𝒊 ) = 𝟏
(𝟐𝝅)𝒎/𝟐 |𝜮𝒊 |𝟏/𝟐
𝒆𝒙𝒑[−𝟏𝟐 (𝒙 − 𝝁𝒊 )𝑻 𝜮−𝟏
𝒊 (𝒙 − 𝝁𝒊 )] eq. 2
𝑝(𝑥|𝜃) = ∑𝐾
𝑖=1 𝜔𝑖 𝑔(𝒙|𝜽𝑖 ) eq. 3
where, 𝝎𝒊 represents the prior probability that a new sample comes from the ith gaussian
component and 𝛉 = {𝜽𝟏 , . . . , 𝜽𝒌 }. The GMM model is constructed by estimating the
parameters 𝜽𝒊 , 𝝎𝒊 for all the clusters using the training samples 𝑿 ∈ ℝ𝑵 𝑿 𝒎. The parameters
are estimated by optimizing the log-likelihood of the training dataset given as below
∑𝑵 𝑲
𝒋=𝟏 𝒍𝒐𝒈 (∑𝒊=𝟏 𝝎𝒊 𝐠(𝒙𝒋 |𝜽𝒊 ) ) eq. 4
184
`
𝑷(𝒔) (𝑪𝒊 |𝒙𝒋 ) denotes the posterior probability that the jth sample comes from the ith gaussian
component.
(𝒔+𝟏) ∑𝑵 (𝒔)
𝒋=𝟏 𝑷 (𝑪𝒊 |𝒙𝒋 )𝒙𝒋
𝝁𝒊 =
∑𝑵 (𝒔)
𝒋=𝟏 𝑷 (𝑪𝒊 |𝒙𝒋 )
Update centroid and
(𝒔+𝟏) (𝒔+𝟏) 𝑻 covariance of each cluster
(𝒔+𝟏) ∑𝑵 (𝒔)
𝒋=𝟏 𝑷 (𝑪𝒊 |𝒙𝒋 )(𝒙𝒋 −𝝁𝒊 )(𝒙𝒋 −𝝁𝒊 )
𝜮𝒊 = using recomputed
∑𝑵 (𝒔)
𝒋=𝟏 𝑷 (𝑪𝒊 |𝒙𝒋 ) memberships from E-step.
(𝒔+𝟏) ∑𝑵 (𝒔)
𝒋=𝟏 𝑷 (𝑪𝒊 |𝒙𝒋 )
𝝎𝒊 = 𝑵
The iteration continues until some convergence criterion on log-likelihood objective is met.
Did you notice the conceptual similarity with the k-means algorithm for finding model
parameters? Previously, we computed posterior probabilities for data point 1069 using
predict_prob method. Let us now use eq. 5 to see if we get the same numbers.
import scipy.stats
g1 = scipy.stats.multivariate_normal(gmm.means_[0,:], gmm.covariances_[0,:]).pdf(x)
g2 = scipy.stats.multivariate_normal(gmm.means_[1,:], gmm.covariances_[1,:]).pdf(x)
g3 = scipy.stats.multivariate_normal(gmm.means_[2,:], gmm.covariances_[2,:]).pdf(x)
print('Local component densities: ', g1, g2, g3)
There is another method, called F-J algorithm38, which can be used to find the optimal number
of GMM components39 and model parameters in an integrated manner. Also, the number of
components does not need to be specified beforehand in F-J method. Internally, the method
initializes with a large number of components and adaptively adjusts this number by
38
Figueiredo & Jain, Unsupervised learning of finite mixture models, IEEE Trans Pattern Anal Mach Intell, 2002
39
Yu & Qin, Multimode process monitoring with Bayesian inference-based finite gaussian mixture models, AIChE Journal,
2008
186
`
eliminating gaussian components with insignificant weights. F-J method also utilizes EM
algorithm for parameter estimation, but with a slightly different weight update mechanism in
the M-step. The reader is encouraged to see the cited references for more details. A downside
of F-J method could be high computational time.
plt.figure()
plt.scatter(X_transformed[:, 0], X_transformed[:, 1], c = cluster_label, s=20, cmap='viridis')
Figure 8.14: GMM based clustering of ellipsoidal data distribution via F-J method
Figure 8.15 shows the clustering for metal etch data via GMM method. BIC method correctly
identifies the optimal number of components. Note that F-J method results in 4 components
for this dataset.
Figure 8.15: BIC plot and GMM clustering of metal etch data
187
`
Due to probabilistic formulation, GMM is widely applied for monitoring process systems. In
this section, we will study one such application for the metal etch process. Figure 8.16 shows
the metal etch calibration and faulty batches in the PCA score space. It is apparent that the
faulty batches tend to lie away from the calibration clusters. Our objective is to develop a
GMM-based monitoring tool that can automatically detect these faulty batches.
Figure 8.16: Calibration (in blue) and faulty (in red) batches in PCA score space
40
Xie & Shi, Dynamic multimode process modeling and monitoring using adaptive gaussian mixture models, Industrial &
Engineering Chemistry Research, 2012
188
`
The global metric is then computed using the posterior probabilities of the test sample to each
gaussian component
(𝑘)
𝐷𝑔𝑙𝑜𝑏𝑎𝑙 (𝒙𝑡 ) = ∑𝐾
𝑘 = 1 𝑃(𝐶𝑘 |𝒙𝑡 ) 𝐷𝑙𝑜𝑐𝑎𝑙 (𝒙𝑡 ) eq. 7
𝑟(𝑁2 −1)
𝐷𝑔𝑙𝑜𝑏𝑎𝑙,𝐶𝐿 = 𝐹𝑟,𝑁−𝑟 (𝛼) eq. 8
𝑁(𝑁−𝑟)
𝑭𝒓,𝑵−𝒓 (𝜶) is the (1-α) percentile of the F distribution with r and n-r degrees of freedom, r is
variable dimension (we performed GMM in PCA score space with 3 latent variables, therefore,
r = 3). Test sample is considered abnormal if Dglobal > Dglobal, CL.
for i in range(score_train.shape[0]):
x = score_train[i,:,np.newaxis]
probs = gmm.predict_proba(x.T)
Figure 8.17 shows the control chart for the training data.
189
`
Figure 8.17: Global monitoring chart for metal etch calibration data
unfolded_TestdataMatrix = np.empty((1,n_vars*n_samples))
for expt in range(test_dataAll.size):
test_expt = test_dataAll[expt,0][5:90,2:]
unfolded_TestdataMatrix = unfolded_TestdataMatrix[1:,:]
# compute Dglobal_test
Dglobal_test = np.zeros((score_test.shape[0],))
for i in range(score_test.shape[0]):
x = score_test[i,:,np.newaxis]
190
`
probs = gmm.predict_proba(x.T)
print('Number of faults identified: ', np.sum(Dglobal_test > Dglobal_CL), ' out of ',
len(Dglobal_test))
The Dglobal and the SVDD (in Chapter 7) models employed for the etching process dataset
used an overall single monitoring model for all the clusters. Nevertheless, it a completely
reasonable approach to build separate monitoring models for each cluster and uses one of
the two approaches from Figure 8.2 for final predictions.
Summary
With this chapter we have conquered another milestone in our journey. We started by
understanding the need for multimode modeling of process systems and then studied several
popular clustering/mixture modeling algorithms. The emphasis on the pros and cons of
different methods was a deliberate choice to enable you to make educated selection of
algorithms for your problems. You are going to encounter multimode processes frequently in
your career and the tools studied in this chapter will help you analyze these systems properly.
191
`
Chapter 9
Decision Trees & Ensemble Learning
Imagine that you are in a situation where even after your best attempts your model could not
provide satisfactory performance (due to high bias or variance). What if we tell you that there
exists a class of algorithms where you can combine several ‘versions’ of your ‘weak’
performing models and generate a ‘strong’ performer that can provide more accurate and
robust predictions compared to its constituent ‘weak’ models? Sounds too good to be true?
It’s true and these algorithms are called ensemble methods.
Ensemble methods are often a crucial component of winning entries in online ML competitions
such as those on Kaggle. Ensemble learning is based on a simple philosophy that committee
wisdom can be better than an individual’s wisdom! In this chapter, we will look into how this
works and what makes ensembles so powerful. We will study popular ensemble methods like
random forests and XGBoost.
The base constituent models in forests and XGBoost are decision trees which are simple yet
versatile ML algorithms suitable for both regression and classification tasks. Decision trees
can fit complex and nonlinear datasets yet enjoy the enviable quality of providing interpretable
results. We will look at all these features in detail. Specifically, we will cover the following
topics
• Introduction to decision trees and random forests
• Soft sensing application of random forests in concrete construction industry
• Introduction to ensemble learning techniques (bagging, Adaboost, gradient boosting)
• Effluent quality prediction using XGBoost in wastewater treatment plant
192
`
Decision trees (DTs) are inductive learning methods which derive explicit rules from data to
make predictions. They partition the feature space into several (hyper) rectangles and then fit
a simple model (usually a constant) in each one. As shown in Figure 9.1 for a binary
classification problem in 2D feature space, the partition is achieved via a series of if-else
statements. As shown, the model is represented using branches and leaves which lead to a
tree-like structure and hence the name decision tree model. The questions asked at each
node make it very clear how the model predictions (class A or class B) are being generated.
Consequently, DTs become the model of choice for applications where ease of rationalization
of model results is very important.
Figure 9.1: A decision tree with constant model used for binary classification in a 2D space
The trick in DT model fitting lies in deciding which questions to ask in the if-else statements
at each node of the tree. During fitting, these questions split the feature space into smaller
and smaller subregions such that the training observations falling in a subregion are similar
to each-other. The splitting process stops when no further gains can be made or stopping
criteria have been met. Improper choices of splits will generate a model that does not
generalize well. In the next subsection, we will study a popular DT training algorithm called
CART (classification and regression trees) which judiciously determines the splits.
Mathematical background
CART algorithm creates a binary tree, i.e., at each node two branches are created that split
the dataset in such a way that overall data ‘impurity’ reduces. To understand this, consider
193
`
the following node41 of a tree. Also assume that we are dealing with binary classification
problem with input vector 𝒙 ∈ 𝑅 𝑚 .
The algorithm needs to decide which one of the m input variables, xk, will be used to split the
set of n samples at this node and with what threshold r. CART makes this determination by
minimizing the following objective
𝑛𝑙𝑒𝑓𝑡 𝑛𝑟𝑖𝑔ℎ𝑡
𝐽(𝑘, 𝑟) = 𝐼𝑙𝑒𝑓𝑡 + 𝐼𝑟𝑖𝑔ℎ𝑡
𝑛 𝑛
where Ileft/right denote the data impurity of the left/right subsets of data and is given by
where pq is the ratio of samples of the corresponding data subset belonging to class q. For
example, if all the samples in a subset belong to class 1, then p1 = 1 and I = 0. Therefore, if
CART could find k and r such that the n samples get perfectly divided class-wise into the left
and right subsets, then the minimum value of J = 0 will be obtained. However, this is usually
not possible, and CART tries to do the best it can. The reduction in impurity (∆𝐼𝑛𝑜𝑑𝑒 ) achieved
𝑛 𝑛
by CART at this node is given by 𝑛𝐼𝑛𝑜𝑑𝑒 − 𝑙𝑒𝑓𝑡
𝑛
𝐼𝑙𝑒𝑓𝑡 − 𝑟𝑖𝑔ℎ𝑡
𝑛
𝐼𝑟𝑖𝑔ℎ𝑡 .
CART simply follows the aforementioned branching scheme recursively. It starts from the top
node (root node) and keeps splitting the subsets. If a node cannot be split any further (because
impurity cannot be reduced anymore or some hyperparameter settings such as
min_samples_split, min_samples_leaf prevent any further split), the node becomes a leaf
or terminal node. For prediction, the leaf node corresponding to the test sample is found and
the majority class from the leaf’s training subset is assigned to the test sample. Note that a
probabilistic prediction for class q can also be made by simply looking at the ratio of the leaf’s
training samples belonging to class q.
41
There are algorithms like ID3 that create more than 2 branches at a node.
194
`
In classical DTs, as we have seen, the split decision rules (or split
functions) at the nodes are based on a single variable. This results in
axis-aligned hyperplanes that split the input space into several
hyperrectangles. Complex split functions using multiple variables may
also be used which may be more suitable for certain datasets – one such
example is shown below
For regression problems, a tree is built in the same way with a different objective function,
J(k,r), which now is given by
𝑛𝑙𝑒𝑓𝑡 𝑛𝑟𝑖𝑔ℎ𝑡
𝐽(𝑘, 𝑟) = 𝑀𝑆𝐸𝑙𝑒𝑓𝑡 + 𝑀𝑆𝐸𝑟𝑖𝑔ℎ𝑡
𝑛 𝑛
2
∑𝑠𝑎𝑚𝑝𝑙𝑒 ∈ 𝑠𝑢𝑏𝑠𝑒𝑡(𝑦̂ − 𝑦𝑠𝑎𝑚𝑝𝑙𝑒 )
𝑀𝑆𝐸 =
# 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
𝑦̂ is the average output of the samples belonging to a subset. Prediction for a test sample is
also taken as the average output value of all training samples assigned to the test sample’s
leaf node.
You are not confined to using the constant predictive models at the leaves
of a regression tree. Linear and polynomial predictive models may be more
suitable for certain problems. Such DTs are called model trees. While
Sklearn allows only the constant model, there exists a package42 called
‘linear-tree’ that allows building model trees with linear models at the leaves.
42
https://github.com/cerlymarco/linear-tree
195
`
Impurity metric
The impurity measure used in Eq. 1 is called Gini impurity. Another commonly employed
measure is entropy and is given as follows for a dataset with 2 classes
2
𝐼𝐻 = − ∑ 𝑝𝑞 log (𝑝𝑞 )
𝑞=1
IH becomes 0 when p1=1 (or p2=0) and 1 when p1=p2=0.5. Therefore, reduction of entropy
leads to more data purity. In practice, both Gini impurity and entropy provide similar results.
# generate data
Import numpy as np
x = np.linspace(-1, 1, 50)[:, None]
y = x*x + 0.25 + np.random.normal(0, 0.15, (50,1))
Figure 9.2: Decision tree regression predictions using unregularized and regularized
models.
196
`
You can use the plot_tree function to plot the tree itself to understand how the training
dataset has been partitioned by the DT model.
# plot tree
plt.figure(figsize=(20,8))
tree.plot_tree(model, feature_names=['x'], filled=True, rounded=True)
While DTs appear to be very flexible and useful modeling mechanism, they are seldom used
as a standalone model. In Figure 9.2, we saw that DTs can easily overfit and give non-
smooth or piece-wise constant approximations. Another disadvantage with DTs is instability,
i.e., small variations in training dataset can result in a very different tree. However, there is a
reason why we invested time in learning DTs. A single tree may not be useful, but when you
combine multiple trees, you get amazing results. We will learn how this is made possible in
the next section.
197
`
Figure 9.3: A random forest prediction is combination of predictions from multiple decision
trees. [For a classification problem, Sklearn returns the class corresponding to the highest
average class probability]
In random forest, the trees are grown to full extent, and therefore, hyperparameter selection
for the trees is not a concern. This makes RF’s training and execution simple and quick. Infact
RFs have very small number of tunable hyperparameters. RFs also lend themselves useful
for computation of variable importances. All these qualities have led to the popularity of
random forests.
Mathematical background
For RF training different trees need to be generated that are as ‘distinct from each-other’ as
possible but at the same time provide good descriptions of the training dataset. This variety
among the trees are achieved via the following two means:
198
`
1) Using different training datasets for each tree: If each tree is trained on the same
dataset, they will end up being identical, make the same kind of errors, and therefore
combining them will offer little benefit. Since we only have a single training dataset to
train the RF, bootstrapping is employed. If original dataset has N samples,
bootstrapping allows creation of multiple datasets, each with Nb (≤ 𝑁) samples, such
that each new dataset is also a good representative of the underlying process that
generated the original dataset. Each bootstrap dataset is generated by randomly
selecting Nb samples with replacement from the original dataset. In RF, Nb = N and the
bootstrapping scheme is illustrated below for N=10.
Figure 9.4 Creation of separate DT models using bootstrap samples. Si denotes the
ith training sample.
2) Using random subsets of input variables to find the optimal split: A very non-intuitive,
somewhat surprising, but incredibly effective aspect of RF training is that not all the
input variables are considered for determining the optimal split function at any node of
any tree in the forest. Instead, a random subset of variables is chosen and then the
node impurity is minimized with these chosen variables. This random selection is
preformed at every node. If input variable 𝒙 ∈ 𝑅 𝑚 , then the number of random split
variables (M) is recommended to be the floored squared root of m.
The above two tricks during training result in trees being minimally correlated to each-other.
Figure below summarizes the RF model fitting procedure. For illustration, it is assumed that
𝒙 ∈ 𝑅 9 and M=3.
199
`
You can see that there are only two main hyperparameters to decide: number of trees and
the size of variable subset. Another advantage with RF is that the constituent trees can be
trained in parallel.
200
`
Figure 9.6: (Left) Random forest regression predictions, (middle) predictions from a couple of
constituent decision trees, (right) Impact of number of trees in RF model on validation error
Figure 9.6 also shows the validation error for different number of trees in the forest. What we
see here is a general characteristic of RFs: as number of trees increase, the validation errors
plateau out. Therefore, if computation time is not a concern, it is preferable to use as many
trees as possible until the error levels out. As far as M is concerned, a lower value leads to
greater reduction in variance but increases the model bias. Do note that RF primarily brings
down the variance (compared to using single DT models) and not the bias. Therefore, if DT
itself is underfitting the RF won’t be able to provide high accuracy (or low bias). This is why
full-grown trees are used in RF. You will find out the reason behind this later in this chapter.
201
`
We will keep 33% of data aside for testing with the number of trees set to a large value 200
and M to 3.
# read data
Import numpy as np
data = np.loadtxt('cement_strength.txt', delimiter=',', skiprows=1)
X = data[:,0:-1]
y = data[:,-1]
# fit RF model
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=200, max_features=3, oob_score=True,
random_state=1).fit(X_train, y_train)
Figure 9.7: RF and PLS predictions for concrete compressive strength dataset. The red line
denotes the ideal yprediction = ymeasured reference
Two noteworthy mentions here. First, we didn’t scale the variables before fitting. It’s not a
mistake! RFs (and DTs) are one of the rare ML methods whose solutions are invariant to data-
scaling. Second, the superior performance of RF over PLS shows the nonlinear modeling
capabilities of RF.
202
`
OOB accuracy
The oob_score for the concrete dataset is same (up to 3 decimal places) as the test score!
An OOB score may not be the perfect generalization assessment, but it comes in handy for
small-sized dataset where keeping aside a test dataset may not be a luxury.
Feature importances
Let’s now look at another handy feature of RF that allows us to rank the model’s input/feature
variables in terms of their importances for output predictions. Sklearn computes feature
importances by computing the average reduction in node impurities (or MSE for regression)
upon usage of a feature for node splitting. First, importance of a feature j in a tree is calculated
as follows
𝑡𝑟𝑒𝑒
∑𝑛𝑜𝑑𝑒𝑠;𝑛𝑜𝑑𝑒 𝑠𝑝𝑙𝑖𝑡𝑡𝑖𝑛𝑔 𝑜𝑛 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑗 ∆𝐼𝑛𝑜𝑑𝑒
𝐹𝐼𝑗𝑡𝑟𝑒𝑒 = 𝑡𝑟𝑒𝑒
∑𝑛𝑜𝑑𝑒𝑠 ∆𝐼𝑛𝑜𝑑𝑒
The average of 𝐹𝐼𝑗𝑡𝑟𝑒𝑒 over all the trees in the forest gives the desired feature importances
∑𝑡𝑟𝑒𝑒𝑠 𝐹𝐼𝑗𝑡𝑟𝑒𝑒
𝐹𝐼𝑗 =
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑒𝑒𝑠
43
For large N, it can be shown that any bootstrap dataset in RF theoretically contains approximately 63% of the samples
from the original dataset
203
`
# feature importances
var_names = ['cement','slag','flyash','water','superplasticizer','coarseaggregate','fineaggregate','age']
importances = model.feature_importances_
plt.barh(var_names, importances)
plt.xlabel('Feature importances')
Above results show that cement content and age of concrete are the two most important
variables for prediction of concrete strength. This result is along the expected lines and further
highlights the usefulness of RFs for analyzing nonlinear dependencies in process data.
This concludes our study of RFs. Hopefully, we have been able to convince you that RFs
could be quite a powerful weapon in your ML arsenal. Let’s proceed to learn about ensemble
learning to understand what is it that imparts power to RFs.
The idea of combining multiple ‘not so good’/weak models to generate a strong model is not
restricted to random forests. The idea is more generic and is called ensemble learning.
Specific methods that implement the idea (like RFs) are called ensemble methods. Figure 9.8
below shows an ensemble modeling scheme employing a diverse set of base models. The
shown model is heterogeneous ensemble model as individual models employ different
learning methodologies. In contrast, in the homogeneous models, the base models use the
same learning algorithm.
204
`
Figure 9.8: Heterogeneous ensemble learning scheme44. RF, which itself is an ensemble
method, is used as a based model here.
There are several options at our disposal to aggregate the predictions from the base models.
For classification, we can use majority voting and pick the class that is predicted by most of
the base models. Alternatively, if base models return class probabilities then soft voting can
also be used. For regression, we can use simple or weighted averaging.
In the ensemble world, a weak model is any model that has poor performance due to either
high bias or high variance. However, together, the weak models combine to achieve better,
more accurate (lower bias) and/or robust (lower variance) predictions. Can we then combine
any set of poor performing models and obtain a super accurate ensemble model? Both yes
and no! There are certain criteria that must be met to be able to reap the benefits of ensemble
learning. First, the base models must individually perform better than random guessing.
Second, the base models must be diverse (i.e., the errors made on unseen data must be
uncorrelated). Consider the following simple illustration to understand these requirements.
44
Another popular heterogeneous ensemble technique is stacking where base models’ predictions serve as inputs to
another meta-model which is trained separately
205
`
In above illustration, the ensemble model will have an accuracy of 97.1% although each base
model is only 60% accurate. While is it relatively easy to fulfill this first criterion of building
base models with > 50% accuracies, it is tough to obtain diversification/independence among
the base models. In Figure 9.8, if all the base models make identical mistakes all the time,
combining them will offer no benefit. We already saw some diversification mechanisms in RF.
There are other ways as well and the schematic below provides an overview of some popular
ensemble diversification techniques.
Figure 9.9: An overview of popular ensemble modeling techniques with homogeneous base
models
206
`
Bagging
In bagging (bootstrap aggregating) technique, the diverse set of base models are generated
by employing the same modeling algorithm with different bootstrap samples of the original
training dataset. Unlike RF, the input variables are kept the same in each base model and the
size of bootstrap dataset is usually kept less than the original number of training samples.
Moreover, the base model is not restricted to be a DT model. As the figure below shows, the
base models are trained independently (allowing parallel trainings) and ensemble prediction
is obtained by combining base models’ predictions.
Figure 9.10: Bagging ensemble technique. Note that if you have multiple independently
sampled original training datasets, then bootstrapping is not needed.
207
`
The defining characteristic of bagging is that it helps in reducing variance but not bias. It can
be shown that if 𝜌̅ is the average correlation among base model predictions then variance of
̅
1−𝜌
yensemble equals 𝜎̅ 2 (𝜌̅ + ) where 𝜎̅ 2 is variance of an individual base model’s predictions
𝐾
and K is number of base models. Therefore, 𝜌̅ = 1 implies no reduction in variance.
Bagging can be used for both regression and classification. Sklearn provides
BaggingClassifier and BaggingRegressor for them, respectively. A simple illustration below
shows how bagging can help achieve smoother results (classification boundaries in this case).
# fit bagging model (Sklearn uses decision trees as base models by default)
from sklearn.ensemble import BaggingClassifier
Bagging_model = BaggingClassifier(n_estimators=500, max_samples=50, random_state=100).fit(X,
y) # K=500 and each DT is trained on 50 training samples randomly drawn with replacement
Figure 9.11: Classification boundaries obtained using (left) fully grown decision tree and
(right) bagging with fully grown decision trees
Boosting
In boosting ensemble technique, the base models are again obtained using the same learning
algorithm but are fitted sequentially and not independently sfrom each other. During training,
each base model tries to ‘correct’ the errors made by the boosted model at the previous step.
At the end of the process, we obtain an ensemble model that shows lower bias45 than the
base models.
45
Variance may or may not decrease. Boosting has been seen to cause overfitting sometimes.
208
`
Boosting is preferred if base model exhibit underfitting/high bias. Like bagging, boosting is not
restricted to DTs as base models, but if DTs are used, shallow trees (trees with only a few
depths) are recommended that don’t exhibit high variance. Moreover, shallow trees or any
other less complex base model are computationally tractable to train sequentially during
boosting process. Boosting can be used for both regression and classification, and there are
primarily two popular boosting methods, namely, Adaboost and Gradient Boosting.
209
`
Gradient Boosting)
Adopting an approach different from Adaboost, Gradient boosting approach corrects the
errors of its predecessor by sequentially training a base model using the residual error made
by the previous base model. Figure 9.13 shows the algorithm’s scheme.
Figure 9.13: Gradient Boosting ensemble scheme. Here, 𝒚 ̂𝒊 denote prediction for training
th
samples from the i base model. Note that the hyperparameter 𝝑 need not be constant (as
used here) for the different base models.
In the scheme above, the hyperparameter 𝝑 ∈ (𝟎, 𝟏] is called shrinkage parameter or learning
rate. It is used to prevent overfitting and is recommended to be assigned a small value like
0.1. Very small learning rate will necessitate usage of large number of base models. The
number of iterations or base models, K, is another important hyperparameter that needs to be
tuned carefully. Too small K will not achieve sufficient bias reduction while too large K will
cause overfitting. It can be optimized using cross-validation or early stopping (keeping track
of validation error at different stages/iterations of ensemble model training and stopping when
errors stop decreasing or model residuals do not have any more pattern that can be modeled).
If you have been active in the ML world in the recent times, then you must have heard the
term XGBoost. It stands for eXtreme Gradient Boosting and is a popular library that
implements several tricks and heuristics to make gradient boosting-based model training very
effective, especially for large datasets. XGBoost uses DTs as base models.
Let’s see an application of XGBoost for a soft-sensing problem. The dataset comes from an
activated sludge process which treats wastewater from domestic and industrial sewers. Data
for 38 variables over each of the 527 days of operation are provided. 7 out of the 38 variables
characterize the process output or effluent quality of the plant. In this case study, we will create
a predictive model for one of the 7 output variables, output conductivity (COND-S) of the
output stream. First 22 measurements taken at different stages of the process are used as
model inputs (detailed process description is in Appendix). Our XGBoost model will act as a
soft sensor to predict the output.
# read data
import pandas as pd
data_raw = pd.read_csv('water-treatment.data', header=None, na_values="?" )
X_raw = data_raw.iloc[:,1:23]
y_raw = data_raw.iloc[:,29]
data = pd.concat([X_raw, y_raw], axis=1)
We notice that there are missing values in different columns of the dataset. We will adopt the
naïve approach of removing rows with any missing data.
211
`
Figure below shows the data for a couple of inputs and the output. There seems to be
adequate variability in the dataset. For model training and assessment, we will split the
dataset into fitting, validation, and test datasets.
Because shallower trees are preferred with boosting, max_depth parameter was used to
restrict tree depth to 3. If desired, this hyperparameter can be tuned via grid search. The
early_stopping_rounds parameter is used to instruct XGBoost to stop iterating (or adding
more base models) if validation accuracy does not improve in any round of 2 consecutive
iterations. During model fitting, validation accuracy at each iteration is reported as shown
below. Here, rmse is used as evaluation metric; any other metric can be specified through
eval_metric parameter in the fit method.
212
`
XGBoost has several other settings that can be used to tune the model. You can check them
out in the official documentation46. Figure 9.14 shows the XGBoost regressor model
predictions. Model accuracies indicate good model performance without much overfitting.
This chapter was a whirlwind tour of ensemble learning which is an important pillar of modern
machine learning. Each topic introduced in the chapter has several more advanced aspects
that we could not cover here. However, you now have the requisite fundamentals and
familiarization in place to explore these advanced aspects.
Summary
In this chapter, we learnt the decision tree modeling methodology. We saw how random
forests can help overcome the high variance issues with trees. Then, we studied the broader
concept of ensemble learning which can be used to overcome the high bias and/or high
variance issues with weak models. Two popular ensemble methods, viz, bagging and
boosting, were conceptually introduced. Finally, we used XGBoost (a popular implementation
of gradient boosting trees) for prediction of effluent quality in a wastewater treatment plant.
This chapter has added several powerful techniques to your ML arsenal and you will find
yourself using these quite often.
46
https://xgboost.readthedocs.io/en/stable/python/python_api.html#module-xgboost.sklearn
213
`
Chapter 10
Other Useful Classical ML Techniques
In this last chapter on classical ML methods, we will study two popular non-parametric
techniques, namely, KDE and kNN. These methods are conceptually and mathematically
simple but have proven to be very useful in practice. While KDE allows us to compute
probability density functions using process data, kNN facilitates characterizing new data
sample using (geometrically) neighboring samples. You might recollect that KDE was often
mentioned in the earlier chapters as the technique of choice to estimate monitoring metrics
thresholds for complex process systems; we will see how to do that in this chapter. In
multivariate settings, KDE is also employed for outlier detection. The most common usage of
kNN is classification, but several other interesting adaptations of traditional kNN have also
been reported and we will study one such application of kNN for fault detection.
Most of the illustrations in this book till now have involved employment of a single ML
technique to solve a given problem. However, it is not necessary to use different ML
algorithms in silos. Combining different ML methods (such as PLS with PCA, or ANN with
PLS) can sometimes provide better performance. Before we end this part of the book on
classical ML techniques, we will look at several examples involving smart combinations of ML
techniques.
214
`
To see one of the utilities of KDE, recollect the two ways we have employed to estimate the
threshold of a process monitoring metric till now. We computed the percentile of either the
samples (for ICA) or an assumed distribution (such as a F distribution for PCA T2 metric).
While the first approach is inappropriate when number of samples is low, the second approach
is impractical for complex systems (or systems where process variables do not follow
gaussian distribution) where the metric distribution is not known beforehand. In these
scenarios, KDE can be employed for control limit estimation.
One can use histogram as well for estimating probability densities. However, as seen in Figure
10.1, non-smooth values are obtained, i.e., densities of neighboring points have sharp jumps.
These sharp jumps are not due to the underlying true distribution but are just an artifact of the
choice of bin locations. The density estimates from histogram strongly depend on the choice
of the number of bins and bin locations. These problems become more serious for high-
dimensional datasets. Therefore, KDE is preferred over histograms for density estimation.
215
`
Mathematical background
In Gaussian mixture modeling, the overall density was approximated as a sum of local
gaussians centered at different clusters. KDE takes this idea to its extreme and puts a local
density function (also called kernel) at each data point as shown below for a 1D case, where
a popular choice for kernel, K, Gaussian kernel is used, and N is number of training samples.
The parameter h in the above illustration is called bandwidth or smoothing parameter and
helps determine the width of local bumps. The overall density at any point (𝒑𝑲𝑫𝑬 (𝒙)) is
obtained by simply summing up the local bumps as shown below.
216
`
∞
The kernels are chosen such that ∫−∞ 𝑲(𝒙) 𝒅𝒙, the total area under the KDE curve, is 1. For
multidimensional case with D dimensions, 𝒑𝑲𝑫𝑬 (𝒙) is given by
𝑁
1 𝒙 − 𝒙𝑘
𝑝𝐾𝐷𝐸 (𝒙) = 𝐷
∑ 𝐾( )
𝑁ℎ ℎ
𝑘=1
The above multivariate KDE uses the same bandwidth for all axes; different bandwidths are
used to deal with non-uniform spread along different axes. We will deal primarily with
univariate problems in this chapter.
The bandwidth, however, is a crucial parameter and should be chosen carefully. Too small
bandwidth leads to spikes in density estimates due to overfitting; absence or presence of a
single data-point can significantly impact the density estimate resulting in high variance. Too
large bandwidth over-smooths the density curve potentially masking the critical structure in
data; the discrepancy between the true and estimated density becomes large resulting in high
bias. Figure 10.3 illustrates this bias-variance trade-off.
217
`
𝒉 = 𝟏. 𝟎𝟔𝝈𝑵−𝟏/𝟓
where 𝝈 is sample standard deviation. A more robust modified version is given by
𝑰𝑸𝑹
𝒉 = 𝟎. 𝟗𝑨𝑵−𝟏/𝟓 ; 𝑨 = 𝒎𝒊𝒏(𝝈, )
𝟏. 𝟑𝟒
where IQR is the interquartile range. These empirical approaches, however, may not give
good results. A more systematic approach is to use cross validation to choose the bandwidth
value that maximizes the probability or likelihood of validation data. For this purpose,
GridSearchCV can be used which automatically returns the optimal bandwidth among
supplied bandwidth values. GridSearchCV uses score method of KernelDensity estimator
which returns the log-likelihood (logarithm of density) at the validation data-points.
To apply KDE, let us revisit our ICA example from Chapter 6. We will determine the monitoring
control limits for the TEP process using KDE. Let us first see how to fit an optimal KDE curve
to the monitoring metric values. For illustration, we will use 𝑰𝟐𝒆 values from the training
samples. Note that we use the empirical expression of h to decide the list of bandwidth values
for grid search.
N = len(ICA_statistics_train[:,1])
empirical_h = 1.06*np.std(ICA_statistics_train[:,1])*N**(1/5)
h_grid = np.linspace(0,5,50)*empirical_h # 50 candidate values
218
`
Figure 10.4: Optimal KDE curve fitted to the 𝑰𝟐𝒆 monitoring metric
We can see that GridSearchCV returns an optimal KDE that correctly reflects the distribution
of the underlying data without overfitting the spurious details.
Let’s say we want to find the 99% control limit for our monitoring metric. With an optimal KDE
curve on hand, the required control limit would be the metric value such that the area under
219
`
the KDE curve to the right of this value is 0.01 as illustrated in Figure 10.5. By definition, there
would be only 1% probability that metric value for test sample (which comes from the same
distribution from which the training samples were obtained) is higher than the 99% control
limit. In process monitoring parlance, this implies that there is a 1% probability that a fault
alarm is a false alarm.
Figure 10.5: 100(1-𝜶) percentile or control limit of a monitoring metric. 𝜶 = 0.01 for a 99%
control limit or 99th percentile.
Numerical integration is needed to find the area under KDE curve between any two metric
values. Let us define a function that takes in metric values and returns a control limit
corresponding to a specified percentile. As you can see, we divided the metric axis into a grid
(metric_grid) with 100 intervals and computed the area under the KDE curve to the left of each
of these grid points. The grid point with the required minimum area (as specified by the
percentile parameter) becomes our control limit.
parameters
-----------
metric_values: numpy array of shape = [n_samples,]
"""
return metric_CL
We can now utilize this function to compute the 99% control limits for our ICA monitoring
metrics. We can see that the control limits are similar to what we had obtained earlier in
Chapter 6.
Figure 10.6: ICA monitoring charts for training data with control limits determined via KDE
Data density estimation via KDE has other uses as well. Samples in low-density regions can
be segregated from high density regions and discarded as outliers. You can use KDE for the
scenario shown in Figure 4.15 for your data pre-processing. Additionally, a process monitoring
tool can be built using KDE as well by employing the logic that a test sample falling in low
density regions (as estimated from training data) would be an abnormal sample. For all these
applications, the procedure for building KDE models remains the same.
221
`
The k-nearest neighbor (kNN) is a versatile technique based on a simple intuitive idea that
the label/value for a new sample can be obtained from the labels/values of closest neighboring
samples (in the feature space) from the training dataset. The parameter k denotes the number
of neighboring samples utilized by the algorithm. As shown in Figure 10.7, kNN can be used
for both classification and regression. For classification, kNN assigns test sample to the class
that appears the most amongst the k neighbors. For regression, the predicted output is the
average of the value of the k neighbors. Due to its simplicity, kNN is widely used for pattern
classification and was included in the list of top 10 algorithms in data mining.47
Figure 10.7: kNN illustration for classification (left) and regression (right). Yellow data-point
denotes unknown test sample. The grey-shaded region represents the neighborhood with 3
nearest samples.
kNN belongs to the class of lazy learners where models are not built explicitly
until test samples are received. At the other end of the spectrum, eager
learners (like, SVM, decision trees, ANN) ‘learn’ explicit models from training
samples. Unsurprisingly, training is slower, and testing is faster for eager
learners. kNN requires computing the distance of the test sample from all the
training samples, therefore, kNN also falls under the classification of instance-
based learning. Instance-based learners make predictions by comparing the
test sample with training instances stored in memory. On the other hand,
model-based learners do not need to store the training instances for making
predictions.
47
Wu et al., Top 10 algorithms in data mining. Knowledge and Information systems, 2008.
222
`
Conceptual background
Apart from an integer k and input-output training pairs, kNN algorithm needs a distance metric
to quantify the closeness of a test sample with the training samples. The standard Euclidean
metric is commonly employed. Once the nearest neighbors have been determined, two
approaches, namely uniform and distance-based, can be employed to decide weights
assigned to each neighbor which impacts the neighbor’s contribution in prediction. In uniform
weighting, all k neighbors are treated equally while, in distance-based weighting, each of the
k neighbors is weighted by the inverse of their distance from the test sample so that closer
neighbors will have greater contributions. The figure below illustrates the difference between
the two weight schemes for a classification problem
Figure 10.8: Illustration on impact of weight-scheme on kNN output. Dashed circles are for
distance references.
In Figure 10.8, with uniform weighting (also called majority voting for classification problems),
the test sample is assigned to class 1 for k = 1 or 3 and class 2 for k = 6. For k = 8, no decision
can be made. With distance-weighting, test sample is always classified as class 1 for k = 1,
3, 6, or 8. This illustration shows that distance weighting can help reduce the prediction
dependence on the choice of k.
For predictions, kNN needs to compute the distance of test samples from all the training
samples. For large training sets, this computation can become expensive. However,
specialized techniques, such as KDTree and BallTree, have been developed to speed up the
extraction of neighboring points without impacting prediction accuracies. These techniques
utilize the structure in data to avoid computing distances from all training samples. The
NearestNeighbors implementation in scikit-learn automatically selects the algorithm best
suited to the problem at hand. The KNeighborsRegressor and KNeighborsClassifier
modules are provided by scikit-learn for regression and classification, respectively.
223
`
A couple of things to pay careful attention in kNN include variable selection and variable
scaling. Variables that are not important for output predictions should be removed; otherwise
unimportant variables will undesirably impact the determination of nearest neighbors. Further,
the selected variables should be properly scaled to ensure that variables with large
magnitudes do not dwarf the contribution of other variables during distance computations.
A few other notable applications of kNN for process systems include the work of Facco et al.
on automatic maintenance of soft sensors50, Borghesan et al. on forecasting of process
disturbances51, Cecilio et al. on detecting transient process disturbances52, and Zhou et al. on
fault identification in industrial processes53. These applications may not utilize the kNN method
directly for classification or regression, but use the underlying concept of similarity of nearest
neighbors.
48
Dong Wang, K-nearest neighbors based methods for identification of different gear crack levels under different motor
speeds and loads: Revisited, Mechanical Systems and Signal Processing, 2016
49
He and Wang, Fault detection using k-nearest neighbor rule for semiconductor manufacturing processes, IEEE
Transactions on Semiconductor Manufacturing, 2007.
50
Facco et al., Nearest-neighbor method for the automatic maintenance of multivariate statistical soft sensors in batch
processing, Industrial Engineering & Chemistry Research, 2010.
51
Borghesan et al., Forecasting of process disturbances using k-nearest neighbors, with an application in process control,
Computers and Chemical Engineering, 2019.
52
Cecilio et al., Nearest neighbors methods for detecting transient disturbances in process and electromechanical systems,
Journal of Process Control, 2014.
53
Zhou et al., Fault identification using fast k-nearest neighbor reconstruction, Processes, 2019
224
`
Fault detection by kNN28 (FD-kNN) is based on a simple idea that distance of a faulty test
sample from the nearest training samples (obtained from normal operating plant conditions)
must be greater than a normal sample’s distance from the neighboring training samples.
Incorporating this idea into the process monitoring framework, a monitoring metric (termed
kNN squared distance) is defined for each training sample as follows
𝒌
𝑫𝟐𝒊 = ∑ 𝒅𝟐𝒊𝒋
𝒋=𝟏
where 𝒅𝟐𝒊𝒋 is the distance of ith sample from its jth nearest neighbor. After computing kNN
squared distances for all the training samples, a threshold corresponding to the desired
confidence limit can be computed. A test sample would be considered faulty if its kNN squared
distance is greater than the threshold.
225
`
Figure 10.9 shows the output from fitting KDE to kNN squared distances distribution and the
resulting monitoring chart for the training samples. Note that we log-transformed the kNN
squared distances because application of KDE to D2 resulted in non-zero densities for D2
below 0. Figure 10.10 shows the monitoring chart for the faulty test samples. 15 out of 20
samples are correctly flagged as faulty while 5 samples are misdiagnosed as normal.
# D2_log_test
d2_nbrs_test, indices = nbrs.kneighbors(score_test)
d2_nbrs_test = d2_nbrs_test[:,0:5] # we want only 5 nearest neighbors
d2_sqrd_nbrs_test = d2_nbrs_test**2
D2_test = np.sum(d2_sqrd_nbrs_test, axis = 1)
D2_log_test = np.log(D2_test)
226
`
The secret behind the success of external analysis is that nonlinear or dynamic characteristics
of process data can be handled by specialized modeling tools and then output residuals
(difference between output measurements and predictions) can be handled efficiently by PCA.
Another noteworthy scheme includes the integration of artificial neural networks (ANN) and
PLS56. The powerful dimension reduction capabilities of PLS are combined with nonlinearity
modeling capabilities of ANN to obtain a high-fidelity soft-sensor model.
54
Kano et al., Evolution of multivariate statistical process control: application of independent component analysis and
external analysis, Computers & Chemical Engineering, 2004
55
Ge et al., Robust online monitoring for multimode processes based on nonlinear external analysis, Industrial
Engineering & Chemistry Research, 2008
56
Qin & McAvoy, Nonlinear PLS modeling using neural networks, Computers & Chemical Engineering, 1992
227
`
Jemwa & Aldrich57 have proposed a scheme that combines SVM and decision trees. This
scheme involves judicious use of support vectors to reduce the number of samples that goes
into building decision trees model. The final result comprises of optimal operating strategies
of the following type
𝐈𝐟 𝟎. 𝟔 ≤ 𝒓𝒆𝒂𝒄𝒕𝒂𝒏𝒕 𝒎𝒐𝒍𝒆 𝒇𝒓𝒂𝒄𝒕𝒊𝒐𝒏 ≤ 𝟎. 𝟗
𝑻𝒉𝒆𝒏 𝒌𝒆𝒆𝒑 𝒓𝒆𝒂𝒄𝒕𝒂𝒏𝒕 𝒕𝒆𝒎𝒑𝒆𝒓𝒂𝒕𝒖𝒓𝒆 𝒘𝒊𝒕𝒉𝒊𝒏 𝟐𝟕𝟎 − 𝟐𝟗𝟎 𝑲
These were just a few examples to illustrate how very powerful models can be built by careful
and conceptually sound combination of different ML techniques. Note that these combination
strategies are more ad-hoc and situation-dependent compared to the more systematic
ensemble learning techniques. The key takeaway is that as long as you are aware of the
underlying mechanism and the combination makes theoretical sense, it does not hurt to try
out various combinations if your problem cannot be satisfactorily handled by a standalone
technique.
Summary
With this chapter we have come to the end of our journey into the world of classical ML
methods. We hope that you already feel empowered and your interest in machine learning
has increased. The methods you have learnt so far should enable you to tackle majority of
the problems you will face as process data scientists. With the classical techniques
conquered, let us now get ready to step into the amazing world of artificial neural networks.
57
Jemwa & Aldrich, Improving process operations using support vector machines and decision trees, AIChE Journal,
2005
228
`
Part 3
Artificial Neural Networks & Deep Learning
229
`
Chapter 11
Feedforward Neural Networks
It won’t be an exaggeration to say that artificial neural networks (ANN) have (re)caught the
fascination of data scientists and are currently the hottest topic among ML practitioners.
Several recent technical breakthroughs and computational advancements have enabled
(deep) ANNs to psrovide remarkable results for a wide range of problems. PSE is no exception
to this trend and is increasingly witnessing more applications in areas like surrogate modeling,
predictive equipment maintenance, etc.
There is no doubt that ANNs have proven to be very powerful. However, with great power
comes great responsibility! If you are not careful with ANN application, you will end up with
disappointing results. To train an ANN model efficiently for a complex system, several
hyperparameters need to be set carefully, which largely remains a trial-and-error exercise.
Nonetheless, several techniques have been devised to overcome the challenges posed by
ANN model training.
In this chapter, we will demystify the world of ANNs and gain insights into how ANNs work.
We will learn the underlying concepts, the tips & tricks of the trade, common problems
encountered, and the techniques developed to handle these problems. Specifically, the
following topics are covered
• Introduction to ANNs
• Exploration of nonlinearity in ANNs
• Hyperparameter optimization for ANNs
• Tips & tricks for improved ANN model training
• Nonlinear soft sensing via ANN
• General guidelines for ANN modeling for PSE
230
`
Artificial neural networks (ANNs) are nonlinear empirical models which can capture complex
relationships between input-output variables via supervised learning or recognize data
patterns via unsupervised learning. Architecturally, ANNs were inspired by human brain and
are a complex network of interconnected neurons as shown in Figure 11.1. An ANN consists
of an input layer, a series of hidden layers, and an output layer. The basic unit of the network,
neuron, accepts a vector of inputs from the source input layer or the previous layer of the
network, takes a weighted sum of the inputs, and then performs a nonlinear transformation to
produce a single real-valued output. Each hidden layer can contain any number of neurons.
Infact, an ANN with just 1 hidden layer has the universal approximation property, i.e., it can
approximate any continuous function with any given accuracy.
Figure 11.1: Architecture of a single neuron and feedforward neural network with 2 hidden
layers for a MISO (multiple input, single output) system
While there are different types of ANN architectures (and ML researchers keep coming up
with new ones!), FFNNs, RNNs, and CNNs are the most common architectures. While we will
study RNNs in the next chapter, in this chapter we will focus on FFNNs. CNNs are primarily
employed for image processing and are not covered in this book.
231
`
Deep learning
The recent popularity of ANNs can be attributed to the successes with deep learning. Deep
learning simply refers to machine learning with deep neural networks (DNNs) which are ANNs
with two or more hidden layers (see Figure 11.2). While shallow networks, with large number
of neurons, can theoretically model any complex function, deep networks need much fewer
model parameters and hence, are theoretically faster to train. Moreover, DNNs enable
bypassing manual feature engineering; the first few hidden layers implicitly generate features
that are utilized by downstream layers.
Although DNNs seem superior to shallow networks, until recently, there weren’t many
applications of DNNs. DNNs suffered from the issue of vanishing and exploding gradients
during model training. However, in recent years, several technical innovations have produced
better nonlinear transformers (activation functions), initialization and regularization schemes,
and, learning/optimization algorithms. These innovations have largely overcome the model
training issues for DNNs and combined with the availability of cheap computing power and
large amount of data have resulted in the AI revolutions that we are witnessing today.
A few words of caution: The power of DNNs should not make you undermine
the importance of feature engineering. Explicit feature engineering
contextualizes process data by considering the specific characteristics of the
underlying process system. For example, for a time-series vibration data,
feature engineering via frequency-domain transformation and binning can
greatly help subsequent network training; without this, a DNN would require
substantially large amount of training data to implicitly learn the features.
232
`
TensorFlow
For classical ML implementations, we utilized the packages available in Scikit-learn. For
ANNs, other specialized libraries are commonly used which make it very easy to build neural
net models. TensorFlow (by Google) and PyTorch (by Facebook) are the two most popular
deep learning frameworks. These frameworks provide specialized algorithms for efficient
training of ANNs.
Direct application of TensorFlow (TF) is not very straightforward and involves a steep learning
curve. Modelers often utilized Keras, a high-level API built on top of TF, for ANN modeling.
However, the latest version of TF released by Google in 2019 has integrated Keras directly
into TF. This has made it possible to define, train, and evaluate ANN models in just a few lines
of code. We will use TensorFlow Keras API in this book. The next section shows a quick
application for modeling a combined cycle power plant to illustrate the ease of usage of these
deep learning libraries.
The deep learning revolution has been a shot in the arm of the ML researchers in
PSE community who are now exploring ‘daring’ applications of ANNs such as
replacing model predictive controllers (MPCs) with ANNs by ‘learning’ MPC’s
optimal policies+. However, neural nets are not new for PSE community. Pre-deep
learning era saw several interesting ANN applications for process systems as well.
Venkatasubramanian++ and Himmelbalu+++ provide excellent accounts on the
history of ANNs in PSE.
The 90s was a relatively high-activity period which witnessed applications of ANNs
for fault detection, soft-sensing, data reconciliation*, etc. This was followed by a
decline in the interest in ANNs due to computational limitations and lack of big data.
Today, these limitations having largely been overcome, ANNs are poised to make
significant impact in PSE.
+Kumar et. al., Industrial, large-scale model predictive control with structured neural networks, Computers and
Chemical Engineering, 2021
++Venkat Venkatasubramanian, The promise of artificial intelligence in chemical engineering: is it here, finally?,
AIChE Journal, 2018
+++David Himmelblau, Accounts of experiences in the application of artificial neural networks in chemical
engineering, Industrial Engineering & Chemistry Research, 2008
*Karjarla, Dynamic data rectification via recurrent neural networks, Ph.D. Thesis, The University of Texas at
Austin, 1995
233
`
We will use data from a CCPP to illustrate the ease with which deep learning models can be
built using TF Keras. The dataset (see Appendix for details) was collected over a period of 6
years and contains hourly average values of ambient temperature (AT), ambient pressure
(AP), relative humidity (RH), and exhaust vacuum (V). These variables influence the net
hourly electrical energy output (also provided in the dataset) of the plant operating at full load
and will be the target variable in our ANN model. Let us first explore the dataset.
Figure 11.3 clearly indicates the impact of input variables on the electrical power (EP).
Figure 11.3: Plots of influencing variables (on x-axis) vs Electrical Power (on y-axis)
234
`
There is also a hint of nonlinear relationship between exhaust vacuum and power. While it
may seem that AP and RH do not influence power strongly, it is a known fact that power
increases with increasing AP and RH individually58. Let us now build a FFNN model with 2
hidden layers to predict power. We first split the dataset into training and test data, and then
scale the variables.
# scale data
from sklearn.preprocessing import StandardScaler
X_scaler = StandardScaler()
X_train_scaled = X_scaler.fit_transform(X_train)
X_test_scaled = X_scaler.transform(X_test)
y_scaler = StandardScaler()
y_train_scaled = y_scaler.fit_transform(y_train)
y_test_scaled = y_scaler.transform(y_test)
To build FFNN model, we will import relevant Keras libraries and add different layers of the
network sequentially. The Dense library is used to define a layer that is fully-connected to the
previous layer.
# define model
model = Sequential()
model.add(Dense(8, activation='relu', kernel_initializer='he_normal', input_shape=(4,)))
# 8 neurons in 1st hidden layer
model.add(Dense(5, activation='relu', kernel_initializer='he_normal'))
# 5 neurons in 2nd layer
model.add(Dense(1))
# 1 neuron in output layer
The above 4-line code completely define the structure of the FFNN. Do not worry about the
activation and kernel_initializer parameters; we will study them in more detail in later sections.
58
Pinar Tufekci, Prediction of full load electrical power output of a base load operated combined cycle power plant using
machine learning methods, Electrical Power and Energy Systems, 2014
235
`
Next, we will compile and fit the model. Again, do not worry about the optimizer, epochs, and
batch_size parameters right now.
# compile model
model.compile(loss='mse', optimizer='Adam') # mean-squared error is to be minimized
# fit model
model.fit(X_train_scaled, y_train_scaled, epochs=25, batch_size=50)
# predict y_test
y_test_scaled_pred = model.predict(X_test_scaled)
y_test_pred = y_scaler.inverse_transform(y_test_scaled_pred)
The above lines are all it takes to a build FFNN and make predictions. Quite convenient, isn’t
it? Figure 11.4 compares actual vs predicted power for the test data.
Figure 11.4: Actual vs predicted target for CCPP dataset (obtained R2 = 0.93)
You can use the model.summary command to visualize the structure of the network and
check the number of model parameters as shown in Figure 11.5. For the relatively simple
CCPP dataset, we can obtain a reasonable model with just 1 hidden layer with 2 neurons.
Nonetheless, this example has now familiarized us with the process of creating a DNN.
236
`
Let’s consider a sample input 𝒙 ∈ 𝑅 4 from the CCPP dataset and trace its path as it moves
through the layers of the network to generate the output. In this forward pass (also called
forward propagation), the inputs are first processed by the neurons of the first hidden layer.
In the jth neuron of this layer, the weighted sum of the inputs are non-linearly transformed via
an activation function, g
𝑎𝑗 = 𝑔(𝒘𝑗𝑇 𝒙 + 𝒃𝒋 )
𝒘𝒋 ∈ 𝑅 4 are the weights applied to the inputs and 𝒃𝒋 is the bias added to the sum. Thus, each
neuron has 5 parameters (4 weights and a bias) leading to 40 parameters for all the 8 neurons
of the 1st layer that need to be estimated. Outputs of all the 8 neurons form vector 𝒂(1) ∈ 𝑅 8
where each row of 𝑾(1) ∈ 𝑅 8×4 contains the weights of a neuron. The same activation function
is used by all the neurons of a layer. 𝒂(1) becomes the input to the 2nd hidden layer.
where 𝑾(2) ∈ 𝑅 5×8 . Each neuron in the 2nd layer has 8 weights and a bias parameter, leading
to 45 parameters in the layer. The final output layer had a single neuron with 6 parameters
and no activation function, giving the network output as follows
where 𝒘(3) ∈ 𝑅 5.
Activation functions
Activation functions (AFs) are what impart non-linear capabilities to neural nets. We used
ReLU activation function for CCPP modeling. While ReLU (rectified linear unit) is the preferred
AF now-a-days, pre-deep learning era used to employ sigmoid activation function. Illustration
below shows the form of sigmoid AF, which transforms pre-activation, z (weighted sum of
inputs plus bias), into activation a ∈ (𝟎, 𝟏). The saturation of sigmoid functions at 1 (or 0) for
large positive (or negative) value of pre-activation results in the issue of vanishing/zero
gradients for deep networks and is the reason why sigmoid AFs are not favored anymore for
hidden layers.
237
`
Table below lists other commonly employed activation functions. We will study in a later
section how these functions impart non-linearity to ANN models.
Figure 11.6: FFNN with softmax activation function. (Subscripts denoting layers are not
shown for clarity)
238
`
Softmax is an exponential function that generates normalized activations so that they sum up
to 1. In Figure 11.6, activation aj (𝒋 ∈ [𝟏, 𝟐, 𝟑]) is generated as follows
𝒆𝒛 𝒋
𝒂𝒋 = 𝒈𝒔𝒐𝒇𝒕𝒎𝒂𝒙 (𝒛𝒋 ) =
∑𝟑𝒌=𝟏 𝒆𝒛𝒌
In ANN world, the pre-activations (𝒛𝒋 ) that are fed as inputs to the softmax function as also
called logits. The softmax activations lie between 0 and 1, and hence, they are interpreted as
class-membership probabilities. The predicted class label is taken as the class with maximum
probability (or activation)
As you will see in next subsection, the predicted class probablities (and not the predicted class
label) are directly used during model training. For classification tasks where the classes are
not mutually exclusive, sigmoid function is employed in the neurons of the output layer.
Sigmoid function is also used for binary mutually exclusive classification problems with a
single neuron in the output layer.
̂𝒔𝒂𝒎𝒑𝒍𝒆 )𝟐
𝑴𝑺𝑬 𝑳𝒐𝒔𝒔 = (𝒚𝒔𝒂𝒎𝒑𝒍𝒆 − 𝒚
𝟏
𝑴𝑺𝑬 𝑪𝒐𝒔𝒕 = × ̂𝒔𝒂𝒎𝒑𝒍𝒆 )𝟐
∑ (𝒚𝒔𝒂𝒎𝒑𝒍𝒆 − 𝒚
# 𝒐𝒇 𝒔𝒂𝒎𝒑𝒍𝒆𝒔
𝒔𝒂𝒎𝒑𝒍𝒆𝒔
Note that the above expression is for a scalar output; for multi-dimensional output, summation
is also carried over different dimensions. Other popular loss metric for regression is mae
(mean absolute error) which measures the absolute difference |𝒚𝒔𝒂𝒎𝒑𝒍𝒆 − 𝒚 ̂𝒔𝒂𝒎𝒑𝒍𝒆 |. MAE cost
is more robust to outliers compared to MSE as large errors are not heavily penalized through
squaring. Mae metric, however, is less efficient due to the discontinuous nature. A good
compromise can be achieved by rmse (root mean squared error) metric.
239
`
For binary/2-class classification problems, binary cross-entropy is the default loss function.
Let y (can take value 0 and 1) be the true label for a data sample and p (or 𝐲̂) be the predicted
probability (of y = 1) obtained from sigmoid output layer. The cross-entropy loss is given by
−𝒍𝒐𝒈(𝟏 − 𝒑), 𝒚 = 𝟎
𝑪𝒓𝒐𝒔𝒔 − 𝒆𝒏𝒕𝒓𝒐𝒑𝒚 𝑳𝒐𝒔𝒔 = −𝒚 ∗ 𝒍𝒐𝒈(𝒑) − (𝟏 − 𝒚) ∗ 𝒍𝒐𝒈(𝟏 − 𝒑) = {
−𝒍𝒐𝒈(𝒑), 𝒚 = 𝟏
The above expression is generalized as follows for a multiclass classification, where overall
loss is sum of separate losses for each class label
# 𝒐𝒇 𝒄𝒍𝒂𝒔𝒔𝒆𝒔
where, 𝒚𝒄 and 𝒑𝒄 are binary indicator and predicted probability of a sample belonging to class
c, respectively. Note that for multi-class classification, the target variable will be in one-hot
encoded form.
𝝏
𝜽𝒊,𝒏𝒆𝒙𝒕 = 𝜽𝒊,𝒄𝒖𝒓𝒓𝒆𝒏𝒕 − 𝜼 𝑱(𝜽)
𝝏𝜽𝒊
where 𝜼 (learning rate), a hyperparameter, determines the magnitude of update in each
iteration. For illustration, Figure 11.7 shows the progression of cost function towards its
minima through multiple iterations for simple systems (one and two model parameters).
The learning rate influences the speed of optimization convergence. Large learning rates
during initial iterations help to quickly reach close to the minima point; however, large values
can also lead divergence instead of convergence. On the other hand, too small learning rates
leads to very slow convergence. To deal with this trade-off, learning rate is often adjusted
during training via learning rate schedule (reducing 𝜼 over iterations) or in-built mechanism in
modern optimizers (more common approach now-a-days).
A drawback of the classical (or vanilla) gradient descent is that the optimizer is slow and often
gets stuck in local minima. As a remedy, several variations of the vanilla approach have been
devised which have been shown to provide better performance. These variations include
Momentum optimization, Nesterov Accelerated Gradient, AdaGrad, RMSProp, and Adam
optimization. Among these, Adam (adaptive moment estimation) optimization, which
combines the ideas of momentum optimization and RMSProp, is recommended as the default
optimizer to use.
Like with everything, Adam may not give good performance for some cases;
in those cases, other optimizers should be tried. With Adam optimization you
don’t have to worry about learning rate scheduling as it’s an adaptive learning
rate algorithm, although the initial learning rate may need adjustments for
some problems. Adam does have a few hyperparameters of its own but
usually the default values work fine.
The superior performance of Adam optimization over gradient descent stems from several
algorithmic innovations which includes, amongst others, using exponentially decaying
average of past gradients to determine update direction. You are encouraged to check out
this article59 for an intuitive explanation of how these optimizers work.
In mini-batch variation, for each parameter update iteration, only a subset of the training
dataset goes into gradient computation. The number of samples in a subset is called batch-
size. Batch sizes commonly range between 32 and 256. MGD results in faster convergence
59
Sebastian Ruder, An overview of gradient descent optimization algorithms, arXiv, 2017
241
`
and noisier/oscillating gradients (different mini-batches can generate very different update
directions), as shown in Figure 11.8. An advantage of noisier gradients is that it can help the
optimizer to escape local minima. Stochastic gradient descent (not common now) is the
extreme case of mini-batch variant where batch-size is just one! Although not technically
correct, practitioners employ the term SGD also when mini-batches are used.
Figure 11.8: Mini-batch & stochastic gradient descents have noisier gradients compared to
batch gradient descent
# 𝐨𝐟 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐬𝐚𝐦𝐩𝐥𝐞𝐬
In MGD, several iterations (equal to ) are needed for the optimizer to go
𝒃𝒂𝒕𝒄𝒉 𝒔𝒊𝒛𝒆
through the entire training dataset, also referred to as completion of one epoch. At the start of
every epoch, training dataset is usually shuffled (to prevent optimizer getting stuck in cycles),
divided into mini-batches, and then update iterations continue. Number of epochs is another
crucial hyperparameter. Training and validation accuracies are plotted against epochs rather
than iterations to keep track of model fitting. More epochs will lead to better training accuracy
but can result in worse validation accuracy. We will learn more about it later. A simple thumb-
rule is to use more epochs for higher batch sizes.
Backpropagation
Backpropagation computes the gradients of the cost function with respect to the model
parameters efficiently through smart usage of chain rule of derivative computation. For
conceptual understanding of the mechanism, let’s focus on SGD (for ease of explanation) and
assume that we are interested in computing the gradient with respect to 𝝎𝟏 as highlighted in
Figure 11.9.
242
`
𝝏𝑱 𝝏𝑱 𝝏𝒂
= ×
𝝏𝝎𝟏 𝝏𝒂 𝝏𝝎𝟏
𝝏𝑱 𝝏𝒂 𝝏𝒛
= ×( × )
𝝏𝒂 𝝏𝒛 𝝏𝝎𝟏
𝝏𝒛 𝝏𝒂 𝝏𝑱
While is simply 𝒙𝟏 and is derivative of the activation function, is the unknown
𝝏𝝎𝟏 𝝏𝒛 𝝏𝒂
element here. However, it is easy to show that this unknown part can be computed if partial
derivatives of cost function with respect to the activations of the 2nd hidden layer are known,
which themselves need the derivative of cost with respect to output layer activation or 𝒚̂. This
final missing element is given by
𝝏𝑱 𝝏 𝟏
= ̂)𝟐 = (𝒚 − 𝒚
(𝒚 − 𝒚 ̂)
̂
𝝏𝒚 ̂
𝝏𝒚 𝟐
The above discussion shows that once the activations, predicted output, and error are
computed by forward propagation, backward propagation is employed to compute required
gradients one layer at a time starting from the last layer. For MGD, through careful
bookkeeping, a compact vector-matrix form of backpropagation can be obtained just like we
did for forward pass. Note that technically backpropagation only refers to the mechanism for
gradient computation, but the term is often used to refer to the entire model fitting algorithm.
Vanishing/Exploding gradients
In the past, the issue of vanishing or exploding gradients posed great difficulty in model fitting
for deep networks. In backpropagation, we saw that gradients are propagated from the output
layer towards the inner layers. Some of these gradients were derivatives of activation
functions, which, for sigmoid and tanh functions, become very small for high pre-activations.
243
`
In the case of vanishing gradient scenario, gradients of inner layer parameters become very
small due to products of small gradients propagating from the outer layers. This causes virtual
halt of inner parameter updates leading to slow convergence and poor model fitting. On the
other hand, sometimes, the problem arises due to gradients becoming very large due to large
weights. Due to backpropagation the problem gets compounded and causes training failure.
The paper by Glorot and Bengio (titled ‘Understanding the difficulty of training deep
feedforward neural networks’) provides excellent investigation on these issues. The usage of
sigmoid activation function and poor initialization schemes were found to be some of the
culprits. Today, these issues have mostly been curtailed through usage of ReLU functions,
batch normalization, correct initialization schemes, etc.
While designing a neural network for complex nonlinear models, one of the most crucial
decision that you will make is to decide between adding more neurons in a layer (increasing
network width) and adding more hidden layers (increasing network depth). Both network width
and depth influence network performance, albeit in different manners. Before we delve further
into this ‘width vs depth’ question, let us analyze the role of ReLU activation functions in some
more details – this will provide some clues to help us make the above decision.
Consider an absolute function that takes a scalar input and outputs the absolute value. This
nonlinear function can be perfectly modeled via a FFNN with 1 hidden layer containing 2
nodes with ReLU activation function as shown in Figure 11.10.
We were able to model the nonlinear absolute function perfectly because the ReLU-based
neurons effectively bifurcated the input space into 2 regions (x <=0 and x>=0) where linear
relationships hold; essentially, the absolute function got modeled via two piecewise linear
functions. Continuing this logic, now assume that the scalar input-output system follows a
244
`
nonlinear (quadratic) relationship. This system can again be approximated by using several
piecewise linear functions as shown in Figure 11.11. Infact, the approximation shown in the
figure was obtained using 1 hidden layer with 5 ReLU neurons.
What we learn from the above illustrations is that having several neurons in a single layer
helps a neural network divide the input space into several local regions and approximate a
complex function via several piecewise simpler functions. Consequently, a very nonlinear
function can be modeled with a very wide network (large number of nodes in one hidden
layer). However, training a very wide network can become problematic.
For a network of fixed width, another way to increase nonlinear capability is to add more
depth/hidden layers to the network, enabling the network to estimate more complicated
features. This is because the successive hidden layers build upon the nonlinear features
generated from the inner layers to impart more nonlinearity to the overall operation. For any
activation function, in a wide network with 1 hidden layer, activations of all neurons have
similar degree of non-linearity. But, in a deep network, activations from different layers exhibit
varied degrees of nonlinearity, with outermost layer exhibiting most nonlinearity. However, as
mentioned before, deep networks are prone to vanishing/exploding gradient issues.
You can see that the answer to the ‘more width or more depth’ question is not very
straightforward. Similar gain in performance can be gained by increasing network width or
depth. A few words of caution here: higher network depth may fail to give adequate
performance if network width is inadequately low. For example, irrespective of the network
depth, absolute function cannot be accurately modeled if network width is just one. Therefore,
both width and depth are important. A thumb-rule is that, for a system with low degree of
nonlinearity, a high width-low depth network should be used, while for a very complex system
high depth network with adequate width should be used.
245
`
For modeling process systems 2 to 3 hidden layers are usually sufficient for
both regression and classification tasks. Going deeper helps if you are working
with inputs which exhibit hierarchical structure such as images (complex
shapes can be broken down into edges, corners, etc.).
We have seen that before executing the model.fit command, several hyperparameters have
to be specified. These include, amongst others, network depth, network width, number of
epochs, minibatch size, learning rates, regularization penalty, activation function type. While
general recommendations exist for most of the hyperparameters, specifying the number of
hidden layers and the number of neurons in every layer requires some work as these depend
largely on the specific problem at hand.
Common approaches for tuning the hyperparameters are based on manual trials and
automated grid (or random) searches. In manual approach there are two strategies. You can
choose a complex network structure (more layers and neurons than you ‘feel’ necessary) and
use regularization techniques and early stopping (these are discussed in next section) to avoid
overfitting. Alternatively, you can start with a simple structure (one hidden layer and
reasonable number of neurons) and gradually increase the network depth and number of
neurons until the overfitting begins.
# scale data
from sklearn.preprocessing import StandardScaler
246
`
X_scaler = StandardScaler()
X_est_scaled = X_scaler.fit_transform(X_est)
X_val_scaled = X_scaler.transform(X_val)
X_test_scaled = X_scaler.transform(X_test)
y_scaler = StandardScaler()
y_est_scaled = y_scaler.fit_transform(y_est)
y_val_scaled = y_scaler.transform(y_val)
y_test_scaled = y_scaler.transform(y_test)
# import packages
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras import regularizers
from tensorflow.keras.optimizers import Adam
# model function
def FFNN_model(hidden_layers, layer_size, regularizationValue, learningRate):
model = Sequential()
model.add(Dense(layer_size, kernel_regularizer=regularizers.L1(regularizationValue),
activation='relu', kernel_initializer='he_normal', input_shape=(4,)))
for _ in range(hidden_layers-1):
model.add(Dense(layer_size, kernel_regularizer=regularizers.L1(regularizationValue),
activation='relu', kernel_initializer='he_normal'))
model.add(Dense(1))
model.compile(loss='mse', optimizer=Adam(learning_rate=learningRate))
return model
# KerasRegressor
from tensorflow.keras.wrappers.scikit_learn import KerasRegressor
model = KerasRegressor(build_fn=FFNN_model, epochs=25, batch_size=50)
# gridSearchCV
from sklearn.model_selection import GridSearchCV
param_grid={
"hidden_layers":[1, 2],
"layer_size":np.arange(1,10),
"regularizationValue": [0.001, 0.01, 0.1],
"learningRate":[0.05, 0.01, 0.1]}
247
`
>>> The best parameters obtained are: {'hidden_layers': 1, 'layer_size': 9, 'learningRate': 0.01,
'regularizationValue': 0.001}
# best model
model = grid_searchcv.best_estimator_.model
Training ANN models is not easy and there are various issues you may encounter. Maybe
your model is overfitting or underfitting, or you are getting stuck in local minima, or your model
is not converging. Keeping track of the training process can help to obtain some glimpse into
how well the model is fitting and provide hints about potential hyperparameter setting
adjustments. One way to keep track of how the model fitting is progressing is to draw plot of
training set and validation set costs w.r.t the epochs. Several strategies have been devised to
deal with ANN training issues that are frequently encountered. We will study these strategies
in this section.
Early stopping
As optimizer iterates for more epochs, training cost goes down; however, the validation cost
may begin to increase. This is an indication of model overfitting. Figure 11.12 shows an
example (generated with the Kamyr digester dataset). To prevent this, early stopping is
adopted, where optimizer aborts training when validation cost begins to increase. To
implement this, validation set and early stopping specifications are supplied as shown in the
code below. Note the usage of the ‘history’ object which holds record of losses during training
for drawing validation plot.
# define model
def FFNN_model():
model = Sequential()
model.add(Dense(20, activation='tanh', kernel_initializer='he_normal', input_shape=(19,)))
model.add(Dense(5, activation='tanh', kernel_initializer='he_normal'))
model.add(Dense(1))
model.compile(loss='mse', optimizer='Adam')
return model
248
`
# validation plot
plt.figure()
plt.title('Validation Curves')
plt.xlabel('Epoch')
plt.ylabel('MSE')
plt.plot(history.history['loss'], label='training')
plt.plot(history.history['val_loss'], label='validation')
plt.legend()
Figure 11.12: Validation plots for Kamyr digester dataset without (left) and with (right) early
stopping
With early stopping, we can specify the number of epoch to a large value and not worry about
overfitting as the training will automatically stop at the right time. The early stopping callback
has several parameters which can be used to alter its behavior. For example, the ‘patience’
parameter that we used specifies the number of epochs to allow with no improvement on
validation set after which the training is stopped. You should checkout the official Tensorflow
documentation60 for details on other early stopping parameters.
60
https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/EarlyStopping
249
`
Regularization
As we have learnt before, regularization is a way to constraint model’s weight parameters to
avoid overfitting (large gap between training and validation accuracies). As one of the best
practices, you should always employ regularization for your ANN model fitting. Accordingly,
the regularization penalty becomes one of the important hyperparameter. Keras makes
specifying regularization (L1, L2, or both) very easy as was shown in the hyperparameter
optimization example where L1 regularization was used while defining the Dense layers.
Keras allows three types of regularizers, namely, kernel_regularizer, bias_regularizer,
activity_regularizer, to regularize the weight parameters, bias parameters, and neuron
activations; of these kernel_regularizer is the recommended default choice.
For deep networks, another form of regularization is very popular and is called ‘dropout’. In
this strategy, in every training iteration, some neurons are ‘dropped out’, i.e., their activations
are not computed in forward pass and weight updates are not performed. Note that dropped
neurons in current iteration can become active in the next iteration. Dropout has been shown
to provide superior generalization61 as it forces the network to avoid relying too much on just
a few neurons. In Keras, dropout is specified layer-wise. Two schemes are illustrated below.
In scheme 1, dropout layer is added between the input and first hidden layer and dropout rate
(fraction of neurons to drop) is set to 0.2; therefore, one out of every 5 input variables are
randomly excluded in an iteration. In scheme 2, dropout layer is added between two hidden
layers; here again, one out of every 5 neurons in the first hidden layer are randomly excluded
in an iteration. Note that dropout is only done during model training.
Initialization
Before the first optimization iteration is performed, model parameters need to be initialized
with some initial values. Not until long ago, a trivial approach would be to initialize all weights
to zero or same value. It was however realized that this trivial approach resulted in poor
training. Specialized initialization schemes have now been devised. If you are using ReLU
activation functions, He initialization scheme should be the default preference; here the
weights parameters for neurons in a layer are initialized by drawing samples from a normal
61
Srivastava et. al., Dropout: A simple way to prevent neural networks from overfitting, The journal of machine learning
research, 2014
250
`
Batch normalization
Batch normalization is another neat trick that have been devised to overcome the
vanishing/exploding gradient issue for deep networks. The initialization schemes ensure that
the weights within a layer are distributed appropriately. However, as training progresses and
model parameters are update, the weights may lose their ‘nice’ distributions. This negatively
impacts model training. With batch normalization strategy, at each training iteration with a
mini-batch, the inputs to a layer are standardized. For deep networks, this strategy has been
shown to provide significant training improvements. Implementing this in Keras is very
straightforward wherein we add (model.add(BatchNormalization())) a batch normalization
layer just before the layer whose inputs we wish to standardize.
Transfer Learning
Imagine that you have trained a deep network which provides good performance for
some complex tasks, such as fault detection. Now you would like to replicate this to
another similar site within your organization. However, you don’t have a lot of training
data at this second site – maybe the second site was commissioned recently and
hasn’t experienced many process faults yet! In such situations, you may want to adopt
the popular strategy of ‘transfer learning’ wherein model parameters of inner layers
of a pretrained model is transferred to a new model as shown in illustration below.
As alluded to before, inner layers in a DNN ‘learn’ the low-level features from process
data and transfer learning simply transfers this learnt logic from one model to another
for related tasks. This leads to faster convergence with limited data*.
*Li et. al., Transfer learning for process fault diagnosis: Knowledge transfer from simulation to physical
processes, Computers & Chemical Engineering, 2020
251
`
Let us employ all the knowledge we have gained regarding training FFNN models to develop
a soft sensor for a very nonlinear system. For this, we will re-use the debutanizer column
dataset that we saw in Chapter 7 (see the Dataset Description section in appendix for system
details). The soft sensor is required to predict C4 content in bottoms product using other
process data around the column. The dataset contains 2394 samples of input-output process
values. The sampling time is 15 minutes. Seven process (pressures, temperatures, flows
around debutanizer column) variables are used as inputs. Figure 11.13 shows that dataset
has decent process variability.
Figure 11.13: Plots of input and output (lower-right plot) variables for the debutanizer column
Before we build the FFNN model, a PLS model is built to serve as a reference for modeling
accuracy assessment. Figure 11.14 suggests that the PLS model is grossly inadequate for
predicting C4 content and hints the presence of strong non-linearities in the system.
Specifically, the linear trend in residual plot clearly shows that the PLS model exhibits poor
performance for high and low values of C4 content.
Figure 11.14: Comparison of test data with PLS predictions and residual plot
252
`
After a few trials, a network with 2 hidden layers with 60 and 30 neurons was obtained that
gave 70% accuracy on test data. The code below shows the values of the rest of the
hyperparameters. Figure 11.15 confirms the superior performance of FFNN model over PLS
model. The residual plot does not show any significant trend indicating that the FFNN model
is able to adequately capture the nonlinear relationships between the input and output
variables. This case study demonstrates the power capabilities of ANN models to model
complex process systems.
# read data
data = np.loadtxt('debutanizer_data.txt', skiprows=5)
# import packages
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras import regularizers
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.optimizers import Adam
# define model
model = Sequential()
model.add(Dense(60, kernel_regularizer=regularizers.L1(0.0000001), activation='relu',
kernel_initializer='he_normal', input_shape=(7,)))
model.add(Dense(30, kernel_regularizer=regularizers.L1(0.0000001), activation='relu',
kernel_initializer='he_normal'))
model.add(Dense(1, kernel_regularizer=regularizers.L1(0.0000001)))
# compile model
model.compile(loss='mse', optimizer=Adam(learning_rate=0.005))
# fit model
es = EarlyStopping(monitor='val_loss', patience=200)
history = model.fit(X_est, y_est, epochs=2000, batch_size=32, validation_data=(X_val, y_val),
callbacks=es)
# predict y
y_test_pred = model.predict(X_test)
253
`
# residual plot
plt.figure()
plt.plot(y_test, y_test-y_test_pred, '*')
plt.xlabel('C4 content test data'), plt.ylabel('residual (raw data- prediction)')
plt.title('residual plot')
# metrics
from sklearn.metrics import r2_score
print('R2 for test dataset:', r2_score(y_test, y_test_pred))
Figure 11.15: Comparison of test data with FFNN predictions and residual plot
254
`
ANN modeling is a very broad topic and with the plethora of tutorials on ANN modeling
available out there, it is easy to get overwhelmed. However, we have seen in this chapter that
ANNs are not as daunting as it may seem if we pay careful attention to a few key concepts. If
you are looking for some quick guidelines on how to setup the ANN hyperparameters for
process systems modeling, then the following suggestions will serve you well.
You will be lucky if you end up finding a good ANN model with the default settings in the very
first attempt. Often the following adjustments may need to be made.
• If validation accuracy is much lower than training accuracy, then increase regularization
penalty
• If optimizer is getting stuck in local minima, then adjust Adam’s initial learning rate. If it
does not help, then try adding more neurons or hidden layers
• If loss vs epoch curve is very noisy, then increase mini-batch size
Summary
Phew! that was a loaded chapter. We covered several techniques that are at our disposal to
train ANN models efficiently. There are a lot more behind each of the several aspects that we
touched upon, and we have only scratched the surface. However, you now are familiar with
the core concepts of ANNs and have hands-on-experience with modeling the debutanizer
column and combined cycle power plants. You are now well-aware of what it takes to develop
a good ANN model for process systems. ANNs are going to be an important weapon in your
arsenal which you will find yourself employing very often. In the next chapter we will learn how
ANNs are deployed for dynamic or temporal data.
255
`
Chapter 12
Recurrent Neural Networks
In previous chapter, we saw that feed-forward neural networks are powerful mechanism for
capturing complex relationships among process variables. In FFNNs, there is an implicit
assumption of static relationships between network inputs and outputs. There is no notion of
process dynamics or temporal order. However, sometimes you will encounter situations where
model output depends on not just the current input but on past inputs as well. Previously, we
employed input augmentation technique (in dynamic PCA/PLS) to deal with dynamic systems.
Unfortunately, in FFNNs, augmentation can lead to significant increase in the number of
model parameters. Recurrent neural networks are specialized networks for efficiently dealing
with sequential or temporal data. In essence, RNNs are FFNNs with ‘memory’.
For process systems, RNNs have been successfully used for system identification, fault
detection & classification, time series forecasting, and predictive maintenance of industrial
equipment. Outside of process industry, RNNs are employed, among others, for natural
language processing (speech recognition, text autofill, language translation, sentiment
analysis, etc.), music composition, and stock market forecasting!
In this chapter, we will learn how RNNs are able to remember things from past to make future
predictions. Since we already covered several neural network concepts in the previous
chapter, we will focus on varied applications of RNNs in this chapter. In the process, we will
look at different network configurations commonly employed. Specifically, the following topics
are covered
• Introduction to RNNs
• Different topological configurations of RNN networks
• System identification via RNN
• Using RNNs for fault classification
• Employing RNNs for predictive maintenance
256
`
Recurrent neural networks (RNNs) are ANNs for dealing with sequential data, where the order
of occurrence of data holds significance. For example, in a production plant, consistently
increasing temperature measurements may indicate one kind of process fault while
consistently decreasing measurements may indicate another fault type. There is no efficient
mechanism to specify this temporal order of data in a FFNN. RNNs accomplish this by
processing elements of a sequence recurrently and storing a hidden state that summarizes
the past information during the processing. The basic unit in a RNN is called a RNN cell which
simply contains a layer of neurons. Figure 12.1 shows the architecture of a RNN consisting of
a single cell and how it processes a data sequence with ten samples or x readings.
Figure 12.1: Representation of an RNN cell in rolled and unrolled format. The feedback
signal in rolled format denotes the recurrent nature of the cell. Here, the hidden state (h) is
assumed to be same as intermediate output (y). h(0) is usually taken as zero vector.
We can see that the ten samples are processed in the same order of their occurrence and not
at once. An output, y(i), is generated at the ith step and then fed to the next step for processing
along with x(i+1). By way of this arrangement, y(i+1) is a function of x(i+1) and y(i). Since, y(i) itself
is a function of x(i) and y(i-1), y(i+1) effectively becomes a function of x(i+1), x(i), and y(i-1). Continuing
the logic further implies that the final sequence output, y(10), is a function of all ten inputs of
the sequence, that is, x(10), x(9),…, x(1). We will see later how this ‘recurrent’ mechanism leads
to efficient capturing of temporal patterns in data.
257
`
RNN outputs
If the neural layer in the RNN cell in Figure 12.1 contains n neurons (n equals 4 in the shown
figure), then each y(i) or h(i) is a n-dimensional vector. For simple RNN cells, y(i) equals h(i). Let
x be a m-dimensional vector. At any ith step, we can write the following relationship
where 𝑾𝒙 ∈ 𝑹𝒏×𝒎 with each row containing the weight parameters of a neuron as applied to
x vector, 𝑾𝒚 ∈ 𝑹𝒏×𝒏 with each row containing the weight parameters of a neuron as applied
to y vector, 𝒃 ∈ 𝑹𝒏 contains the bias parameters, and 𝒈 denotes the activation function. The
same neural parameters are used at each step.
If all the outputs of the sequence are of interest, then it is called sequence-to-sequence or
many-to-many network. However, very often only the last step output is needed, leading to
sequence-to-vector or many-to-one network. We will see later how we can direct Keras to
discard all outputs except the last one. Moreover, the last step output may need to be further
processed and so a FC layer is often added. Figure 12.2 shows one such topology.
Figure 12.3 shows two other popular architecture, namely, vector to sequence RNN and
delayed sequence to sequence RNN. In the former scheme, a single input returns a
sequence. A real-life example could be dynamic process response due to a step change in
process inputs. The delayed sequence to sequence network (also called encoder-decoder
network) is also a many-to-many network, however, here, the output sequence is delayed.
This scheme is utilized if each step of output sequence depends on the entire input sequence.
Encoder-decoder scheme is often utilized for language translation because initial words of the
translation can be influenced by final words of the sentence being translated.
258
`
LSTM networks
RNNs are powerful dynamic models, however, the vanilla RNN (with a single neural layer in
a cell) introduced before faces difficulty learning long-term dependencies, i.e., when number
of steps in a sequence is large (~ ≥10). This happens due to the vanishing gradient problem
during gradient backpropagation. To overcome this issue, LSTM (Long Short-Term Memory)
cells have been devised which are able to learn very long-term dependencies (even greater
than 1000) with ease. Unlike vanilla RNN cells, LSTM cells have 4 separate neural layers as
shown in Figure 12.4. Moreover, in a LSTM cell, the internal state is stored in two separate
vectors, h(t) or hidden state and c(t) or cell state. Both these states are passed from one LSTM
cell to the next during sequence processing. h(t) can be thought of as the short-term
state/memory and c(t) as the long-term state/memory (we will see later why), and hence the
name LSTM.
The vector outputs of the FC layers interact with each-other and the long-term state via three
‘gates’ where element-wise multiplications occur. These gates control what information go
into the long-term and short-term states at any sequence processing step. While we will
understand the mathematical details later, for now, if suffices to know the following about
these gates:
• Forget gate determines what parts of long-term state, c(t), are retained and erased
• Input gate determines what parts of new information (obtained from processing of x(t) and
h(t-1)) are stored in long-term state
• Output gate determines what parts of long-term state are passed on as short-term state
259
`
Figure 12.4: Architecture of an LSTM cell. Three FC layers use sigmoid activation functions
and one FC layer uses tanh activation function. Each of these neural layers have its own
parameters 𝑾𝒙 , 𝑾𝒉 , and 𝒃
This flexibility in being able to manipulate what information are passed down the chain during
sequence processing is what makes LSTM networks so successful at capturing long-term
patterns in sequential datasets. Consequently, LSTM networks are the default RNNs now-a-
days. In the next section we will see a quick application of RNN for system identification using
LSTM cells.
260
`
There is another popular variant of RNN cell, called GRU cell. As shown in the illustration
below, GRU cell is simpler than LSTM cell. GRU cell has 3 neural layers and its internal
state is represented using a single vector, h(t). For several common tasks, GRU cell-
based RNN seems to provide similar performance as LSTM cell-based RNN and
therefore, it is slowly gaining more popularity.
The task of building mathematical models of dynamical processes using input and output data
is referred to as system identification (SI). RNNs are aptly suited for SI as they are designed
to capture dynamic relationships. To illustrate this, we will use data from a single input single
output (SISO) heater system where heater power output is manipulated to maintain desired
system temperature. The dataset contains 4 hours of training data and 14 minutes of test
data. Data is sampled at every second. Our modeling objective is to predict the next
temperature value using the current and past data. Let us first quickly look at training data.
# read data
data = pd.read_csv('TCLab_train_data.txt')
heaterPower = data[['Q1']].values
temperature = data[['T1']].values
261
`
# plot data
plt.plot(temperature, 'k'), plt.ylabel('Temperature'), plt.xlabel('Time (sec)')
plt.figure()
plt.plot(heaterPower), plt.ylabel('Heater Power'), plt.xlabel('Time (sec)')
Figure 12.5: Plots of input and output data of a dynamic heater system
Figure 12.5 clearly shows the dynamic nature of data; quick, large, and consistent variations
in heater power leads to temperature being in transient state most of the times. It is obvious
that temperature value at any instant would depend on not just the current heater power but
also the past values of heater power. To reduce the amount of past data required, we will use
both power and temperature values from the past to predict current temperature value. In SI
terminology, we are attempting to build an ARX (autoregressive with exogenous variables)
model.
X = data[['T1','Q1']].values
y = data[['T1']].values
X_scaler = StandardScaler()
X_scaled = X_scaler.fit_transform(X)
y_scaler = StandardScaler()
y_scaled = y_scaler.fit_transform(y)
Now that we have decided our model inputs and outputs variables, there is one more data re-
arrangement that needs to be done. Currently, the shape of X array is {# of samples, # of
features or input variables}. For RNN, this input data matrix needs to be converted into the
shape {# of sequence samples, # of time steps, # of features} such that each entry (of shape
{# of time steps, # of features}) along the 0th dimension is a complete sequence of past data
262
`
which is used to make prediction of temperature values. Here, the # of time steps is taken to
be 7062 and # of features equals 2. If you do not have a good idea of how many time steps to
use for your system, then it will become a model hyper-parameter which will need to be
optimized. For each input sequence, a scalar temperature value is predicted and therefore,
the Y matrix is re-arranged accordingly as shown in the code below
Basically, each block of 70 continuous rows in (scaled) X array becomes a sequence. The
topology of RNN that we will build is same as that shown in Figure 12.2, except for the number
of time-steps. Like we did for FFNN modeling, we will import relevant Keras libraries. In the
code below, a LSTM RNN layer is followed by a dense layer.
# define model
model = Sequential()
model.add(LSTM(units=25, kernel_regularizer=regularizers.L1(0.001), input_shape=(nTimeSteps,2)))
# LSTM cell with 25 neurons in each of the 4 neural layers
model.add(Dense(units=1))
# 1 neuron in output layer
The above 3-line code completely defines the structure of the required RNN. The single
neuron in the output layer does the job of transforming the 25th dimensional hidden state
vector from the last step of RNN layer into a scalar value. Note that by default LSTM layer
returns only the last step output. The model summary below shows the number of model
62
https://apmonitor.com/do/index.php/Main/LSTMNetwork
263
`
parameters in each layer. We will understand in the next section how we ended up with 2800
parameters in the LSTM layer.
Next we compile and fit the model. Note that instead of providing an explicit validation dataset,
we used the validation_split parameter to specify 30% validation split.
Figure 12.7: Measured vs predicted temperature values for SISO heater system
264
`
Figure 12.7 shows that our RNN can fit the training data well. Let us now check its
performance on test dataset.
# scale data
X_test_scaled = X_scaler.transform(X_test)
y_test_scaled = y_scaler.transform(y_test)
# predict y_test
y_test_sequence_pred = model.predict(X_test_sequence)
y_test_pred = y_scaler.inverse_transform(y_test_sequence_pred)
Figure 12.8 confirms the good generalization capability of the RNN model as the model
predictions match the raw temperature values in test dataset very well. Note that model
predictions are not available before the first 70 samples.
It is now a good time for a quick comparison between FFNN and RNN modeling. If we had
built a FFNN model with past power and temperature values as additional input variables,
then we would have ended up with 140-dimensional input variable. Assuming we use a hidden
layer with only 25 neurons followed by a single output neuron, we would have ended up with
more than 3500 model parameters – 25% more than that used by our RNN model. This rough
thought-experiment shows how RNNs end up with better parameter efficiency for modeling
dynamic systems. And by now, you would know that lower number of model parameters
generally implies lower chances of over-fitting and better model training. Therefore, for
dynamic systems, RNNs should be the preferred model choice.
265
`
Figure 12.8: Measured vs predicted temperatures for SISO heater system test dataset
It is always nice to have a good understanding of the impact of adding additional neurons on
total number of model parameters; pre-emptive steps can be taken to prevent over-fitting as
model complexity increases. In vanilla RNN cell, we saw that 𝑾𝒙 ∈ 𝑹𝒏×𝒎 , 𝑾𝒉 ∈ 𝑹𝒏×𝒏 and,
𝒃 ∈ 𝑹𝒏 . This leads to total number of parameters equal to 𝑚𝑛 + 𝑛2 + 𝑛. In an LSTM cell, each
of the 4 neural layers have their own set of these parameters. Therefore an LSTM cell has
4(𝑚𝑛 + 𝑛2 + 𝑛) model parameters. In the previous heater example, we have m = 2 and n =
25; therefore we ended up with 4(50 + 625 + 25) = 2800 model parameters in the LSTM layer.
Let us now take a sneak peak into the inner workings of a LSTM cell. An LSTM cell may seem
intimidating, however, they are actually simple to understand. Let us look at each of the neural
layer and the gates, and understand the specific roles they play. The core component of the
LSTM cell is the cell state (𝒄𝒕 ∈ 𝑹𝒏 ) that almost cuts through the cell untouched, except for a
couple of element-wise operations which modify the individual elements/components of 𝒄𝒕 .
These modificatons determine which components are useful or no longer useful for further
processing of the sequence. To understand these modifications, consider the first of the 4
neural layers.
266
`
Here, the input (𝒙𝒕 ) and the hidden state from previous step (𝒉𝒕−𝟏 ) are processed to generate
vector f (∈ 𝑹𝒏 ) whose values lie between 0 and 1. If the ith component of f is 0, it implies that
the ith component of 𝒄𝒕−𝟏 gets erased/forgotten after the element-wise multiplication (and
hence, the name ‘forget gate’). The second and third neural layers generate n-dimensional
vectors i and g, respectively. The component values of i (∈ [0, 1]) determine how much the
components of g (∈ [−1, 1]) are passed through the input gate and get added to the cell state.
These are all the modification that occur to the cell state before it is passed on to the next
LSTM cell. However, there is one more major task that occurs inside LSTM cell. The hidden
state, 𝒉𝒕 ∈ 𝑹𝒏 , is generated from the cell state, which acts as the cell output at any sequence
step. The fourth neural layer participates in this task as shown below.
267
`
The vector output, o (∈ 𝑹𝒏 ), from the fourth neural layer determines which parts (and how
much) of the cell state is output as the hidden state at this time step. The cell state is passed
through a tanh function before the output gate to push its values to be between -1 and 1.
We hope that the above description gave you a good conceptual overview of how memories
are created and passed along the RNN chain during sequence processing to capture long-
term patterns in any sequence. We will now continue to look at a few more interesting
application of LSTMs.
In FFNNs, we created deeper networks by adding more hidden layers between the input and
output layers. Similarly, in RNNs, deep networks can be built by stacking RNN cells on top of
each other, as shown in Figure 12.9. While both the cell state and hidden state are passed
along the chain (horizontally), only the hidden state are passed from one layer to the next
(vertically). The hidden states of the cells in the last layer become the network outputs.
Figure 12.9: Representation of a deep RNN in rolled and unrolled format. The hidden state
(h) is passed from one layer to the next.
Like we did for shallow RNN, we can ignore all outputs except that from the last step to obtain
a sequence-to-vector network in deep RNNs as well. We will see one such application for fault
classification for a large-scale system in the next section.
268
`
The way process variables evolve in the presence of a process fault can give crucial clues
about the nature/kind of process fault. Therefore, a sequence of process values (rather than
just a single snapshot of current process values) is often used as inputs for fault classification.
RNNs are best suited for classification using sequence data. To illustrate this, we will use data
from Tennessee Eastman Process (TEP), a large-scale chemical manufacturing process.
In chapter 6, we had used simulated TEP data for fault detection and classification using static
data to introduce ICA and FDA methods. We will again use simulated data but from another
source63 where a much bigger dataset is available. The dataset still contains 21 different fault
classes (along with no-fault operation); however, for each fault class, 500 separate
simulations runs were conducted for both training and test dataset. Each training simulation
run contains 500 time samples from 25 hours of operation and each test run contains 960
time samples from 48 hours of operation. Figure 12.10 shows temporal profiles for the first 10
signals in the dataset for 4 complete simulation runs with no fault, with fault of class 1, and
with fault of class 8.
Figure 12.10: Dynamic profile of first 10 signals for complete simulation runs with no fault,
with fault of class 1, and with fault of class 8.
63
Reith, C.A., B.D. Amsel, R. Tran., and B. Maia. Additional Tennessee Eastman process simulation data for anomaly
detection evaluation. Harvard Dataverse, Version 1, 2017
269
`
Figure 12.10 suggests that just the end values of process variables may not accurately
indicate the specific process fault; instead, the temporal profiles hold the key to accurate fault
classification. Our objective is to predict the fault class (including no fault class) using the
complete simulation run data. Let’s begin by loading the training data into python. The original
data is in .RData format which we can read using pyreadr package.
# read data
import pyreadr
fault_free_training_data = pyreadr.read_r('TEP_FaultFree_Training.RData')['fault_free_training']
# pandas dataframe
fault_free_testing_data = pyreadr.read_r('TEP_FaultFree_Testing.RData')['fault_free_testing']
faulty_training_data = pyreadr.read_r('TEP_Faulty_Training.RData')['faulty_training']
faulty_testing_data = pyreadr.read_r('TEP_Faulty_Testing.RData')['faulty_testing']
The snippet below gives a quick overview of fault_training_data dataframe. All four
dataframes follow the same column arrangement.
• Column 1 (faultNumber) denotes the fault class, ranging from 0 (fault-free) to 20
• Column 2 (simulationRun) denotes the simulation run, ranging from 1 to 500
• Column 3 (sample) denotes the sampling instances for a simulation run, ranging from
1 to 500 for training data and 1 to 960 for test data
• Columns 45 to 55 contain the sensor measurements
Before we proceed, we will remove some faults classes from our dataset because these faults
are not recognizable64.
# remove fault classes 3,9,15 from faulty datasets
faulty_training_data = faulty_training_data[faulty_training_data['faultNumber'] != 3]
faulty_training_data = faulty_training_data[faulty_training_data['faultNumber'] != 9]
64
Chemical Process Fault Detection using deep learning, www.mathworks.com
270
`
Next, we set aside some simulation runs as validation dataset. Moreover, only a small fraction
of faulty simulation data is used for training and validation. In typical real-life scenarios,
number of faulty instances are few and therefore, it would be unrealistic to assume equal
number of fault-free and faulty samples. The code below accomplishes this and generates
the final training, validation, and test datasets.
# separate validation dataset out of training dataset and create imbalanced faulty dataset
fault_free_validation_data = fault_free_training_data[fault_free_training_data['simulationRun'] > 400]
fault_free_training_data = fault_free_training_data[fault_free_training_data['simulationRun'] <= 400]
faulty_validation_data = faulty_training_data[faulty_training_data['simulationRun'] > 490]
faulty_training_data = faulty_training_data[faulty_training_data['simulationRun'] <= 50]
# convert to numpy
fault_free_training_data = fault_free_training_data.values
fault_free_validation_data = fault_free_validation_data.values
fault_free_testing_data = fault_free_testing_data.values
faulty_training_data = faulty_training_data.values
faulty_validation_data = faulty_validation_data.values
faulty_testing_data = faulty_testing_data.values
Sensor measurements, which start from 4th column onwards, are the input variables and the
fault number in 1st column is our output variable. We separate out the input variables and
scale them.
y_train = training_data[:,0]
y_val = validation_data[:,0]
y_test = testing_data[:,0]
271
`
# scale data
from sklearn.preprocessing import StandardScaler
X_scaler = StandardScaler()
X_train_scaled = X_scaler.fit_transform(X_train)
X_val_scaled = X_scaler.transform(X_val)
X_test_scaled = X_scaler.transform(X_test)
Now, like we did for system identification example, we need to re-arrange the data into
sequence form. In TEP dataset, each simulation run forms a complete sequence. For
example, each simulation run in the training dataset with 500 samples is a sequence with 500
time-steps.
Before we fit the model, a final pre-processing is done to convert our categorical output labels
into the one-hot encoded form. This converts the 1-D output vectors into 21-D matrices.
272
`
Alright, we are now ready to define our RNN model. Model’s topology is shown below. Since
the output fault classes are mutually exclusive, we will use softmax output layer. Also, we
specify return_sequences=True parameter into LSTM to ensure that sequence output is
passed from the 1st LSTM layer to the 2nd.
# import packages
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras import regularizers
# define model
model = Sequential()
model.add(LSTM(units=128, kernel_regularizer=regularizers.L1(0.0001), return_sequences=True,
input_shape=(nTimeStepsTrain,52)))
# 1st LSTM layer with sequence output
model.add(LSTM(units=64, kernel_regularizer=regularizers.L1(0.0001)))
# 2nd LSTM layer without sequence output
model.add(Dense(21, activation='softmax'))
# Softmax output layer with 21 neurons
273
`
We can now compile and fit our model. A few noteworthy things in the code below.
Categorical_crossentropy loss, which is suitable for multiclass classification, is used. An
additional ‘accuracy’ metric is specified which will be tracked (along with ‘loss’) during model
fitting.
Figure 12.11 shows that almost 100% accuracies are obtained on both training and validation
dataset, suggesting the adequacy of our model.
plt.plot(history.history['accuracy'], label='train')
plt.plot(history.history['val_accuracy'], label='val')
plt.title('Validation Curves: Accuracy')
plt.xlabel('Epoch'), plt.ylabel('Accuracy')
Figure 12.11: Evolution of losses and accuracies during model fitting for TEP fault
classification
Let’s see how well we can classify the process faults in test dataset. We will visualize the
results using confusion matrix.
274
`
Y_test_sequence_pred = model.predict(X_test_sequence)
y_test_sequence_pred = np.argmax(Y_test_sequence_pred, axis = 1)
conf_matrix = confusion_matrix(y_test_sequence, y_test_sequence_pred, labels=list(range(21)))
Figure 12.12: Confusion matrix for TEP fault classification with test dataset
Figure 12.12 shows that the RNN model can correctly identify most of the fault classes. It
faces some difficulty with classes 8, 10, and 16. For example, several simulations with fault 8
275
`
are identified as having fault 12. Overall, the model gives ~88% accuracy. Note that the time-
steps in training and test sequence do not have to be the same as shown in this example
where each test sequence has 960 time-steps.
Hopefully, this example gave you a good understanding of how to deploy a RNN model for
fault classification using dynamic data. In the next section, we will study another very
important application of RNNs, predictive maintenance.
The process-fault detection models that we have built till now have relied on using latest (and
recent past) process data to detect process faults that have already occurred. The process
reliability personnel get relatively short amount of time to prepare for the maintenance.
Wouldn’t it be great if our models could predict well in advance (a week or may be a month)
the impending failure of any equipment? Such models have been successfully deployed in
process industry and are referred to as predictive maintenance (PdM) models.
PdM models use historical past failure data to extract dynamic patterns as indicators of
failures in the future. Infact, models can be built to predict the precise failure time in future.
The time until failure of an equipment is also referred to as remaining useful life (RUL). RNNs
are among the most useful technique available at our disposal to tackle RUL estimation or
failure prediction problems.
For illustration, we will use aircraft gas turbine engine dataset which consists of dynamic data
from multiple sensors from several engine operation simulations. Each simulation starts with
an engine (with different degrees of initial wear) operating within normal limits. Engine
degradation starts at some point during the simulation and continues until engine failure.
Training datasets contain complete data until engine failures, while the test dataset contains
data until some point prior to failure. Actual RUL has been provided for the test dataset. Our
objective is to develop a PdM model to predict engine failure using simulation data in the test
dataset.
276
`
Maintenance Strategies
The maintenance strategies in process industry have been strongly influenced by the
advances in ML and sensor technologies. It has evolved from time-based preventive
maintenance to proactive monitoring-based condition-based maintenance (CBM),
and to advance prediction-based preventive maintenance (PdM).
Figure 12.13: Sensor reading from training dataset (left) and test dataset (right). The actual
RUL for the shown engine ID 90 is 28 as provided in the RUL_FD001.txt.
277
`
# training data
train_df = pd.read_csv('PM_train.txt', sep=" ", header=None)
train_df.drop(train_df.columns[[26, 27]], axis=1, inplace=True) # last two columns are blank
train_df.columns = ['EngineID', 'cycle', 'OPsetting1', 'OPsetting2', 'OPsetting3', 's1', 's2', 's3',
's4', 's5', 's6', 's7', 's8', 's9', 's10', 's11', 's12', 's13', 's14', 's15', 's16', 's17', 's18', 's19', 's20', 's21']
# test data
test_df = pd.read_csv('PM_test.txt', sep=" ", header=None)
test_df.drop(test_df.columns[[26, 27]], axis=1, inplace=True)
test_df.columns = ['EngineID', 'cycle', 'OPsetting1', 'OPsetting2', 'OPsetting3', 's1', 's2', 's3',
's4', 's5', 's6', 's7', 's8', 's9', 's10', 's11', 's12', 's13', 's14', 's15', 's16', 's17', 's18', 's19', 's20', 's21']
While most of the code above is self-explanatory, the last part reads the true RUL data,
removes the redundant columns with ‘nan’ data, and adds an additional column ‘EngineID’ as
shown below
65
https://github.com/umbertogriffo/Predictive-Maintenance-using-LSTM
278
`
Next, we will do some dataframe manipulations to compute RUL and failure label at any given
cycle for an engine. The code below basically adds the highlighted columns shown in the
snippet below to the training and test dataframes.
# training dataset
maxCycle_df = pd.DataFrame(train_df.groupby('EngineID')['cycle'].max()).reset_index()
maxCycle_df.columns = ['EngineID', 'maxEngineCycle']
w1 = 30
train_df['binaryLabel'] = np.where(train_df['engineRUL'] <= w1, 1, 0 )
# compute maxEngineCycle for test data using data from test_df and truth_df
maxCycle_df = pd.DataFrame(test_df.groupby('EngineID')['cycle'].max()).reset_index()
maxCycle_df.columns = ['EngineID', 'maxEngineCycle']
truth_df['maxEngineCycle'] = maxCycle_df ['maxEngineCycle'] + truth_df['finalRUL']
truth_df.drop('finalRUL', axis=1, inplace=True)
279
`
Next, we will create the sequence samples. For each engine, any continuous block of 50
cycles forms a sequence; output label for the sequence is decided based on whether the
engine’s RUL at the end of sequence is more than 30 or not. To accomplish this, we will define
a utility function as shown below
280
`
We are now ready to build and compile our RNN. The topology is similar to the previous fault
classification example except that here we use a single neuron output layer with sigmoid
activation function. Moreover, for regularization, we utilize the dropout technique.
# define model
model = Sequential()
model.add(LSTM(units=100, return_sequences=True, input_shape=(nSequenceSteps, 24)))
model.add(Dropout(0.2))
model.add(LSTM(units=50))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
# compile model
model.compile(loss='binary_crossentropy', optimizer='Adam', metrics=['accuracy'])
Before we fit our RNN, we need to set aside some sequences for validation. If we look at the
distribution of class labels in the training sequences, we will notice class imbalance. As a best
practice, validation dataset should be chosen with the same class distribution. This is done
easily in sklearn by using the stratify parameter as shown below
281
`
We can now fit the model. Validation curves show that we can obtain almost 100% accuracies
on training dataset.
For test data, we will create one sequence per engine using the last 50 cycles. Confusion
matrix shows that we can correctly predict failure in 84% of the cases. Hyperparameter tuning
may lead to even better performance.
# input/output test sequences (only the last sequence is used to predict failure)
X_test_sequence = []
y_test_sequence = []
282
`
In this current example, we attempted to predict if an engine will fail within the next 30 cycles.
In the next subsection, we will try to predict the numeric RUL of each engine directly. The
code will largely remain the same except for a few tweaks.
Figure 12.14 compares the actual versus predicted RUL values for the test engine dataset. It
is apparent that the model performs satisfactorily well. Such accurate estimation of remaining
283
`
useful life of process equipment can be a great assistance in maintenance planning and
avoidance of costs due to unexpected equipment failures.
Figure 12.14: Predicted vs observed engine RULs for test aircraft engine dataset
It won’t be surprising if you are already thinking about all the different applications where you
can use the predictive capabilities of RNNs (or more specifically LSTMs)! Such enthusiasm is
not unwarranted, however, remember that a key requirement for a successful predictive
solution is the availability of historical database on failure events. Moreover, the available
sensor readings should show some degradation pattern as indications of impending failure.
As long as these two data requirements are met, you can achieve your predictive objectives.
Summary
In this chapter, we studied the application of ANNS for dynamic/temporal systems. We learnt
about the remarkable capabilities of LSTM in extracting patterns from long sequences. We
implemented RNNs for a few typical applications in PSEs, viz, system identification, fault
classification, and predictive failure prediction. After working out these examples, you would
have gained a very good understanding on how to setup the RNN-based solution for your
specific problems. In the next chapter, we will study reinforcement learning, which is another
powerful ANN-based technique.
284
`
Chapter 13
Reinforcement Learning
The ML algorithms we have learnt till now rely upon supply of all modeling-relevant
input/output data prior to model training. In contrast, reinforcement training (RL) takes ML a
step further and is designed to collect the required training data by itself through interactions
with the physical system that it is trying to learn about. Through trial-and-error, an RL model
learns what actions to take to accomplish any given task. During training, actions that result
in favorable results/rewards get reinforced and after multiple interactions, the model
eventually learns an optimal action plan/policy! Sounds impressive, right?
RL mimics how we, humans, learn things (such as riding a cycle) through trial & error and
environment interactions. This concept opens up a plethora of potential RL applications. You
have probably already heard or seen some of the remarkable feats achieved by RL models
such as computers playing games better than the best human players or humanoid robots
learning how to run, etc. In this chapter, we will focus on process industry-related applications
of RL, specifically for process control.
RL is a very broad and constantly evolving field. There are a lot of RL-specific terminology
and concepts. We will declutter the world of RL in this chapter and you will learn how to setup
and solve an RL problem. Specifically, this chapter covers the following topics
285
`
RL is the branch of ML wherein an agent repeatedly interacts with its environment to learn the
best way to accomplish a task or the optimal action policy. Figure 13.1 shows the basic setup
of reinforcement learning. As shown, the RL agent receives information about the current
state (st) of the environment based on which an action (at) is decided (as per agent’s action
policy). As a result of the action, the environment moves to a new state (st+1) and generates
a scalar reward (rt+1) indicating how good was the taken action. Before the agent takes another
action, the learning algorithm uses the information (st, at, rt+1, st+1) to improve its policy and
then the cycle continues. Eventually, an optimal policy is obtained that maps environment
states to optimal actions such that the total rewards earned till task completion is maximized.
Once trained, the learning process can be stopped, and the policy function can be deployed.
Figure 13.1: Reinforcement learning setup depicting an agent’s interactions with its
environment
A simple real-life analogy could be the task of finding the optimal driving route from your office
to home in a new town. Here, you are the agent and environment would comprise of your car,
city’s roads and highways network, traffic, weather, your geospatial location, and basically
everything excluding you. The total reward to maximize could be the negative of the time
taken to reach home (less driving time is better). During driving, depending on the
environment state, you would take decisions on whether to take any highway exit or make
any turn or not. Assuming no internet (and google maps!), being new in town, you would not
know if taking those exits or turns will help you reach home faster and therefore you will
explore different possible routes. After several trials, you would gain a good understanding of
the town and eventually would be able to take the most optimal action for any given
environment state at any point during driving. RL follows the same methodology to find the
optimal mapping using some systematic (and very smart) learning algorithms.
286
`
In process industry, although PID and MPC controllers are well-established, their
shortcomings are well-known. While PID controllers perform unsatisfactorily for complex
nonlinear systems, MPCs solve optimization problems using process models which makes
online action computation infeasible for large-scale nonlinear systems. Moreover, both these
controllers suffer performance degradation issues (due to changing process conditions,
process drift) over time, necessitating regular maintenance. Controller maintenance entails
re-identification of process models which can be time and resource intensive and may require
interference to normal plant operations for training data collection.
Given the aforementioned issues with the current state-of-the-art for the process controllers
and the recent successes of RL, interest in leveraging RL technology for process control has
been reignited. Several recent studies66 have demonstrated how RL-based controllers can
provide superior performance. Not requiring online optimization (because optimal action
policy is pre-computed), easy adaptation of action policy under changing environment by
66
Cassol et. Al., Reinforcement learning applied to process control: A Van der Vusse reactor case study, Computer Aided
Chemical Engineering, 2018
Rajesh Siraskar, Reinforcement learning for control of valves, Machine Learning with Applications, 2021
Ma et. Al., Continuous control of a polymerization system with deep reinforcement learning, Journal of Process Control,
2019
287
`
continued learning with new process data are some characteristics that make RM-based
controllers very promising.
You may be slightly concerned about training an RL agent in the real plant
environment because the generated actions in early stages of training can
be ‘very bad’ and even unsafe. Moreover, for complex system, thousands
of interactions may be required to reach even a reasonably-good policy;
plant managers will certainly never agree to this! These are valid concerns
and therefore, the common practice is to use a sufficiently accurate model
of the plant and train RL agent offline in a simulated environment. Once
offline learning is complete, the learning process can be turned off and RL
agent deployed in real plant. There are, however, a few good reasons to
keep learning on (continually or sporadically) post-deployment. First, your
plant model will probably not be 100% accurate. Therefore, the RL agent
may use some online interactions to fine-tune its policy. Second, as alluded
to before, the plant behavior may change over time and the agent will need
to tweak its policy to re-adjust to changes in its environment.
We have showered enough praises on RL. Let’s now get down to understanding how RL
actually works. For this, we will first learn some RL terminology and concepts. A quick
disclaimer here that some of these new concepts may seem ‘abstract’ and not immediately
useful. But, as you read through this chapter, we promise that all dots will connect and their
utility will become ‘obvious’.
on the most recent state (st) and action (at). Such memoryless characteristic is important in
RL because it allows the agent to only consider the current state when deciding the next
optimal action and not worry about what actions were previously taken to reach st.
Figure 13.3: Transition in an MDP depend only on current state and action
In an MDP, selection of features that completely characterize the state of the environment
and enable the agent to act on its basis becomes crucial. For example, consider the following
problem of controlling the liquid level in a tank
Here, l, lsp, and the rate of change of l could be used to define the environment state. lsp can
also be substituted with l - lsp as an alternative formulation. Depending on your problem
formulation, data from the past may also be included in the state vector to facilitate the agent’s
decision making.
289
`
The agent’s job is to take actions that maximize the cumulative reward (also called return)
over the course of the task. The return can be represented as 𝑹𝒕 = ∑𝑻−𝒕
𝒌=𝟏 𝒓𝒕+𝒌 . However, this
direct summation can cause trouble for tasks which don’t have a definite end state. The level
control task is one such example where T=∞ (because the task is to control the level forever!).
To keep Rt bounded, the following formulation is used instead
The return in Eq. (1) is called discounted return and 𝜸 ∈ (𝟎, 𝟏). The farther the reward in the
future, less importance is accorded to it. A 𝜸 close to 0 makes an agent consider only the
immediate reward (rt+1) for choosing at while 𝜸 close to 1 makes long-term rewards important.
𝒓𝒕 = −(𝒍𝒕 − 𝒍𝒔𝒑 )𝟐
Appropriate crafting of reward function is very important as it strongly influences the agent’s
decisions and RL training convergence.
Policy
A policy is a rule or mapping used by an RL agent to determine at given st. A policy can be
deterministic or stochastic. A deterministic policy provides a single action
𝒂𝒕 = 𝝁(𝒔𝒕 )
while a stochastic policy (𝝅) provides a probability over the set of actions
The goal in RL is to learn a policy which provides maximum return. The optimal policy is often
denoted as 𝝁* or 𝝅*.
290
`
Value function
RL algorithms often make use of value functions. State value function, 𝑽𝝁 (𝒔), gives the return
an agent expects to receive from being in the state s and acting according to some policy 𝝁.
𝑽𝝁 (𝒔) = 𝑬[𝑹𝒕 |𝒔𝒕 = 𝒔, 𝒑𝒐𝒍𝒊𝒄𝒚 𝝁]
State-action value function, 𝑸𝝁 (𝒔, 𝒂), is another useful function that gives the expected return
from taking an arbitrary action (not necessarily following policy 𝝁) from state s and then
following policy 𝝁.
𝑸𝝁 (𝒔, 𝒂) = 𝑬[𝑹𝒕 |𝒔𝒕 = 𝒔, 𝒂𝒕 = 𝒂, 𝒑𝒐𝒍𝒊𝒄𝒚 𝝁]
When the optimal policy, 𝝁*, is followed, the value functions are denoted as V*(s) and Q*(s,a).
In above definitions, we take expectations (E[.]) of the returns because real-life environments
are uncertain and stochastic, i.e., taking an action at at state st can result in slightly different
st+1 in different iterations. This transition uncertainty can be denoted as the probability
𝑷(𝒔𝒕+𝟏 = 𝒔′ |𝒔𝒕 = 𝒔, 𝒂𝒕 = 𝒂). Above definitions incorporate this transition probability during
return computation.
Bellman equation
The value functions can also be represented in a recursive form. Consider the optimal state
function,
𝑽∗ (𝒔) = 𝑬[𝑹𝒕 |𝒔𝒕 = 𝒔, 𝝁∗ ]
= 𝑬[𝐫𝐭+𝟏 + 𝛄𝐫𝐭+𝟐 + 𝛄𝟐 𝐫𝐭+𝟑 + ⋯ |𝒔𝒕 = 𝒔, 𝝁∗ ]
= 𝑬[𝐫𝐭+𝟏 + 𝛄𝐑 𝐭+𝟏 |𝒔𝒕 = 𝒔, 𝝁∗ ]
= 𝑬[𝐫𝐭+𝟏 + 𝛄𝑽∗ (𝒔𝒕+𝟏 )|𝒔𝒕 = 𝒔, 𝝁∗]
Note again that 𝒔𝒕+𝟏 ~ 𝑷(𝒔𝒕+𝟏 = 𝒔′ |𝒔𝒕 = 𝒔, 𝒂𝒕 = 𝒂) and is sampled following the environment’s
transition probability. The above recursive equations, also called Bellman equations, are
central to several RL algorithms. Bellman equation simply states that the value of the current
state (or state-action) is the sum of immediate reward and the value of whatever state the
environment lands next.
291
`
Alright, we now have enough fundamentals in place to start taking a deeper look into RL
algorithms for optimal policy learning. Let’s begin with one of the most popular technique, Q-
learning.
Q-learning refers to the RL algorithms designed for estimating the 𝑸∗ (𝒔, 𝒂) values. Once the
optimal value function is available, an optimal policy can be framed as simply picking the
action corresponding to the highest value for the current state as shown below
𝒂𝒕 = 𝐚𝐫𝐠 𝐦𝐚𝐱
′
𝑸∗ (𝒔𝒕 , 𝒂)
𝒂
To see how 𝑸∗ may be obtained, consider an MDP with discrete state and action space. We
can represent the Q function as a Q-table67
To find the optimal values for each state/action pair, all the values in Q-table are first initialized
to 0, some initial state is arbitrarily chosen, agent-environment interactions starts, and then
the bellman equation is used in the following iteration form
𝒐𝒓
∗ 𝒏𝒆𝒘 (𝒔𝒕 , 𝒂𝒕 ) ← 𝑸
𝑸 ∗ 𝒐𝒍𝒅 (𝒔𝒕 , 𝒂𝒕 ) + 𝜶 ቆ𝒓𝒕+𝟏 + 𝛄 𝐦𝐚𝐱 𝑸
∗ 𝒐𝒍𝒅 (𝒔𝒕+𝟏 , 𝒂′ ) − 𝑸
∗ 𝒐𝒍𝒅 (𝒔𝒕 , 𝒂𝒕 )ቇ eq. 2
′ 𝒂
In the pseudo-code above, there are two terms whose meaning may not be immediately clear
to you, namely, ‘episode’ and ‘terminal state’. An episode simply refers to one complete
simulation or experiment. For example, one instance of a drive from your office to home would
be one episode. A terminal state marks the end of an episode. In the driving example, arrival
at home or end of fuel could be the terminal states. For tasks that may not necessarily end in
a terminal state (such as controlling the fluid level), a maximum number of steps are pre-
specified for an episode during training.
293
`
Simple illustration
If you are not yet entirely convinced about how taking some random actions can help an agent
arrive at an optimal Q-table and an optimal policy, then let’s consider a hypothetical control
problem. We will take our level control problem and consider only the following 4 symbolic
states
RL agent here has 2 action choices - close or open the valve by a fixed amount. Closing a
valve takes the environment to the higher fluid level state; the agent of course does not know
this to begin with. We will set 𝜶 = 𝟏 and 𝜸 = 𝟎. 𝟗. A random action is taken if both actions
have same value, otherwise greedy action is taken (𝜺 = 𝟎). Let’s now run some mock
experiments.
Since the agent has reached a terminal state (s3), next episode will start. Initial state will be
reset but the Q-table will remain intact.
294
`
Before we continue with any further iterations, let’s pause and observe what we have achieved
∗ is already good enough to give the desired optimal policy!
in just 4 iterations. The current 𝑸
In state s1, RL agent will close the value because of higher expected return (𝟒 > 𝟐𝟎). In state
∗ values will get closer to
s2, agent will again close the value. With further iterations, the 𝑸
more accurate estimates, but the optimal policy will remain the same.
The above illustration should have made the Q-learning algorithm clear to you. It is simple to
implement yet is quite powerful. A direct industrial application of Q-learning can be seen in
the work of Syafiie et al. for automatic control of pH neutralization processes68.
68
Syafiie et al., Model-free control of neutralization processes using reinforcement learning, Engineering Applications of
Artificial Intelligence, 2007
295
`
RL Taxonomy
Q-learning, although a breakthrough RL algorithm, is just one of the many RL
algorithms currently out there. Because many of them borrow ideas from each-other,
it is difficult to list and bucket all of them comprehensively into some reasonable
categories. Nonetheless, the figure below presents a non-comprehensive picture of
the current RL algorithm landscape.
Figure 13.4 focusses on model-free algorithms as these are more popular due to their
ease of implementation. Another broad categorization shown is in terms of whether
an agent tries to learn an optimal policy directly or indirectly through estimation of 𝑄∗
first. Extremely powerful algorithms have recently emerged by combining certain
features of policy optimization and Q-learning algorithms. Among these DDPG has
been used in several promising studies for RL-based process control and therefore
is covered later in the chapter.
While tabular Q-learning is an elegant method, it is unfortunately not suitable for MDPs with
high-dimensional continuous state and/or action spaces. Consider the fluid level control
problem with fluid level as its state and valve adjustments as action. Here, fluid level and valve
opening can attain any value from 0% to 100%. For tabular Q-learning, we may discretize
these variables crudely into 100 intervals each. This already leads to a 100 X 100 sized Q-
table. With finer discretization and more states/actions, the Q-table can become very large
and unmanageable. Unfortunately, most process control problems would face these issues.
296
`
Deep Q-learning
One solution to continuous space problems is to use a function that approximates the Q-table
and returns the Q-value for any state-action pair. When a deep neural network is used as the
function (as shown below), the methodology is called deep Q-learning and the neural network
is called deep Q-network (DQN).
DQN
Like tabular Q-learning, the parameters, w, of Q-network are updated iteratively such that the
output 𝑸𝒘 (𝒔𝒕 , 𝒂𝒕 ) gets closer to its target value. But what’s the target value when the optimal
𝑸∗ (𝒔𝒕 , 𝒂𝒕 ) is unknown? The Bellman equation is used to provide what is called a TD target or
bootstrap69 target (𝒚 ̃𝒕 ). Let (st, at, rt+1, st+1) be a transition tuple obtained via environment
interaction at time t, then
̃𝒕 = 𝒓𝒕+𝟏 + 𝛄 𝐦𝐚𝐱
𝒚 𝑸𝒘 (𝒔𝒕+𝟏 , 𝒂′ )
′ 𝒂
The above TD target is used to derive a network parameter update mechanism as shown
below
1 2
𝐿𝑡 (𝒘) = (𝒚̃ 𝑡 − 𝑄𝒘 (𝒔𝒕 , 𝒂𝒕 )) Loss function
2
𝜕𝐿𝑡 𝜕𝑄𝒘 (𝒔𝒕 , 𝒂𝒕 ) Although 𝒚 ̃𝑡 depends on
≈ − (𝒚
̃ 𝑡 − 𝑄𝒘 (𝒔𝒕 , 𝒂𝒕 ))
𝜕𝒘 𝜕𝒘 𝒘𝑡 , this dependence is
𝜕𝐿𝑡
𝒘𝑡 ← 𝒘𝑡 − 𝛼 ignored for this partial
𝜕𝒘 Parameter update derivative
The shown loss function is also called mean squared Bellman error (MSBE) which quantifies
the error in satisfying the Bellman equation. This update scheme should be familiar to you
69
In RL literature, the practice of updating value functions using other value estimates is called bootstrapping
297
`
because we had seen this in Chapter 11 for ANNs. The pseudo-code below summarizes the
basic form70 of DQN algorithm
While DQNs have given remarkable results71, their convergence is not guaranteed72.
Moreover, DQNs are not well suited for process control problems which have continuous
action spaces as the computation of 𝐦𝐚𝐱
′
𝑸𝒘 (𝒔𝒕+𝟏 , 𝒂′ ) in each step for continuous variable 𝒂′
𝒂
is quite inconvenient, if not impractical. Note that the behavior policy may include solving an
optimization for choosing 𝒂𝒕 as well.
Although we will see a detailed RL-based process control case study using
DDPG algorithm in the next section, it is still useful to conceptually understand
DQNs and policy gradient methods. These are precursors to more modern
techniques and understanding them will throw crucial insights into DDPG’s
algorithm.
70
Without considering replay memory and target networks. These are introduced in the next section on DDPG.
71
Mnih et. al, Playing Atari with deep reinforcement learning, arXiv preprint, 2013
72
The combination of Q function approximation, bootstrapping, and off-policy behavior is known as the ‘deadly triad’
and is known to cause divergence
298
`
The algorithm behind learning of these policy functions is simple: during training, all the
rewards obtained during an episode are collected and then the model parameters (𝜽) are
updated so as to increase returns in subsequent episodes. When the update mechanism
makes use of the gradient of model’s performance w.r.t 𝜽, the algorithm is called policy-
gradient algorithm. The issue with these algorithms is that model training is very slow due to
slow convergence. This is partly due to the fact that update occurs at end of episodes (unlike
in Q-learning where updates occur at each step) due to the requirement of cumulative reward
or return.
Actor-Critic framework
The strength and weakness analysis of DQN and policy-gradient algorithms show that they
are kind of complementary to each-other in the sense that weakness of one is strength of the
other. Policy-gradient algorithms don’t have to solve any optimization problem at every
iteration, while DQNs don’t wait for the end of an episode for an update. RL researchers saw
this complementarity and proposed an actor-critic (AC) framework (Figure 13.5) that merges
the two algorithms.
Figure 13.5: RL agent training within AC framework. Note that actor and critic can take any
functional form and not necessarily be neural networks.
299
`
During any training step, the critic uses the actor to help evaluate 𝐦𝐚𝐱
′
𝑸𝒘 (𝒔𝒕+𝟏 , 𝒂′ )
𝒂
approximately as 𝑸𝒘 (𝒔𝒕+𝟏 , 𝝁𝜽 (𝒔𝒕+𝟏 )) and therefore avoids the difficult optimization; the actor
uses the critic to help evaluate whether its predicted action 𝝁𝜽 (𝒔𝒕 ) is good or bad via
𝑸𝒘 (𝒔𝒕 , 𝝁𝜽 (𝒔𝒕 )) and therefore updates itself immediately instead of waiting till episode’s end to
know if its actions are leading to higher return or not. This is how actor and critic hep each
other. Let’s look at the update mechanism in more detail.
AC update mechanism
Let the actor and critic be neural networks with model parameters, 𝝎 and 𝜽, respectively. The
critic update is same as that in deep Q-learning except the highlighted part below
For actor network, we will use the rationale that when an action is predicted 𝒂 = 𝝁𝜽 (𝒔𝒕 ), the
objective is to maximize the return
Note that since the actor’s objective is to maximize the return, it performs gradient ascent,
unlike the critic which does a gradient descent to minimize its TD error. Pseudo code in
Algorithm 3 summarizes the AC algorithm.
300
`
With the fundamentals of AC algorithm (and the underlying mathematics that powers it) in
place, we are ready to learn about the DDPG algorithm which was among the first RL
algorithm suitable for complex industrial process control problems.
In 2016, Lillicrap et at.73 published an actor-critic, model-free algorithm, DDPG, that could
successfully handle MDPs with high dimensional continuous state and action spaces.
Although it was based on the previously shown AC architecture (with both critic and actor as
neural networks), it brought together several innovative ideas from other works (Silver et al.74,
Mnih et at.75, etc.) which greatly improved the robustness and stability of RL agent training.
These innovations include:
73
Lillicrap et al., Continuous control with deep reinforcement learning, arXiv, 2016
74
Silver et al., Deterministic policy gradient algorithms, ICML, 2014
75
Mnih et al., Human level control through deep reinforcement learning, Nature, 2015
301
`
Figure 13.6 shows how and where DDPG incorporates these innovations into the AC
framework of Figure 13.5. Don’t worry if this figure looks confusing. We will dissect these new
concepts one-by-one.
Figure 13.6: RL agent training within DDPG framework. Note that 4 neural networks are used.
Replay memory
In Algorithm 3 we saw that the network parameter update is based on a single transition tuple
(st, at, rt+1, st+1). The problem with this (apart from less computational efficiency compared to
mini-batch updates) is that the sampled tuples at successive updates are correlated and
therefore not completely independent. This has been found to cause convergence issues.
Replay buffer or replay memory (RM) helps to overcome this issue.
A replay memory is simply a large cache of transition tuples. At any training step, a transition
tuple (s, a, r, s’) is generated according to the agent’s behavior policy and stored in RM. If the
RM is already full (number of tuples become equal to some prespecified size K), the oldest
sampled tuple is removed. The actor and critic networks are updated using a random mini-
batch of M (M < K) tuples from the RM. Expressions below show the mini-batch update
mechanism. Let B = {(𝒔𝒊 , 𝒂𝒊 , 𝒓𝒊 , 𝒔′𝒊 )} 𝑲
𝒊=𝟏 be the mini-batch of transition tuples from RM. Then,
𝑴
𝜶𝒄 𝝏𝑸𝒘 (𝒔𝒊 , 𝒂𝒊 )
𝒘← 𝒘+ ̃𝒊 − 𝑸𝒘 (𝒔𝒊 , 𝒂𝒊 ))
∑(𝒚
𝑴 𝝏𝒘
𝒊=𝟏
where 𝒚
̃ 𝒊 = 𝒓𝒊 + 𝛄𝑸𝒘 (𝒔′𝒊 , 𝝁𝜽 (𝒔′𝒊 ))
302
`
We had previously seen how mini-batch update is superior to stochastic update for ANN
training in Chapter 11. Similar observations have been made for RL as well.
Target networks
Critic networks are trained by minimizing the error between the predicted Q value and TD
target. However, as we noted previously, the TD target itself depends on the critic network
that we are trying to fit. This makes the AC algorithm prone to divergence. This is resolved by
creating replicas of the actor and critic networks, called target networks, whose parameters
evolve slowly but track the main networks being learnt. Let 𝑸𝒘,𝒕𝒂𝒓𝒈𝒆𝒕 and 𝝁𝜽,𝒕𝒂𝒓𝒈𝒆𝒕 denote the
target networks. The TD target for the ith transition tuple from RM at any training step is given
by
After the actor and critic networks have been updated (using Eq. 3), target networks are
updated as follows
𝒘𝒕𝒂𝒓𝒈𝒆𝒕 ← 𝝉𝒘 + (𝟏 − 𝝉)𝒘𝒕𝒂𝒓𝒈𝒆𝒕
𝜽𝒕𝒂𝒓𝒈𝒆𝒕 ← 𝝉𝜽 + (𝟏 − 𝝉)𝜽𝒕𝒂𝒓𝒈𝒆𝒕
Where 𝝉 is a hyperparameter between 0 and 1 (usually kept close to 0). DDPG algorithm,
therefore, has 4 neural networks that are fitted simultaneously.
𝜷(𝒔𝒕 ) = 𝝁𝜽 (𝒔𝒕 ) + Ɲ
You are now ready to build your RL-based process controllers using DDPG algorithm. Let’s
revisit the fluid level control problem and look at it in more detail.
Figure 13.7 below shows several details of our level control problem76. As alluded to earlier,
the RL agent controller needs to learn how to control the tank liquid level (specifically, keep it
between 47.5% to 52.5%) under the influence of inflow disturbances. This simple example
will hopefully help you understand how to setup RL agent training and its subsequent
deployment.
76
The system, problem, and solution mechanism are adopted from the thesis work: ‘E. R. Mageli, Reinforcement
learning in process control, Norwegian University of Science and Technology, 2019’.
304
`
Environment State
Agent Action
𝑑𝑙 1
= 2 (𝑞𝑖𝑛 − 𝑧𝐴𝑜𝑢𝑡 √2𝑔𝑙)
𝑑𝑡 𝜋𝑟
Parameters: Reward
Tank height: 10 m
Tank radius (r): 7 m
Pipe radius: 0.5 m
Pipe area (𝐴𝑜𝑢𝑡 ): 𝜋 ∗ 0.52
g: 9.81 m/s2
Let us first define the tank environment77 that the RL agent will interact with. If you are not
familiar with using Python Classes, don’t’ worry. Most of the code is self-explanatory and
appropriate annotations have been provided where necessary.
# Tank class defined here will be used later in our RL agent training script
import numpy as np
import random
class tank_environment:
""" Create an OpenAI style environment
"""
77
A popular Python library, Gym, contains several ready-built environments such as cartpole and pendulum.
305
`
# disturbance related
self.pre_def_dist = pre_def_dist
self.pre_distFlow = pre_distFlow # disturbance flows used during testing
self.distFlow = [1] # stores disturbance flows during a training episode
306
`
Args:
action (valve opening): an action provided by the agent
Returns:
state (numpy array of size (3,)): agent's observation of the current environment
reward (float) : amount of reward returned after taking the action
done (bool): indicates whether an episode has been terminated
"""
# parameters
g = 9.81
The system is simulated for 5 timesteps for an action
during training. This was found to help during training.
for i in range(5):
# compute rate of change of tank level
q_dist = self.get_disturbanceFlow()
q_out = action*self.pipe_Aout*np.sqrt(2*g*self.level)
dhdt = (q_dist - q_out)/(np.pi*self.radius*self.radius)
307
`
# compute reward
if done:
reward = -10
elif self.level > self.soft_min_level and self.level < self.soft_max_level:
reward = 1
else:
reward = 0
def reset(self):
"""
Description: Reset tank environment to initial conditions
Returns: Initial state of the environment
"""
self.level = 0.5 * self.height
self.distFlow = [1]
With the behavior of the tank environment defined, we can now start writing the code for agent
training. We will begin by importing some packages and defining some utility functions. If you
have defined the tank_environment class in a separate script file (say,
Tank_Environment.py), you will have to import the class in your current script (as done here).
# import packages
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from Tank_Environment import tank_environment
model = Sequential()
model.add(Dense(8, activation='relu', input_shape=(n_states,)))
model.add(Dense(8, activation='relu'))
model.add(Dense(n_action, activation='sigmoid'))
return model
return model
The above two functions will be used later to create our actor, critic, and the targets networks.
The next function performs a soft update of the target networks.
return target_network
The next function generates OU noise that is added as a disturbance to the actor’s action.
The subsequent figure shows an example of the kind of noise pattern generated by this
mechanism. As is apparent, an OU process shows greater inertia (compared to a wildly
fluctuating gaussian process) as the OU noise signals tend to evolve in the same direction for
long durations; this promotes deep exploration during RL agent’s training.
309
`
Next, we create a function to sample a batch of stored transition tuples from the RM. You will
notice that an additional variable ‘done’ is part of a transition tuple here. The significance of
this variable will become clear in the next utility function.
def sample_ReplayMemory():
"""
Returns:
states: A 2D numpy array with 3 columns
actions: A 2D numpy array with 1 column
rewards: A 1D numpy array
next_states: A 2D numpy array with 3 columns
dones: A 1D numpy array
# separate the states, actions, rewards, next_states, dones from the selected transitions
states, actions, rewards, next_states, dones = [np.array([transition[field_index] for transition in
batch]) for field_index in range(5)]
return states, actions, rewards, next_states, dones
310
`
̃𝒊 = 𝒓𝒊 + (𝟏 − 𝒅𝒐𝒏𝒆𝒊 ) ∗ 𝛄𝑸𝒘,𝒕𝒂𝒓𝒈𝒆𝒕 (𝒔′𝒊 , 𝝁𝜽,𝒕𝒂𝒓𝒈𝒆𝒕 (𝒔′𝒊 )) 𝐟𝐨𝐫 𝐞𝐚𝐜𝐡 𝒊𝒕𝒉 𝐭𝐮𝐩𝐥𝐞 𝐢𝐧 𝐦𝐢𝐧𝐢𝐛𝐚𝐭𝐜𝐡
𝒚
The revised formula states the obvious fact that the TD target is simply the reward (𝒓𝒊 ) if the
episode or task terminates (𝒅𝒐𝒏𝒆𝒊 equals 1 or True) from taking action 𝒂𝒊 at state 𝒔𝒊 . The
function below implements this formula.
Returns:
td_targets: A 2D numpy array with 1 column
"""
td_targets = np.zeros_like(target_Q_values)
for i in range(target_Q_values.shape[0]):
if dones[i]:
td_targets[i] = rewards[i]
else:
td_targets[i] = rewards[i] + gamma*target_Q_values[i]
return td_targets
Alright, we are now at the last utility function that we will define. This function updates the
main actor and critic networks using data from the sampled minibatch. There are several new
concepts involved here that warrant some explanations. You will notice the use of tf.function
decorator78 which converts a regular Python function to a Tensorflow function resulting in
faster computations. The input signature specifies the shape and type of each Tensor
argument to the function. Within the function, GradientTape API79 is utilized which implements
automatic differentiation for gradient computations. This API allows Tensorflow to keep a
‘record’ of what operations are executed within the context of the tf.GradientTape. When
78
https://www.tensorflow.org/guide/function
79
https://www.tensorflow.org/guide/autodiff
311
`
tape.gradient function is called, Tensorflow computes the gradients of the recorded operations
w.r.t. the specified variables.
Args:
s: A 2D numpy array with 3 columns
a: A 2D numpy array with 1 column
td_targets: A 2D numpy array with 1 column
"""
312
`
actor = ActorNetwork()
target_actor = ActorNetwork()
target_actor.set_weights(actor.get_weights())
critic_optimizer = keras.optimizers.Adam(lr=0.01)
actor_optimizer = keras.optimizers.Adam(lr=0.01)
loss_fn = keras.losses.MeanSquaredError()
# define environment
from Tank_Environment import tank_environment
env = tank_environment() # __init__ function of the class gets called implicitly here
# implement action, get back new state and reward, and save in RM
next_state, reward, done = env.step(action[0])
episode_reward = episode_reward + reward
replay_memory.append((state, action, reward, next_state, done))
rewards.append(episode_reward)
The above script completes the implementation of RL agent’s training via DDPG algorithm.
As commonly encountered with RL trainings, hyperparameter tuning proved to be difficult
even for our simple tank system. Often, training would converge to a trivial RL controller that
would completely open and close the valves to control the liquid level. Figure 13.8 shows the
performance of RL agent during testing with a converged reasonable solution. The code below
shows how to setup RL testing/deployment.
314
`
level_hist = []
valve_opening_hist = []
actor = keras.models.load_model('actor_saved')
for step in range(n_steps):
# take action
action = actor.predict(state.reshape(1,-1)).flatten()
next_state, reward, done = env.step_test(action[0], step) # step_test function is same as step
function except that the system is simulated for only 1 timestep for each action
# store
valve_opening_hist.append(action[0])
level_hist.append(next_state[0]*env.height)
# check termination
if done:
break
else:
state = next_state
Figure 13.8: Evaluation of RL controller for the tank liquid level control
315
`
As the plots show, RL agent is able to keep the liquid level within the most profitable region
by nicely modulating the valve opening. For example, around 110 seconds, when the
disturbance flow increases sharply, RL controller opens the valve further to keep the level
stable. Remember that RL agent does not have knowledge of the disturbance’s magnitude. It
has been trained to take an optimal decision only using information about the current liquid
level and the rate of change of level.
As previously mentioned, RL training may throw up several challenges at you. However, you
do have several recourses. Reformulating your problem can be a worthwhile thing to do. For
example, in the tank level control problem, you may use ∆𝒇 (change in valve opening) as the
action and add the current f as another state variable80. You may also engineer the reward
function differently and penalize the controller for making large action moves. More recent
algorithms such as TD3, PPO may be tried that try to tackle the training stability issues.
Hopefully, you have now gained adequate understanding of RL fundamentals to help you
judge your situation appropriately and make wise decisions.
This concludes our quick tour of the world of reinforcement learning. As remarked before, RL
is a vast field and many more algorithms have been proposed since DDPG which are gaining
popularity. The exposition in this chapter will help you get started with RL and think of all the
different ways you can utilize this powerful tool. Unlike other ML technologies, RL has not yet
seen widespread adoption in the process industry (due to long training times, difficulty in
imposing process constraints, requirement of adequate simulator for initial offline training,
etc.). However, considering the amount of research ongoing in RL and increasing ease of
availability of cheap computational resources, it is not far-fetched to state that RL will play an
important role in industrial process control in the near future.
Summary
In this chapter, we looked at theoretical and practical aspects of RL. We studied two popular
algorithms namely, Q-learning and DDPG, in detail. We developed a RL-based controller for
controlling the fluid level in a tank. We hope that RL is no longer a mystery for you. With this
chapter, we have also reached the end of our ML journey. During this journey, you picked up
several powerful tools. As our parting message, we would just advise you the ‘No Free Lunch’
theorem that states that no single method exists that can obtain best results all the time.
Hopefully, the knowledge you gained from the book will help you make the right (and
educated) modeling choices for your problems. All the best!
80
Sometimes, to ensure that the state vector has all the needed information, you may need to add some past
observations and actions in your state vector. This is analogous to data augmentation we have seen previously.
316
`
Part 4
Deploying ML Solutions Over Web
317
`
Chapter 14
Process Monitoring Web Application
You have done all the hard work to obtain a ML model that meets performance criteria and
now it’s time to deliver the solution / model to its end-users who will be using the model’s
results on a regular basis. But how do you do it? If the end-users are non-technical (from data-
science perspective) like plant operators, you cannot ask them to have their own Python
installation to run the ML model. Under such circumstances, deployment over web is a good
and frequently employed solution for delivering ML results.
For web deployment, Python provides several frameworks (like Django, Flask, CherryPy) for
developing simple to complex web application fast. Very often, all you may want is a quick
prototype that allows you to collect user feedback. Keeping this in mind, we will show you how
to build a light-weight web application from scratch and demonstrate how easy it is to do so.
Specifically, we will build a process monitoring tool that provides 24 X 7 fault detection &
diagnosis (FDD) results to plant operators over web. During the course of building this
solution, we will learn the following topics
318
`
Figure 14.1 shows the user interface that we will build in this chapter to display the results in
real-time from a process monitoring ML model. The user interface is to be accessed via a web
browser and provides two crucial information to the plant operators. First, the fault detection
component communicates to the operators the state of the process and alerts them if any
process abnormality is detected. Second, the fault diagnosis component identifies the faulty
variables responsible for process abnormality. The monitoring model will be built using PCA
and the methodology introduced in Chapter 5 will be employed for FDD. The process system
is the same as that in Chapter 5, i.e., the polymer processing plant with 33 variables.
A web application primarily has 2 main parts: front-end and back-end. Front-end (or client-
side) is the part that end-users see and interact with directly through web browsers. The back-
end works behind the scenes to deliver information to the browsers. When a user enters your
website’s URL in browser, a request is sent to the back-end which parses this request,
processes it, and sends back a response which is displayed on the front-end. Illustration below
shows the data transmission scheme that we will employ
319
`
As you can see, the data analytics occurs on the back-end residing on a dedicated computer
and the results are seen on any machine via browsers. While the front-end in our application
will do only the basic job of displaying FDD results, front-ends can do much more such as
user-data entry validation, animations, etc., and these complex functionalities are handled by
any modern internet browser.
Before we build the full-fledged monitoring tool, let’s build a simple ‘Hello World’ application
to gain some familiarity with the process of developing a web application using Python. We
will use CherryPy81 package, a popular and simple web application framework, which is used
for rapid deployment of web apps and is very easy to learn. CherryPy provides a production-
ready web server which can easily handle medium-scale web applications and therefore, is
suitable for our purpose.
For the simple application, let’s type the following code in a Python script82 helloWorld.py and
execute it in the Anaconda command terminal (to execute from windows command terminal,
you will need to edit Windows Environment variables to add Anaconda to your System Path)
# import package
import cherrypy
81
You can install CherryPy via pip install cherrypy command. The official CherryPy doc has some excellent instructive
tutorials (https://docs.cherrypy.org/en/latest/tutorials.html). Do check them out if you want to learn more.
82
We will not cover the concepts behind Python classes or CherryPy configuration in detail. While understanding them is
useful, it’s not a necessity to be able to understand the script. You can just follow the structure shown here and tweak
them if needed as you become more familiar with developing web applications.
320
`
@cherrypy.expose
def index(self):
return "Hello world!"
# execution settings
cherrypy.config.update({'server.socket_host': '0.0.0.0'})
if __name__ == '__main__':
cherrypy.quickstart(HelloWorld()) # when this script is executed, host HelloWorld app
You will see the following ‘strange looking’ output on the terminal
This output simply indicates that a web server has started and is ready to accept incoming
requests on the address http://localhost:8080 (or http://127.0.0.1:8080). Go ahead and type
this address on your browser’s URL address bar. You should see the following
What’s happening here is that the web server directs the incoming request to the helloWorld
app which triggers the execution of the default index83 function which returns the text ‘Hello
World’ to the browser which is then displayed to you. In this illustration, the back-end and the
front-end both reside on your PC; however, for tool deployment, you would setup the back-
end on some other dedicated machine. Our web server can also listen to requests coming
from some other PC that is in the same workplace network – the URL will need to be modified
to make it explicitly specify that the web server is hosted on your PC. Assuming your PC name
is machineName, you can ask any colleague of yours to use the address
http://machineName:8080 on their browser – they should see the same result as above.
83
‘index’ function is executed by default if no explicit function (or path segment) is provided in the URL
321
`
This is it! This is all that is needed to create and deploy a web application. You would not have
expected it to be so easy, would you? By now, you might be beginning to get some ideas on
how we can extend this simple web app for our monitoring tool. All we need is that our web
server serves FDD results upon receiving browser requests. We will now add the required
components in our simple web app to enable this functionality.
To provide monitoring results to plant operators, we will replace the HelloWorld class in our
HelloWorld.py script with a FDDapp class as shown below and save the file as FDD.py.
If you use the web address http://machineName:8080/getResults (after we have defined the
logic for runPCAmodel function), the getResults function will get executed which will execute
runPCAmodel function to generate results and return the current process state information.
For now, our returned output is very simple, but we will modify it later to return more instructive
results. Before we look into the code for runPCAmodel function, an important note that we
would not want to (re) train and generate a PCA model every time our web app is accessed.
What we can do instead is to save the generated PCA model (and other model variables)
after model training once and then just re-use the saved model in the runPCAmodel function.
To save and re-use ML models and model variables, we can use the pickle package. Let’s
add the following code to our ProcessMonitoring_PCA.py script from Chapter 5.
"Q_CL": Q_CL,
"T2_CL": T2_CL} # dictionary data structure uses key-value pairs
If you re-run the ProcessMonitoring_PCA.py script, you will now see a PCAmodelData.pkl file
in your project folder.
Let’s now look at the runPCAmodel function below. You will notice that for ease of code
maintenance and readability, we have separated out different tasks into separate helper
functions. Note that if you prefer you can put the runPCAmodel function in a separate .py file
and the import it into your FDD.py script. For simplicity, we will keep all the code in FDD.py
file. As the explanatory comments indicate, we load the data in the saved pickle files into the
workspace, fetch latest process data, transform latest data via PCA, compute monitoring
indexes, check for the presence of process fault, and create some plots that we will eventually
show on the user interface. The resulting FD result is returned. The monitoring index
computation code is similar to that in Chapter 5 and therefore the helper functions’ details are
not shown here, but you are encouraged to check them out in the online script file.
# runPCAmodel function
def runPCAmodel():
# read saved PCA model data
with open('PCAmodelData.pickle', 'rb') as f:
PCAmodelData = pickle.load(f)
323
`
# detect fault
if (currentQ > PCAmodelData["Q_CL"]) or (currentT2 > PCAmodelData["T2_CL"]):
processSate = 'Issue Detected'
else:
processSate = 'All Good'
# generate and save metric plot containing historical metrics and current metrics
generateMetricPlots(currentQ, currentT2, PCAmodelData["Q_CL"], PCAmodelData["T2_CL"])
# saves 'metricPlot.png' in the working folder
return processSate
Now you know how you can embed your ML code and model into a web application. But our
web app is not complete yet. We still need to instruct our web app to send those generated
metrics and contribution plots to the end users.
Interactive and nice-looking user interfaces cannot be rendered with just plain text output
returned from web servers. Along with different media objects (like figures), explicit
instructions need to be provided to the browsers on how to structure the webpage. Those
instructions are written using HTML (specifies the basic structure of the page) and CSS
(handles the formatting/appearance aspects) languages that browsers understand. Let’s
begin by modifying our web app so that we obtain the following user interface.
324
`
<html>
<style>
body
{
font-family: Arial; text-align: center;
}
#topHeader
{
padding: 15px; background: #1abc9c; color: white; font-size: 30px;
}
</style>
<body>
<h1 id='topHeader'>Smart Process Monitoring Tool</h1>
<p style="color:red;"> <b> Issue detected </b> </p>
</body>
</html>
You can in fact go ahead and type the above code in a notepad, save it as sample.html and
open the file in a browser. You will see the above interface. We can embed this code in our
getResults function as per the pseudo-code shown below
325
`
While the above code structure would technically be correct, it is not a best practice to keep
the front-end code and core back-end code in the same script. The separation can be
achieved as follows
path = os.path.abspath(os.path.dirname(__file__))
env = Environment(loader=FileSystemLoader(path))
def getResults(self):
processState = runPCAmodel() # returns numeric flag: 0 => 'All good' or 1 => 'Issue detected'
<html>
<head>
<style>
body { font-family: Arial; text-align: center;}
#topHeader {padding: 15px; background: #1abc9c; color: white; font-size: 30px;}
</style>
</head>
<body>
non-html code
<h1 id='topHeader'> Smart Process Monitoring Tool </h1> parsed by Jinja
{% if state == 0 %}
<p style="color:green;"> <b> All Good </b> </p> <br>
{% else %}
<p style="color:red;"> <b> Issue Detected </b> </p> <br>
{% endif %}
</body>
</html>
326
`
You will notice some additional elements ({% …%}, if statements) in the
frontEndTemplate.html file. These additional elements are not meant for browsers. This file is
a template that will be used by the Jinja templating engine (imported in our FDD.py script) to
generate the required HTML code. You can notice the logic that has been put in place to
modify the central dashboard message and its text color depending upon the processState
value. The templating framework is a beautiful solution to keep the front-end and back-end
code separate and encourage code-readability.
Alright, now you understand how front-end code can be put in a web app. Let’s move ahead
and add our figures to the interface We will add the following code into the body section of the
template file for displaying figures.
<body>
…
<center> <img src="/metricPlot.png" height="300" width="1450"> </center> <br> <br>
<center> <img id='contri_img' src="/contributionPlot.png" style="display: none;" height="300"
width="1350"> </center>
</body>
All we did above was include image elements into the HTML code and provide the correct
path. A small code is also inserted into FDD.py script to let the web app know where to find
these image files.
# execution settings
cherrypy.config.update({'server.socket_host': '0.0.0.0'})
conf = {'/': {'tools.staticdir.on': True,'tools.staticdir.dir': path}}
If you execute FDD.py and access your website, you would see the figures on the user
interface now. Let’s now come to the last and interesting part of adding some dynamics or
interactivity to our interface. What we desire is that the contribution plots be shown only when
the user clicks on ‘Show Contribution Plot’ button (these plots would not be much useful when
monitoring indices are below the thresholds). To enable this interactivity, we will employ
another cornerstone language of web development, Javascript. As shown below, a Javascript
function is put in the head section of the template and a button is added to the user interface.
<head>
…
<script>
function toggleDisp(imgID) {
var x = document.getElementById(imgID);
327
`
if (x.style.display == "none")
{
x.style.display = "block"; }
else {
x.style.display = "none"; }
}
</script>
..
</head>
<body>
…
<button type="button" onclick="toggleDisp('contri_img')"> Show/Hide Contribution Plot
</button> <br> <br> <br>
…
</body>
When the button is clicked, the Javascript function gets executed which toggles the display
property of the contribution plots images. There remains one last piece of the puzzle.
Currently, the latest monitoring results are displayed only when a user refreshes/reloads the
webpage. This clearly is undesirable. What we want is that the interface refreshes itself
automatically to display latest results at regular interval. This is accomplished by including the
following in the head section which refreshes/re-loads the interface every 30 seconds.
Our web app is now complete. Go ahead and try it out. You should see the interface shown
in Figure 14.1. The top metric plots will first show only a single data-point and new data-points
will show up after every automatic refresh. The world of web development is huge, and we
covered only a small portion of it in this chapter to develop a very basic user interface with
limited functionalities. Technologies like Ajax, jQuery, Angular, React, Bootstrap exist that can
be used to make extremely complex and functionality-rich websites.
Summary
In this chapter we studied the very last phase of the end-to-end development of a machine
learning project, i.e., taking the results from ML models to the non-technical end-users via
web applications. Concepts learnt in this chapter can help you rapidly deploy your tool and
collect crucial user feedback. With this chapter, we have come to the end of our process data
science journey. We hope that you enjoyed reading this book as much as we enjoyed writing
it!
328
`
Appendix
Dataset Descriptions
329
`
This appendix section provides a quick overview of the process datasets used in the book.
The ML techniques implemented on the datasets are also mentioned.
Figure A1: Polymer manufacturing data. Each curve corresponds to a process variable
For this dataset, it was reported that the process started behaving abnormally around sample
70 and eventually had to be shut down. Dimensionality reduction and process monitoring/fault
detection & diagnosis using PCA was illustrated using this data in Chapter 5.
84
https://www.academia.edu/38630159/Multivariate_data_analysis_wiki
330
`
85
MacGregor et al., Process monitoring and diagnosis by multiblock PLS methods, Process Systems Engineering, 1994
331
`
The feed and reactor operating conditions strongly influence the final product quality and
therefore, an efficient FDD is highly desirable for such processes. For this dataset, a process
fault occurs sample 51 onwards due to increase in feed impurity level. Figure A4 shows that
fault detection using quality variables plot is not ‘obvious’. A couple of process variables do
seem to show significant deviations, but it would be serendipitous if a plant operator was
looking at these specific variables at the right time! An efficient and systematic fault detection
mechanism using PLS is illustrated for this data in Chapter 5.
Figure A4: Scaled process (left) and quality (right) variables in LDPE dataset
A bigger dataset for TEP has also been provided by Reith et al.86 The dataset still contains
21 different fault classes (along with no-fault operation); however, for each fault class, 500
separate simulations runs were conducted for both training and test dataset. Each training
simulation run contains 500 time samples from 25 hours of operation and each testing run
contains 960 time samples from 48 hours of operation. LSTM-based fault classification has
been demonstrated with this data in Chapter 12.
86
Reith, C.A., B.D. Amsel, R. Tran., and B. Maia. Additional Tennessee Eastman process simulation data for anomaly
detection evaluation. Harvard Dataverse, Version 1, 2017
333
`
The dataset contains 2394 samples of input-output process values. Note that the dataset is
provided in its normalized form and the output values have been translated by 8 samples to
compensate for the effect of the time delay. SVR and ANN have been used for soft sensor
development in Chapters 7 and 11, respectively.
87
Fortuna et. al., Soft sensors for monitoring and control of industrial processes, Springer, 2007
334
`
88
https://archive.ics.uci.edu/ml/datasets/water+treatment+plant
89
UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/combined+cycle+power+plant
335
`
336
`
90
Turbofan engine degradation simulation data set, NASA, https://ti.arc.nasa.gov/tech/dash/groups/pcoe/prognostic-
data-repository/
337