Galatro D. Data Analytics For Process Engineers. Prediction, Control... 2ed 2024
Galatro D. Data Analytics For Process Engineers. Prediction, Control... 2ed 2024
Mechanical Engineering
Data Analytics
for Process
Engineers
Prediction, Control and Optimization
Synthesis Lectures on Mechanical
Engineering
This series publishes short books in mechanical engineering (ME), the engineering branch
that combines engineering, physics and mathematics principles with materials science to
design, analyze, manufacture, and maintain mechanical systems. It involves the production
and usage of heat and mechanical power for the design, production and operation of
machines and tools. This series publishes within all areas of ME and follows the ASME
technical division categories.
Daniela Galatro · Stephen Dawe
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG
2024
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole
or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage
and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or
hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does
not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective
laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book are
believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give
a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that
may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Process engineering is the engineering field that pioneered the development of process
simulation and process control, including advanced automation linking machine learn-
ing and optimization tools. At the same time, statistics and data-driven modelling have
fundamentally supported monitoring and troubleshooting capabilities. Therefore, process
engineering has played and is playing a key role in proving the applicability of data ana-
lytics to ensure the reliability of processes. However, the acknowledged literature in data
analytics has presented the progress of data analytics in process engineering in journals,
perhaps underrating the possibility of structurally compiling applicable data analytics tools
and exemplifying their application tailored to the process engineer’s needs. This book has
created an exclusive data analytics domain for process engineers, describing tools that
can be pragmatically used in different contexts and at different levels of expertise, from
undergraduate or graduate students looking to design unit operations or process plants
to plant engineers or researchers looking to monitor, control or optimize processes. Its
pragmatism also lies in exposing the reader to these techniques while disengaging from
a rigorous mathematical background, with prompt applicability in the field or starting
point for further research. The data analytics workflow proposed in this book can also be
detached from a typical one conceived for data scientists: data analytics for process engi-
neers begins by understanding the sources of data (mostly continuous data), derived from
process and pilot plants, laboratories and generated by process simulation. We continue
our workflow with exploratory data analysis (EDA), including familiar tools like simple
visualization techniques and managing outliers and missing values. We also borrowed
three core data scientists’ exploratory data analysis tools: correlograms, clustering and
dimensionality reduction. These techniques complete the skills required to get insights
from the data and identify patterns. Once investigated and cleaned, the data is ready
for the ultimate goals in process engineering: modelling, control and optimization. Data-
based modelling builds up from simple regression models, such as simple or multiple
linear regression models, splines and adaptive regression splines, to non-linear regression
models and non-linear machine learning models, including neural networks, random for-
est and support vector machine. Finally, our fitting options include distribution models
v
vi Preface
such as normal, Weibull and gamma distributions. We dedicated a section of the data-
based modelling chapter to model performance and verification, as it is crucial to address
the goodness of fit and select the best model for decision-making processes. Furthermore,
we highlight the need for testing causality, as correlation and causality might not coexist.
Moving towards process control, data analytics is presented to support the design of typ-
ical controllers and model predictive control settings. In optimization, we explore simple
optimization using grid search, random search and gradient search, and then introduce our
readers later briefly to evolutionary algorithms, particle swarm optimization and Bayesian
inference. Multi-objective optimization is also addressed, as it is common in process engi-
neering to deal with two or more conflicting objectives. While structurally providing data
analytics tools for EDA, data-driven modelling, data-driven control and optimization, we
continuously emphasize the importance of assessing the physical meaning of the over-
come of exploring and fitting data, as tools are simply tools and might deploy meaningless
or misleading results if not appropriately used and/or interpreted. Our examples and dis-
cussions contribute to this purpose. R and its interface RStudio were selected as the data
analytics software for this book since it is a well-known robust statistical and data analyt-
ics software. Our codes and examples are stored on a GitHub page for free access. At the
end of each chapter, we also included a list of resources (listed in order of appearance),
further recommended readings and references. In our final chapter, called final remarks,
we wrapped up key takeaways from each chapter, plus some interesting topics such as
introduction and documentation about R and RStudio; clarification on definitions like data
analysis, data analytics and machine learning; access to open datasets; data analytics and
the physical meaning of phenomena; and a brief passage on the future of data analytics in
process engineering. Finally, we would like to thank our readers for choosing this book
as a complementary tool to your journey in the process engineering field.
1 Sources of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Plant Process Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Pilot Plant and Laboratory Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Process Simulation Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Summary and Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Data Disclosure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Recommended Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1 Types of Data and Types of Exploratory Data Analysis . . . . . . . . . . . . . . 13
2.2 Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Simple Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Time-Series Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.2 Scatter Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.3 Multivariate Scatter Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.4 Box Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.5 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.6 Temperature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.7 Wind Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4 Outliers and Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4.1 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4.2 Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5 Correlogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.6 Clustering and Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.6.1 K-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
vii
viii Contents
Plant process data is acquired by the plant automation system at the control center, mostly
from the controllers in centralized or distributed control systems (DCSs). The data col-
lected from process variables is facilitated by instruments and the interface between the
process and the control subsystem, to be delivered to the controllers.
Plant process data is collected and stored in several databases, such as Manufacturing
Execution Systems (MESs) and Laboratory Information Management Systems (LIMSs).
MES is a specialized manufacturing software used to control production and engineering
environments. LIMS is a research and development (R&D) form of MES; essentially,
it’s a real-time system that stores and tracks mostly analytical measurements from a
laboratory on quality and specification parameters of feeds, products, and intermediate
streams. This data is used for operation monitoring, fault detection, performance analysis,
operations, production, maintenance planning, parameter estimation, process simulation,
optimization, and resource planning.
The data collected through the sensor network must be accurate, reproducible, and
reliable. Accuracy refers to the ability of an instrument to measure the true value. Repro-
ducibility is the ability of a sensor to reproduce a value within a specific interval.
Therefore, a sensor can be precise and inaccurate, for instance, when several measure-
ments fall within an interval that does not contain the true value. Reliability refers to the
probability that the data will exist during a certain period.
(i) colours to encode meaning, (ii) patterns for quick detection of anomalies, and (iii)
clear context to visualizations in terms of time, scale, and area.
Process historians is software that stores large amounts of data generated from different
sources, including control and monitoring plant data, laboratory information management,
and resource planning. Transforming the big data from historians into actionable infor-
mation is key for diagnostics, production, performance, maintenance, and safety. Initially,
process historians were fed data from the process control system, typically a DCS. Several
centralized and decentralized architectures support modern process historians and include
data-handling techniques like filtering. Moreover, data stored in process historians can be
used by commercially available software for analysis and reporting, including power and
energy consumption, safety and alarm monitoring and management, and mass balancing.
Small-scale datasets generated from tests performed in pilot plants and laboratories are
just as valuable as big data. Though this is a limited resource, analysis of this pilot data is
crucial for the feasible scaling up of process plants. Still, it is key in providing the basis
for the feasible scaling-up of process plants.
A pilot plant is a system producing small volumes of products for the purpose of
obtaining insight from the novel processes and technologies used in the plant’s opera-
tion. The data provided in pilot plants is valuable for designing the full-scale plant and
upgrading and/or optimization of existing plants. On the other hand, laboratory data from
process plants, is key data generated using standard tests and collected at different sam-
pling points and considering different frequencies, typically providing information on the
composition and properties of different streams in varying periods. This data provides
insight into the quality of feed, intermediate, and product streams and is also useful for
calibrating process composition analyzers.
Laboratory data can also be generated and collected from experiments conducted in a
laboratory but not necessarily performed in a pilot plant, aimed at analyzing the impact of
varying input variables on output variables. To capture valid data from experiments and
maximize the learning capability from the data, it is required to conduct a proper design
of experiments (DOEs) and perform tests using standard methodologies to ensure that all
significant factors and corresponding operating and/or design ranges that control the value
of a group of parameters are considered. The principle of DOE is that a change in one
or more independent variables or input variables is hypothesized to trigger a response in
one or more dependent variables or output variables. The purposes of the experimenta-
tion could be, for instance, to compare alternatives, identify significant inputs (factors)
that affect an output (response), achieve an optimal response, minimize variability, and
balance trade-offs. Several approaches have been used for DOEs, including full factorial
4 1 Sources of Data
designs, response surface designs, fractional factorial, and mixture designs. All the pos-
sible combinations of levels for all factors are considered in a full factorial DOE. For
example, the output variable, thickness, depends on the input variables: speed, tempera-
ture, and viscosity. If two levels are considered for each input variable (high and low),
then the number of runs for our experiment can be calculated as 2k , where k is the number
of factors; hence, eight total runs are calculated. A fractional factorial, on the other hand,
would consider only a fraction of the total runs calculated for the full factorial, such as
one-half or one-quarter, depending on the number of factors.
A popular computer-aided DOE technique is the D-optimal design for multi-factor
experiments. These designs are constructed to minimize the generalized variance of the
estimated regression coefficients. In the DOE setting, the matrix X represents the data
matrix of independent variables. D-optimal designs minimize the overall variance of the
estimated regression coefficients by maximizing the determinant of X’X, and they are
typically used when resources are limited to run a full factorial design entirely. Example
1.1 illustrates the use of D-optimal DOE.
Example 1.1 Design of Experiment using D-optimal. Design an experiment for eight
runs to compile data about the evolution of the capacity fade of lithium-ion batteries over
time. The capacity fade depends on the temperature, the state-of-charge (SOC) or level of
charge of the battery relative to its capacity, and the C-rate or rate of discharge compared to
its capacity. Use D-optimal DOE.
###Install
```{r, echo = TRUE}
install.packages("skpr")
install.packages("shiny")
```
A random seed is set for reproducibility. A full quadratic model (linear, interaction,
and quadratic terms) is assumed.
1.2 Pilot Plant and Laboratory Data 5
###General
```{r, echo = TRUE}
library(skpr)
library(shiny)
set.seed(12345678)
candidateset <- expand.grid(Temp = c(0, 25, 45), SOC
= c(20, 50, 100), C_rate = c(1, 2, 3))
design <- gen_design(candidateset = candidateset,
model = ~ Temp + SOC + C_rate + Temp*SOC + Temp*C_rate
+ SOC*C_rate + Temp^2 + SOC^2 + C_rate^2, trials = 8)
design
```
The design matrix, which shows the optimal combination of factors for eight runs, is
shown as follows (Fig. 1.1).
Thus, for instance, run # 1 defines an experiment at 0 °C, 20% SOC, and C-rate of 3.
The number of runs depends on the resources that can be afforded, such as time and
money. Replications are ideal, as they help validate the results. Researchers look at a trade-
off between the amount of Type I and II errors they can afford to risk and the resources.
The more levels and factors we have, the more combinations (runs) are possible; hence,
more time and money are added to your project. For instance, for the previous example,
a three-level factorial design would require 27 runs instead of 8!
6 1 Sources of Data
Example 1.2 Simple sensitivity analysis. In this example, we explore a dataset generated
from a sensitivity analysis performed in a commercial process simulation software on a
gas-treating unit, where methyldiethanolamine (MDEA) is used to remove H2 S and CO2
from natural gas. This dataset shows the impact of the amine concentration (in %), amine
flow rate (in USGPM), and reboiler duty of the regeneration tower on the acid gas loading
(in mol/mol) in the bottom stream of the regenerator tower. The complete dataset is included
in the file Ex1.2.csv.
We can see the effect of the amine flow rate on acid gas loading at 40% concentration
of MDEA by plotting both variables. The simplified file used for this scenario is Ex1.2_
A.csv.
1.3 Process Simulation Data 7
###General
```{r, echo = TRUE}
setwd("C:/Book/Chapter1/R-Codes")
data<-
read.csv(file="Ex1.2_A.csv",head=TRUE,sep=",")
data= data[c(3:7),c(1:4)]
data
AmineFlow = data$AmineFlow
AcidGasLoading=data$AcidGas.Loading
plot(AmineFlow,AcidGasLoading, main="Acid Gas Load-
ing at 40% amine conc.", xlab = "Amine Flow, USGPM",
ylab="Acid Gas Loading", pch=19)
```
Figure 1.2 shows that the acid gas loading increases as the amine flow increases (there
is a clear trend!), but how sensitive is this change? How does it compare when changing
other input variables? What is the sensitivity ranking (order of importance) of the input
variables? We will review different fitting and sensitivity analysis techniques in Chaps. 2
and 3.
Note: The acid gas loading is the amount of acid gas (H2 S and CO2 ), on a molar basis,
that will be removed by a solvent, in this case, removed by MDEA.
In several cases, the amount of data we obtain from any data source is insufficient; this
can be intuitively interpreted as limited data due to limited resources. Simple exploratory
data analysis might also reveal that a significant amount of data can be biased or distorted
instead! In both cases, insufficient and biased data might deliver incorrect outcomes. How
do we get the data we need, at the scale we need, without considerably compromising
accuracy, balance, and data quality? The answer is: generating synthetic data. However,
can artificial data be generated (not directly measured) from machine learning algorithms
without compromising accuracy, balance, and quality? Let us illustrate the generation and
use of synthetic data with Example 1.3.
Example 1.3 Synthetic data. Amine systems are used to remove CO2 and H2 S from nat-
ural gas. Their reliability is closely related to complex corrosion problems. Several factors
contribute to corrosion in these systems; for instance, the corrosion rate of amines such as
methanolamine (MEA) and diethanolamine (DEA) in carbon steel (in mm/y) depends on
the fluid temperature (in °C), heat-stable salt or HSAS (in %), the acid gas loading (in mol/
mol), and fluid velocity (in m/s) [1–5]. The dataset in Ex1.3.csv contains 200 points of these
variables. Use the following code to generate synthetic data and compare it with the original
data (for quality check purposes).
###Install
```{r, echo = TRUE}
install.packages("synthpop") #for synthetic data
```
The data was artificially created using the function syn and method cart (regression
trees). m is the number of synthetic copies of the observed data to be generated.
Linear regression uses a single predictive formula holding over the entire data space.
A single global model can be challenged when the interaction between parameters occurs
or the data exhibits non-linearity. An alternative approach to non-linear regression is the
regression tree, designed to approximate a function through a process known as binary
recursive partitioning (an iterative process that splits the data into partitions). It then
continues splitting each partition into smaller groups as the algorithm moves up each
partition.
1.4 Synthetic Data 9
###Synthetic
```{r, echo = TRUE}
library("synthpop")
df_observed <- read.csv(file = "C:/Book/Chapter1/R-
Codes Ex1.3.csv")
df_synthetic <- syn(df_observed, m = 1, method=
"cart", cart.minbucket = 10);
The synthetic data is stored in the file df.csv, where a synthetic copy of the original
200 points was generated. The command compare is used in R to graphically represent
differences between the observed and synthetic data when calculating the corrosion rate,
as shown in Fig. 1.3.
In addition to the previous comparison, two corrosion rate models were obtained in
MATLAB® , fitted artificial neural networks, using (i) measured or observed data and
(ii) hybrid data, respectively. Both models were tested with a different dataset, including
measured data. The metric used for this comparison was the coefficient of correlation r,
Fig. 1.3 Differences between observed and synthetic data when calculating the corrosion rate
10 1 Sources of Data
which shows a better predictive accuracy of the model as its value approaches 1. In this
case, the r values were 0.9862 and 0.8985, respectively, indicating that the model fitted
with measured or observed data is more accurate than the one fitted with hybrid data;
nevertheless, both models can reasonably predict the corrosion rate.
Synthetic data is built artificially. Therefore, it does not represent events occurring in the
real world. It has been used for both training and testing purposes. The main advantage of
generating synthetic data is to generate large training datasets! Data scientists are looking
at (i) data quality, balance, and variety; and (ii) scalability by supplementing data to
achieve a large scale of diverse inputs.
There are two types of synthetic data: (i) fully synthetic, which does not retain values
from the original data, relying on algorithms based on generative methods; (ii) hybrid,
which combines real and synthetic data, pairing random records from a real dataset
with synthetic records. The main drawback of hybrid datasets is the computational cost
incurred when generating records.
The main challenges of synthetic data involve (i) realism, since it must accurately
reflect the original data; (ii) bias, as the synthetic data can drag the same biases of the
original data.
How can we generate synthetic data? By using one or the combination of the following
methods:
(i) based on the statistical distribution, where we draw numbers from the distribution by
observing the real statistical distribution of the real data; hence, similar factual data
should be reproduced.
(ii) based on a model, generating random data with the model we create to explain the
observed behaviour or response.
(iii) deep learning, employing techniques such as Variational Autoencoder or Adversarial
Network models.
Process data can be typically obtained from three different sources: (i) plant process data,
(ii) pilot plants and laboratory data, and (iii) process simulation data. The plant automation
system acquires plant process data at the control center, showing values read by different
sensors (instruments) located in the plant. Pilot plants and laboratory data generate limited
data that can be used for plant design purposes or to complete process plant data collected
Problems 11
from existing plants. This data source requires the design of experiments to ensure that
all significant factors and corresponding operating and/or design ranges that control the
value of a group of parameters are considered. Finally, process simulation models can
generate process data through sensitivity analysis. These mathematical models are based
on first principles, data-driven models, or a combination of both types, and they can be
coded or directly used from commercial simulation software.
Insufficient and biased data might deliver incorrect outcomes when predicting values.
Synthetic data can be used to generate artificial data without considerably compromising
accuracy and data quality. Some pending questions to explore when generating synthetic
data are: How much is insufficient in a dataset, and how much would be sufficient when
generating synthetic data? How well does the model generated from the synthetic or
hybrid data perform when extrapolating?
Data Disclosure
The data was generated, maintaining the described problem’s physical meaning of the
analyzed phenomenon or process.
Problems
1.2 Obtain synthetic data to complete measured data and correlate (visualizing graph-
ically) the viral load in wastewater (in copies/L) versus the number of COVID-19
12 1 Sources of Data
Resources
Recommended Readings
Amine gas treating unit: Gulf Professional Publishing. (2014). Oil and gas corrosion prevention:
From surface facilities to refineries.
Design of experiment: Freddi, A., & Salmon, M. (2019). Design principles and methodologies
from conceptualization to first prototyping with examples and case studies. Springer International
Publishing.
Synthetic data: McLachlan, S., Dube, K., Gallagher, T., Simmonds, J. A., & Fenton, N. (2019).
Realistic synthetic data generation: The ATEN framework. Biomedical Engineering Systems and
Technologies, 497–523. https://doi.org/10.1007/978-3-030-29196-9_25.
References
1. American Petroleum Institute. (2016, April). API RP 581, Risk-based inspection technology (3rd
ed.).
2. Corrosion modeling in lean and rich amine systems. (2022). In European Federation of Corrosion
(EFC) Series. Corrosion in Amine Treating Units (2nd ed., pp. 55–66). Woodhead Publishing.
https://doi.org/10.1016/b978-0-323-91549-6.00007-1
3. Han Yang, J., Lin Xie, J., & Zhang, L. (2017). Study on corrosion of carbon steel in DEA aqueous
solutions. IOP Conference Series: Earth and Environmental Science, 113, 012006. https://doi.org/
10.1088/1755-1315/113/1/012006
4. Orozco-Agamez, J., Tirado, D., Umaña, L., Alviz-Meza, A., García, S., & Peña, D. (2022).
Effects of composition, structure of amine, pressure and temperature on CO2 capture efficiency
and corrosion of carbon steels using amine-based solvents: A review. Chemical Engineering
Transactions, 96,. https://doi.org/10.3303/CET2296085
5. Malo, J. M., Uruchurtu, J., Vasquez, R. C., Rios, G., Trejo, A., & Rinson, R. E. (2000, March).
The effect of diethanolamine solution concentration on the corrosion of steel. In Paper presented
at the CORROSION 2000, Orlando, Florida.
6. 519covid.CA. 519covid.ca. (n.d.). Retrieved May 9, 2022, from http://www.519covid.ca/
Exploratory Data Analysis
2
Data can be classified into two groups: structured data and unstructured data. Structured
data is a form of data that is organized, such as categorical or numerical data. Unstruc-
tured data is a form of data that does not have an explicit structure, such as audio,
language text, and images.
In process engineering, we typically use numeric or continuous variables; these vari-
ables can be any value within an infinite or finite interval. Some examples of numeric
variables include temperature, pressure, and concentration. There are two types of numeric
variables: interval and ratio. An interval variable has a numeric scale; its interpretation
is the same throughout the scale. For example, a reboiler duty measurement is twice as
hot and may not reflect twice the temperature. On the other hand, a ratio variable is an
interval scale whose zero value indicates the absence of the quantity being measured.
In this book, we exclusively deal with numeric variables.
EDA can be cross-classified in two ways:
Example 2.1 Summary statistics. Obtain the summary statistics for a dataset on the inter-
facial hydrolysis of a reagent, showing the effect of hydrolysis on kinetics and drop in pH
versus time at two different amine concentrations (1 and 2). The data is stored in the file
Ex2.1.csv. We can use different packages and commands in R to get the summary statistics,
such as summary (by default, no package is required), pastecs, Hmics, and psych. Let us
use them in this example for comparison purposes. Note: An interesting work on interfacial
hydrolysis kinetics of trimesolyl chloride (a side reaction in reverse osmosis) is reported in
reference [1].
###Install
```{r, echo = TRUE}
install.packages("pastecs")
install.packages("Hmisc")
```
###General
```{r, echo = TRUE}
setwd("C:/Book/Chapter2/Examples")
data <-read.csv(file="Ex2.1.csv",head=TRUE,sep=",")
data=data[c(3:47),c(1:3)];
#METHOD 1 - Summary
summary(as.numeric(data$X1))
summary(as.numeric(data$X2))
#METHOD 2 - Pastecs
library(pastecs)
stat.desc(as.numeric(data$X1))
stat.desc(as.numeric(data$X2))
#METHOD 3 - Hmisc
library(Hmisc)
describe(as.numeric(data$X1))
describe(as.numeric(data$X2))
```
summary (for each amine concentration set)
The results from Hmisc are visualized as a data frame (please refer to the corresponding
*.rmd file).
Summary is a basic function of R to provide information about minimum and maxi-
mum values, mean, median, first and third quartiles. Pastecs also includes range, standard
deviation, missing, and null values. Finally, Hmisc also calculates the distinct value of
each column, the frequency of each value, and the proportion of that value in that column.
What do we typically observe in Summary Statistics? For instance, the data provided
in this example for amine concentration X 1 has a pH range of 2.8, with a minimum value
of 3.7, a maximum value of 6.6, and a standard deviation of 0.7. There is more variability
in X 1 (spreading) in terms of standard deviation compared to X 2 . No nulls or missing
values are observed in the dataset.
16 2 Exploratory Data Analysis
EDA’s main graphical tools are time-series, scatter plots, multi-variable charts and matri-
ces, box plots, and frequency histograms. In this section, we discuss these tools with some
examples.
Example 2.2 Time-series plot. The following time-series plot shows the degradation rates
for a polymer. The degradation rate is expressed as the relative mass in percentage. Note:
An interesting work on degradation rates for a high-density polyethylene (HDPE) fiber in
the marine environment is reported in reference [2]. The complete dataset is included in the
file Ex2.2.csv, and the R code is shown as follows:
The following package ggplot2 must be installed. The package ggplot2 is a data
visualization package for R.
###Install
```{r, echo = TRUE}
install.packages("ggplot2")
```
2.3 Simple Visualization 17
###General
```{r, echo = TRUE}
library(ggplot2)
setwd("C:/Book/Chapter2/Examples")
data <-read.csv(file="Ex2.2.csv",head=TRUE,sep=",")
A smoothed trend line has been added using loess, which is short for Local Regression.
Loess is a common method used to smoothen a time series. It is a non-parametric method
where least squares regression is performed in localized subsets.
A model has been created with loess, relating both variables and the function predict
has been used to predict the degradation rate at 120 years, for instance.
Figure 2.1 shows how the relative mass of the polymer decreases over time. The
predicted value at 120 years is 55.8% degradation.
• Record data in sequential or chronological order; otherwise, your data cannot be used
in a time-series plot to assess patterns over time.
• Collect data at regular time intervals, for instance, once a day, once a week; other-
wise, your time-series plot may be misleading. Scatter plots are typically used when
collecting data at irregular intervals.
• Collect data at the time intervals when you want to detect patterns, for instance, week-
to-week patterns in a chemical process, with data collected at the same time each week.
Collecting data at the right frequency is crucial: collecting data less frequently will not
allow you to detect patterns. On the other hand, collecting data more frequently might
add noise to the data and will also not allow you to detect patterns clearly!
• Collect the right amount of data: you need to be sure to collect enough data showing
a long-term pattern and not only an anomaly in your process.
• Collect data, if possible, at the same sampling point.
Smoothing
Time series representations have several potential applications; we can mostly use them
to describe how a process evolves, helping us forecast or predict the future. To achieve
this prediction goal, these plots rely on capturing at least one of the low or high-frequency
behavior.
2.3 Simple Visualization 19
A time-series plot can be decomposed into trend, seasonal, cycling, and noise com-
ponents. A clear trend is observed in Fig. 2.1, which is the plot’s only component. For
instance, seasonal components could tell us that specific trends of a variable are observed
during weekends, different than those observed during the week. Cycling components
could tell us, for instance, a specific pattern repeated over time. Finally, noise shows the
random variation over a given time interval. Smoothing attempts to remove the higher-
frequency behavior to describe the lower-frequency behavior easily; therefore, smoothing
levels can help us remove these components. Thus, a small amount of smoothing removes
the noise component, while more smoothing can remove the seasonal and cyclical com-
ponents to show one isolated trend. A bad smoothing strategy might remove more than
one component at a time, altering the behavior representation of a process. Example 2.3
shows different strategies for smoothing noise in signals.
Example 2.3 Smoothing noise. This example includes a dataset showing signal data (taken
each second) of inlet temperature in a dryer. The corresponding R code is shown as follows,
while the dataset is stored in the Ex2.3.csv file:
###Install
```{r, echo = TRUE}
install.packages("ggplot2")
```
###General
```{r, echo = TRUE}
library(ggplot2)
setwd("C:/Book/Chapter2/Examples")
data <-read.csv(file="Ex2.3.csv",head=TRUE,sep=",")
Figure 2.2 shows noisy data of temperature in time. Let us apply a smoother (loess)
and compare their efficacy in reducing noise (Fig. 2.3).
20 2 Exploratory Data Analysis
What defines a smoother is the basis to represent the smooth function and the penalty
used to penalize the basis coefficients of the function to control the degree of smoothness.
How do we select the right smoother? In a low-noise scenario, we typically choose
a simple moving average. We use either visualization or experimentation when we try
smoother as the noise increases. When selecting the right smoother, we should also ask
ourselves: Can we isolate certain components in our problem? What exactly do we want
to capture when forecasting? (For instance, a trend, or cyclical behaviour).
Notes: loess is a function in R that fits a polynomial surface. Loess regression can be
applied on a numerical vector to smoothen it and to predict Y.
A scatter plot or scatter chart uses dots to represent values for two different numeric
variables; they are used to observe relationships between variables. The data is displayed
as a collection of points, with values of one variable shown on the x-axis, and values of
the other variable shown on the vertical axis. Let us illustrate the use of a scatter plot,
solving the following example (Fig. 2.4).
Example 2.4 Scatter plot. This example is associated with a dataset including Ion-1 and
Ion-2 versus Ion-3 (as concentration ratios); this representation determines the effect of
22 2 Exploratory Data Analysis
weathering of minerals in the groundwater. Let us visualize the data with a scatter plot and
infer the potential correlation between these two variables. The corresponding R code is
shown as follows, while the data is stored in the Ex2.4.csv file. Note: An interesting work
on the effect of weathering of carbonate minerals in the groundwater is shown in reference
[3].
###General
```{r, echo = TRUE}
setwd("C:/Book/Chapter2/Examples")
data <-read.csv(file="Ex2.4.csv",head=TRUE,sep=",")
data
# Scatter plot
plot(data$Ion1.Ion3, data$Ion2.Ion3,
xlab="Ion1/Ion3 ", ylab="Ion2/Ion3", pch=19)
#abline(lm(data$Ion2.Ion3~data$Ion1.Ion3),
col="red") # regression line (y~x)
```
The scatter plot allows us to infer that there is not a clear relationship between the
variables Ion-2/Ion-3 and Ion-1/Ion-3 (could be linear?). We can visualize the best straight
line by removing # in the last code line and re-running the code chunk (Fig. 2.5).
How can we fit dispersed data? Details about data fitting are included in Chap. 3.
Multivariate scatter plots are used to look at the relationships between pairs of variables
in one group of plots; hence, they are helpful to describe relationships among three or
more variables. Example 2.5 illustrates the use of these plots.
Example 2.5 Multivariate scatter plot. The degradation or aging of lithium-ion batteries
is seen as the capacity reduction of the batteries over time. When the battery is at rest, aging
depends on time, temperature, and the state of charge (SOC) or charge level of the battery
relative to its capacity. Let us illustrate the effect of the temperature and SOC on the average
capacity after storing two sets of battery cells for 500 days of a multivariate scatter plot: (i)
for a battery cell cathode made of nickel-molybdenum-cobalt or NMC (data stored in the
file Ex2.5a) and (ii) a battery cell cathode made of nickel–cobalt aluminum or NCA (data
stored in the file Ex.2.5b). Note: An interesting work on the calendar aging of lithium-ion
batteries is shown in reference [4].
###General
```{r, echo = TRUE}
setwd("C:/Book/Chapter2/Examples")
data <-
read.csv(file="Ex2.5a.csv",head=TRUE,sep=",")
data
# Multivariable Scatter plot for NMC cells
pairs(~Capacity+Temp+SOC,data=data,
main="NMC aging",col= ifelse(data$Temp == 35,
"blue",ifelse(data$Temp == 25, "black","red")))
```
24 2 Exploratory Data Analysis
The data stored in Ex2.5a.csv includes three columns: Temperature (25, 35, and 50
°C), SOC (from 0 to 100%), and capacity (in %). All temperatures in the scatterplot are
identified by a different colour (35 °C is blue, 25 °C is black, and 50 °C is red). The
corresponding multivariate scatterplot is shown in Fig. 2.6.
A similar R code is required for the NCA data, stored in Ex2.5b.csv. The corresponding
multivariate scatterplot is shown in Fig. 2.7.
Figures 2.6 and 2.7 show that the capacity of a battery decreases as the temperature
and SOC increase. When comparing cathode chemistries, for instance, NCA cells seem
to degrade faster (capacity drop) than NMC cells at 50 °C and degrade slower at different
temperatures for all SOCs.
Box plots help in showing the distributional characteristics of data. The box plot
terminology is shown in Fig. 2.8.
2.3 Simple Visualization 25
To build a box plot, use a horizontal or vertical number line and a rectangular box. The
endpoints of the axis are labelled with the smallest and largest data values. The first
quartile (Q1) shows one end of the box, and the third quartile (Q3) marks the other end
of the box. The middle 50% of the data falls inside the box. The whiskers are identified
from the ends of the box to the largest and smallest data values. The second quartile
(Q2) or median can be between Q1 and Q3, or it can be one, or the other, or both. In
some cases, we can encounter dots marking outliers’ values, where the whiskers are not
extending to the minimum and maximum values. Example 2.6 shows how to interpret a
box plot.
Example 2.6 Box plot. Exposure to particulate matter with a diameter of less than 10
μ m (PM10) poses a significant risk to human health. It is suggested that there is a link
between long-term exposure to PM10 and respiratory mortality [5]. The data in the file
Ex2.6.csv provides values of the days per year of the mean daily PM10 concentration for
each measurement station (denoted with the letter S, from 1 to 8) in a specific region from
2012 to 2022. Let us prepare a box plot to study this data.
###General
```{r, echo = TRUE}
setwd("C:/Book/Chapter2/Examples")
data <-read.csv(file="Ex2.6.csv",head=TRUE,sep=",")
data
boxplot(PM~Year,data=data, main="PM10 per year/sta-
tion",
xlab="Year", ylab="PM10",ylim = c(300, 370))
```
The data stored in Ex2.6.csv includes three columns: Station (from S1 to S8), Year
(from 2012 to 2022), and PM10. The corresponding box plot is shown in Fig. 2.9.
Figure 2.9 shows, for instance, that in 2013, the PM10 varied between 314 and 339
among the 8 measurement stations. 2014 and 2017 show two outliers or values that
2.3 Simple Visualization 27
Fig. 2.9 Box plot for PM10, for 8 measurement stations from years 2012 to 2022
notably differ from the dataset. The mean and median share the same value of 331
(including outliers). What other important information does this plot offer us?
2.3.5 Histogram
In a histogram graph, the data is represented by numerical data points grouped according
to specified ranges called bins, showing the frequency distribution of the data. The x-
axis of a histogram represents the intervals of the measured values, while the y-axis
shows the frequency or height of the bars. Frequency refers to the number of times the
values happen within the interval or width of the bar. A histogram is useful to determine
the distribution of the data and provide us with meaningful indicators such as mean and
median. Outliers can also be observed in these graphs. The distribution can be:
Kurtosis is also studied when analyzing histograms, measuring whether the data is pre-
dominantly normal or not (with outliers) in terms of distribution. A perfectly normal
distribution has zero kurtoses, also known as mesokurtic. Negative kurtosis or platykurtic
shows thinner tails and a flatter peak. Positive kurtosis or leptokurtic shows a fat-tailed
distribution with several outliers.
To build a histogram (i) we find the highest and lowest data value in the dataset; (ii) we
compute the range by subtracting the minimum value from the maximum value; (iii) we
use the range to estimate the width of our classes. Once the class width is estimated, (iv)
a class is selected considering the minimum data value, subsequent classes are generated
until a class that includes the maximum data value is obtained. Once we have organized
our data by classes, we proceed to draw the histogram following these steps:
Example 2.7 Histogram. The dataset Ex2.7.csv includes observations of the maximum
daily temperature (in °C) and average wind speed (in m/s) at an undisclosed location. Create
histograms for the temperature and wind speed.
###Install
```{r, echo = TRUE}
install.packages("moments")
```
###General
```{r, echo = TRUE}
setwd("C:/Book/Chapter2/Examples")
data <-read.csv(file="Ex2.7.csv",head=TRUE,sep=",")
data
#Temperature Histogram
summary(data$Temperature)
hist(data$Temperature,
xlab="Temperature, in °C",
xlim=c(10,40),
col="blue",
freq=TRUE)
print(skewness(data$Temperature))
print(kurtosis(data$Temperature))
print(skewness(data$Wind))
print(kurtosis(data$Wind))
```
This code first provides a summary statistics for the variables temperature and wind
speed:
2.3.6 Temperature
[1] -0.3587357
[1] 2.630471
30 2 Exploratory Data Analysis
[1] 0.4850302
[1] 3.324937
Outliers and missing values are frequently observed when collecting data. They are also
called noise, incomplete data, or abnormal values. We often remove them, as we consider
them unnecessary data. However, outliers and missing values express some facts about the
data; therefore, we must understand the reason for the mechanism of such values. Man-
agement of outliers and missing values is an important step in exploratory data analysis,
as they might compromise the statistical power of the study, affect the reliability of the
data by introducing bias to the results, and reduce the accuracy of models in predicting
outcomes.
This section summarizes the main techniques to detect and treat outliers and missing
values in datasets.
2.4.1 Outliers
An outlier is an observation or value that significantly differs from other data points.
Outliers may be due to the observed phenomenon’s inherent variability and measure-
ment errors (e.g., instrument failure). Thus, we can distinguish two classes of outliers: (i)
extreme values and (ii) mistakes. Extreme values are possible but unlikely responses.
Outliers can be detected using scatter plots, box plots, and histograms. Nevertheless,
there are three statistical tests to detect outliers formally:
• Grubbs’s test allows us to detect whether the highest or the lowest value in a dataset is
an outlier. This test detects one outlier at a time; for instance, the null and alternative
hypotheses are:
As for any statistical test, if the p-value is less than the chosen significance threshold
(typically a = 0.05) then the null hypothesis is rejected, and we can conclude that the
lowest or highest value is an outlier.
• Dixon’s test, like the Grubbs test, detects whether a dataset’s highest or lowest value
is an outlier. It is performed on the suspected outlier individually, and this test is most
useful for small sample sizes (usually n ≤ 25).
• Rosner’s test is used to detect several outliers at once. It is designed to minimize
masking, which happens when an outlier is close in value to another outlier and can
go undetected.
32 2 Exploratory Data Analysis
Detecting outliers is important because it can bias the fit estimates and predictions. Once
the outliers have been identified, there are three approaches you might consider treating
them:
• Imputation: We replace the outlier points with the mean/median/mode. This technique
can be applied depending on the data context; for instance, if the variation is low or the
variable has low leverage over the system’s response, then this approach is acceptable
and could lead to satisfactory results.
• Capping: We can cap observations outside the 1.5 × interquartile range (IQR) limits.
IQR measures the spread of data.
• Prediction: Outliers can be substituted with missing values (NA) and predicted by
considering them as an outcome or response variable.
Note: There are several sources of outliers in process engineering, from measurement
errors (from instruments), experiment and/or failures, power or emergency failures induc-
ing instrument errors, changes in process conditions (e.g., stream compositions). These
sources must be identified, as possible, for troubleshooting purposes, and to keep the
physical meaning when analyzing data.
Missing values occur when no data value is captured for the variable in an observation.
Missing data can significantly impact the conclusions drawn from the data; hence, they
must be treated. The options for NA values include:
• Deleting the observations: make sure that after your delete your observations you
o Have sufficient data points so the model does not lose representation capability of
the physical phenomenon.
o Do not introduce bias (non-representation of classes).
• Deleting the variable is practical when a particular variable has more NAs than the rest
of the dataset, and the variable is not a significant predictor.
• Imputation with mean/mean/mode: this is like the approach used for outliers.
• Prediction: this is also like the approach used for outliers.
Let us perform some tests and treatments on outliers and missing values.
2.4 Outliers and Missing Values 33
Example 2.8 Outliers and missing values. Let us use the dataset Ex2.8.csv (a modification
of Ex2.7.csv) and the Rosner test in R to check suspected outliers for the variable wind.
Once detected, we proceed to cap them.
###Install
```{r, echo = TRUE}
install.packages("EnvStats")
```
###Detect Outliers
```{r, echo = TRUE}
library(EnvStats)
setwd("C:/Book/Chapter2/Examples")
data <-read.csv(file="Ex2.8.csv",head=TRUE,sep=",")
test <- rosnerTest(data$Wind, k=5)
test$all.stats
test
```
The function rosnerTest requires two arguments: the data, and k, the number of sus-
pected outliers; in this case, we chose 5. The corresponding statistic regarding outliers is
generated as follows:
Based on the Rosner test, one outlier (see the column Outlier) is the observation 42
(see the column Obs.Num).
Note: We have other packages in R that can be used for outliers’ detection, including
lofactor, outliers, outlierTest, OutlierDetection, and mvoutlier.
The following code allows for capping the outliers, based on the 1.5 × IQR limits:
###Treat Outliers
```{r, echo = TRUE}
Wind <- data$Wind
qnt <- quantile(Wind, probs=c(.25, .75), na.rm = T)
caps <- quantile(Wind, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(Wind, na.rm = T)
Wind[Wind < (qnt[1] - H)] <- caps[1]
Wind[Wind> (qnt[2] + H)] <- caps[2]
summary(Wind)
```
###Remove NAs
```{r, echo = TRUE}
Wind_clean <- na.omit(Wind)
summary(Wind_clean)
```
Cook’s distance
A graphical way to observe outliers in linear regression is the Cook’s distance, which
shows the influence of each observation on the fitted response values. Cook’s distance
summarizes how much a regression change is affected when the ith observation is
removed.
A general rule of thumb to investigate outliers is checking if the data point is more than
3× the mean of all the distances. Example 2.9 illustrates this rule of thumb and Cook’s
distance to detect outliers.
Example 2.9 Identifying outliers using Cook’s distance. Let us detect the outliers in the
dataset stored in Ex2.9.csv using Cook’s distance; this file includes values for the acid gas
loading as a function of the amine flow rate. This is an extended dataset of Ex1.2.csv.
###Install
```{r, echo = TRUE}
install.packages("dplyr")
```
###Scatter plot
```{r, echo = TRUE}
setwd("C:/Book/Chapter2/Examples")
data <-read.csv(file="Ex2.9.csv",head=TRUE,sep=",")
data
# Scatter plot
plot(data$AmineFlow,data$AcidGas.Loading,
xlab="Amine Flow, USGPM ", ylab="Acid Gas Load-
ing, %mol/%mol", pch=19)
```
Figure 2.11 shows a linear trend in the data when removing observation 12 (possible
outlier).
Outliers are detected using Cook’s distances using the following code:
36 2 Exploratory Data Analysis
We fit the data with a linear regression in the first part of the code. The summary
statistics of the fitting is shown as follows:
2.4 Outliers and Missing Values 37
Residuals:
Min 1Q Median 3Q Max
-0.0008724 -0.0004625 -0.0002204 0.0002252 0.0026975
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.968e-03 1.441e-03 -1.365 0.197158
data$AmineFlow 6.300e-05 1.316e-05 4.789 0.000442 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Let us look at the adjusted R-squared value for this example. Chapter 3 covers in detail
the prediction techniques. This value, when close to 1, attests to the model’s accuracy.
The current value of R2 is 0.6278.
The second part of the code allows detecting outliers using Cook’s distance based on
the rule of thumb 3 × the mean of all the distances. The only outlier detected using this
technique is observation 12.
The last part of the code removes the outlier and refits the model. The corresponding
statistics summary after outliers’ removal is:
Residuals:
Min 1Q Median 3Q Max
-7.612e-04 -7.855e-05 2.011e-05 1.748e-04 3.906e-04
Coefficients:
Estimate Std. Error t value
(Intercept) -5.521e-04 4.934e-04 -1.119
data_without_outliers$AmineFlow 4.773e-05 4.577e-06 10.430
Pr(>|t|)
(Intercept) 0.287
data_without_outliers$AmineFlow 4.85e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The final value of R2 is 0.8998, closer to 1, and greater than 0.6278, proving that
removing the outlier, when physically possible, improves the fitting of the model. Hence,
the acid gas loading and amine flow are linearly correlated.
2.5 Correlogram
Correlation is a statistical measure that shows the extent to which two variables are lin-
early related without stating the cause and effect of this relation. A correlation can be
visualized with a scatterplot, as in Ex.2.9. When working with multiple variables, we can
use a correlogram, a correlation matrix graph. A correlogram is useful to highlight the
most correlated variables in a data table. In R, the plot includes correlation coefficients
coloured according to the value. Ex. 2.10 illustrates the use of a correlogram.
Let us illustrate a correlogram relating the design variables to the convection heat
transfer coefficient.
The package corrplot must be installed.
The corresponding R code is shown as follows:
###Install
```{r, echo = TRUE}
install.packages("corrplot")
```
2.5 Correlogram 39
The correlograms are generated using the following code for two different visualiza-
tions (circle, and number):
###Correlation
```{r, echo = TRUE}
setwd("C:/Book/Chapter2/Examples")
data <-
read.csv(file="Ex2.10.csv",head=TRUE,sep=",")
data
M <- cor(data)
library(corrplot)
#Using circles for visualizations
corrplot(M, method="circle")
###Significance
```{r, echo = TRUE}
#p-values matrix
cor.mtest <- function(mat, ...) {
mat <- as.matrix(mat)
n <- ncol(mat)
p.mat<- matrix(NA, n, n)
diag(p.mat) <- 0
for (i in 1:(n - 1)) {
for (j in (i + 1):n) {
tmp <- cor.test(mat[, i], mat[, j], ...)
p.mat[i, j] <- p.mat[j, i] <-
tmp$p.value
}
}
colnames(p.mat) <- rownames(p.mat) <-
colnames(mat)
p.mat
}
# matrix of the p-value of the correlation
p.mat <- cor.mtest(data)
head(p.mat[, 1:6]) #"6" is the total number of var-
iables
Figure 2.14 shows that the input variable x 5 is the most significant for the chosen
significance level when relating to h.
A white box in the correlogram shows that the correlation is not significantly different
from 0 at a specified significance level for a couple of variables; this means there is
no linear relationship between them. To determine whether a correlation coefficient is
significantly different from 0, a correlation test must be performed. Even if the correlation
coefficient between two variables is low, the correlation test shows if we can reject or not
the hypothesis of no correlation in the population!
Correlograms are very helpful for preliminary visual inspection. Nevertheless, we must
consider that outliers and missing values in a time series may seriously affect them; for
example, extreme points will tend to depress the data correlation coefficients towards zero.
K-means clustering, clusters are defined so that the intra-cluster variation is minimized;
this variation is the sum of squared Euclidean distances between each observation and the
mean of the cluster:
W(Ck ) = (xi − μk )2 (2.1)
xi ∈Ck
where
x i is the data point belonging to the cluster C k .
µk is the mean value of the points assigned to the cluster C k .
Each observation is placed in a cluster such that the sum of squares distance of the
observation to their cluster centres µk is a minimum:
k
k
W(Ck ) = (xi − μk )2 (2.2)
k=1 k=1 xi Ck
Example 2.11 K-means clustering. The file Ex2.11.csv includes data of the capacity (in
%) of a lithium-ion battery over time (in days). Let us perform clustering to determine the
optimal number of clusters.
###Install
```{r, echo = TRUE}
#another way of getting packages
packages<-function(x){
x<-as.character(match.call()[[2]])
if (!require(x,character.only=TRUE)){
install.packages(pkgs=x,repos="http://cran.r-
project.org")
require(x,character.only=TRUE)
}
}
packages(fviz_cluster)
packages(gridExtra)
packages(cluster)
packages(factoextra) # visualization tool
```
###Clusters
```{r, echo = TRUE}
setwd("C:/Book/Chapter2/Examples")
data <-
read.csv(file="Ex2.11.csv",head=TRUE,sep=",")
#k-means clustering
set.seed(100)
data_K2 <- kmeans(dataNorm, centers = 2, nstart =
25)
fviz_cluster(data_K2, data = dataNorm)
```
To perform clustering, we must scale data values such that the overall statistical
summary of every variable has a mean zero and a unit variance value (Fig. 2.15).
We performed k-means clustering setting two clusters (centers = 2). We can visualize
the results as follows:
2.6 Clustering and Dimensionality Reduction 45
K-means requires the user to specify a priori the number of clusters to classify the
data. However, there are plots employing the Elbow and Silhouette methods that suggest
the optimal number of clusters, showing the total within-groups sums of squares versus
the number of clusters and looking at the bend in the graph (Fig. 2.16).
The Elbow method plots the explained variation as a function of the number of clusters
and selects the elbow or bend of the curve to suggest the number of clusters to use
(Fig. 2.17).
The Elbow method suggests that the optimal number of clusters for this example is 2,
while the Silhouette method suggests that this number is 3. When analyzing the physical
meaning of degradation (capacity) in lithium-ion batteries, it is reasonable to think that
until 80% of capacity, degradation over time is quasi-linear due to the formation and
growth of a layer called Solid Electrolyte Interface (SEI). The SEI layer reduces the
amount of available lithium for the intercalation reaction (the main reaction that occurs
in these batteries when charging or discharging). After 80% of capacity, lithium plating,
another degradation mechanism, is believed to be responsible for the non-linear behavior
of the degradation over time. Hence, two clusters are expected to reflect the lithium-ion
battery’s first life (due to the SEI layer) and second life (due to the combined SEI and
value − mean
z= (2.3)
standarddeviation
⎡ ⎤
Cov(x, x) Cov(x, y) Cov(x, z)
⎢ ⎥
⎣ Cov(y, x) Cov(y, y) Cov(y, z) ⎦ (2.4)
Cov(z, x) Cov(z, y) Cov(z, z)
• The covariance matrix is used to identify any relationship between the variables. If the
covariance sign is positive, the two variables are correlated (increasing or decreasing);
48 2 Exploratory Data Analysis
if the sign is negative, the two variables are inversely correlated (one variable increases
when the other decreases).
• Computation of eigenvectors and eigenvalues of the covariance matrix. These elements
allow for determining the principal components of the data. These components are
uncorrelated, as they are new variables created as linear combinations of the initial vari-
ables. How can we interpret the PCA components? For instance, if a 10-dimensional
dataset generates ten principal components, but PCA sets the maximum possible infor-
mation in the first component, then the maximum remaining information is contained
in the second component, followed by the third component, and so on, until obtaining
a scree plot. When observing this plot, we can discard those components with low
information, reducing dimensionality while keeping relevant information.
Example 2.12 Let us use the data Ex2.10 to perform a PCA for dimensionality-reduction
purposes.
###Install
```{r, echo = TRUE}
install.packages("factoextra")
```
###PCA
```{r, echo = TRUE}
setwd("C:/Book/Chapter2/Examples")
data <-
read.csv(file="Ex2.10.csv",head=TRUE,sep=",")
head(data)
#Scree plot
res.pca <- prcomp(data, scale = TRUE)
screeplot(res.pca, type = "l", npcs = 6, main =
"Screeplot of the first 6 PCs")
abline(h = 1, col="red", lty=5)
legend("topright", legend=c("Eigenvalue = 1"),
col=c("red"), lty=5, cex=0.6)
The corresponding scree plot and contribution plot are shown as follows (Fig. 2.18):
An eigenvalue less than 1 means that the component explains less than a single
explanatory variable, meaning we should discard those components. In our case, we
should discard 3 components (Fig. 2.19).
The contribution plot (biplot) shows principal component (PC) scores of samples (dots)
and loadings of variables (vectors). The plot shows observations as points in the plane
formed by two synthetic variables (two principal components):
• The more parallel the vector to a principal component axis is a vector, the more it
contributes only to that PC.
• The longer the vector, the more variability of this variable is shown by the two principal
components.
• The angles between vectors of different variables show their correlation; small angles
indicate a high positive correlation, right angles represent a lack of correlation, and
opposite angles represent a highly negative correlation.
Global sensitivity analysis defines the importance of model inputs and their interactions
regarding the model output. This analysis is called global, as (i) all the input factors are
varied simultaneously, and (ii) it is performed over the range of each input factor.
Variance-based sensitivity analysis or Sobol indices is a technique to perform a global
sensitivity analysis, as it provides a better understanding of the relationship between the
variables of a model. The variance of the model’s output is decomposed into fractions or
measures of sensitivity attributed to the inputs.
The first-order Sobol sensitivity index S, measures the contribution of each variable or
parameter to the output variable, and hence, its effect on the variance of the model:
V[E[Y]Qi ]
Si = (2.5)
V[Y]
where [E[Y ]Q i ] is the expected value of the output Y when the variable or parameter Q i
is fixed.
The total Sobol sensitivity index (ST) accounts for the sensitivity of the first-order and
higher-order effects (interaction between variables):
V E[Y]Q−i
STi = 1 − (2.6)
V[Y]
where Q −i includes all uncertain parameters except Q i .
Note: The sum of the first-order Sobol sensitivity indices cannot exceed one. Likewise,
the sum of the total Sobol sensitivity indices is equal to or greater than one.
Example 2.13 Sobol indices. We use a dataset included in Ex2.13, showing the variable y
(corrosion rate) as a function of the variables x 1 (acid gas), x 2 (heat stable salts), x 3 (velocity),
and x 4 (temperature) to perform a global sensitivity analysis using Sobol indices.
###Install
```{r, echo = TRUE}
install.packages("sensitivity")
```
###Sobol
```{r, echo = TRUE}
library(sensitivity)
setwd("C:/Book/Chapter2/Examples ")
data <-
read.csv(file="Ex2.13.csv",head=TRUE,sep=",")
x1=data$x1
x2=data$x2
x3=data$x3
x4=data$x4
y=data$y
XX1=data.frame(data[c(1:500),c(1:4)]);
rownames(XX1) <- NULL
XX2=data.frame(data[c(501:1000),c(1:4)]);
rownames(XX2) <- NULL
modelX <-
lm(y~x1+x2+x3+x4+x1^2+x2^2+x3^2+x4^2+x1*x2+x1*x3+x1*x
4+x2*x3+x2*x4+x3*x4)
#summary(modelX)
sol <-sobol(model=modelX,X1=XX1,X2=XX2,order=1)
print(sol)
plot(sol)
```
We assume that the corrosion rate y is correlated with a linear model with interaction
parameters (modelX). The corresponding summary statistics shows an adjusted R-squared
of 0.7189. When calculating the first-order Sobol indices for the assumed model, it shows
the following contribution of the input variables (x 1 to x 4 ):
The corresponding values of the figure are:
Figure 2.20 (and its corresponding values) shows that x 4 (temperature) is the most
influential variable on the corrosion rate, followed by x 2 , x 3 , and x 1
2.7 Summary and Final Remarks 53
Exploratory data analysis (EDA) is a critical process that data analysts, scientists, and
engineers perform to investigate data, discover patterns, pick anomalies, check assump-
tions, and test hypotheses. There is no standard recipe for performing EDA. Nevertheless,
we propose the workflow shown in Fig. 2.21, supported by the techniques, algorithms,
and methods included in this chapter.
The raw data (or data before being processed) shall be processed through six steps,
including (i) summary statistics, (ii) data visualization, (iii) capture and treatment of out-
liers and missing values, (iv) correlation, (v) clustering and dimensionality reduction, and
(vi) sensitivity analysis to generate ‘clean data’ ready for modelling and prediction pur-
poses. The readiness of the data can be revised when modelling and predicting values,
depending on the model’s accuracy, and the physical interpretation of the analyzed phe-
nomenon or process. For instance, our EDA might suggest a dimensionality reduction that
leads to a significant accuracy loss for modelling and prediction purposes; hence, we can
decide to omit this step. Another example can be associated with capturing and treating
outliers; some chemical processes, for example, might exhibit extreme values due to the
intrinsic nature of the processes even under normal operating conditions; therefore, we
cannot simply delete these values. EDA is critical for defining accurate models, and this
analysis must be accompanied by a process engineering interpretation of the process to
be successful and lead to representative models.
54 2 Exploratory Data Analysis
Data Disclosure
The data was generated, maintaining the described problem’s physical meaning of the
analyzed phenomenon or process.
Problems
2.1 Analyze the current Exploratory Data Analysis on the data for modelling nitro-
gen dioxide concentration levels across Germany. The article can be down-
loaded from: https://www.sciencedirect.com/science/article/pii/S2352340921006089.
The dataset can be downloaded from: https://zenodo.org/record/5148684#.Yp-J-
ajMJro, file DATA_MonitoringSites_DE.csv. Perform your own EDA with the tools
provided in this chapter and compare it with the current one.
2.2 Perform an Exploratory Data Analysis on the water treatment plant experiment
described by Souza et al. (2013)[4]. The objective is to estimate the fluoride concen-
tration in the effluent of an urban water treatment plant. The corresponding dataset
can be downloaded from: https://home.isr.uc.pt/~rui/publications/datasets.html, see
‘WWTP’. Discuss and report your conclusions about the dataset.
56 2 Exploratory Data Analysis
Resources
Recommended Readings
Chakraborty, S., & Dey, L. (2023). Computing for data analysis: Theory and practices.
Springer Verlag, Singapore.
Roy, Kavika (2022). Dimensionality reduction techniques in Data Science. KDnuggets.
https://www.kdnuggets.com/2022/09/dimensionality-reduction-techniques-data-science.
html.
Frost, J. (2023, May 18). Box plot explained with examples. Statistics By Jim. https://
statisticsbyjim.com/basics/graph-groups-boxplots-individual-values/.
Glen, G., & Isaacs, K. (2012). Estimating sobol sensitivity indices using correlations.
Environmental Modelling & Software, 37, 157–166. https://doi.org/10.1016/j.envsoft.2012.
03.014.
Glen, S. (2022, January 14). Cook’s distance / cook’s D: Definition, interpretation.
Statistics How To. https://www.statisticshowto.com/cooks-distance/.
Giudici, P. (2013). Statistical models for data analysis. Springer.
Irizarry, R. A. (n.d.). Introduction to data science. Chapter 28 Smoothing. http://raf
alab.dfci.harvard.edu/dsbook/smoothing.html.
McGregor, M. (2020, September 21). 8 clustering algorithms in machine learning that
all data scientists should know. freeCodeCamp.org. https://www.freecodecamp.org/news/
8-clustering-algorithms-in-machine-learning-that-all-data-scientists-should-know/.
Menon, K. (2023, January 12). The complete guide to Skewness and kurtosis:
Simplilearn. Simplilearn.com. https://www.simplilearn.com/tutorials/statistics-tutorial/ske
wness-and-kurtosis.
n.d. (2017, October 7). Principal component analysis in R: Prcomp VS princomp.
STHDA. http://www.sthda.com/english/articles/31-principal-component-methods-in-r-pra
ctical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/.
Pearson, R. K. (2018). Exploratory Data Analysis Using R. CRC Press.
Prakash, K. B. (2022). Data science handbook: A practical approach. Wiley-Scrivener.
Snehal_bm. (2021, July 8). How to treat outliers in a data set?. Analytics Vidhya.
https://www.analyticsvidhya.com/blog/2021/07/how-to-treat-outliers-in-a-data-set/.
What is cluster analysis? when should you use it for your results?. Qualtrics.
(2022, November 30). https://www.qualtrics.com/experience-management/research/clu
ster-analysis/.
References 57
Zhang, X., Trame, M., Lesko, L., & Schmidt, S. (2015). Sobol Sensitivity Analysis:
A tool to guide the development and evaluation of Systems Pharmacology Models. CPT:
Pharmacometrics & Systems Pharmacology, 4(2), 69–79. https://doi.org/10.1002/psp4.6.
References
1. Behera, S., & Suresh, A. K. (2019). Data on of interfacial hydrolysis kinetics of an aromatic acid
chloride. Data in Brief, 26, 104337. https://doi.org/10.1016/j.dib.2019.104337
2. Andrady, A. (2015). Degradation of plastics in the environment. Plastics and Environmental
Sustainability, 145–184. https://doi.org/10.1002/9781119009405.ch6.
3. Mallick, J., Singh, C., AlMesfer, M., Kumar, A., Khan, R., Islam, S., & Rahman, A. (2018).
Hydro-geochemical assessment of groundwater quality in ASEER region Saudi Arabia. Water,
10(12), 1847. https://doi.org/10.3390/w10121847
4. Souza, F., Araújo, R., Matias, T., & Mendes, M. (2013). A Multilayer-Perceptron Based Method
for Variable Selection in Soft Sensor Design. Journal of Process Control, 23(10), 1371–1378.
https://doi.org/10.1016/j.jprocont.2013.09.014
5. California Air Resources Board. Inhalable Particulate Matter and Health (PM2.5 and PM10) |
California Air Resources Board. (n.d.). https://ww2.arb.ca.gov/resources/inhalable-particulate-
matter-and-health.
6. Taiyun Wei, V. S. (2021, November 18). An introduction to corrplot package. https://cran.r-pro
ject.org/web/packages/corrplot/vignettes/corrplot-intro.html
Data-Based Modelling for Prediction
3
Simple regression models include simple linear regression, which is the most common
form of regression analysis, multivariate linear regression, splines, multivariate adap-
tive regression splines, other functions (such as exponential, logarithmic, and polynomial
functions), response surface regression, Kriging, among others.
Simple linear regression is the linear regression approach that attempts to model the
relationship between an explanatory or independent variable x and a dependent variable
y.
A simple linear regression model has an equation of the form
y = mx + b (3.1)
En
i=1 xi
X= (3.2)
n
En
i=1 yi
Y= (3.3)
n
where n is the number of ordered pairs.
En ( )( )
i=1 xi − X yi − Y
m= E ( )2 (3.4)
n ni=1 xi − X
b = Y − mX (3.5)
Scatter plots allow us to observe the linearity between the variables x and y. Analytically,
the linearity is confirmed or not using the coefficient of determination r 2 , which shows the
proportion of the variance in y that is predictable from x:
RSS
r2 = 1 − (3.6)
TSS
where RSS is the sum squares of residuals and TSS is the total sum of squares.
E
n
( )2
RSS = yi − f(xi ) (3.7)
i=1
E
n
( )2
TSS = yi − Y (3.8)
i=1
3.2 Simple Regression Models 61
Example 3.1 The hydrolysis of a prescription drug is a first-order reaction with regard to
it. Data for this reaction at 25 °C and pH 7.0 is provided in the file Ex3.1.csv. Determine the
rate constant.
The following package ggplot2 must be installed. The package ggplot2 is a data
visualization package for R.
A preliminary exploratory data analysis using a scatter plot reveals a linear relationship
between the variables (Fig. 3.1).
###General
```{r, echo = TRUE}
setwd("C:/Book/Chapter3/Examples")
data <-read.csv(file="Ex3.1.csv",head=TRUE,sep=",")
data=data[c(1:4),c(2:3)];
data
plot(data$InitialRate,data$PresDrug,xlab="Concen-
tration of Prescription Drug, M",ylab="Initial Rate,
M/min",pch=19)
```
Fig. 3.1 Graphical relationship between the initial rate and concentration of the prescription drug
62 3 Data-Based Modelling for Prediction
where [A] is the concentration of the reactant A (prescription drug) at time t, [A]0 is the
initial concentration of the reactant A, and k is the rate constant. To find the value of k is
necessary to estimate the slope of the linear function between the variables. We use the
function lm in R and present the summary stats of the model:
###Model
```{r, echo = TRUE}
model <-lm(data$InitialRate~data$PresDrug)
summary(model)
```
Call:
lm(formula = data$InitialRate ~ data$PresDrug)
Residuals:
Min 1Q Median 3Q Max
-6.490e-06 -6.184e-07 4.021e-07 1.018e-06 3.767e-06
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.391e-06 1.657e-06 0.84 0.426
data$PresDrug 1.387e-03 3.198e-05 43.37 8.81e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.828e-06 on 8 degrees of freedom
Multiple R-squared: 0.9958, Adjusted R-squared: 0.9952
F-statistic: 1881 on 1 and 8 DF, p-value: 8.814e-11
The value of the rate constant k (slope) is 1.387 × 10–3 min−1 . The intercept of the
linear function is 1.391 × 10–6 .
Let us examine the adequacy of the model:
• The residuals are the difference between the dependent variable’s predicted and
observed values. For all points of the dataset, the residuals are negligible.
• The standard error is the standard deviation divided by the square root of the sample
size. In this case, it is reported as negligible.
• t-values denote the confidence we have in the coefficients as a predictor. The higher
the t-value is, the greater the confidence we have. Generally, any t-value greater than
+2 or less than −2 is acceptable; our model reflects this acceptance criterion.
• Pr (>|t|) is the acronym used by R in the model output, related to the probability of
observing any value equal or larger than t. A small p-value indicates that the null
hypothesis is weak; hence, it is unlikely that observed differences are due to chance.
3.2 Simple Regression Models 63
We can use different significant codes or thresholds, such as 0, 0.001, 0.01, 0.5 or 0.1.
R will show with * representations, the p-values related to a specific threshold.
• The model’s residual standard error is the residual deviation of the residuals. Smaller
residual standard errors imply that the predictions are better. For this model, this value
is negligible.
• R-squared is the coefficient of determination, which ranges from 0 to 1. This statis-
tical measure determines how well the data fits the regression model. The difference
between multiple R-squared and adjusted R-squared is that adjusted R-squared consid-
ers different independent variables against the model, while R-squared. For this model
and dataset, both values are the same, and equal to 1, denoting a perfect fit of the data
into a linear model.
• F-statistics is used to test the significance of regression coefficients in linear regres-
sion models. If the p-value associated with the F-statistics is ≥ 0.05, then there
is no relationship between the independent and dependent variables. If the p-value
associated with the F-statistics is <0.05, then at least one independent variable is
related to the dependent variable. The F-statistics compares the joint effect of all the
variables together. In the model of this example, the p-value associated with the F-
statistics is <0.05; hence, the independent variable (initial rate) is linearly related to the
dependent variable (concentration of prescription drug).
• In linear regression models, each term is an estimated parameter that uses one degree
of freedom.
A convenient way to showing the residuals of the model is through a residual plot
(Fig. 3.2):
###Residual Plot
```{r, echo = TRUE}
#Extract the residuals of the model
resi <-residuals(model)
The x-axis displays the fitted values, while the y-axis displays the residuals. From the
plot, the spread of the results tends to be indistinctly higher or lower for higher or lower
fitted values.
64 3 Data-Based Modelling for Prediction
We can also generate a Q-Q plot, which is helpful to show if the residuals follow a
normal distribution. If the data values in this plot fall along a straight line at a 45-degree
angle, then the data follows a normal distribution (Fig. 3.3).
###Q-Q Plot
```{r, echo = TRUE}
#Create the Q-Q plot for residuals
qqnorm(resi)
We can observe that one residual strays from the line at the lowest theoretical quartiles,
which could indicate that the data is not normally distributed.
Multiple linear regression (MLR) is a statistical technique that uses two or more indepen-
dent variables to predict the outcome of one dependent variable. The MLR model has the
form
3.2 Simple Regression Models 65
E
p
yi = b0 + bj xij + ei (3.10)
j=1
where
yi εR is the real-valued response for the i-th observation
b0 εR is the regression intercept
b j εR is the j-th predictor’s regression slope
xi j εR is the j-th predictor for the i-th observation
( )
ei ∼ N 0, σ2 is a Gaussian error term, assuming that is an unobserved random variable
iε{1, . . . , n} is the regression intercept
p is the predictor (p > 1)
The multivariate (multiple) linear regression (MvLR) model has the form
E
p
yik = b0k + bjk xij + eik (3.11)
j=1
where
yik εR is the k-th real-valued response for the i-th observation
b0k εR is the regression intercept for k-th response
bjk εR is the j-th predictor’s regression slope for k-th response
xij εR is the j-th predictor for the i-th observation
66 3 Data-Based Modelling for Prediction
Example 3.2 Multiple linear regression and multivariate linear regression. The degra-
dation of lithium-ion batteries is observed as a decrement in their capacity over time. When
these batteries are at rest, degradation is caused by the temperature and state-of-charge
(SOC). Model the degradation for the dataset included in Ex3.2.csv as a function of the
inverse of the temperature (1/K), SOC (%), and time (d).
###General
```{r, echo = TRUE}
setwd("C:/Book/Chapter3/Examples")
data <-read.csv(file="Ex3.2.csv",head=TRUE,sep=",")
data=data[c(1:83),c(2:4)];
data
```
###Model 1
```{r, echo = TRUE}
model<-lm(data$Capac-
ity~data$InvT+data$SOC+data$time)
summary(model)
```
Call:
lm(formula = data$Capacity ~ data$InvT + data$SOC + data$time)
Residuals:
Min 1Q Median 3Q Max
-6.6809 -2.9024 -0.4834 3.2543 8.0319
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.950e+01 8.065e+00 9.858 2.08e-15 ***
data$InvT 7.989e+03 2.468e+03 3.237 0.00176 **
data$SOC -8.050e-02 1.489e-02 -5.407 6.61e-07 ***
data$time -1.521e-02 2.662e-03 -5.714 1.88e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The summary results of the previous model suggest that a multiple linear model does
not accurately relate the stress factors with the capacity, with an adjusted R-square =
0.4501, and a residual standard error of 3.54. The coefficients b0 (intercept), b1 (for
InvT), b2 (for SOC), and b3 (for time) were estimated. All variables exhibit a very low
p-value; therefore, they can be considered all significative when predicting capacity.
Now, let us include linear interaction parameters in the previous model and examine
the accuracy of the model.
68 3 Data-Based Modelling for Prediction
###Model 2
```{r, echo = TRUE}
model <-lm(data$Capac-
ity~data$InvT+data$SOC+data$time+data$InvT*data$SOC+d
ata$InvT*data$time+data$SOC*data$time)
summary(model)
```
Call:
lm(formula = data$Capacity ~ data$InvT + data$SOC + data$time +
data$InvT * data$SOC + data$InvT * data$time + data$SOC *
data$time)
Residuals:
Min 1Q Median 3Q Max
-6.2272 -2.4546 0.1815 2.0677 6.9136
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.585e+02 4.298e+01 3.687 0.000424 ***
data$InvT -1.819e+04 1.341e+04 -1.356 0.179096
data$SOC -5.682e-01 4.704e-01 -1.208 0.230849
data$time -1.724e-01 4.921e-02 -3.504 0.000771 ***
data$InvT:data$SOC 1.715e+02 1.470e+02 1.167 0.246988
data$InvT:data$time 5.521e+01 1.511e+01 3.654 0.000473 ***
data$SOC:data$time -2.474e-04 9.288e-05 -2.664 0.009422 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
As can be seen in the previous results, both the residual standard error and R-squared
improve when compared to model 1. Nevertheless, low p-values are restricted to the inter-
cept and time and the linear interaction parameters between the inverse of the temperature
and time, and SOC and time. Is it possible to dimensionally reduce this model by elim-
inating those variables exhibiting high p-values? How would it affect the accuracy of
the model? Is a linear model enough to accurately predict the capacity as a function of
the inverse of the temperature, SOC, and time? Later in this chapter, we answer these
questions using ‘response surface regression’ when solving this example.
Notes:
– b must be non-negative.
– When b > 1, we have an exponential growth model.
– When 0 < b < 1, we have an exponential decay model.
y is the response variable, x is the predictor variable; a and b are the regression
coefficients that describe the relationship between y and x.
R2 denotes the relative predictive power of a logarithmic model; its value varies
between 0 and 1.
Example 3.3 Exponential decay. The isotope IT-99 is used as a radioactive tracer. It decays
by a process called isomeric transition, where it releases gamma rays and low-energy elec-
trons. The decay factors for IT-99 are given in Ex3.3.csv. Find a decay model for this
dataset.
###General
```{r, echo = TRUE}
setwd("C:/Book/Chapter3/Examples")
data <-read.csv(file="Ex3.3.csv",head=TRUE,sep=",")
data=data[c(1:24),c(1:2)];
data
```
70 3 Data-Based Modelling for Prediction
###Model
```{r, echo = TRUE}
model <-lm(log(data$decay)~data$time)
summary(model)
betas<-model$coefficients
A<-exp(betas[1]);
b<-exp(betas[2]);
A
B
```
Equation 3.15 is first modelled as a logarithmic function (log in R is the natural log-
arithmic); the coefficients or betas of the model are then extracted to finally find the
coefficients of the exponential function by applying e to both sides.
The corresponding summary statistics is:
Call:
lm(formula = log(data$decay) ~ data$time)
Residuals:
Min 1Q Median 3Q Max
-0.0030848 -0.0006722 -0.0003317 0.0003191 0.0050043
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0050043 0.0006272 -7.979 6.17e-08 ***
data$time -0.2298932 0.0001129 -2035.734 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Intercept)
0.9950082
data$time
0.7946185
3.2 Simple Regression Models 71
The R-squared 1; therefore, the regression is very accurate. Moreover, the residual
standard error is also negligible.
y = a0 + a1 x + a2 x2 + . . . + an xn (3.16)
where ai are the coefficients of the polynomial terms, and n is the degree of the
polynomial function. a0 is typically referred to as the intercept.
Polynomial regression is a special linear regression case since we fit the polyno-
mial equation on the data with a curvilinear relationship between the dependent and
independent variables.
R2 denotes the relative predictive power of a polynomial model; its value varies
between 0 and 1.; its value varies between 0 and 1.
The response surface regression (RSR) explores and finds the relationship between
several independent variables and one or more response or dependent variables. RSR
produces a polynomial regression model with cross-product terms of variables denoting
the interaction between them. For instance, a response variable y, which depends on the
variables x 1 , x 2 , and x 3 , can be modelled using an RSR model with an equation of the
form
y = b0 + b1 x1 + b2 x2 + b3 x3 + b1−2 x1 x2 + b1−3 x1 x3
+ b2−3 x2 x3 + b1−1 x21 + b2−2 x22 + b3−3 x23 (3.17)
where b0 is the intercept, bi are the linear coefficients of the RSR, bi−i are the coefficients
of the quadratic terms, and bi− j are the coefficients of the interaction terms.
Example 3.4 Response surface regression. In this example, we add the quadratic terms for
Example 3.2.
72 3 Data-Based Modelling for Prediction
###Model
```{r, echo = TRUE}
b1<-data$InvT;
b2<-data$SOC;
b3<-data$time;
b12<-data$InvT*data$SOC;
b13<-data$InvT*data$time;
b23<-data$SOC*data$time;
b11<-data$InvT*data$InvT;
b22<-data$SOC*data$SOC;
b33<-data$time*data$time;
model <-lm(data$Capac-
ity~b1+b2+b3+b12+b13+b23+b11+b22+b33)
summary(model)
```
Call:
lm(formula = data$Capacity ~ b1 + b2 + b3 + b12 + b13 + b23 +
b11 + b22 + b33)
Residuals:
Min 1Q Median 3Q Max
-6.016 -2.481 -0.109 2.230 5.726
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.021e+02 1.649e+02 -1.832 0.071009 .
b1 2.640e+05 9.901e+04 2.666 0.009442 **
b2 -5.346e-01 4.829e-01 -1.107 0.271910
b3 -1.806e-01 4.771e-02 -3.786 0.000312 ***
b12 1.566e+02 1.616e+02 0.969 0.335652
b13 5.320e+01 1.438e+01 3.700 0.000416 ***
b23 -2.501e-04 8.876e-05 -2.818 0.006206 **
b11 -4.305e+07 1.515e+07 -2.841 0.005817 **
b22 2.153e-04 6.068e-04 0.355 0.723756
b33 3.516e-05 1.994e-05 1.764 0.081998 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The residual standard error and R-squared have considerably improved compared to
the simple linear model and the one including linear interaction terms. P-values for coef-
ficients b2 and b12 suggest that both terms, the linear for SOC and the linear interaction
parameter between the inverse of the temperature and SOC are not significant. Likewise,
the quadratic term b22 can also be discarded. Let us neglect both terms and re-estimate
the model:
###Model - simplified
```{r, echo = TRUE}
b1<-data$InvT;
b3<-data$time;
b13<-data$InvT*data$time;
b23<-data$SOC*data$time;
b11<-data$InvT*data$InvT;
b33<-data$time*data$time;
simplified_model<-lm(data$Capac-
ity~b1+b3+b13+b23+b11+b33)
summary(simplified_model )
```
Call:
lm(formula = data$Capacity ~ b1 + b3 + b13 + b23 + b11 + b33)
Residuals:
Min 1Q Median 3Q Max
-6.1257 -2.5201 -0.1696 2.4765 6.2959
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.541e+02 1.564e+02 -2.264 0.026404 *
b1 2.820e+05 9.605e+04 2.936 0.004392 **
b3 -1.754e-01 4.672e-02 -3.754 0.000338 ***
b13 5.247e+01 1.426e+01 3.678 0.000436 ***
b23 -2.839e-04 4.687e-05 -6.058 4.91e-08 ***
b11 -4.373e+07 1.472e+07 -2.970 0.003983 **
b33 3.484e-05 1.980e-05 1.760 0.082432 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
As seen in the previous summary, the dimensionality reduction has not significantly
impacted the accuracy of the simplified model, as the residual standard error and R-
squared do not significantly differ compared to the full quadratic model.
3.2.5 Splines
Example 3.5 Splines. True boiling point (TBP) distillation is a batch distillation process
used to characterize crude oils. It is generated by plotting the cumulative volume distillation
fraction with temperature. The dataset Ex3.5.csv includes typical TBP data for crude oil.
We are asked to create a model ‘correlating’ the increasing temperature as a function of the
cumulative volume.
The following code allows for loading and plotting the data.
###General
```{r, echo = TRUE}
setwd("C:/Book/Chapter3/Examples")
data <-read.csv(file="Ex3.5.csv",head=TRUE,sep=",")
data=data[c(1:14),c(1:2)];
data
plot(data$Volume...,data$Temp...R,xlab = "Cumula-
tive volume, %", ylab= "Temperature, R")
```
3.2 Simple Regression Models 75
We can infer in Fig. 3.4 that polynomials can be used to represent the true boiling point.
Cubic polynomial splines can indeed be used to model the TBP, allowing for intersecting
many points for interpolation purposes. In this example, we use ‘smoothing splines’, a
mathematical complex version of splines that are smoother and more flexible as well, as
they do not require the selection of the number of knots or cut-points.
###Install package
```{r, echo = TRUE}
install.packages("npreg")
```
###Model
```{r, echo = TRUE}
library(npreg)
model<-smooth.spline(x=data$Vol-
ume...,y=data$Temp...R)
plot(x=data$Volume...,y=data$Temp...R)
lines(model,col="blue")
```
Figure 3.5 shows the goodness of fit of the smoothing spline model when fitting to the
TBP data.
Example 3.6 MARS. The dataset Ex3.6.csv includes data of the heat transfer coefficient h
as a function of five design parameters (x 1 , x 2 , x 3 , x 4 , x 5 ) associated with a battery thermal
management system (BTMS). Create a model of h as a function of the design parameter
using MARS.
###Install package
```{r, echo = TRUE}
install.packages("earth")
```
###General
```{r, echo = TRUE}
setwd("C:/Book/Chapter3/Examples")
data <-read.csv(file="Ex3.6.csv",head=TRUE,sep=",")
data=data[c(1:35),c(1:6)];
```
78 3 Data-Based Modelling for Prediction
###MARS
```{r, echo = TRUE}
library(earth)
mars1 <-earth(data$h~ ., data = data,degree=3)
print(mars1)
summary(mars1)
datax=data[c(1:29),c(1:5)];
prediction <- predict(mars1,datax)
prediction
plot(prediction,data$h,xlab="predicted
'h'",ylab="real 'h'")
best_line<-lm(prediction~data$h)
abline(best_line,col="red")
```
coefficients
(Intercept) 223.082128
h(5.416-x5) -44.360444
h(x5-5.416) 153.465815
x4 * h(x5-5.416) -39.650983
x1 * x4 * h(x5-5.416) 2.320801
Fig. 3.7 Real versus predicted value of the heat transfer coefficient h
The R-squared of the model is close to 1 (0.9897), showing the goodness of fit of
MARS in modelling and consequently predicting h. A key feature of MARS is the impor-
tance: the estimated order of importance of the design variables is x 5 , x 4 , and x 1 , with x 2
and x 3 unused. This algorithm includes a backward elimination feature selection routine
that estimates the error as each predictor is added to the model.
A graphical representation of the goodness of fit is shown in Fig. 3.7, where the real
versus the predicted values are plotted to fit the best line (R2 = 0.9970).
The corresponding R file of this example is Ex3.6.rmd.
3.2.7 Kriging
Kriging is a spatial interpolation method; it uses a limited set of sampled data points
to estimate the value of a variable by interpolation over a continuous spatial field. For
instance, the average monthly carbon dioxide concentration over a city varies across a
random spatial field. It differs from other simple methods like linear regression or splines
since it uses the spatial correlation between sampled points to estimate the variable’s
value through interpolation in the spatial field. Kriging weights are estimated such that
points close to the location of interest have more weight than those located farther away.
The Kriging procedure is performed in two steps: first, the spatial covariance structure
of the sample points is fitted in a variogram; second, weights derived from this structure
are used for interpolation in the spatial field. Remember that covariance measures the
direction of the relationship between two variables; thus, a positive covariance indicates
that both variables tend to be high or low simultaneously, while a negative covariance
means the opposite.
80 3 Data-Based Modelling for Prediction
In simple linear regression, the linear model has two variables, x, the independent variable,
y, the dependent variable, and the parameters m and b, as per Eq. 3.1. We use a specific
method to estimate the parameters of the model and apply a certain criterion function:
E
n
( )2
0= ŷi − yi (3.19)
i=1
where ŷi are the estimated values of the dependent variable, and yi are the measured
values of the dependent variable. Here, we assumed that all the observations are equally
reliable; otherwise, a weighted (w) sum of squares may be minimized:
E
n
( )2
0= wii ŷi − yi (3.20)
i=1
In the least squares (LSQ) method, the estimators are the values of the parameters m
and b which minimize the objective function 0. Thus, we must calculate the derivatives
∂0 ∂0
∂m and ∂b , equate them to zero and solve the system of equations to find m and b.
In linear LSQ, the objective function 0 is a quadratic function of the parameters.
3.3 Non-linear Regression Models 81
Like simple LSQ, non-linear least squares (NLLSQ) is the form of least squares anal-
ysis used to find n parameters in non-linear models. In NLLSQ the objective function
is quadratic with respect to the parameters only in a region close to its minimum value;
in this case, we use a truncated Taylor series as a good approximation to the model.
Some examples of non-linear least squares solvers include Gauss–Newton (GN), QR
decomposition, and gradient methods.
The default function to solve NLLSQ problems in R is nls, which includes the solvers
GN, Golub-Pereyra for partially linear least-squares problems, and port, an algorithm with
parameter bounds constraints.
Example 3.7 Non-linear least squares. The non-linear degradation of lithium-ion batteries
is observed in the second life of the battery (approximately at 80% of its initial capacity)
due to a degradation mechanism known as lithium plating. A dataset including capacity (in
%) as a function of the number of cycles N (number of times a battery can be fully charged
and discharged) is found in Ex3.7.csv. Find the coefficients of the non-linear model using
the function nls in R.
###Install package
```{r, echo = TRUE}
install.packages("stats")
```
###General
```{r, echo = TRUE}
setwd("C:/Book/Chapter3/Examples ")
data <-read.csv(file="Ex3.7.csv",head=TRUE,sep=",")
data=data[c(1:14),c(1:2)];
data
```
The mathematical function that relates capacity and the number of cycles is:
Capacity = kN p (3.21)
###Model
```{r, echo = TRUE}
library(stats)
model <- nls(data$Capacity~ k*(data$N^p),
data = data,
start = list(k=4, p = -0.1),)
summary(model)
```
Formula: data$Capacity ~ k * (data$N^p)
Parameters:
Estimate Std. Error t value Pr(>|t|)
k 220.70788 13.06278 16.90 9.89e-10 ***
p -0.30227 0.01524 -19.83 1.54e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The tasks performed by any machine learning algorithm are executed using three
techniques: supervised, unsupervised, and reinforcement learning.
In supervised learning, a model can learn from it to easily provide the result of the
problem. Supervised learning deals with classification and regression problems. Regres-
sion problems are used for continuous data; some examples include linear regression and
non-linear regression. Classification problems predict discrete values, for instance, the
classification/reconnaissance of images; some examples include support vector machine
and logistic regression.
In unsupervised learning, there is no clean labelled or complete dataset; hence, the
algorithm is set to find hidden features and clusters in the data; some examples include
neural networks and k-means clustering.
3.4 Non-linear Machine Learning Algorithms 83
Random forest is a machine learning algorithm that combines the outputs of decision
trees to obtain a single result. Its training algorithm applies bootstrapping or bagging
to tree learning, used to improve accuracy and stability, as well as minimize overfitting
and reduce variance. When bagging a training set A, new training sets are generated by
sampling from A uniformly and with replacement (which means that some observations
may be repeated); this ensures that each bootstrap step is independent. This technique is
useful when neural networks are unstable. Random forest, like neural networks, handles
both classification and regression problems.
The following example illustrates using a simple neural network and random forest.
Example 3.8 Neural network and random forest. The self-diffusion coefficient of a system
(DC) is a function of its concentration, the operating temperature, and the concentration of a
specific salt. Build an unsupervised machine learning prediction model for the corresponding
dataset (with normalized values) in Ex3.8.csv.
###Install package
```{r, echo = TRUE}
install.packages("neuralnet")
install.packages("grid")
install.packages("MASS")
install.packages("nnet")
```
The previous code included 8 hidden nodes, the sum of squares error (SSE), and the
hyperbolic tangent as an activation function; a step-max of 1000 (maximum steps for
the neural network training. The linear output is set as FALSE (the activation function is
applied to the output) (Fig. 3.8).
The NN plot includes three input nodes (independent variables), the hidden layer, and
the output layer, all linked and showing their corresponding weights. The prediction error
for this NN configuration is 14.1. In addition to tuning the number of hidden nodes,
selecting the activation function is key when building and testing NNs.
We used the entire dataset in the previous code to build/train the neural network (NN).
A good practice when building NNs is to split the dataset into train (to build the NN) and
test (for prediction purposes). This practice is known as partitioning. A typical partitioning
or split ratio for many machine learning algorithms is 70/30. An example of a typical code
that can be used for this purpose is:
###Splitting
```{r, echo = TRUE}
sample <- sample(c(TRUE, FALSE), nrow(data), re-
place=TRUE, prob=c(0.7,0.3))
train <- data[sample, ]
test <- data[!sample, ]
```
Now, let us check the performance of random forest. The corresponding codes for
random forest are shown below:
86 3 Data-Based Modelling for Prediction
The mean squared residuals for random forest is 0.025. A key parameter for tuning
random forest is the number of trees; we typically try values ranging from 50 and 500
trees. Finally, like NN, splitting the dataset into train (to build the random forest model)
and test (for prediction purposes) is a good practice.
Random forest performs accurately when predicting the dependent variable in the pre-
vious example. NN usually requires more data to achieve the same level of accuracy
as random forest, but they learn and benefit (increase accuracy) from large amounts of
data. On the other hand, random forest is computationally less expensive than NNs, often
experiencing no performance gain when a threshold amount of data is reached.
The support vector machine algorithms aim to find a hyperplane in a dimensional space of
N number of features) to classify the data points within the boundary of the hyperplane.
Support vectors are used to maximize the margin distance between data points of different
classes by influencing the orientation and location of the hyperplane. The function that
maximizes the margin is hinge loss; hence, the cost function is zero if the predicted value
and the actual value are of the same sign; otherwise, the loss value is calculated. The
hyperplane contains the maximum number of points, so SVM is ideal for fitting dispersed
data.
A machine learning model with several parameters must be learned from the data.
Nevertheless, there are hyperparameters chosen by humans before the training begins,
based on trial and error and even intuition. In SVM, hyperparameters are typically found
by using simple optimization strategies such as grid search (refer to Chap. 6). In the
following example, we perform an SVM-based regression, including the performance
tuning of the model.
Example 3.9 Support vector machine. Fit the dataset (in Ex3.9.csv), relating the total con-
centration of SARS-Co-2 in wastewater versus the number of virus cases reported by the
city hospitals over time.
###Install packages
```{r, echo = TRUE}
install.packages("e1071")
install.packages("ggpubr")
```
###Linear regression
```{r, echo = TRUE}
setwd("C:/Book/Chapter3/Examples ")
data <-read.csv(file="Ex3.9.csv",head=TRUE,sep=",")
data=data[c(1:35),c(2:3)];
data
summary(model)
```
Call:
lm(formula = PCR_hospital ~ Concentration, data = data)
Residuals:
Min 1Q Median 3Q Max
-6725.3 -2857.5 -521.6 2407.4 6963.7
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1811.2 853.1 -2.123 0.0413 *
Concentration 13913.4 1114.5 12.484 4.77e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The adjusted R-squared is 0.8252 is close to 1; this might suggest a good fit for the
data using a linear model; nevertheless, the residual standard error is 3726! Moreover,
3.4 Non-linear Machine Learning Algorithms 89
a dispersed cluster of data points is located in the bottom left part of Fig. 3.9. These
features suggest that SVM might be a good model for fitting the data. The corresponding
R code for SVM is:
###SVM
```{r, echo = TRUE}
library(e1071)
setwd("C:/Book/Chapter3/Examples ")
data <-read.csv(file="Ex3.9.csv",head=TRUE,sep=",")
data=data[c(1:35),c(2:3)];
data
The first section of the previous code models the two variables using SVM and plots
the model’s predicted values (red) against the measured concentration of the virus (blue).
A linear model is also added (black) for comparison purposes (Fig. 3.10).
Fig. 3.10 Untuned SVM model and linear regression model for Example 3.9
The SVM model does not accurately adjust the data; therefore, tuning is required.
Three parameters are typically required for tuning SVM models: (i) the regularization
parameter, which helps to balance the model complexity and empirical error; (ii) gamma,
which adjusts for overfitting or underfitting; and (iii) epsilon, which gives a tolerable error
of the regression model. In the following section of the code, we improve the quality of
the SVM model by modifying the parameter epsilon. The tuning is visualized in Fig. 3.11.
In Fig. 3.11, the dark parches show the optimal zone in which the algorithm’s epsilon
and computational cost are balanced. R automatically finds the optimal zone once we
define a range for epsilon (from 0.08 to 0.15, in our case), which must be changed in the
code and re-run, hence, obtaining the optimal regression.
As shown in Fig. 3.12, the performance of the tuned SVM model is remarkable, with
an R2 value of 0.99.
Fig. 3.12 Tuned SVM model and linear regression model for Example 3.9
In this section, we have studied some supervised and unsupervised machine learning algo-
rithms that are typically used for prediction. In the next section, we explore distribution
models, a set of statistical techniques with several applications in the process engineering
field.
Data distribution is a function that provides the value of a variable, quantifies relative fre-
quency, and transforms the raw data into a meaningful visualization tool that assists with
value information. The first step we shall perform to fit the data to a specific distribution
is conducting exploratory data analysis (EDA) to learn about the features in datasets that
might help us find patterns in them.
There are several types of data distribution, including Bernoulli, binomial, normal
(Gaussian), Poisson, uniform, Weibull, and gamma, among others. These models are
represented by a specific standard parameterization formula, where the corresponding
parameters are typically estimated using a maximum likelihood estimator. In this section,
we study three distribution models typically used in process, mechanical, and other
engineering disciplines.
92 3 Data-Based Modelling for Prediction
Example 3.10 Normal distribution. Let us again fit the data included in Ex2.7 (filename
renamed Ex3.10.csv) to a normal distribution.
#Add linear:
modelL <- lm(PCR_hospital ~ Concentration, data)
predictedYL <- predict(modelL, data)
points(data$Concentration, predictedYL, col =
"blue", pch=4)
#Tuning SVM
tuneResult<- tune(svm, PCR_hospital ~ Concentration,
data = data,
ranges = list(epsilon = seq(0.08,0.15,0.01), cost =
2^(2:9))
)
print(tuneResult)
# Draw the tuning graph
plot(tuneResult)
tunedModel<- tuneResult$best.model
tunedModelY<- predict(tunedModel, data)
library("ggpubr")
cor(tunedModelY, data$PCR_hospital, method =
c("pearson", "kendall", "spearman"))
rsq<-cor.test(tunedModelY, data$PCR_hospital,
method=c("pearson", "kendall", "spearman"))
print(rsq)
plot(data$Concentration,data$PCR_hospital,
xlab="Virus concentration (copies/L)",ylab="Number of
reported COVID cases in hospitals", pch=16)
points(data$Concentration, tunedModelY, col = "red",
pch=4)
```
3.5 Distribution Models 93
The function fitdistr from the package MASS allows fitting the data to different types
of distribution, such as normal distribution. We can extract the distribution parameters
using estimate, then verify the normality using a histogram (visual inspection, as in Exam-
ple 3.7) and the Shapiro–Wilk normality test, which is the formal test and recommended
test for normality.
###Install packages
```{r, echo = TRUE}
install.packages("MASS")
```
The Shapiro–Wilk normality test indicates that if the test is non-significantly different
from a normal distribution. The p-value in this case is 0.05102; therefore, the data is
normally distributed.
where κ > 0 is the shape parameter and λ > 0 is the scale parameter of the distribution.
The Weibull distribution effectively provides reliability characteristics using a relatively
small sample size.
• A value of κ < 1 indicates that the failure rate decreases over time.
• A value of κ = 1 indicates that the failure rate is constant over time.
• A value of κ > 1 indicates that the failure rate increases over time.
Example 3.11 Weibull distribution. The data in Ex3.11.csv includes the days a device
was on test before failure (no censored data). Fit the data to a Weibull distribution and find
the device reliability at 15 years.
94 3 Data-Based Modelling for Prediction
###Normal distribution
```{r, echo = TRUE}
setwd("C:/Book/Chapter3/Examples ")
data <-
read.csv(file="Ex3.10.csv",head=TRUE,sep=",")
library(MASS)
fit <- fitdistr(data$Temperature, "normal")
class(fit)
para<-fit$estimate
para
hist(data$Temperature,xlab="Temperature")
shapiro.test(data$Temperature)
```
The data is fitted to a Weibull distribution using fitdist, and the parameters shape and
scale are then estimated. The corresponding summary and plots are shown below:
###Install packages
```{r, echo = TRUE}
install.packages("MASS")
install.packages("fitdistrplus")
install.packages("weibullness")
```
AIC and BIC are discussed in the next section of this chapter (Fig. 3.13).
We performed a test [2] to formally verify that the data follows a Weibull distribution
(p-value > 0.05):
###Weibull distribution
```{r, echo = TRUE}
setwd("C:/Book/Chapter3/Examples ")
data <-
read.csv(file="Ex3.11.csv",head=TRUE,sep=",")
data<-data[c(1:10),c(1:1)];
library(MASS)
library(fitdistrplus)
x α−1 e−βx β α
f (x; α, β) = for x > 0 α, β > 0 (3.24)
|(α)
where | is the gamma function, α is the shape parameter, and β is the rate parameter.
Gamma distributions have been used to model degradation processes (e.g., lithium-ion
batteries) [3]; in medicine, to model the age distribution of cancer incidence [4], and
other engineering and science applications.
Example 3.12 Gamma distribution. The data in Ex3.12.csv includes the capacity fade (%
capacity) of a capacitor over time (in days). Fit the data to a Gamma distribution and estimate
the capacity after 30 days.
###Install packages
```{r, echo = TRUE}
install.packages("MASS")
install.packages("fitdistrplus")
install.packages("dgof")
```
The data is fitted to a Gamma distribution using fitdist, and the parameters shape
and scale are then estimated. The corresponding summary and plots are shown below
(Fig. 3.14):
###Gamma distribution
```{r, echo = TRUE}
setwd("C:/Book/Chapter3/Examples ")
data <-
read.csv(file="Ex3.12.csv",head=TRUE,sep=",")
data<-data[c(1:17),c(1:2)];
data
library(MASS)
library(fitdistrplus)
#Gamma test
library(goft)
gamma_test(data$Capacity)
```
3.6 Model Performance and Validation 97
We performed a test [5] to formally verify that the data follows a Gamma distribution
(p-value > 0.05):
data: data$Capacity
V = -1.4082, p-value = 0.3194
Several metrics are used to measure the model performance. Any reference to metrics
or errors estimated with respect to the data used to train or validate a predictive model
is called in-sample, while the reference to test new data is called out-of-sample. The
difference between the predicted value and the actual value from the in-sample data is
called the residual for each point, while the corresponding out-of-sample difference is
called the prediction error.
98 3 Data-Based Modelling for Prediction
A model evaluation can be performed for model selection, comparison and/or tuning.
Many techniques and metrics are used, sometimes simultaneously for cross-validation
purposes, including: (i) regression performance metrics, such as R2 and adjusted R2 ,
mean squared error (MSE) or root mean squared error (RMSE), and mean or absolute
error, F-score, and others; (ii) bias variance trade-off and/or model complexity metrics,
such as the residual sum of squares; and (iii) model validation and selection metrics, such
as the Akaike information criterion (AIC) and Bayesian information criterion (BIC).
The main errors associated with predictive analytics are both in-sample and out-of-
sample errors. Model performance on training data is typically optimistic; therefore, the
data errors are usually low compared to the out-of-sample errors. A crucial decision-
making process for a data analyst is considering trade-offs between the types of errors. In
many cases, such as evaluating health outcomes using machine learning techniques, false
negatives might be expected, depending on the selected trade-off for the estimated errors.
The AIC is an estimator of prediction error, allowing for estimating the relative quality
of statistical models for a given dataset. In practice, we select a set of potential models to
represent/predict the data, and then, we calculate the models’ corresponding AIC values.
Let us say what we have three potential candidates for a given dataset, and we calculate
the values of AIC1 , AIC2 , and AIC3 , and ( let AICmin
) be the minimum of these three
AICmin −AICi
2
values. Then, we estimate the quantity e or relative likelihood of a model,
indicating the probability that the ith model minimizes the meaning or information loss.
Therefore, the best-fit model is the one that explains the greatest amount of variation
using the fewest number of independent variables. The BIC is related to the AIC; when
comparing several models, the ones with lower BIC are preferred; nevertheless, a lower
BIC does not always indicate one model is better than another!
Example 3.13 Model selection. Let us compare models for the data included in Ex3.2.csv,
using the AIC and BIC.
###Install packages
```{r, echo = TRUE}
install.packages("AICcmodavg")
```
###General
```{r, echo = TRUE}
setwd("C:/Book/Chapter3/Examples ")
data <-read.csv(file="Ex3.2.csv",head=TRUE,sep=",")
data=data[c(1:83),c(1:4)];
data
```
3.6 Model Performance and Validation 99
Now, let us fit the data to a linear (model 1) and a full quadratic model (model 2).
###Model 1 - linear
```{r, echo = TRUE}
b1<-data$InvT;
b2<-data$SOC;
b3<-data$time;
model1 <-lm(data$Capacity~b1+b2+b3)
summary(model1)
```
###Model 2 - quadratic
```{r, echo = TRUE}
b1<-data$InvT;
b2<-data$SOC;
b3<-data$time;
b12<-data$InvT*data$SOC;
b13<-data$InvT*data$time;
b23<-data$SOC*data$time;
b11<-data$InvT*data$InvT;
b22<-data$SOC*data$SOC;
b33<-data$time*data$time;
model2 <-lm(data$Capac-
ity~b1+b2+b3+b12+b13+b23+b11+b22+b33)
summary(model2)
```
Another popular test to decide if a sample comes from a population with a specific distri-
bution is the Kolmogorov–Smirnov. An important advantage of this test is the fact that the
test does not depend on the cumulative function being tested; moreover, this is an exact
test; therefore, non-adequate sample size is required. Some of the disadvantages of this
test include its sensitivity near the centre of the distribution; it only applies to continuous
distribution; perhaps the most critical disadvantage is that the location and parameters of
the distribution must be predefined.
While causation and correlation can exist simultaneously, we cannot always state that
correlation implies causation. A correlation implies that there is a statistical associa-
tion between variables (two or more variables are related). Causation indicates that one
event or variable causes another event or response. An appropriate design of experiment
(DOE) typically reveals causation; for instance, we can run an experiment where similar
groups receive different treatments, and then we can record and analyze the outcomes
of each group to finally conclude that a treatment causes or not an effect if and only if
the groups have significantly different outcomes. Causal inference in statistics involves
studying a system by measuring one variable that we suspect might affect the measure
of another. Three conditions are required to claim causal inference: (i) covariation, (ii)
discarding rival explanations for the association between variables, and (iii) temporal
ordering. Researchers deal with causality by trying to provide a framework to rightfully
claim it; several efforts include general theories like the Structural Causal Model where
we analyze the covariance of any pair of observed variables and understand the inter-
action and/or confounding of variables as well as counterfactuals, attributions, and other
potential ingredients affecting causal inference.
How is machine learning (ML) addressing causality? Causal supervised learning, for
instance, can help us enhance predictive models (e.g., using invariant feature learning
approaches); causal generative modeling, on the other hand, can provide the basis to
generate counterfactual samples. Causal Reinforcement Learning proved to be efficient in
de-confounding data. But, despite these and several other research efforts and how promis-
ing ML looks at solving the causation dilemma, it is quite challenging to get evaluation
data for ML-causation-based algorithms. This book presents a series of recommended
readings at the end of this chapter, as this subject deserves an extended and deep under-
standing. However, in Example 4.15, we illustrate the use of causal forest to deal with
generated data on a health outcome (binary).
Example 3.14 Causal random forest. A causal forest can be used to estimate the conditional
average treatment. When a treatment assignment is binary and is unconfounded, we can
estimate the potential outcome of the two possible treatment states. Let us estimate if the
3.7 Correlation and Causality 101
variables socioeconomic state, race, parity, sex, and concentration of a carcinogen (in parts
per million, converted to binary on the basis of a value above a limit threshold = 1) might or
not might cause an illness (expressed as a probability, also converted to binary using a binary
variable using a binomial distribution), employing this algorithm in R. The corresponding
data is included in Ex3.14.csv.
models<-list(model1, model2)
BINaml transforms the probability of getting the illness into a binary variable
(simulating binomial trials).
102 3 Data-Based Modelling for Prediction
###Install packages
```{r, echo = TRUE}
install.packages("grf")
#install sufrep from zip file
```
###General
```{r, echo = TRUE}
library(sufrep)
setwd("C:/Book/Chapter3/Examples ")
data <-
read.csv(file="Ex3.14.csv",head=TRUE,sep=",")
data=data[c(1:160),c(1:9)];
dataX=data[c(1:160),c(1:8)];
#The probability of getting the illness
dataY=data$Prob;
BINaml <- rbinom(length(dataY),1,dataY);
```
###Causal forest
```{r, echo = TRUE}
library(grf)
n<-length(dataY)
W<-rbinom(n,1,0.5)
c.forest<-causal_for-
est(dataX,BINaml,W,tune.num.reps=80)
average_treatment_effect(c.forest)
The ATE is a causal estimand that calculates the difference between the potential out-
comes that would be observed if the exposure of all individuals is set to 1 and 0. ATE is
interpreted as the difference in risk when everyone (in the population) is exposed versus
if everyone (in the population) is not. A value different than zero might indicate causality.
Resources 103
While the previous example and its approach are quite simplistic, we encourage the
reader to verify causation between variables, as this will lead to accurate predict events
and responses and explain why the events and/or responses happen. Better decisions are
based on causation, not correlation.
In this Chapter, we showed the use of data-based modelling for prediction purposes from
datasets to describe the behaviour of a system and/or a process. We explored different
regression models (linear and non-linear), supervised and unsupervised machine learning
algorithms, and distribution models and examined their adequacy. Moreover, we exam-
ined their adequacy by discussing physical meaning when predicting values or physical
validation, model performance and validation using estimators/errors, and model compar-
ison and selection criteria. Finally, we emphasize the need for the engineer or researcher
to verify the causation of the model/dataset, as better decision-making processes can be
effectively performed when the cause and effect between variables is understood and
verified.
Data Disclosure
The data was generated, maintaining the described problem’s physical meaning of the
analyzed phenomenon or process.
Problems
3.1 Use the different datasets in this chapter; fit the data and perform predictions
using different regression models, ML techniques and distribution models. Compare
models’ performance.
3.2. For problem 3.1, perform a sensitivity analysis on the training and tuning parameters
of the ML techniques.
3.3 .3 Design and run a causality problem to verify lithium-ion degradation’s cause(s).
Resources
Npreg in R: https://rdocumentation.org/packages/npreg/versions/1.0-9
earth in R: https://rdocumentation.org/packages/earth/versions/5.3.2
104 3 Data-Based Modelling for Prediction
stats in R: https://rdocumentation.org/packages/stats/versions/3.6.2
neuralnet in R: https://rdocumentation.org/packages/neuralnet/versions/1.44.2
grid in R: https://rdocumentation.org/packages/grid/versions/3.6.2
MASS in R: https://rdocumentation.org/packages/MASS/versions/7.3-58.3
brnn in R: https://rdocumentation.org/packages/brnn/versions/0.9.2
randomForest in R: https://rdocumentation.org/packages/randomForest/versions/4.
7-1.1
caret in R: https://rdocumentation.org/packages/caret/versions/6.0-94
e1071 in R: https://rdocumentation.org/packages/e1071/versions/1.7-13
ggpubr in R: https://rdocumentation.org/packages/ggpubr/versions/0.6.0
fitdistrplus in R: https://rdocumentation.org/packages/fitdistrplus/versions/1.1-11
weibullness in R: https://rdocumentation.org/packages/weibullness/versions/1.23.8
dgof in R: https://rdocumentation.org/packages/dgof/versions/1.4
R-codes and data repository: https://github.com/CHE408UofT/DGSD_UofT
Recommended Readings
Martens, H. (2023). Causality, machine learning and human insight. Analytica Chimica
Acta, 1277, 341585. https://doi.org/10.1016/j.aca.2023.341585
Rodríguez, E. M. (2023, March 27). Causal ML: What is it and what is its importance?.
Plain Concepts. https://www.plainconcepts.com/causal-ml/
Steinwart, I., & Christmann, A. (2008). Support Vector Machines. Springer New York.
Sullivan, W. (2017). Machine Learning for Beginners Guide Algorithms: Decision
tree & random forest introduction. Healthy Pragmatic Solutions Inc.
Thomas, S. (2022, March 2). What is a residual in stats?. Outlier. https://articles.out
lier.org/what-is-a-residual-in-stats
V., B. B. S. (2018). Introduction to machine learning with r: Rigorous mathematical
analysis. O’Reilly Media, Inc.
What is Random Forest?. IBM. (n.d.). https://www.ibm.com/topics/random-forest#:
~:text=Random%20forest%20is%20a%20commonly,both%20classification%20and%20r
egression%20problems.
Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for Data Science:
Import, Tidy, transform, visualize, and model data. O’Reilly Media, Inc.
References
The modern control theory (MCT), also known as model-based control (MBC), was
introduced in the 1960s, with the parametric state-space model developed by Kalman.
MBC refers to the control basis for linear and non-linear systems. Linear control systems
methodologies typically include robust control, zero-pole assignment, and linear-quadratic
regulator (LQR) design, while non-linear control systems methodologies typically include
feedback linearization, backstepping controllers, etc. MBC requires first modeling or iden-
tifying the plant or process, then designing the controller using a plant model representing
the true system, being the certainty equivalence, the fundamental assumption in this the-
ory. Hence, a model-based controller might only work efficiently if the plant model falls
into the assumed model set. Alternatively, data-driven or data-based control (DBC) mod-
els were proposed, in which the controller is designed using online or offline input/output
(I/O) data from the data processing without having an implicit mathematical model of
the controlled process. Therefore, DBC controller design uses and depends only on plant
measurement I/O data; finally, stability, robustness, and convergence are guaranteed by
mathematical analysis under reasonable assumptions. DBC works efficiently for con-
trollers whose identification-based or first-principle models are highly non-linear or too
high order.
DBC approaches include (i) online data-based methods such as simultaneous per-
turbation stochastic approximation, model-free adaptive control, and unfalsified control
methodology; and (ii) offline data-based methods like the PID control method, the itera-
tive feedback tuning (IFT), the correlation-based tuning, and others. Most DBC methods
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 107
D. Galatro and S. Dawe, Data Analytics for Process Engineers, Synthesis Lectures
on Mechanical Engineering, https://doi.org/10.1007/978-3-031-46866-7_4
108 4 Data-Based Modelling for Control
are designed using controller parameter tuning approaches, and some of them assume the
controller structure a priori. In any case, there is no established theoretical framework
for DBC, and advancements in this theory have skyrocketed with the advent of machine-
learning techniques and hardware capabilities. In this Chapter, we illustrate the use of
data-based modelling for control purposes through two examples, one for basic control
theory, for tuning a proportional-integral-derivative (PID) control, and another example
regarding model predictive control.
A simple inspection using a plot reveals that the optimal values for Kp , Ti , and Td
correspond to set 1 (Fig. 4.2).
###Optimal set
```{r, echo = TRUE}
setwd("C:/Book/Chapter4/Examples")
data <-read.csv(file="Ex4.1.csv",head=TRUE,sep=",")
data
plot(data$Set, data$Error, pch = 10, col =
2,xlab="Set",ylab="Er-
ror",xlim=c(1,12),ylim=c(9.4,10))
```
Nevertheless, minimizing the cost function fun (created using a full quadratic
polynomial), reveals that the optimal values are 10, 0.89, and 0.01, respectively:
###Optimization
```{r, echo = TRUE}
install.packages("stats")
```
library(stats)
fun=function(x)
betas[1]+betas[2]*x[1]+betas[3]*x[2]+be-
tas[4]*x[3]+betas[8]*x[1]^2+betas[9]*x[2]^2+be-
tas[9]*x[3]^2
result <- optim(fn=fun, par = c(1,1,0.01),
lower=c(0,0,0), upper=c(10,1,0.06), method="L-BFGS-
B")
result
```
112 4 Data-Based Modelling for Control
Call:
lm(formula = data$Error ~ b1 + b2 + b3 + b12 + b23 + b13 + b11 +
b22 + b33)
Residuals:
1 2 3 4 5 6
7 8 9 10
-5.204e-18 1.792e-17 4.962e-02 -2.114e-02 3.404e-02 -4.237e-02 -
6.704e-02 2.940e-02 1.596e-02 5.719e-02
11 12
-9.221e-02 3.657e-02
$par
[1] 10.0000000 0.8921718 0.0100000
$value
[1] 9.747709
Classic control suits most of the control problems. Model Predictive Control (MPC) is a
fair control approach that can be applied to almost all control problems. This approach is
based on a real-time optimization of a mathematical model; MPC predicts the future sys-
tem behavior that determines the optimal trajectory of the manipulated variable, adjusting
a process model. A simplified block diagram of an MPC-based control loop is shown in
Fig. 4.3, where u is the manipulated variable on the control variable y.
4.3 Model Predictive Control 113
An excellent example of an MPC code (in Python and Excel) is presented in Ref.
[2]. In their example, a sequence of manipulated variable (MV) adjustments that drive
the controlled variable (CV) is estimated for a linear dynamic model along an expected
reference trajectory or target.
MPC uses a dynamic model of the response of the process variables. Changes known
as manipulated variables must be performed to calculate the control moves, which force
the process variables to follow a predefined trajectory until reaching a target. An optimal
controller is designed by minimizing the error from the trajectory. MPC models can be
linear, empirical, first principle, machine learning, and hybrid-based. MPC is typically
used in MIMO (multiple input–multiple outputs) systems.
In the example presented in Ref. [2], the process variable is driven toward the target in
a SISO system (single input, single output system), by moving the manipulated variable.
The cost function is a quadratic error between Model and Target, which is minimized via
optimization.
PID controllers are popular in industrial control systems since they compare data with a
reference data value, minimizing the error between them to keep the system reaching and
staying at a setpoint value. The parameters of PID controllers must be tuned based on the
requirements of system performance. Conversely, MPC depends on the process’s dynamic
models; these models are typically linear models obtained by system identification or non-
linear models that can be prescribed by machine learning tools in complex systems. The
optimizer finds an optimal control input to minimize a cost function relating to the model
response and a predefined target. MPC is an ideal option for MIMO systems.
In this Chapter, we showed the use of data-based modelling for control purposes
through one example for tuning a proportional-integral-derivative (PID) control. While
this example is a simple illustration of the enormous potential of using data-based
modelling for control purposes, the control of complex systems, such as the control
of manufacturing processes, we recognize that machine learning algorithms offer clear
advantages over traditional control approaches, as they can effectively handle uncertain-
ties and changes in the process, using them for deriving models of the plant for analysis,
simulation, controller design, and model-based estimation for control.
Data Disclosure
The data was generated, maintaining the described problem’s physical meaning of the
analyzed phenomenon or process.
References 115
Problems
4.1 Review, run, and discuss the set of MPC examples provided in: https://github.com/
rhalDTU/MPCR.
4.2 Strategize possible control strategies for a distillation column. Discuss their advan-
tages and disadvantages.
Resources
nls in R: https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/nls.
optim in R: https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/optim.
R-codes and data repository: https://github.com/CHE408UofT/DGSD_UofT.
Recommended Readings
Dummies guide to PID. PID for Dummies - Control Solutions. (n.d.). https://csimn.com/CSI_pages/
PID.html.
Hou, Z.-S., & Wang, Z. (2013). From model-based control to data-driven control: Survey, clas-
sification and perspective. Information Sciences, 235, 3–35. https://doi.org/10.1016/j.ins.2012.
07.014
Kouvaritakis, B., & Cannon, M. (2016). Model predictive control classical, robust and stochastic.
Springer.
Seborg, D. E. (2019). Process dynamics and control. Wiley.
Zulu, A. (2017). Towards explicit PID control tuning using machine learning. In 2017 IEEE
AFRICON, Cape Town, South Africa, 2017 (pp. 430–433). https://doi.org/10.1109/AFRCON.
2017.8095520
Zhou, Y. (2022, April 23). A summary of PID control algorithms based on AI-enabled embedded
systems. Security and Communication Networks, 1939-0114. https://doi.org/10.1155/2022/715
6713
References
When we optimize a process or a system, we look at selecting the best element from
a set of available options under certain conditions or criteria. An optimization problem
consists of minimizing or maximizing a real function, typically called objective function,
cost function, or loss function, by choosing input values from within an allowed set of
values while computing the value of the function.
An optimization problem is mathematically represented by:
• A local optima is the minimum or maximum extrema of f for a given region of the
input space. Unless f is convex (in minimization), we can get several local minima.
Nonconvex problems are tackled with global optimization, which looks at converging
to the actual optimal solution of these problems.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 117
D. Galatro and S. Dawe, Data Analytics for Process Engineers, Synthesis Lectures
on Mechanical Engineering, https://doi.org/10.1007/978-3-031-46866-7_5
118 5 Optimization
• A global minimum is a point where the value of f is smaller than or equal to the value
at all other optimal solutions or feasible points.
and bounded by
xLi ≤ xi ≤ xU
i , i = 1, 2, . . . , n (L: lower, U: upper) (5.3)
The non-dominated solution is the set of all solutions that are not dominated by any
element of the solution set; the feasible space of the non-dominated set is called the
Pareto-optimal set, while the boundary of all points mapped from the Pareto-optimal
set is called the Pareto-optimal front (see Fig. 5.1). The goal of the MOO is to find a
diverse set of solutions, interpreted as a favorable trade-off between the objectives and,
therefore, located as close as possible to the Pareto.
When designing PID controllers, for instance, optimization can be conducted by mini-
mizing an error function to find the parameters of the controller, using the optim function
and the method L-BFGS-B, or limited memory Broyden-Fletcher-Goldfarb-Shanno, an
optimization algorithm in the family of quasi-Newton methods, looking at finding zeroes
or local maxima and minima of functions. In this Chapter, we introduce optimization
methods grouped as (i) Group Search, Random Search, and Gradient Search, (ii) evo-
lutionary algorithms such as genetic algorithms and particle swarm, and (iii) Bayesian
inference, including MOO problems.
5.2 Grid Search, Random Search, and Gradient Search 119
Grid search is an optimization algorithm that selects the best values from a list of provided
value options. This algorithm is typically used for hyperparameter tuning. A domain (e.g.,
a hyperparameters domain) is divided into a discrete grid; then, every combination of the
grid is tried, and performance metrics are calculated to verify the effectiveness of the
algorithm in finding optimal values.
Random search, on the other hand, initializes x with a random position in the search-
space, it samples a new position y from a Euclidean space (hypersphere), iterates if f(y) <
f(x), and moves to the new position by setting x = y, until a termination criterion is met.
Examples 5.1 and 5.2 illustrate the use of grid search and random search to identify
the variables of the Arrhenius equation.
Example 5.1 Arrhenius equation–grid search. The Arrhenius equation represents the
fraction of collisions with enough energy to overcome the activation energy in a chemical
reaction. The formula of the Arrhenius equation is as follows:
Ea
k = Ae− RT (5.4)
where k is the rate constant, A is the pre-exponential factor, E a is the activation energy, R
is the universal gas constant and T is the absolute temperature. In the following code, we
use an equally spaced grid to identify (find) the parameters A and E a .
120 5 Optimization
###Install packages
```{r, echo = TRUE}
install.packages("pracma")
}
```
#Arrhenius example
A<-c()
Ea<-c()
n<-20000;
m<-20000;
for (i in 1:n) {
for (j in 1:m) {
kdot<-A[i]*exp(-Ea[i]/(R*Temp));
f_error[i]<-(k-kdot)^2;
if (f_error[i]<=f_tar) {
f_tar<-f_error[i];
x1_opt<-A[i];
x2_opt<-Ea[i];
} else {
iter<-i;
break;
}
}
}
x1_opt;
x2_opt;
kdot
end_time - start_time}
```
5.2 Grid Search, Random Search, and Gradient Search 121
By searching A and E a in an equally spaced grid within the ranges of 450–500 and
12–18, respectively, we find the values of 480.0 and 15.6, minimizing an error function
between the true rate constant and the approximated value. The error was estimated to be
a 4.1% relative error, with a running time of 1.2 min.
Example 5.2 Arrhenius equation–random search. In the following code, we use random
search to identify (find) the parameters A and Ea (see example 5.1).
###Random search
```{r, echo = TRUE}
start_time <- Sys.time()
numIter<-10000000;
f_tar <- 100000;
A<-c()
Ea<-c()
f<- function(Ea,A) {
kdot<-A*exp(-Ea/(R*Temp));
f_error<-(k-kdot)^2;
}
for (i in 1:numIter) {
Ea=runif(numIter,12,18);
A=runif(numIter,450,500);
kdot<-A[i]*exp(-Ea[i]/(R*Temp));
f_error[i]<-(k-kdot)^2;
if (f_error[i]<=f_tar) {
f_tar<-f_error[i];
x1_opt<-A[i];
x2_opt<-Ea[i];
} else {
iter<-i;
break;
}
}
122 5 Optimization
By randomly searching A and E a within the ranges of 450–500 and 12–80, respec-
tively, we find the values of 475.5 and 17.7, minimizing an error function between the
true rate constant and the approximated value. The error was estimated to be 1.3% relative
error, with a running time of 1.7 s. Random search performs better than grid search when
considering accuracy and computational cost. The grid search method performance can
be improved by using a fine mesh or non-equally spaced mesh.
Evolutionary algorithms (EA) are inspired by nature, as they emulate the process of
natural selection. It includes four steps: initialization, selection, genetic operators, and
termination. In EAs, fitter members survive and reproduce, while unfit members die off
without contribution to the gene pool of subsequent generations.
Genetic algorithms (GA) are a subset of EA. The process cycle of GA is shown in
Fig. 5.2.
The GA process begins with initialization, a stage where an initial population of can-
didate solutions is randomly generated as binary strings. At each generation step, a pool
of parents is selected using a selection mechanism to pass on the genetic material to sub-
sequent generations. Then, a child population is created by variation operations, such as
crossover and mutation, forming the basis of the next generation. Crossover usually takes
pairs of parents from the parent population using random selection with replacement
until the new child population reaches the same size as the original parent population.
Mutation, on the other hand, introduces new material into the population by randomly
changing codons on the chromosome. Once the child population is created, the children
are evaluated by assigning a fitness value to rank the population. Then, the old popula-
tion is replaced with the new child population, usually with the generational replacement
method. Finally, the GA terminates after a predetermined number of iterations or until a
stopping criterion has been met.
In the previous example, we determined the optimal solutions for the operating parameters
leading to maximizing the profit of a process plant. Highly non-linear surrogate models
obtained from simulated and/or plant data can be coupled to genetic algorithms for mini-
mization or maximization purposes, destined to optimize costs, energy consumption, and
environmental restrictions. Now, let us study the method and solutions generated by using
another popular optimization algorithm: Particle Swarm Optimization.
Particle Swarm Optimization (PSO) is a simple optimization algorithm not dependent
on the gradient or differential form of the objective function. It is a biologically based
algorithm where some members of a flock of birds, for instance, can profit from the
experience of all other members of the flock. Each bird helps find the optimal solution
in a high-dimensional solution space; this solution is heuristic since the found solution is
close to the global optimal. The algorithm starts with several random points or particles
on the plane and then looks for the minimum point in random directions; after a certain
number of iterations, the minimum point of the function becomes the minimum point ever
searched by the swarm of particles.
Example 5.3 Genetic algorithm and particle swarm optimization. The tension–com-
pression spring design problem shown in Fig. 5.3 is a continuous constrained problem
where we look at minimizing the volume V of a coil spring under a constant tension/
compression load [1] This problem has been discussed by several researchers employing
different optimization algorithms, as shown in reference [1]. The mathematical formulation
of the optimization problem is shown as follows:
Where
Let us solve the optimization problem of minimizing the volume V using genetic
algorithms and particle swarm optimization.
The lower and upper bound of each variable is shown in Table 5.1.
The inequality constraints for the optimization problem are given as follows:
x23 x3
1− ≤0 (5.6)
71785x14
4x22 − x1 x2 1
+ −1≤0 (5.7)
12566 x2 x13 − x14 5108x12
140.45x1
1− ≤0 (5.8)
x22 x3
x2 + x1
−1≤0 (5.9)
1.5
Note : In many cases, constraints can also be surrogate models fitted with the corresponding
data.
Genetic Algorithm
#objective function
fitness_value <- (x[3]+2)*x[2]*x[1]^2;
#constraints
co1 <- 1 - x[2]^3*x[3]/(71785*x[1]^4);
co2 <- (4*x[2]^2-x[1]*x[2])/(12566*(x[2] * x[1]^3
- x[1]^4)) + 1/(5108 * x[1]^2) - 1;
co3 <- 1 - (140.45*x[1])/(x[2]^2*x[3]);
co4 <- (x[2]+x[1])/1.5 - 1;
#imposing constraints
fitness_value <- ifelse( co1 <= 0 & co2 <= 0 & co3
<= 0 & co4 <= 0, fitness_value, fitness_value + 1000)
return(fitness_value)
}
#objective function
fitness_value <- (x[3]+2)*x[2]*x[1]^2;
#constraints
co1 <- 1 - x[2]^3*x[3]/(71785*x[1]^4);
co2 <- (4*x[2]^2-x[1]*x[2])/(12566*(x[2] * x[1]^3
- x[1]^4)) + 1/(5108 * x[1]^2) - 1;
co3 <- 1 - (140.45*x[1])/(x[2]^2*x[3]);
co4 <- (x[2]+x[1])/1.5 - 1;
#imposing constraints
fitness_value <- ifelse( co1 <= 0 & co2 <= 0 & co3
<= 0 & co4 <= 0, fitness_value, fitness_value + 1000)
return(fitness_value)
}
set.seed(90)
psoptim(rep(NA,3), fn = fitness, lower = c(0.05,
0.25, 2), upper = c(2, 1.3, 15)
```
GA settings:
Type = real-valued
Population size = 150
Number of generations = 1000
Elitism = 8
Crossover probability = 0.8
Mutation probability = 0.1
Search domain =
x1 x2 x3
lower 0.05 0.25 2
upper 2.00 1.30 15
GA results:
Iterations = 1000
Fitness function value = 1087.244
Solution =
x1 x2 x3
[1,] 1.989891 1.297155 14.98572
── PSO ───────────────────
$par
[1] 0.05378783 0.40934120 8.76019214
$value
[1] 0.01274305
$counts
function iteration restarts
13000 1000 0
$convergence
[1] 2
$message
Both optimization algorithms (GA and PSO) perform poorly when minimizing the objec-
tive function since this is a convex optimization problem, which means that all the
constraints are convex functions, and the objective function is also a convex function (if
minimizing). In this case, gradient descent is recommended as an effective optimization
algorithm. Would any setting improve the previous solutions?
Now let us solve a multi-objective optimization (MOO) using a popular algorithm:
NSGA-II.
The non-dominated sorting genetic algorithm II (NSGA-II) is an algorithm that effec-
tively deals with issues such as (i) computational complexity, (ii) non-elitism approach,
and (iii) the specification of a sharing parameter. Moreover, the dominance is modified to
solve constrained MOO problems efficiently. In the following example, we illustrate the
use of NSGA-II in solving an optimization problem involving two objective functions.
Now let us solve a multi-objective optimization (MOO) using a popular algorithm: NSGA-
II.
The non-dominated sorting genetic algorithm II (NSGA-II) is an algorithm that effec-
tively deals with issues such as (i) computational complexity, (ii) non-elitism approach,
and (iii) the specification of a sharing parameter. Moreover, the dominance is modified to
solve constrain MOO problems efficiently. In the following example, we illustrate the use
of NSGA-II in solving an optimization problem involving two objective functions.
1 − e[−NTU(1−c)]
effectiveness = (5.10)
1 − ce[−NTU(1−c)]
The lower and upper bound of each variable is shown in Table 5.2.
The corresponding code in R is shown as follows:
130 5 Optimization
return(c(f1,f2))
}
res <- nsga2(hex, 2, 2, generations=1000,
lower.bounds=c(0.1, 1), up-
per.bounds=c(0.9, 10), mprob=0.2,cprob=0.8,pop-
size=200,vectorized=FALSE)
Figure 5.4 shows different Pareto-optimal solutions for the objective functions effec-
tiveness and cost. For instance, a feasible solution would be a heat exchanger of
effectiveness = 0.9, costing approximately 3,400 USD.
The tuning parameters used in this example include mprob (mutation probability),
cprob (crossover probability), and popSize (size of the population); these values are tested
to determine their sensitivity to the solution, for instance, with population sizes of 50, 100,
and 200; mutation probability of 0.1, 0.2, and 0.5 (and less typical values of 0.8 and 1.0);
and crossover rate of 0.5, 0.6, up to 0.9.
5.6 Bayesian Inference and Optimization 131
Bayesian optimization aims at locating a global maximum or minimum of all feasible val-
ues in the environment; the search follows a guided policy to iteratively find the sampling
location, get the observation value, and refresh the policy. The objective function can be
a black box function; hence, the interaction with the environment is done by sampling
at a specific location. This sampled value is then corrupted by noise (Gaussian), which
is an indirect evaluation of the actual sampling value. A gradient can be used to further
optimize this method and, thus, improve the functional evaluation. Mathematically, this
model can be expressed as the probability distribution of the function (which includes a
perturbation or Gaussian noise) based on a location x and a true function value, which
means that there is a normally distributed probability function for the actual observation
around the objective function, spread by the noise variance.
Bayesian inference requires a prior distribution, the likelihood for a specific parame-
ter, a posterior distribution, and the evidence of the data. In the Bayesian approach, the
parameter of interest is a random variable following a probability distribution over all
feasible values, which can be obtained by employing the Bayes rule.
132 5 Optimization
Example 5.5 Bayesian optimization. The total cost (in USD) of a pipe carrying a liquid
can be calculated as the combination of individual costs, such as pipe material, installation,
depreciation, flow rate, energy cost for pumping, maintenance, liquid properties, pumping
efficiencies, and taxes. The following function gives the economical optimal pipe diameter
that minimizes this total cost: [2]
⎡ ⎤ 1
2.84
D4.84+n nXE(1 + F)(Z + ab)(1 − )
f=⎣ ⎦
(1 + 0.794LeD) 0.000189YKρ0.84 μ0.16 (1 + M)(1 − ) + ZM
ab_dash
(5.12)
where:
ab: fractional annual depreciation (a) and maintenance (b) on pipeline (0.2).
ab_dash: fractional annual depreciation on pumping installation (a) and installed cost of
pipeline, including fittings (b) (0.4).
D: pipe diameter (ft – unknown).
E: combined fractional efficiency of pump and motor (0.5).
F: factor for installation and fitting (6.7).
K: energy cost delivered to the motor (0.04 USD/kWh).
Le: factor for friction in fitting, equivalent length in pipe diameter per length of pipe
(2.74 1/ft).
M: factor to express cost of piping installation, in terms of yearly cost of power
delivered to the fluid.
n: exponent in pipe-cost Eq. (1.35).
P: installation cost of pump and motor (150 USD/HP).
X: cost of 1 ft of 1 ft diameter pipe (29.52 USD).
Y: days of operation per day (365 d).
Z: fractional rate of return of incremental investment (0.1).
: factor for taxes and other expenses (0.55).
We are aiming to find the optimal diameter by minimizing the pipe cost.
The corresponding code in R is shown as follows:
5.6 Bayesian Inference and Optimization 133
D<-0.1;
fd <- function(D) {
numerator<-((D^(4.84+n))*n*X*E*(1+FF)*(Z+ab*(1-
fi)));
denominator<-
(1+0.794*Le*D*(0.000189*Y*K*(ro^0.84)*(miu^0.16)))*((
1+M)*(1-fi)+(Z*M/ab_dash));
f<--(Q-(numerator/denominator)^(1/2.84))^2;
list(Score=f,Pred=0);
}
All variables are first defined and assigned to their corresponding values. An initial
value of 0.1 is assigned to the pipe diameter. The function fd includes the numerator and
denominator of Eq. 6.12 separately to define the objective function as the mean squared
error (with a negative sign indicating that the function is minimized). Score and Pred
are then set as the function and zero, respectively; Pred is a table with validation/cross-
validation prediction for each algorithm iteration. The search bound for the diameter is
then defined between 0.1 and 0.8 ft. Finally, the Bayesian optimization is performed by
setting the number of iterations = 20 and initial points (random points chosen to sample
the target function before the algorithm fits the Gaussian process). Other parameters to
explore for this algorithm include the acquisition function and tuning parameters kappa
and epsilon. The best parameter found by the algorithm is:
x1_opt;
x2_opt;
kdot
In example 5.5, the model for approximating the objective function is given (Eq. 5.12).
Nevertheless, Bayesian optimization is ideal in complex optimization problems (e.g., com-
putationally expensive), as the algorithm employs a probabilistic model to optimize the
objective function. Basically, Bayesian optimization has two components: a Bayesian sta-
tistical model (e.g., a Gaussian process) to model the objective function and an acquisition
function for deciding where to sample next. Gaussian processes are ideal for the statis-
tical model, as they are tractable and flexible. The function GP_fit in R can be used
Data Disclosure 135
Data Disclosure
The data was generated, maintaining the described problem’s physical meaning of the
analyzed phenomenon or process.
136 5 Optimization
Problems
5.1 Review, run, compare, and discuss all the previous examples using different opti-
mization algorithms. Is there a metric (or set of metrics) to effectively compare these
algorithms?
5.2 Perform sensitivity analyses on each optimization algorithm’s tuning parameters to
evaluate their impact on the found solutions.
Resources
GA documentation: https://www.rdocumentation.org/packages/GA/versions/3.2.3/topics/ga
PSO documentation: https://cran.r-project.org/web/packages/pso/pso.pdf
NSGA-II documentation: https://cran.r-project.org/web/packages/nsga2R/nsga2R.pdf
Bayesian Optimization I: https://cran.r-project.org/web/packages/rBayesianOptimization/rBayesian
Optimization.pdf
Bayesian Optimization II: https://www.rdocumentation.org/packages/ParBayesianOptimization/ver
sions/1.2.6/topics/bayesOpt
R-codes and data repository: https://github.com/CHE408UofT/DGSD_UofT
Recommended Readings
Boyd, S. P., & Vandenberghe, L. (2011). Convex optimization. Cambridge Univ. Pr.
Chaves, I.D.G., López, J.R.G., Zapata, J.L.G., Robayo, A.L., Niño, G.R. (2016). Process Optimization
in Chemical Engineering. In: Process Analysis and Simulation in Chemical Engineering. Springer,
Cham. https://doi.org/10.1007/978-3-319-14812-0_7
Deb, K., Pratap, A. Agarwal, S. Meyarivan, T. (2002). A fast and elitist multiobjective genetic algo-
rithm: NSGA-II. IEEE Transactions on Evolutionary Computation, 6(2), 182-197. https://doi.org/
10.1109/4235.996017.
Kochenderfer, M. J., & Wheeler, T. A. (2019). Algorithms for optimization. The Mit Press.
Edgar, T. F., Himmelblau, D. M., & Lasdon, L. S. (2201). Optimization of Chemical Processes.
McGraw-Hill.
Simon, D. (2013). Evolutionary optimization algorithms: Biologically-inspired and population-based
approches to computer intelligence. Wiley-Blackwell.
Yu, X., & Gen, M. (2013). Introduction to evolutionary algorithms. Springer London.
References 137
References
R is a robust statistical software used for data analysis in different fields. Like Python, it
includes a set of libraries and extensive documentation to build and run codes. RStudio
is an integrated developing environment or interface for R. We used the desktop form of
this tool, available at: https://posit.co/download/rstudio-desktop/. The previous link will
link us to the installation/download of RStudio.
CRAN is the R network that includes up-to-date versions of codes and documenta-
tion for R. CRAN can be accessed through https://cran.r-project.org/. R documentation is
found in the following link: https://www.rdocumentation.org/packages/dgof/versions/1.4.
Finally, CRAN also contains a comprehensive introduction to R worth checking: https://
cran.r-project.org/doc/manuals/r-release/R-intro.html.
Data analysis and data analytics are sometimes interchangeable in many contexts, but
they are, indeed, different. Data analysis is extracting meaning from data to make the right
decision. Data analytics is a more complex process since we use data and techniques to
find new and/or complex insights to make enhanced predictions. In data analysis, we
collect, manipulate, and examine the data for insight. In data analytics, we analyze and
work with the data to make the right decision. The third term that brings us to a non-
universal definition is machine learning (ML). We can safely define machine learning as a
branch of computer science that combines data and algorithms to solve complex problems
that could be cost-prohibitive for humans. It is often associated with artificial intelligence
or AI (typically used interchangeably), but ML can be seen as a branch of AI. With
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 139
D. Galatro and S. Dawe, Data Analytics for Process Engineers, Synthesis Lectures
on Mechanical Engineering, https://doi.org/10.1007/978-3-031-46866-7_6
140 6 Final Remarks
AI, humans want to create intelligent machines simulating human behavior and human
capabilities; ML, on the other hand, while learning and adapting through experience, is
used for more specific tasks and applications.
There are different sources of open datasets all over the web. From shared data collected
by agencies (e.g., COVID-19 datasets) to research data referred to in peer-reviewed papers
(e.g., data in brief , by Elsevier, a publishing company), and open datasets. Kaggle is, per-
haps, one of the most visited sites (https://www.kaggle.com/datasets) due to its ample
collection of datasets and codes, learning tools, and discussion forums. One interest-
ing dataset for process engineers to test their knowledge in data analytics is Tennessee
Eastman Process Simulation Dataset for anomaly detection evaluation (https://www.kag
gle.com/datasets/averkij/tennessee-eastman-process-simulation-dataset); be aware of the
user agreement of any dataset, for accessing, downloading, modifying, and citing or
referencing.
Assessing the physical meaning of the results generated from exploratory data analy-
sis (EDA), data-based modelling, data-based control, and optimization tasks makes data
analytics meaningful for process engineers.
Clean data typically results from managing outliers and missing data, clustering, and
performing dimensionality reduction. These tasks seem quickly achievable by running a
set of codes and techniques: an eyes-closed process! Let us imagine mathematically omit-
ting outliers associated with an unknown modification in the feed characterization of a
process plant or discarding variables identified in first-principle equations. Would your
model be reliable and representative of your phenomenon? What would be the impact of
using your model in a decision-making process? The physical meaning includes clearly
understanding, when possible, your variables’ potential interaction and effects. Risky sit-
uations can arise when, for instance, correlations defeat causality; our responsibility as
process engineers is to provide meaningful and accurate solutions; therefore, you must
comprehend how your process or system works! In some specific cases, this understand-
ing is quite challenging: in the early stages of investigating degradation in capacitors and
lithium-ion batteries, it was impossible to identify its causes and effects, as they are asso-
ciated with side reactions, challenging to measure and/or quantify. There are also complex
settings such as modelling earth sciences or out-of-space-related data or health outcomes.
While failing to account for confounding variables can play a role in accurately modelling
6.5 Remarks on Sources of Data 141
these systems, our inference techniques shall be supported at least by educated assump-
tions and hypotheses and, moreover, mathematically cross-validated using two or more
techniques to solve a specific task or even testing our models within your operational
envelope to verify that it keeps physical meaning. For instance, we can use a training
dataset to generate a model to predict the pressure drop in equipment, with R2 close to
1 and negligible residuals, and still test our model to discover that, within an expected
range of operation variables, it provides you with negative values. We can see similar out-
comes in optimization when our constraints and bounds are poorly defined, for example,
leading to unfeasible physical solutions! The message is clear for process engineers: We
are process engineers, we are not data analysts performing data analytics in a different
field, and as such, we understand the physical meaning of processes and systems.
In Chap. 1, we identified at least three different sources of data for analysis. When ranking
the familiarity of process engineers in acknowledging and/or using one or more of these
sources, we can think that plant process data, pilot plant, and laboratory data are, indeed,
quite familiar when we operate chemical plants or conduct research, for instance; followed
by process simulation data, when we design or troubleshoot a process, and finally, a
probably less familiar concept, synthetic data. Artificially generated data, however, can
be intuitively seen, for example, as interpolating or extrapolating data, as a means of
obtaining meaningful values when measured data is insufficient or biased.
The quality of the plant process data depends on the quality, maintenance, and loca-
tion of the sensors and instruments collecting the data. Gathering this information over
time is crucial for monitoring, controlling, and troubleshooting, but it can be used for
design and optimization.
The quality of the pilot plant and laboratory data depends on the design of experi-
ments to capture the phenomenon at different process variable ranges, which is typically
constrained by time and resources (e.g., equipment and/or computational cost).
The quality of the data generated by process simulation software might be unquestion-
able if and only if our simulation tool is capable of strictly reproducing the system or
process we simulate, which depends on the accuracy and extent in defining, for instance,
kinetics networks (e.g., side reactions, yields, and catalysts), simulation modes (steady-
state or dynamics), thermodynamics packages, a representative feed characterization,
among others.
The quality of synthetic data is still a debatable subject, as best practices shall be stan-
dardized to make it a safe option in many fields; moreover, machine learning is a relatively
new field of study compared to process simulation, for instance, and new and more effec-
tive algorithms are being developed at a rapid pace. On the other hand, it is understandable
that limited amounts of data inevitably require fast and cost-effective ways to scale our
142 6 Final Remarks
models; hence, synthetic data can be seen as the best option to generate data. It has been
tested in some applications that synthetic data might even enhance the performance of
predictive models! In any case, there is a critical requirement in ensuring the quality of
synthetic data: using clean data, which might require harmonization (being merged from
different sources). This step shall be followed by assessing the similarity of our synthetic
data with our real data to make proper decisions.
Finally, the combination of sources of data is also a reality in our process engi-
neering industry. Accelerated laboratory tests generate data that can also be harmo-
nized with online data captured by sensors and instruments. Reconciling data via har-
monization is a complex and iterative process that requires, as per any data analysis per
se, an understanding of the physical phenomena being measured and analyzed.
All sources of data are equally important, as they can provide different insights and
serve cross-validation purposes when designing, predicting, controlling, and optimizing
processes.
Simple visualization can save cost and time. We can observe trends and already identify
patterns for further modelling solutions. Furthermore, it allows us to identify outliers and
missing values. It is good practice for data analytics when we start analyzing our data,
but exploratory data analysis is not a trickled-down process. We might manage outliers
using specific techniques to later discover, by simple visualization, that our patterns have
completely changed! Hence, simple visualization can fit at any step of the data analytics
framework as a helpful tool that can save cost, time, and accuracy!
Principal Component Analysis is perhaps the most robust and accurate technique when
studying the relationship between variables. Saturated PCA graphs can be expected as we
increase the number of variables to analyze. We consider that running correlograms (not
always the world is linear, we know…), might help us pre-understand this relationship,
which can even contradict when running a PCA or Sobol indices. Simplicity might also
save us cost and time when analyzing data.
Splitting is a good practice when building and testing your machine learning model,
enhancing the model’s performance when predicting. Generally, we use a training set
including 80% of the data, 10% for a validation set, and 10% for a test set. This rule is
a good split to start with; however, this is not set. Splitting requires finding an optimum
split by carefully analyzing the dimension of the data, the type of model, and even custom-
ary factors associated with your process or system (e.g., temporal conditions like ramping
6.9 Remarks on Optimization 143
up the temperature in a reactor, which would require clustering your data according to
the process events).
Another important feature of machine learning tools is using standardization or nor-
malization to preprocess the data. They both help not to distort the differences in values.
The main difference between both is that Gaussian distribution is assumed in standardiza-
tion; hence, variables equally contribute to the analysis. Which one is better? When using
neural networks, for example, if the data has different dimensions, sometimes it is not
helpful to make assumptions about the data distribution; therefore, normalization is the
right data preprocessing option. A process engineer must choose their data preprocessing
technique before fitting a machine learning model to enhance its performance.
A particular topic when using machine learning is regularization, which refers to tech-
niques used to minimize overfitting or underfitting of the models. In neural networks, for
instance, Bayesian regularization is a powerful technique (package brnn in R); we invite
our readers to test datasets using this technique, as ML models sometimes efficiently learn
the training data but fail when generalizing for new data.
Finally, what seems obvious might not be. When using model distributions to fit
data, use a reliable test to evaluate the goodness of fit of the distribution. Moreover,
for all other models, using different metrics and comparing between them is a must;
we introduced BIC and AIC, R2 , residuals, and other errors, but there are many other
ways to explore when comparing models’ performance, so make sure the data is fit to
the right model that mathematically and physically represents the process or system.
Machine learning techniques are proven to support tuning tasks for control purposes.
Process control only sometimes requires sophisticated algorithms, as PID controllers can
many times do the job. Model predictive control (MPC), on the other hand, is an advanced
subject that was not deeply covered in this book, as it deserves understanding the mathe-
matical principles defining it; in practice, MPC is not a simple task for one input and out
output variable, but it is a predecessor of modern control in process plants or complex
automated systems, where advanced machine learning techniques are required to control
several tasks on several input and output variables.
to approximate the objective function in optimization problems [1]. This minimizes using
other less efficient heuristic solutions that might not find optimal solutions.
Another crucial consideration when optimizing process engineering problems is select-
ing the right optimization algorithm. As we studied in Chap. 5, convex optimization
problems cannot be solved using genetic algorithms or particle swarm optimization. We
recommend, when possible, using gradient-based optimization algorithms and leaving
other complex problems for GA or PSO.
There is so much to learn about data analytics for process engineers in a fast-paced world
where enthusiastic people in this field quickly develop and share thousands of lines of
code and algorithms annually. We clustered the recent trends in (i) enhancing process
control and automation strategies, (ii) providing effective tools for failure analysis and
inspection, and (iii) applying big data analytics.
Some examples of these trends include (i) using convolutional neural networks for
failure localization and 3D characterization of materials and equipment [2, 3]; (ii) using
reinforcement learning in nuclear plants to improve the automated control of reactors, a
process-safety application that might bring us to a fully automatic control operation of
these plants; (iii) big data applications for process monitoring and control, manufacturing,
and modelling and optimization [4, 5].
Process engineering is one of the many fields being benefited by the advancement in
data analytics and machine learning. Process engineers must keep up with this progress
by staying updated in efficiently using these tools to ensure competitiveness in the market
and, ultimately, providing enhanced solutions to design, monitor, troubleshoot, control,
and optimize processes.
References
1. Villarrubia, G., De Paz, J. F., Chamoso, P., & la Prieta, F. D. (2018). Artificial neural networks
used in optimization problems. Neurocomputing, 272, 10–16. https://doi.org/10.1016/j.neucom.
2017.04.075
2. Paulachan, P., Siegert, J., Wiesler, I., & Brunner, R. (2023). An end-to-end convolutional neu-
ral network for automated failure localisation and characterisation of 3D interconnects. Scientific
Reports, 13(1). https://doi.org/10.1038/s41598-023-35048-0.
3. Chang, Z., Wan, Z., Xu, Y., Schlangen, E., & Šavija, B. (2022). Convolutional neural network
for predicting crack pattern and stress-crack width curve of air-void structure in 3D printed
concrete. Engineering Fracture Mechanics, 271, 108624. https://doi.org/10.1016/j.engfracmech.
2022.108624
References 145
4. Park, J., Kim, T., Seong, S., & Koo, S. (2022). Control automation in the heat-up mode of a
nuclear power plant using reinforcement learning. Progress in Nuclear Energy, 145, 104107.
https://doi.org/10.1016/j.pnucene.2021.104107
5. Sadat Lavasani, M., Raeisi Ardali, N., Sotudeh-Gharebagh, R., Zarghami, R., Abonyi, J., &
Mostoufi, N. (2021). Big data analytics opportunities for applications in process engineering.
Reviews in Chemical Engineering, 39(3), 479–511. https://doi.org/10.1515/revce-2020-0054