100% found this document useful (1 vote)

1K views151 pages

Galatro D. Data Analytics For Process Engineers. Prediction, Control... 2ed 2024

This book provides an industry-oriented data analytics approach for process engineers, including data acquisition methods and sources, exploratory data analysis and sensitivity analysis, data-based modelling for prediction, data-based modelling for monitoring and control, and data-based optimization of processes. While many of the current data analytics books target business-related problems, the rationale for this book is a specific need to understand and select applicable data analytics approa

Uploaded by

jan puchalski

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

1K views151 pages

Galatro D. Data Analytics For Process Engineers. Prediction, Control... 2ed 2024

Uploaded by

jan puchalski

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 151

Synthesis Lectures on

Mechanical Engineering

Daniela Galatro · Stephen Dawe

Data Analytics
for Process
Engineers
Prediction, Control and Optimization
Synthesis Lectures on Mechanical
Engineering
This series publishes short books in mechanical engineering (ME), the engineering branch
that combines engineering, physics and mathematics principles with materials science to
design, analyze, manufacture, and maintain mechanical systems. It involves the production
and usage of heat and mechanical power for the design, production and operation of
machines and tools. This series publishes within all areas of ME and follows the ASME
technical division categories.
Daniela Galatro · Stephen Dawe

Data Analytics for Process

Engineers
Prediction, Control and Optimization
Daniela Galatro Stephen Dawe
Department of Chemical Engineering Department of Chemical Engineering
and Applied Chemistry and Applied Chemistry
University of Toronto University of Toronto
Toronto, ON, Canada Toronto, ON, Canada

ISSN 2573-3168 ISSN 2573-3176 (electronic)

Synthesis Lectures on Mechanical Engineering
ISBN 978-3-031-46865-0 ISBN 978-3-031-46866-7 (eBook)
https://doi.org/10.1007/978-3-031-46866-7

This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole
or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage
and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or
hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does
not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective
laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book are
believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give
a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that
may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Paper in this product is recyclable.

Preface

Process engineering is the engineering field that pioneered the development of process
simulation and process control, including advanced automation linking machine learn-
ing and optimization tools. At the same time, statistics and data-driven modelling have
fundamentally supported monitoring and troubleshooting capabilities. Therefore, process
engineering has played and is playing a key role in proving the applicability of data ana-
lytics to ensure the reliability of processes. However, the acknowledged literature in data
analytics has presented the progress of data analytics in process engineering in journals,
perhaps underrating the possibility of structurally compiling applicable data analytics tools
and exemplifying their application tailored to the process engineer’s needs. This book has
created an exclusive data analytics domain for process engineers, describing tools that
can be pragmatically used in different contexts and at different levels of expertise, from
undergraduate or graduate students looking to design unit operations or process plants
to plant engineers or researchers looking to monitor, control or optimize processes. Its
pragmatism also lies in exposing the reader to these techniques while disengaging from
a rigorous mathematical background, with prompt applicability in the field or starting
point for further research. The data analytics workflow proposed in this book can also be
detached from a typical one conceived for data scientists: data analytics for process engi-
neers begins by understanding the sources of data (mostly continuous data), derived from
process and pilot plants, laboratories and generated by process simulation. We continue
our workflow with exploratory data analysis (EDA), including familiar tools like simple
visualization techniques and managing outliers and missing values. We also borrowed
three core data scientists’ exploratory data analysis tools: correlograms, clustering and
dimensionality reduction. These techniques complete the skills required to get insights
from the data and identify patterns. Once investigated and cleaned, the data is ready
for the ultimate goals in process engineering: modelling, control and optimization. Data-
based modelling builds up from simple regression models, such as simple or multiple
linear regression models, splines and adaptive regression splines, to non-linear regression
models and non-linear machine learning models, including neural networks, random for-
est and support vector machine. Finally, our fitting options include distribution models

v
vi Preface

such as normal, Weibull and gamma distributions. We dedicated a section of the data-
based modelling chapter to model performance and verification, as it is crucial to address
the goodness of fit and select the best model for decision-making processes. Furthermore,
we highlight the need for testing causality, as correlation and causality might not coexist.
Moving towards process control, data analytics is presented to support the design of typ-
ical controllers and model predictive control settings. In optimization, we explore simple
optimization using grid search, random search and gradient search, and then introduce our
readers later briefly to evolutionary algorithms, particle swarm optimization and Bayesian
inference. Multi-objective optimization is also addressed, as it is common in process engi-
neering to deal with two or more conflicting objectives. While structurally providing data
analytics tools for EDA, data-driven modelling, data-driven control and optimization, we
continuously emphasize the importance of assessing the physical meaning of the over-
come of exploring and fitting data, as tools are simply tools and might deploy meaningless
or misleading results if not appropriately used and/or interpreted. Our examples and dis-
cussions contribute to this purpose. R and its interface RStudio were selected as the data
analytics software for this book since it is a well-known robust statistical and data analyt-
ics software. Our codes and examples are stored on a GitHub page for free access. At the
end of each chapter, we also included a list of resources (listed in order of appearance),
further recommended readings and references. In our final chapter, called final remarks,
we wrapped up key takeaways from each chapter, plus some interesting topics such as
introduction and documentation about R and RStudio; clarification on definitions like data
analysis, data analytics and machine learning; access to open datasets; data analytics and
the physical meaning of phenomena; and a brief passage on the future of data analytics in
process engineering. Finally, we would like to thank our readers for choosing this book
as a complementary tool to your journey in the process engineering field.

Toronto, Canada Daniela Galatro

Stephen Dawe
Contents

1 Sources of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Plant Process Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Pilot Plant and Laboratory Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Process Simulation Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Summary and Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Data Disclosure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Recommended Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1 Types of Data and Types of Exploratory Data Analysis . . . . . . . . . . . . . . 13
2.2 Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Simple Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Time-Series Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.2 Scatter Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.3 Multivariate Scatter Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.4 Box Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.5 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.6 Temperature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.7 Wind Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4 Outliers and Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4.1 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4.2 Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5 Correlogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.6 Clustering and Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.6.1 K-means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

vii
viii Contents

2.6.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2.6.3 Variance-Based Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.7 Summary and Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Data Disclosure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Recommended Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3 Data-Based Modelling for Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.1 Regression and Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2 Simple Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2.1 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2.2 Multiple Linear Regression and Multivariate Linear
Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2.3 Exponential and Logarithmic Regression . . . . . . . . . . . . . . . . . . . . 69
3.2.4 Polynomial and Response Surface Regressions . . . . . . . . . . . . . . . 71
3.2.5 Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.2.6 Multivariate Adaptive Regression Splines . . . . . . . . . . . . . . . . . . . . 76
3.2.7 Kriging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.3 Non-linear Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.4 Non-linear Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.4.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.4.2 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.4.3 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.5 Distribution Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.5.1 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.5.2 Weibull Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.5.3 Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.6 Model Performance and Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.7 Correlation and Causality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.8 Summary and Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Data Disclosure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Recommended Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4 Data-Based Modelling for Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.1 Modern Control Theory and Data-Based Control . . . . . . . . . . . . . . . . . . . . 107
4.2 Basic Control Theory: PID Controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.3 Model Predictive Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Contents ix

4.4 Summary and Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

Data Disclosure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Recommended Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.1 Basic Optimization Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.2 Grid Search, Random Search, and Gradient Search . . . . . . . . . . . . . . . . . . 119
5.3 Evolutionary Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.4 Particle Swarm Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.5 Multi-objective Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.6 Bayesian Inference and Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.7 Summary and Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Data Disclosure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Recommended Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.1 R and RStudio: Introduction, Documentation, and Codes . . . . . . . . . . . . . 139
6.2 Data Analysis, Data Analytics, and Machine Learning . . . . . . . . . . . . . . . 139
6.3 Open Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.4 Data Analytics and the Physical Meaning of Phenomena . . . . . . . . . . . . . 140
6.5 Remarks on Sources of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.6 Remarks on Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.7 Remarks on Data-Based Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.8 Remarks on Data-Based Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.9 Remarks on Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.10 The Future of Data Analytics in Process Engineering . . . . . . . . . . . . . . . . 144
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Sources of Data
1

1.1 Plant Process Data

Plant process data is acquired by the plant automation system at the control center, mostly
from the controllers in centralized or distributed control systems (DCSs). The data col-
lected from process variables is facilitated by instruments and the interface between the
process and the control subsystem, to be delivered to the controllers.
Plant process data is collected and stored in several databases, such as Manufacturing
Execution Systems (MESs) and Laboratory Information Management Systems (LIMSs).
MES is a specialized manufacturing software used to control production and engineering
environments. LIMS is a research and development (R&D) form of MES; essentially,
it’s a real-time system that stores and tracks mostly analytical measurements from a
laboratory on quality and specification parameters of feeds, products, and intermediate
streams. This data is used for operation monitoring, fault detection, performance analysis,
operations, production, maintenance planning, parameter estimation, process simulation,
optimization, and resource planning.
The data collected through the sensor network must be accurate, reproducible, and
reliable. Accuracy refers to the ability of an instrument to measure the true value. Repro-
ducibility is the ability of a sensor to reproduce a value within a specific interval.
Therefore, a sensor can be precise and inaccurate, for instance, when several measure-
ments fall within an interval that does not contain the true value. Reliability refers to the
probability that the data will exist during a certain period.

Supplementary Information The online version contains supplementary material available at

https://doi.org/10.1007/978-3-031-46866-7_1.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 1

D. Galatro and S. Dawe, Data Analytics for Process Engineers, Synthesis Lectures
on Mechanical Engineering, https://doi.org/10.1007/978-3-031-46866-7_1
2 1 Sources of Data

In summary, precision or reproducibility is related to the quality of the instrument.

Accuracy and reliability are associated with the probability of instrument failure; if the
failure constitutes a bias, then accuracy is compromised.
Data acquisition is classified as data acquired solely from instrument readings and soft-
ware enhancement. When solely supported by instrument measurements, each variable’s
value is directly associated with the instrument that measures it. The probability of a
sensor failure is not negligible; hence, redundancy is used to guarantee data availability.
There are two types of redundancy: (i) hardware redundancy and (ii) software redundancy.
Hardware redundancy implies that two or more sensors are used to measure the same
variable, for instance, two thermocouples measuring the temperature inside a vessel. Soft-
ware redundancy refers to a set of measurements of different variables that satisfy a
mathematical model. For example, measuring the flow rates of several input and output
streams and then redundantly checking that the sum of the input flow rates is equal to the
sum of the outputs to preserve the conservation of mass.
Instrument readings are inaccurate, typically presenting significant levels of noise.
Therefore, it is required to determine the best estimates out of conflicting readings. For
this data reconciliation, engineers and plant operators use a set of empirical models and
statistical tools to assess the consistency of the measurements. Alternatively, existing pro-
cesses and laboratory data can be used to infer the value of a certain measurement, using,
for example, neural networks. The virtual sensors generated from this data validation
process are called soft sensors.
Software redundancy increases the precision of the variable estimates, meaning that
the standard deviation associated with these estimates is lower than those of the individ-
ual measurements. Similarly, it increases reliability since it guarantees that an estimate
of this variable can be obtained through a model if a direct measurement fails, due to a
malfunction or total failure. Redundancy filters the data and is useful to detect and/or esti-
mate or eliminate biases caused by instrument malfunctions. This data filtering approach
is called gross error detection. It is typically identified using statistical hypothesis testing,
where the null hypothesis means that there are no gross errors. In contrast, the alternative
hypothesis shows that at least one gross error exists in the set of measurements. Once
the gross errors are determined, these should be either (i) eliminated or independently
estimated, (ii) analyzing how the gross error is related to a bias, to determine if filtering
process variations or historical data shall be used. Several techniques have been employed
to address this assessment, such as serial elimination, serial compensation, and collective
compensation.
Plant process data is typically visualized graphically or tabularly through (i) digital
process diagram, as current process values indicating levels, temperatures, concentrations,
flow rates, pressure, and alarms in different points of the process plant; (ii) trends, includ-
ing historical data; and (iii) calculated variables, such as Key Performance Indicators
(KPIs), calculated by processing data from multiple sources. Data visualization includes
1.2 Pilot Plant and Laboratory Data 3

(i) colours to encode meaning, (ii) patterns for quick detection of anomalies, and (iii)
clear context to visualizations in terms of time, scale, and area.
Process historians is software that stores large amounts of data generated from different
sources, including control and monitoring plant data, laboratory information management,
and resource planning. Transforming the big data from historians into actionable infor-
mation is key for diagnostics, production, performance, maintenance, and safety. Initially,
process historians were fed data from the process control system, typically a DCS. Several
centralized and decentralized architectures support modern process historians and include
data-handling techniques like filtering. Moreover, data stored in process historians can be
used by commercially available software for analysis and reporting, including power and
energy consumption, safety and alarm monitoring and management, and mass balancing.

1.2 Pilot Plant and Laboratory Data

Small-scale datasets generated from tests performed in pilot plants and laboratories are
just as valuable as big data. Though this is a limited resource, analysis of this pilot data is
crucial for the feasible scaling up of process plants. Still, it is key in providing the basis
for the feasible scaling-up of process plants.
A pilot plant is a system producing small volumes of products for the purpose of
obtaining insight from the novel processes and technologies used in the plant’s opera-
tion. The data provided in pilot plants is valuable for designing the full-scale plant and
upgrading and/or optimization of existing plants. On the other hand, laboratory data from
process plants, is key data generated using standard tests and collected at different sam-
pling points and considering different frequencies, typically providing information on the
composition and properties of different streams in varying periods. This data provides
insight into the quality of feed, intermediate, and product streams and is also useful for
calibrating process composition analyzers.
Laboratory data can also be generated and collected from experiments conducted in a
laboratory but not necessarily performed in a pilot plant, aimed at analyzing the impact of
varying input variables on output variables. To capture valid data from experiments and
maximize the learning capability from the data, it is required to conduct a proper design
of experiments (DOEs) and perform tests using standard methodologies to ensure that all
significant factors and corresponding operating and/or design ranges that control the value
of a group of parameters are considered. The principle of DOE is that a change in one
or more independent variables or input variables is hypothesized to trigger a response in
one or more dependent variables or output variables. The purposes of the experimenta-
tion could be, for instance, to compare alternatives, identify significant inputs (factors)
that affect an output (response), achieve an optimal response, minimize variability, and
balance trade-offs. Several approaches have been used for DOEs, including full factorial
4 1 Sources of Data

designs, response surface designs, fractional factorial, and mixture designs. All the pos-
sible combinations of levels for all factors are considered in a full factorial DOE. For
example, the output variable, thickness, depends on the input variables: speed, tempera-
ture, and viscosity. If two levels are considered for each input variable (high and low),
then the number of runs for our experiment can be calculated as 2k , where k is the number
of factors; hence, eight total runs are calculated. A fractional factorial, on the other hand,
would consider only a fraction of the total runs calculated for the full factorial, such as
one-half or one-quarter, depending on the number of factors.
A popular computer-aided DOE technique is the D-optimal design for multi-factor
experiments. These designs are constructed to minimize the generalized variance of the
estimated regression coefficients. In the DOE setting, the matrix X represents the data
matrix of independent variables. D-optimal designs minimize the overall variance of the
estimated regression coefficients by maximizing the determinant of X’X, and they are
typically used when resources are limited to run a full factorial design entirely. Example
1.1 illustrates the use of D-optimal DOE.

Example 1.1 Design of Experiment using D-optimal. Design an experiment for eight
runs to compile data about the evolution of the capacity fade of lithium-ion batteries over
time. The capacity fade depends on the temperature, the state-of-charge (SOC) or level of
charge of the battery relative to its capacity, and the C-rate or rate of discharge compared to
its capacity. Use D-optimal DOE.

– Temperature range: 0–45 °C.

– SOC range: 0–100%.
– C-rate range: 1–3.

The following packages skpr and shiny must be installed.

###Install
```{r, echo = TRUE}
install.packages("skpr")
install.packages("shiny")
```

A random seed is set for reproducibility. A full quadratic model (linear, interaction,
and quadratic terms) is assumed.
1.2 Pilot Plant and Laboratory Data 5

Fig. 1.1 Design matrix

###General
```{r, echo = TRUE}
library(skpr)
library(shiny)
set.seed(12345678)
candidateset <- expand.grid(Temp = c(0, 25, 45), SOC
= c(20, 50, 100), C_rate = c(1, 2, 3))
design <- gen_design(candidateset = candidateset,
model = ~ Temp + SOC + C_rate + Temp*SOC + Temp*C_rate
+ SOC*C_rate + Temp^2 + SOC^2 + C_rate^2, trials = 8)
design
```

The design matrix, which shows the optimal combination of factors for eight runs, is
shown as follows (Fig. 1.1).
Thus, for instance, run # 1 defines an experiment at 0 °C, 20% SOC, and C-rate of 3.

The corresponding R file of this example is Ex1.1.rmd.

The number of runs depends on the resources that can be afforded, such as time and
money. Replications are ideal, as they help validate the results. Researchers look at a trade-
off between the amount of Type I and II errors they can afford to risk and the resources.
The more levels and factors we have, the more combinations (runs) are possible; hence,
more time and money are added to your project. For instance, for the previous example,
a three-level factorial design would require 27 runs instead of 8!
6 1 Sources of Data

1.3 Process Simulation Data

The main principle of process simulation is the model-based representation of physical,

chemical, and biological unit operations in software, allowing for the calculation of pro-
cess properties. This process representation is supported mainly by first principles and
data-driven models. Each model introduces approximations and assumptions that help us
to describe the process behavior and stream properties over a wide range of temperatures
and pressures, including limited interpolation and extrapolation that real data might not
cover.
Sensitivity analysis (SA) is frequently used in process simulation to study how a
change in the input variables of a model influences the outputs, either in steady-state
or dynamic mode. Input changes include process or operational stream conditions, ther-
modynamic model parameters, and equipment design parameters. SA assesses the effects
and uncertainties of different sources of variation in the process. Hence, SA is important
in model development, validation, and optimization. Furthermore, SA in process simula-
tion can potentially provide a vast amount of reliable data that could be subsequently used
for process design, control, and optimization purposes. The data reliability is tied to the
accurate setting of the simulation in terms of a compound’s characterization, selection of
thermodynamic methods, process conditions, and equipment parameters. For instance, in
process control, data derived from simulation-SAs provide wide operational envelopes of
systems and subsystems that could support tuning control loops and enhance predicting
and action capabilities for advanced control strategies.
Example 1.2 illustrates the results of performing a sensitivity analysis via process
simulation performed on a natural gas treating unit.

Example 1.2 Simple sensitivity analysis. In this example, we explore a dataset generated
from a sensitivity analysis performed in a commercial process simulation software on a
gas-treating unit, where methyldiethanolamine (MDEA) is used to remove H2 S and CO2
from natural gas. This dataset shows the impact of the amine concentration (in %), amine
flow rate (in USGPM), and reboiler duty of the regeneration tower on the acid gas loading
(in mol/mol) in the bottom stream of the regenerator tower. The complete dataset is included
in the file Ex1.2.csv.

We can see the effect of the amine flow rate on acid gas loading at 40% concentration
of MDEA by plotting both variables. The simplified file used for this scenario is Ex1.2_
A.csv.
1.3 Process Simulation Data 7

Fig. 1.2 Acid gas loading at

40% amine concentration
versus amine flow rate

###General
```{r, echo = TRUE}
setwd("C:/Book/Chapter1/R-Codes")
data<-
read.csv(file="Ex1.2_A.csv",head=TRUE,sep=",")
data= data[c(3:7),c(1:4)]
data
AmineFlow = data$AmineFlow
AcidGasLoading=data$AcidGas.Loading
plot(AmineFlow,AcidGasLoading, main="Acid Gas Load-
ing at 40% amine conc.", xlab = "Amine Flow, USGPM",
ylab="Acid Gas Loading", pch=19)
```

Figure 1.2 shows that the acid gas loading increases as the amine flow increases (there
is a clear trend!), but how sensitive is this change? How does it compare when changing
other input variables? What is the sensitivity ranking (order of importance) of the input
variables? We will review different fitting and sensitivity analysis techniques in Chaps. 2
and 3.
Note: The acid gas loading is the amount of acid gas (H2 S and CO2 ), on a molar basis,
that will be removed by a solvent, in this case, removed by MDEA.

The corresponding R file of this example is Ex1.2.rmd.

8 1 Sources of Data

1.4 Synthetic Data

In several cases, the amount of data we obtain from any data source is insufficient; this
can be intuitively interpreted as limited data due to limited resources. Simple exploratory
data analysis might also reveal that a significant amount of data can be biased or distorted
instead! In both cases, insufficient and biased data might deliver incorrect outcomes. How
do we get the data we need, at the scale we need, without considerably compromising
accuracy, balance, and data quality? The answer is: generating synthetic data. However,
can artificial data be generated (not directly measured) from machine learning algorithms
without compromising accuracy, balance, and quality? Let us illustrate the generation and
use of synthetic data with Example 1.3.

Example 1.3 Synthetic data. Amine systems are used to remove CO2 and H2 S from nat-
ural gas. Their reliability is closely related to complex corrosion problems. Several factors
contribute to corrosion in these systems; for instance, the corrosion rate of amines such as
methanolamine (MEA) and diethanolamine (DEA) in carbon steel (in mm/y) depends on
the fluid temperature (in °C), heat-stable salt or HSAS (in %), the acid gas loading (in mol/
mol), and fluid velocity (in m/s) [1–5]. The dataset in Ex1.3.csv contains 200 points of these
variables. Use the following code to generate synthetic data and compare it with the original
data (for quality check purposes).

The package synthpop must be installed.

###Install
```{r, echo = TRUE}
install.packages("synthpop") #for synthetic data
```

The data was artificially created using the function syn and method cart (regression
trees). m is the number of synthetic copies of the observed data to be generated.
Linear regression uses a single predictive formula holding over the entire data space.
A single global model can be challenged when the interaction between parameters occurs
or the data exhibits non-linearity. An alternative approach to non-linear regression is the
regression tree, designed to approximate a function through a process known as binary
recursive partitioning (an iterative process that splits the data into partitions). It then
continues splitting each partition into smaller groups as the algorithm moves up each
partition.
1.4 Synthetic Data 9

###Synthetic
```{r, echo = TRUE}
library("synthpop")
df_observed <- read.csv(file = "C:/Book/Chapter1/R-
Codes Ex1.3.csv")
df_synthetic <- syn(df_observed, m = 1, method=
"cart", cart.minbucket = 10);

compare(df_synthetic, df_observed, vars = "Corro-

sion.Rate");
write.syn(df_synthetic,'df',filetype="csv",data.la-
bels=NULL)
```

The synthetic data is stored in the file df.csv, where a synthetic copy of the original
200 points was generated. The command compare is used in R to graphically represent
differences between the observed and synthetic data when calculating the corrosion rate,
as shown in Fig. 1.3.
In addition to the previous comparison, two corrosion rate models were obtained in
MATLAB® , fitted artificial neural networks, using (i) measured or observed data and
(ii) hybrid data, respectively. Both models were tested with a different dataset, including
measured data. The metric used for this comparison was the coefficient of correlation r,

Fig. 1.3 Differences between observed and synthetic data when calculating the corrosion rate
10 1 Sources of Data

which shows a better predictive accuracy of the model as its value approaches 1. In this
case, the r values were 0.9862 and 0.8985, respectively, indicating that the model fitted
with measured or observed data is more accurate than the one fitted with hybrid data;
nevertheless, both models can reasonably predict the corrosion rate.

The corresponding R file of this example is Ex1.3.rmd.

Synthetic data is built artificially. Therefore, it does not represent events occurring in the
real world. It has been used for both training and testing purposes. The main advantage of
generating synthetic data is to generate large training datasets! Data scientists are looking
at (i) data quality, balance, and variety; and (ii) scalability by supplementing data to
achieve a large scale of diverse inputs.
There are two types of synthetic data: (i) fully synthetic, which does not retain values
from the original data, relying on algorithms based on generative methods; (ii) hybrid,
which combines real and synthetic data, pairing random records from a real dataset
with synthetic records. The main drawback of hybrid datasets is the computational cost
incurred when generating records.
The main challenges of synthetic data involve (i) realism, since it must accurately
reflect the original data; (ii) bias, as the synthetic data can drag the same biases of the
original data.
How can we generate synthetic data? By using one or the combination of the following
methods:

(i) based on the statistical distribution, where we draw numbers from the distribution by
observing the real statistical distribution of the real data; hence, similar factual data
should be reproduced.
(ii) based on a model, generating random data with the model we create to explain the
observed behaviour or response.
(iii) deep learning, employing techniques such as Variational Autoencoder or Adversarial
Network models.

1.5 Summary and Final Remarks

Process data can be typically obtained from three different sources: (i) plant process data,
(ii) pilot plants and laboratory data, and (iii) process simulation data. The plant automation
system acquires plant process data at the control center, showing values read by different
sensors (instruments) located in the plant. Pilot plants and laboratory data generate limited
data that can be used for plant design purposes or to complete process plant data collected
Problems 11

Fig. 1.4 Diagram of the

three-phase separator for
sensitivity analysis

from existing plants. This data source requires the design of experiments to ensure that
all significant factors and corresponding operating and/or design ranges that control the
value of a group of parameters are considered. Finally, process simulation models can
generate process data through sensitivity analysis. These mathematical models are based
on first principles, data-driven models, or a combination of both types, and they can be
coded or directly used from commercial simulation software.
Insufficient and biased data might deliver incorrect outcomes when predicting values.
Synthetic data can be used to generate artificial data without considerably compromising
accuracy and data quality. Some pending questions to explore when generating synthetic
data are: How much is insufficient in a dataset, and how much would be sufficient when
generating synthetic data? How well does the model generated from the synthetic or
hybrid data perform when extrapolating?

Data Disclosure

The data was generated, maintaining the described problem’s physical meaning of the
analyzed phenomenon or process.

Problems

1.1 A stream composed of compound A (80% weight), B, C, D, and E, are separated

in a three-phase separator (see Fig. 1.4). Study the data generated by a sensitivity
analysis that evaluates the impact of inlet pressure and temperature (feed stream) on
the separation process (vapour, liquid 1, and liquid 2 streams). Intuitively rank the
sensitivity parameters.

The corresponding *.csv file is Pr1.1.csv.

1.2 Obtain synthetic data to complete measured data and correlate (visualizing graph-
ically) the viral load in wastewater (in copies/L) versus the number of COVID-19
12 1 Sources of Data

cases by reporting data (data source: http://www.519covid.ca/ [6]. Perform a sen-

sitivity analysis to study the impact of increasing the number of synthetic data in
the hybrid data pool and the accuracy when correlating viral load versus the num-
ber of COVID-19 cases. Can we reduce the number of viral load measurements by
substituting measured values for synthetic data? (Numerically support your answer).

Resources

Synthetic data in R: https://cran.r-project.org/web/packages/synthpop/synthpop.pdf.

R-codes and data repository: https://github.com/CHE408UofT/DGSD_UofT.

Recommended Readings

Amine gas treating unit: Gulf Professional Publishing. (2014). Oil and gas corrosion prevention:
From surface facilities to refineries.
Design of experiment: Freddi, A., & Salmon, M. (2019). Design principles and methodologies
from conceptualization to first prototyping with examples and case studies. Springer International
Publishing.
Synthetic data: McLachlan, S., Dube, K., Gallagher, T., Simmonds, J. A., & Fenton, N. (2019).
Realistic synthetic data generation: The ATEN framework. Biomedical Engineering Systems and
Technologies, 497–523. https://doi.org/10.1007/978-3-030-29196-9_25.

References

1. American Petroleum Institute. (2016, April). API RP 581, Risk-based inspection technology (3rd
ed.).
2. Corrosion modeling in lean and rich amine systems. (2022). In European Federation of Corrosion
(EFC) Series. Corrosion in Amine Treating Units (2nd ed., pp. 55–66). Woodhead Publishing.
https://doi.org/10.1016/b978-0-323-91549-6.00007-1
3. Han Yang, J., Lin Xie, J., & Zhang, L. (2017). Study on corrosion of carbon steel in DEA aqueous
solutions. IOP Conference Series: Earth and Environmental Science, 113, 012006. https://doi.org/
10.1088/1755-1315/113/1/012006
4. Orozco-Agamez, J., Tirado, D., Umaña, L., Alviz-Meza, A., García, S., & Peña, D. (2022).
Effects of composition, structure of amine, pressure and temperature on CO2 capture efficiency
and corrosion of carbon steels using amine-based solvents: A review. Chemical Engineering
Transactions, 96,. https://doi.org/10.3303/CET2296085
5. Malo, J. M., Uruchurtu, J., Vasquez, R. C., Rios, G., Trejo, A., & Rinson, R. E. (2000, March).
The effect of diethanolamine solution concentration on the corrosion of steel. In Paper presented
at the CORROSION 2000, Orlando, Florida.
6. 519covid.CA. 519covid.ca. (n.d.). Retrieved May 9, 2022, from http://www.519covid.ca/
Exploratory Data Analysis
2

2.1 Types of Data and Types of Exploratory Data Analysis

Data can be classified into two groups: structured data and unstructured data. Structured
data is a form of data that is organized, such as categorical or numerical data. Unstruc-
tured data is a form of data that does not have an explicit structure, such as audio,
language text, and images.
In process engineering, we typically use numeric or continuous variables; these vari-
ables can be any value within an infinite or finite interval. Some examples of numeric
variables include temperature, pressure, and concentration. There are two types of numeric
variables: interval and ratio. An interval variable has a numeric scale; its interpretation
is the same throughout the scale. For example, a reboiler duty measurement is twice as
hot and may not reflect twice the temperature. On the other hand, a ratio variable is an
interval scale whose zero value indicates the absence of the quantity being measured.
In this book, we exclusively deal with numeric variables.
EDA can be cross-classified in two ways:

– Each method is either non-graphical or graphical.

– Each method is either univariate or multivariate.

Non-graphical methods typically include summary statistics, while graphical methods

represent and summarize data diagrammatically. In univariate methods, we look at one
variable at a time, while in multivariate methods, we explore the relationship between
variables by looking at two or more variables at a time.

Supplementary Information The online version contains supplementary material available at

https://doi.org/10.1007/978-3-031-46866-7_2.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 13

D. Galatro and S. Dawe, Data Analytics for Process Engineers, Synthesis Lectures
on Mechanical Engineering, https://doi.org/10.1007/978-3-031-46866-7_2
14 2 Exploratory Data Analysis

2.2 Summary Statistics

Summary statistics is part of descriptive statistics. It is a way to extract information from

one or a few variables, including:

– A measure of the central tendency or central position of the frequency distribution of a

data group. This measure represents the value of the investigated samples concerning
the measured property. Typical statistics for central tendency include mean, median,
and mode.
– A measure of spread, showing how the measured values are spread around the central
value. Typical statistics for spreading include range, quartiles, and standard deviation.
– A measure of skewness and peakedness of the frequency distribution, showing the
spread of data around the central value in both left and right directions and how sharp
or flat the distribution is in the central position, respectively.
Note: Summary statistics only summarize data and do not offer an interpretation of
the data compared to its graphical representation.

Example 2.1 Summary statistics. Obtain the summary statistics for a dataset on the inter-
facial hydrolysis of a reagent, showing the effect of hydrolysis on kinetics and drop in pH
versus time at two different amine concentrations (1 and 2). The data is stored in the file
Ex2.1.csv. We can use different packages and commands in R to get the summary statistics,
such as summary (by default, no package is required), pastecs, Hmics, and psych. Let us
use them in this example for comparison purposes. Note: An interesting work on interfacial
hydrolysis kinetics of trimesolyl chloride (a side reaction in reverse osmosis) is reported in
reference [1].

The following packages pastecs and Hmisc must be installed.

###Install
```{r, echo = TRUE}
install.packages("pastecs")
install.packages("Hmisc")
```

The summary statistics are calculated as follows:

2.2 Summary Statistics 15

###General
```{r, echo = TRUE}
setwd("C:/Book/Chapter2/Examples")
data <-read.csv(file="Ex2.1.csv",head=TRUE,sep=",")
data=data[c(3:47),c(1:3)];

#METHOD 1 - Summary
summary(as.numeric(data$X1))
summary(as.numeric(data$X2))

#METHOD 2 - Pastecs
library(pastecs)
stat.desc(as.numeric(data$X1))
stat.desc(as.numeric(data$X2))

#METHOD 3 - Hmisc
library(Hmisc)
describe(as.numeric(data$X1))
describe(as.numeric(data$X2))
```
summary (for each amine concentration set)

Min. 1st Qu. Median Mean 3rd Qu. Max.

3.700 4.500 5.200 5.147 5.700 6.600
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.600 4.200 4.500 4.547 4.800 5.900

pastecs (for each amine concentration set)

The results from Hmisc are visualized as a data frame (please refer to the corresponding
*.rmd file).
Summary is a basic function of R to provide information about minimum and maxi-
mum values, mean, median, first and third quartiles. Pastecs also includes range, standard
deviation, missing, and null values. Finally, Hmisc also calculates the distinct value of
each column, the frequency of each value, and the proportion of that value in that column.
What do we typically observe in Summary Statistics? For instance, the data provided
in this example for amine concentration X 1 has a pH range of 2.8, with a minimum value
of 3.7, a maximum value of 6.6, and a standard deviation of 0.7. There is more variability
in X 1 (spreading) in terms of standard deviation compared to X 2 . No nulls or missing
values are observed in the dataset.
16 2 Exploratory Data Analysis

The corresponding R file of this example is Ex2.1.rmd.

2.3 Simple Visualization

EDA’s main graphical tools are time-series, scatter plots, multi-variable charts and matri-
ces, box plots, and frequency histograms. In this section, we discuss these tools with some
examples.

2.3.1 Time-Series Plot

The time-series plot is perhaps the most prevalent in engineering applications. It is a

univariate 2-dimensional plot. One axis, the time axis, shows different scales, such as
seconds, minutes, hours, months, and years, while the other axis shows the numeric val-
ues. The time axis is usually displayed in the x-axis, but this is not required. Example 2.2
illustrates the use of time-series plots.

Example 2.2 Time-series plot. The following time-series plot shows the degradation rates
for a polymer. The degradation rate is expressed as the relative mass in percentage. Note:
An interesting work on degradation rates for a high-density polyethylene (HDPE) fiber in
the marine environment is reported in reference [2]. The complete dataset is included in the
file Ex2.2.csv, and the R code is shown as follows:

The following package ggplot2 must be installed. The package ggplot2 is a data
visualization package for R.

###Install
```{r, echo = TRUE}
install.packages("ggplot2")
```
2.3 Simple Visualization 17

The time series plot is generated as follows:

###General
```{r, echo = TRUE}
library(ggplot2)
setwd("C:/Book/Chapter2/Examples")
data <-read.csv(file="Ex2.2.csv",head=TRUE,sep=",")

# Basic time-series plot

p<- ggplot(data = data, aes(x = data$Time..years, y
= data$relative.degradation...))+
geom_line(color = "#00AFBB", size = 2)
p

#Add trend smoothed line:

loessp<- p + stat_smooth(
color = "#FC4E07", fill = "#FC4E07",
method = "loess"
)
loessp
#Get the model and predict a value (for instance @120
years)
model <- loess(data$relative.degradation... ~
data$Time..years,data)
predicted.value<- predict(model,120)
predicted.value
```

A smoothed trend line has been added using loess, which is short for Local Regression.
Loess is a common method used to smoothen a time series. It is a non-parametric method
where least squares regression is performed in localized subsets.
A model has been created with loess, relating both variables and the function predict
has been used to predict the degradation rate at 120 years, for instance.
Figure 2.1 shows how the relative mass of the polymer decreases over time. The
predicted value at 120 years is 55.8% degradation.

The corresponding R file of this example is Ex2.2.rmd.

18 2 Exploratory Data Analysis

Fig. 2.1 Time-series plot—degradation as a function of time

Data considerations for time-series plots

Consider the following guidelines to effectively represent data in time-series plots:

• Record data in sequential or chronological order; otherwise, your data cannot be used
in a time-series plot to assess patterns over time.
• Collect data at regular time intervals, for instance, once a day, once a week; other-
wise, your time-series plot may be misleading. Scatter plots are typically used when
collecting data at irregular intervals.
• Collect data at the time intervals when you want to detect patterns, for instance, week-
to-week patterns in a chemical process, with data collected at the same time each week.
Collecting data at the right frequency is crucial: collecting data less frequently will not
allow you to detect patterns. On the other hand, collecting data more frequently might
add noise to the data and will also not allow you to detect patterns clearly!
• Collect the right amount of data: you need to be sure to collect enough data showing
a long-term pattern and not only an anomaly in your process.
• Collect data, if possible, at the same sampling point.

Smoothing
Time series representations have several potential applications; we can mostly use them
to describe how a process evolves, helping us forecast or predict the future. To achieve
this prediction goal, these plots rely on capturing at least one of the low or high-frequency
behavior.
2.3 Simple Visualization 19

A time-series plot can be decomposed into trend, seasonal, cycling, and noise com-
ponents. A clear trend is observed in Fig. 2.1, which is the plot’s only component. For
instance, seasonal components could tell us that specific trends of a variable are observed
during weekends, different than those observed during the week. Cycling components
could tell us, for instance, a specific pattern repeated over time. Finally, noise shows the
random variation over a given time interval. Smoothing attempts to remove the higher-
frequency behavior to describe the lower-frequency behavior easily; therefore, smoothing
levels can help us remove these components. Thus, a small amount of smoothing removes
the noise component, while more smoothing can remove the seasonal and cyclical com-
ponents to show one isolated trend. A bad smoothing strategy might remove more than
one component at a time, altering the behavior representation of a process. Example 2.3
shows different strategies for smoothing noise in signals.

Example 2.3 Smoothing noise. This example includes a dataset showing signal data (taken
each second) of inlet temperature in a dryer. The corresponding R code is shown as follows,
while the dataset is stored in the Ex2.3.csv file:

The following package ggplot2 must be installed.

###Install
```{r, echo = TRUE}
install.packages("ggplot2")
```

###General
```{r, echo = TRUE}
library(ggplot2)
setwd("C:/Book/Chapter2/Examples")
data <-read.csv(file="Ex2.3.csv",head=TRUE,sep=",")

# Basic time-series plot

p<- ggplot(data = data, aes(x = data$Time..s, y =
data$Temperature..C))+
geom_line(color = "#00AFBB", size = 1)
p
```

Figure 2.2 shows noisy data of temperature in time. Let us apply a smoother (loess)
and compare their efficacy in reducing noise (Fig. 2.3).
20 2 Exploratory Data Analysis

Fig. 2.2 Time-series plot—temperature

Fig. 2.3 Time-series plot—temperature, using loess smoother

2.3 Simple Visualization 21

###With loess smoother

```{r, echo = TRUE}
library(ggplot2)
setwd("C:/Book/Chapter2/Examples")
data <-read.csv(file="Ex2.3.csv",head=TRUE,sep=",")

# Basic line plot

p<- ggplot(data = data, aes(x = data$Time..s, y =
data$Temperature..C))+
geom_line(color = "#00AFBB", size =
1)+geom_smooth(method="loess")
p
```

The corresponding R file of this example is Ex2.3.rmd.

What defines a smoother is the basis to represent the smooth function and the penalty
used to penalize the basis coefficients of the function to control the degree of smoothness.
How do we select the right smoother? In a low-noise scenario, we typically choose
a simple moving average. We use either visualization or experimentation when we try
smoother as the noise increases. When selecting the right smoother, we should also ask
ourselves: Can we isolate certain components in our problem? What exactly do we want
to capture when forecasting? (For instance, a trend, or cyclical behaviour).
Notes: loess is a function in R that fits a polynomial surface. Loess regression can be
applied on a numerical vector to smoothen it and to predict Y.

2.3.2 Scatter Plot

A scatter plot or scatter chart uses dots to represent values for two different numeric
variables; they are used to observe relationships between variables. The data is displayed
as a collection of points, with values of one variable shown on the x-axis, and values of
the other variable shown on the vertical axis. Let us illustrate the use of a scatter plot,
solving the following example (Fig. 2.4).

Example 2.4 Scatter plot. This example is associated with a dataset including Ion-1 and
Ion-2 versus Ion-3 (as concentration ratios); this representation determines the effect of
22 2 Exploratory Data Analysis

Fig. 2.4 Scatter

plot—weathering effect

weathering of minerals in the groundwater. Let us visualize the data with a scatter plot and
infer the potential correlation between these two variables. The corresponding R code is
shown as follows, while the data is stored in the Ex2.4.csv file. Note: An interesting work
on the effect of weathering of carbonate minerals in the groundwater is shown in reference
[3].

###General
```{r, echo = TRUE}
setwd("C:/Book/Chapter2/Examples")
data <-read.csv(file="Ex2.4.csv",head=TRUE,sep=",")
data
# Scatter plot
plot(data$Ion1.Ion3, data$Ion2.Ion3,
xlab="Ion1/Ion3 ", ylab="Ion2/Ion3", pch=19)
#abline(lm(data$Ion2.Ion3~data$Ion1.Ion3),
col="red") # regression line (y~x)
```

The scatter plot allows us to infer that there is not a clear relationship between the
variables Ion-2/Ion-3 and Ion-1/Ion-3 (could be linear?). We can visualize the best straight
line by removing # in the last code line and re-running the code chunk (Fig. 2.5).

Fig. 2.5 Scatter

plot—weathering effect adding
fit line
2.3 Simple Visualization 23

How can we fit dispersed data? Details about data fitting are included in Chap. 3.

The corresponding R file of this example is Ex2.4.rmd.

2.3.3 Multivariate Scatter Plot

Multivariate scatter plots are used to look at the relationships between pairs of variables
in one group of plots; hence, they are helpful to describe relationships among three or
more variables. Example 2.5 illustrates the use of these plots.

Example 2.5 Multivariate scatter plot. The degradation or aging of lithium-ion batteries
is seen as the capacity reduction of the batteries over time. When the battery is at rest, aging
depends on time, temperature, and the state of charge (SOC) or charge level of the battery
relative to its capacity. Let us illustrate the effect of the temperature and SOC on the average
capacity after storing two sets of battery cells for 500 days of a multivariate scatter plot: (i)
for a battery cell cathode made of nickel-molybdenum-cobalt or NMC (data stored in the
file Ex2.5a) and (ii) a battery cell cathode made of nickel–cobalt aluminum or NCA (data
stored in the file Ex.2.5b). Note: An interesting work on the calendar aging of lithium-ion
batteries is shown in reference [4].

The corresponding R code is shown as follows:

###General
```{r, echo = TRUE}
setwd("C:/Book/Chapter2/Examples")
data <-
read.csv(file="Ex2.5a.csv",head=TRUE,sep=",")
data
# Multivariable Scatter plot for NMC cells
pairs(~Capacity+Temp+SOC,data=data,
main="NMC aging",col= ifelse(data$Temp == 35,
"blue",ifelse(data$Temp == 25, "black","red")))
```
24 2 Exploratory Data Analysis

Fig. 2.6 Multivariate scatter plot for NMC aging

The data stored in Ex2.5a.csv includes three columns: Temperature (25, 35, and 50
°C), SOC (from 0 to 100%), and capacity (in %). All temperatures in the scatterplot are
identified by a different colour (35 °C is blue, 25 °C is black, and 50 °C is red). The
corresponding multivariate scatterplot is shown in Fig. 2.6.
A similar R code is required for the NCA data, stored in Ex2.5b.csv. The corresponding
multivariate scatterplot is shown in Fig. 2.7.
Figures 2.6 and 2.7 show that the capacity of a battery decreases as the temperature
and SOC increase. When comparing cathode chemistries, for instance, NCA cells seem
to degrade faster (capacity drop) than NMC cells at 50 °C and degrade slower at different
temperatures for all SOCs.

The corresponding R file of this example is Ex2.5.rmd.

2.3.4 Box Plot

Box plots help in showing the distributional characteristics of data. The box plot
terminology is shown in Fig. 2.8.
2.3 Simple Visualization 25

Fig. 2.7 Multivariate scatter plot for NCA aging

Fig. 2.8 Box plot terminology

26 2 Exploratory Data Analysis

Box plots can tell us:

• Where the mid 50% of the data lies.

• They can identify outliers (points outside the whisker lines), indicating some abnormal
data in the dataset.
• We can compare different independent data to find commonalities between datasets.

To build a box plot, use a horizontal or vertical number line and a rectangular box. The
endpoints of the axis are labelled with the smallest and largest data values. The first
quartile (Q1) shows one end of the box, and the third quartile (Q3) marks the other end
of the box. The middle 50% of the data falls inside the box. The whiskers are identified
from the ends of the box to the largest and smallest data values. The second quartile
(Q2) or median can be between Q1 and Q3, or it can be one, or the other, or both. In
some cases, we can encounter dots marking outliers’ values, where the whiskers are not
extending to the minimum and maximum values. Example 2.6 shows how to interpret a
box plot.

Example 2.6 Box plot. Exposure to particulate matter with a diameter of less than 10
μ m (PM10) poses a significant risk to human health. It is suggested that there is a link
between long-term exposure to PM10 and respiratory mortality [5]. The data in the file
Ex2.6.csv provides values of the days per year of the mean daily PM10 concentration for
each measurement station (denoted with the letter S, from 1 to 8) in a specific region from
2012 to 2022. Let us prepare a box plot to study this data.

The corresponding R code is shown as follows:

###General
```{r, echo = TRUE}
setwd("C:/Book/Chapter2/Examples")
data <-read.csv(file="Ex2.6.csv",head=TRUE,sep=",")
data
boxplot(PM~Year,data=data, main="PM10 per year/sta-
tion",
xlab="Year", ylab="PM10",ylim = c(300, 370))
```

The data stored in Ex2.6.csv includes three columns: Station (from S1 to S8), Year
(from 2012 to 2022), and PM10. The corresponding box plot is shown in Fig. 2.9.
Figure 2.9 shows, for instance, that in 2013, the PM10 varied between 314 and 339
among the 8 measurement stations. 2014 and 2017 show two outliers or values that
2.3 Simple Visualization 27

Fig. 2.9 Box plot for PM10, for 8 measurement stations from years 2012 to 2022

notably differ from the dataset. The mean and median share the same value of 331
(including outliers). What other important information does this plot offer us?

The corresponding R file of this example is Ex2.6.rmd.

2.3.5 Histogram

In a histogram graph, the data is represented by numerical data points grouped according
to specified ranges called bins, showing the frequency distribution of the data. The x-
axis of a histogram represents the intervals of the measured values, while the y-axis
shows the frequency or height of the bars. Frequency refers to the number of times the
values happen within the interval or width of the bar. A histogram is useful to determine
the distribution of the data and provide us with meaningful indicators such as mean and
median. Outliers can also be observed in these graphs. The distribution can be:

• Normal: where mean, median, and mode are the same.

• Bimodal: which has two peaks representing (two normal distributions).
• Right-skewed: where most of the data values are on the right side.
• Left-skewed: where most of the data values are on the left side.
• Random: which shows several peaks with no apparent pattern. This suggests that dif-
ferent data attributes are shown in one graph, so they should be separated into two or
more graphs and to be analyzed accordingly.
28 2 Exploratory Data Analysis

Kurtosis is also studied when analyzing histograms, measuring whether the data is pre-
dominantly normal or not (with outliers) in terms of distribution. A perfectly normal
distribution has zero kurtoses, also known as mesokurtic. Negative kurtosis or platykurtic
shows thinner tails and a flatter peak. Positive kurtosis or leptokurtic shows a fat-tailed
distribution with several outliers.
To build a histogram (i) we find the highest and lowest data value in the dataset; (ii) we
compute the range by subtracting the minimum value from the maximum value; (iii) we
use the range to estimate the width of our classes. Once the class width is estimated, (iv)
a class is selected considering the minimum data value, subsequent classes are generated
until a class that includes the maximum data value is obtained. Once we have organized
our data by classes, we proceed to draw the histogram following these steps:

• Draw a horizontal line where we denote the classes.

• Place evenly spaced marks along the horizontal line to denote the classes.
• Label the marks (scale) and name the horizontal axis.
• Draw a vertical line to the left of the lowest class.
• Select a scale for the vertical axis, accommodating the class with the highest frequency.
• Label the marks (scale) and name the vertical axis.
• Construct bars for each class; the height of each bar represents the frequency of the
class.
Note: Histograms work best when the sample size is at least 20. If the sample size
is too small, each bar on the histogram might not contain enough data to show its
distribution accurately. Instead, we can use individual value plots.

Example 2.7 Histogram. The dataset Ex2.7.csv includes observations of the maximum
daily temperature (in °C) and average wind speed (in m/s) at an undisclosed location. Create
histograms for the temperature and wind speed.

The corresponding R code is shown as follows:

The package moments must be installed.

###Install
```{r, echo = TRUE}
install.packages("moments")
```

The code for generating the histogram is shown as follows:

2.3 Simple Visualization 29

###General
```{r, echo = TRUE}
setwd("C:/Book/Chapter2/Examples")
data <-read.csv(file="Ex2.7.csv",head=TRUE,sep=",")
data

#Temperature Histogram
summary(data$Temperature)
hist(data$Temperature,
xlab="Temperature, in °C",
xlim=c(10,40),
col="blue",
freq=TRUE)

print(skewness(data$Temperature))
print(kurtosis(data$Temperature))

#Wind Speed Histogram

summary(data$Wind)
hist(data$Wind,
xlab="Wind speed, in m/s",
xlim=c(0,12),
col="blue",
freq=TRUE)

print(skewness(data$Wind))
print(kurtosis(data$Wind))
```

This code first provides a summary statistics for the variables temperature and wind
speed:

2.3.6 Temperature

Min. 1st Qu. Median Mean 3rd Qu. Max.

12.00 21.95 26.00 25.54 29.50 37.50

[1] -0.3587357
[1] 2.630471
30 2 Exploratory Data Analysis

2.3.7 Wind Speed

Min. 1st Qu. Median Mean 3rd Qu. Max.

0.600 3.300 4.200 4.409 5.500 10.100

[1] 0.4850302
[1] 3.324937

The generated histograms are illustrated in Fig. 2.10.

The histogram for the temperature is skewed to the left. We can verify the normalcy
of the distribution using skewness in R, showing a value of -0.36. On the other hand, the
excess kurtosis is 2.63, which indicates that the distribution is leptokurtic.
The histogram for the wind speed is skewed to the right. We can verify the normalcy
of the distribution using skewness in R, showing a value of 0.49. On the other hand, the
excess kurtosis is 3.32, which indicates that the distribution is leptokurtic.

The corresponding R file of this example is Ex2.7.rmd.

Fig. 2.10 Histograms for

a maximum daily temperature
and b average wind speed
2.4 Outliers and Missing Values 31

2.4 Outliers and Missing Values

Outliers and missing values are frequently observed when collecting data. They are also
called noise, incomplete data, or abnormal values. We often remove them, as we consider
them unnecessary data. However, outliers and missing values express some facts about the
data; therefore, we must understand the reason for the mechanism of such values. Man-
agement of outliers and missing values is an important step in exploratory data analysis,
as they might compromise the statistical power of the study, affect the reliability of the
data by introducing bias to the results, and reduce the accuracy of models in predicting
outcomes.
This section summarizes the main techniques to detect and treat outliers and missing
values in datasets.

2.4.1 Outliers

An outlier is an observation or value that significantly differs from other data points.
Outliers may be due to the observed phenomenon’s inherent variability and measure-
ment errors (e.g., instrument failure). Thus, we can distinguish two classes of outliers: (i)
extreme values and (ii) mistakes. Extreme values are possible but unlikely responses.
Outliers can be detected using scatter plots, box plots, and histograms. Nevertheless,
there are three statistical tests to detect outliers formally:

• Grubbs’s test allows us to detect whether the highest or the lowest value in a dataset is
an outlier. This test detects one outlier at a time; for instance, the null and alternative
hypotheses are:

o H 0 : the highest or lowest value is not an outlier.

o H 1 : the highest or lower value is an outlier.

As for any statistical test, if the p-value is less than the chosen significance threshold
(typically a = 0.05) then the null hypothesis is rejected, and we can conclude that the
lowest or highest value is an outlier.

• Dixon’s test, like the Grubbs test, detects whether a dataset’s highest or lowest value
is an outlier. It is performed on the suspected outlier individually, and this test is most
useful for small sample sizes (usually n ≤ 25).
• Rosner’s test is used to detect several outliers at once. It is designed to minimize
masking, which happens when an outlier is close in value to another outlier and can
go undetected.
32 2 Exploratory Data Analysis

Detecting outliers is important because it can bias the fit estimates and predictions. Once
the outliers have been identified, there are three approaches you might consider treating
them:

• Imputation: We replace the outlier points with the mean/median/mode. This technique
can be applied depending on the data context; for instance, if the variation is low or the
variable has low leverage over the system’s response, then this approach is acceptable
and could lead to satisfactory results.
• Capping: We can cap observations outside the 1.5 × interquartile range (IQR) limits.
IQR measures the spread of data.
• Prediction: Outliers can be substituted with missing values (NA) and predicted by
considering them as an outcome or response variable.

Note: There are several sources of outliers in process engineering, from measurement
errors (from instruments), experiment and/or failures, power or emergency failures induc-
ing instrument errors, changes in process conditions (e.g., stream compositions). These
sources must be identified, as possible, for troubleshooting purposes, and to keep the
physical meaning when analyzing data.

2.4.2 Missing Values

Missing values occur when no data value is captured for the variable in an observation.
Missing data can significantly impact the conclusions drawn from the data; hence, they
must be treated. The options for NA values include:

• Deleting the observations: make sure that after your delete your observations you

o Have sufficient data points so the model does not lose representation capability of
the physical phenomenon.
o Do not introduce bias (non-representation of classes).

• Deleting the variable is practical when a particular variable has more NAs than the rest
of the dataset, and the variable is not a significant predictor.
• Imputation with mean/mean/mode: this is like the approach used for outliers.
• Prediction: this is also like the approach used for outliers.

Let us perform some tests and treatments on outliers and missing values.
2.4 Outliers and Missing Values 33

Example 2.8 Outliers and missing values. Let us use the dataset Ex2.8.csv (a modification
of Ex2.7.csv) and the Rosner test in R to check suspected outliers for the variable wind.
Once detected, we proceed to cap them.

The package EnvStats must be installed.

The corresponding R code is shown as follows:

###Install
```{r, echo = TRUE}
install.packages("EnvStats")
```

The R code for detecting outliers is shown as follows:

###Detect Outliers
```{r, echo = TRUE}
library(EnvStats)
setwd("C:/Book/Chapter2/Examples")
data <-read.csv(file="Ex2.8.csv",head=TRUE,sep=",")
test <- rosnerTest(data$Wind, k=5)
test$all.stats
test
```

The function rosnerTest requires two arguments: the data, and k, the number of sus-
pected outliers; in this case, we chose 5. The corresponding statistic regarding outliers is
generated as follows:

i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier

0 4.544604 2.277176 22.8 42 8.016681 3.492818 TRUE

1 4.412319 1.665294 10.1 9 3.415421 3.490507 FALSE

2 4.370803 1.598122 8.4 22 2.521207 3.488176 FALSE

3 4.341176 1.565814 8.3 113 2.528285 3.485824 FALSE

4 4.311852 1.533704 0.6 53 2.420189 3.483453 FALSE

34 2 Exploratory Data Analysis

Based on the Rosner test, one outlier (see the column Outlier) is the observation 42
(see the column Obs.Num).
Note: We have other packages in R that can be used for outliers’ detection, including
lofactor, outliers, outlierTest, OutlierDetection, and mvoutlier.
The following code allows for capping the outliers, based on the 1.5 × IQR limits:

###Treat Outliers
```{r, echo = TRUE}
Wind <- data$Wind
qnt <- quantile(Wind, probs=c(.25, .75), na.rm = T)
caps <- quantile(Wind, probs=c(.05, .95), na.rm = T)
H <- 1.5 * IQR(Wind, na.rm = T)
Wind[Wind < (qnt[1] - H)] <- caps[1]
Wind[Wind> (qnt[2] + H)] <- caps[2]
summary(Wind)
```

The summary statistics is as follows:

Min. 1st Qu. Median Mean 3rd Qu. Max.

0.600 3.300 4.200 4.416 5.500 8.400

Although there is no missing value, this is an example of an R code to remove them:

###Remove NAs
```{r, echo = TRUE}
Wind_clean <- na.omit(Wind)
summary(Wind_clean)
```

The corresponding R file of this example is Ex2.8.rmd.

2.4 Outliers and Missing Values 35

Cook’s distance
A graphical way to observe outliers in linear regression is the Cook’s distance, which
shows the influence of each observation on the fitted response values. Cook’s distance
summarizes how much a regression change is affected when the ith observation is
removed.
A general rule of thumb to investigate outliers is checking if the data point is more than
3× the mean of all the distances. Example 2.9 illustrates this rule of thumb and Cook’s
distance to detect outliers.

Example 2.9 Identifying outliers using Cook’s distance. Let us detect the outliers in the
dataset stored in Ex2.9.csv using Cook’s distance; this file includes values for the acid gas
loading as a function of the amine flow rate. This is an extended dataset of Ex1.2.csv.

The corresponding R code is shown as follows:

The package dplyr must be installed.

###Install
```{r, echo = TRUE}
install.packages("dplyr")
```

Let us observe any possible trend using a scatter plot.

###Scatter plot
```{r, echo = TRUE}
setwd("C:/Book/Chapter2/Examples")
data <-read.csv(file="Ex2.9.csv",head=TRUE,sep=",")
data
# Scatter plot
plot(data$AmineFlow,data$AcidGas.Loading,
xlab="Amine Flow, USGPM ", ylab="Acid Gas Load-
ing, %mol/%mol", pch=19)
```

Figure 2.11 shows a linear trend in the data when removing observation 12 (possible
outlier).
Outliers are detected using Cook’s distances using the following code:
36 2 Exploratory Data Analysis

Fig. 2.11 Scatter plot—acid gas loading versus amine flow

###Detect Outliers using Cook’s distance

```{r, echo = TRUE}
#Fit with linear regression
model <- lm(data$AcidGas.Loading ~ data$AmineFlow,
data = data)
summary(model)

#Detect outliers using Cook's distance

cooksD <- cooks.distance(model)
influential <- cooksD[(cooksD > (3 * mean(cooksD,
na.rm = TRUE)))]
influential

#Remove outliers and fit model again

library(dplyr)
names_of_influential <- names(influential)
outliers <- data[names_of_influential,]
data_without_outliers <- data %>% anti_join(outli-
ers)
model2 <- lm(data_without_outliers$AcidGas.Loading ~
data_without_outliers$AmineFlow, data = data_with-
out_outliers)
summary(model2)
```

We fit the data with a linear regression in the first part of the code. The summary
statistics of the fitting is shown as follows:
2.4 Outliers and Missing Values 37

Residuals:
Min 1Q Median 3Q Max
-0.0008724 -0.0004625 -0.0002204 0.0002252 0.0026975

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.968e-03 1.441e-03 -1.365 0.197158
data$AmineFlow 6.300e-05 1.316e-05 4.789 0.000442 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.0009026 on 12 degrees of freedom

Multiple R-squared: 0.6565, Adjusted R-squared: 0.6278
F-statistic: 22.93 on 1 and 12 DF, p-value: 0.0004421

Let us look at the adjusted R-squared value for this example. Chapter 3 covers in detail
the prediction techniques. This value, when close to 1, attests to the model’s accuracy.
The current value of R2 is 0.6278.
The second part of the code allows detecting outliers using Cook’s distance based on
the rule of thumb 3 × the mean of all the distances. The only outlier detected using this
technique is observation 12.
The last part of the code removes the outlier and refits the model. The corresponding
statistics summary after outliers’ removal is:

Residuals:
Min 1Q Median 3Q Max
-7.612e-04 -7.855e-05 2.011e-05 1.748e-04 3.906e-04

Coefficients:
Estimate Std. Error t value
(Intercept) -5.521e-04 4.934e-04 -1.119
data_without_outliers$AmineFlow 4.773e-05 4.577e-06 10.430
Pr(>|t|)
(Intercept) 0.287
data_without_outliers$AmineFlow 4.85e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.0002961 on 11 degrees of freedom

Multiple R-squared: 0.9082, Adjusted R-squared: 0.8998
F-statistic: 108.8 on 1 and 11 DF, p-value: 4.848e-07
38 2 Exploratory Data Analysis

The final value of R2 is 0.8998, closer to 1, and greater than 0.6278, proving that
removing the outlier, when physically possible, improves the fitting of the model. Hence,
the acid gas loading and amine flow are linearly correlated.

The corresponding R file of this example is Ex2.9.rmd.

2.5 Correlogram

Correlation is a statistical measure that shows the extent to which two variables are lin-
early related without stating the cause and effect of this relation. A correlation can be
visualized with a scatterplot, as in Ex.2.9. When working with multiple variables, we can
use a correlogram, a correlation matrix graph. A correlogram is useful to highlight the
most correlated variables in a data table. In R, the plot includes correlation coefficients
coloured according to the value. Ex. 2.10 illustrates the use of a correlogram.

Example 2.10 Correlogram. A battery thermal management system (BTMS) is hardware

installed in electric vehicles to keep the temperature of lithium-ion battery packs within the
desired range. When designing a BTMS, we are interested in determining the heat transfer
coefficient h (in W/°C) as a function of the BTMS design variables x 1 to x 5 .

Let us illustrate a correlogram relating the design variables to the convection heat
transfer coefficient.
The package corrplot must be installed.
The corresponding R code is shown as follows:

###Install
```{r, echo = TRUE}
install.packages("corrplot")
```
2.5 Correlogram 39

The correlograms are generated using the following code for two different visualiza-
tions (circle, and number):

###Correlation
```{r, echo = TRUE}
setwd("C:/Book/Chapter2/Examples")
data <-
read.csv(file="Ex2.10.csv",head=TRUE,sep=",")
data
M <- cor(data)

library(corrplot)
#Using circles for visualizations
corrplot(M, method="circle")

#Using numbers for visualizations

corrplot(M, method="number", type = "upper")
```

The correlograms are shown as follows:

Note: Positive correlations are displayed in blue, while negative correlations are in
red. Colour intensity, and the size of the circle indicate proportionality to the correlation
coefficients.
Figures 2.12 and 2.13 show that variable x 5 strongly correlates with the convection
heat mass transfer coefficient h or response (correlation coefficient equals 0.98), followed
by x 2 and x 1 . There is no significant correlation with x 3 and x 4 , suggesting that a potential
regression between input variables and h might prescind from these variables. The topic
of dimensionality reduction will be discussed in the next section of this chapter. We can
also add the significance level to the correlogram; for this, we need to compute first the
matrix of p-values for a given significance threshold (in this case, we chose 0.5) and then
plot the correlogram, as shown in Fig. 2.14 [6].
40 2 Exploratory Data Analysis

Fig. 2.12 Full correlogram for

a battery thermal management
design

Fig. 2.13 Upper correlogram

with numbers for a battery
thermal management system
design
2.5 Correlogram 41

Fig. 2.14 Upper correlogram

with circles for a battery
thermal management design

###Significance
```{r, echo = TRUE}
#p-values matrix
cor.mtest <- function(mat, ...) {
mat <- as.matrix(mat)
n <- ncol(mat)
p.mat<- matrix(NA, n, n)
diag(p.mat) <- 0
for (i in 1:(n - 1)) {
for (j in (i + 1):n) {
tmp <- cor.test(mat[, i], mat[, j], ...)
p.mat[i, j] <- p.mat[j, i] <-
tmp$p.value
}
}
colnames(p.mat) <- rownames(p.mat) <-
colnames(mat)
p.mat
}
# matrix of the p-value of the correlation
p.mat <- cor.mtest(data)
head(p.mat[, 1:6]) #"6" is the total number of var-
iables

#add significance level

corrplot(M, type="upper", order="hclust",
p.mat = p.mat, sig.level = 0.5, insig =
"blank")
```
42 2 Exploratory Data Analysis

Figure 2.14 shows that the input variable x 5 is the most significant for the chosen
significance level when relating to h.

The corresponding R file of this example is Ex2.10.rmd.

A white box in the correlogram shows that the correlation is not significantly different
from 0 at a specified significance level for a couple of variables; this means there is
no linear relationship between them. To determine whether a correlation coefficient is
significantly different from 0, a correlation test must be performed. Even if the correlation
coefficient between two variables is low, the correlation test shows if we can reject or not
the hypothesis of no correlation in the population!
Correlograms are very helpful for preliminary visual inspection. Nevertheless, we must
consider that outliers and missing values in a time series may seriously affect them; for
example, extreme points will tend to depress the data correlation coefficients towards zero.

2.6 Clustering and Dimensionality Reduction

Clustering is the assignment of values or objects to homogeneous groups or clusters,

ensuring that these values or objects clustered in different groups are not similar. Cluster-
ing aims to describe the non-apparent structure of the data. Features describe each set of
values or objects; hence, clustering defines an accurate distance between values or objects
to quantify similarity.
In machine learning, highly dimensional data can be reduced by decreasing the number
of features or dimensions under consideration. As more features are considered, the data
suffers from the ‘curse of dimensionality’, altering its analysis as it becomes very sparse.
The dimensionality reduction can be performed through feature selection, selecting from
the existing features, or using feature extraction by combining the existing features and
extracting new ones. The main technique employed for feature extraction is the Principal
Component Analysis (PCA).
This section describes K-means clustering and PCA as the most popular techniques
for clustering and dimensionality reduction in data analytics.

2.6.1 K-means Clustering

K-means clustering is an algorithm that aims to partition n observations into k clusters,

finding similarities between the observations. Thus, each observation belongs to the cluster
with the nearest mean.
2.6 Clustering and Dimensionality Reduction 43

K-means clustering, clusters are defined so that the intra-cluster variation is minimized;
this variation is the sum of squared Euclidean distances between each observation and the
mean of the cluster:

W(Ck ) = (xi − μk )2 (2.1)
xi ∈Ck

where
x i is the data point belonging to the cluster C k .
µk is the mean value of the points assigned to the cluster C k .
Each observation is placed in a cluster such that the sum of squares distance of the
observation to their cluster centres µk is a minimum:

k
k
W(Ck ) = (xi − μk )2 (2.2)
k=1 k=1 xi Ck

The K-means algorithm is summarized as follows.

• Specify the number of clusters (k), at least preliminary.

• Randomly select k objects from the dataset and place them as the initial cluster means.
• Assign each observation to their nearest centroid.
• The cluster centroid is updated for each cluster by calculating the updated mean of the
values in the cluster.
• Iteratively minimize the total sum of square.

Example 2.11 K-means clustering. The file Ex2.11.csv includes data of the capacity (in
%) of a lithium-ion battery over time (in days). Let us perform clustering to determine the
optimal number of clusters.

The corresponding R code is shown as follows:

44 2 Exploratory Data Analysis

###Install
```{r, echo = TRUE}
#another way of getting packages
packages<-function(x){
x<-as.character(match.call()[[2]])
if (!require(x,character.only=TRUE)){
install.packages(pkgs=x,repos="http://cran.r-
project.org")
require(x,character.only=TRUE)
}
}

packages(fviz_cluster)
packages(gridExtra)
packages(cluster)
packages(factoextra) # visualization tool
```

The following code allows performing k-mean clustering on the data:

###Clusters
```{r, echo = TRUE}
setwd("C:/Book/Chapter2/Examples")
data <-
read.csv(file="Ex2.11.csv",head=TRUE,sep=",")

dataNorm <- as.data.frame(scale(data)) #normalizing

#k-means clustering
set.seed(100)
data_K2 <- kmeans(dataNorm, centers = 2, nstart =
25)
fviz_cluster(data_K2, data = dataNorm)
```

To perform clustering, we must scale data values such that the overall statistical
summary of every variable has a mean zero and a unit variance value (Fig. 2.15).
We performed k-means clustering setting two clusters (centers = 2). We can visualize
the results as follows:
2.6 Clustering and Dimensionality Reduction 45

Fig. 2.15 Cluster plot for Example 2.11

K-means requires the user to specify a priori the number of clusters to classify the
data. However, there are plots employing the Elbow and Silhouette methods that suggest
the optimal number of clusters, showing the total within-groups sums of squares versus
the number of clusters and looking at the bend in the graph (Fig. 2.16).

Fig. 2.16 Optimal number of clusters, according to the Elbow method

46 2 Exploratory Data Analysis

###Optimal number of Clusters

```{r, echo = TRUE}
#Optimal clusters (k) Using Elbow method
fviz_nbclust(x = dataNorm,FUNcluster = kmeans,
method = 'wss' )

#Optimal clusters (k) Using Average Silhouette

Method
fviz_nbclust(x = dataNorm,FUNcluster = kmeans,
method = 'silhouette' )
```

The Elbow method plots the explained variation as a function of the number of clusters
and selects the elbow or bend of the curve to suggest the number of clusters to use
(Fig. 2.17).
The Elbow method suggests that the optimal number of clusters for this example is 2,
while the Silhouette method suggests that this number is 3. When analyzing the physical
meaning of degradation (capacity) in lithium-ion batteries, it is reasonable to think that
until 80% of capacity, degradation over time is quasi-linear due to the formation and
growth of a layer called Solid Electrolyte Interface (SEI). The SEI layer reduces the
amount of available lithium for the intercalation reaction (the main reaction that occurs
in these batteries when charging or discharging). After 80% of capacity, lithium plating,
another degradation mechanism, is believed to be responsible for the non-linear behavior
of the degradation over time. Hence, two clusters are expected to reflect the lithium-ion
battery’s first life (due to the SEI layer) and second life (due to the combined SEI and

Fig. 2.17 Optimal number of clusters, according to the Silhouette method

2.6 Clustering and Dimensionality Reduction 47

lithium plating mechanisms). This emphasizes the importance of the decision-making

process of an engineer based on the physical meaning of the analyzed phenomenon.

The corresponding R file of this example is Ex2.11.rmd.

2.6.2 Principal Component Analysis

Principal Component Analysis (PCA) is a dimensionality-reduction method that trans-

forms large into small datasets while preserving most of the information in the large
dataset, trading off accuracy in favour of simplification.
How can we perform a PCA?

• Standardization of the continuous initial variables. This preprocessing is required to

minimize generating biased results. Standardization is also known as scaling and is
performed using the following equation:

value − mean
z= (2.3)
standarddeviation

• Computation of a p x p covariance matrix, where p is the number of dimensions. This

matrix includes the covariances of all possible pairs of initial variables. For instance,
for a 3 × 3 dataset containing variables x, y, and z, the covariance matrix is:

⎡ ⎤
Cov(x, x) Cov(x, y) Cov(x, z)
⎢ ⎥
⎣ Cov(y, x) Cov(y, y) Cov(y, z) ⎦ (2.4)
Cov(z, x) Cov(z, y) Cov(z, z)

• The covariance matrix is used to identify any relationship between the variables. If the
covariance sign is positive, the two variables are correlated (increasing or decreasing);
48 2 Exploratory Data Analysis

if the sign is negative, the two variables are inversely correlated (one variable increases
when the other decreases).
• Computation of eigenvectors and eigenvalues of the covariance matrix. These elements
allow for determining the principal components of the data. These components are
uncorrelated, as they are new variables created as linear combinations of the initial vari-
ables. How can we interpret the PCA components? For instance, if a 10-dimensional
dataset generates ten principal components, but PCA sets the maximum possible infor-
mation in the first component, then the maximum remaining information is contained
in the second component, followed by the third component, and so on, until obtaining
a scree plot. When observing this plot, we can discard those components with low
information, reducing dimensionality while keeping relevant information.

Let us illustrate the use of PCA in Example 2.12.

Example 2.12 Let us use the data Ex2.10 to perform a PCA for dimensionality-reduction
purposes.

The corresponding R code is shown as follows:

The package factoextra must be installed.

###Install
```{r, echo = TRUE}
install.packages("factoextra")
```

The following code allows performing PCA on the data:

2.6 Clustering and Dimensionality Reduction 49

###PCA
```{r, echo = TRUE}
setwd("C:/Book/Chapter2/Examples")
data <-
read.csv(file="Ex2.10.csv",head=TRUE,sep=",")
head(data)

#Scree plot
res.pca <- prcomp(data, scale = TRUE)
screeplot(res.pca, type = "l", npcs = 6, main =
"Screeplot of the first 6 PCs")
abline(h = 1, col="red", lty=5)
legend("topright", legend=c("Eigenvalue = 1"),
col=c("red"), lty=5, cex=0.6)

#Biplot of individuals and variables

fviz_pca_biplot(res.pca,
col.var = "contrib", # Color by contri-
butions to the PC
gradient.cols = c("#00AFBB", "#E7B800",
"#FC4E07"),
repel = TRUE # Avoid text overlap-
ping
)
```

The corresponding scree plot and contribution plot are shown as follows (Fig. 2.18):
An eigenvalue less than 1 means that the component explains less than a single
explanatory variable, meaning we should discard those components. In our case, we
should discard 3 components (Fig. 2.19).

Fig. 2.18 Scree plot for Example 2.12

50 2 Exploratory Data Analysis

Fig. 2.19 Contribution plot for Example 2.12

The contribution plot (biplot) shows principal component (PC) scores of samples (dots)
and loadings of variables (vectors). The plot shows observations as points in the plane
formed by two synthetic variables (two principal components):

• The more parallel the vector to a principal component axis is a vector, the more it
contributes only to that PC.
• The longer the vector, the more variability of this variable is shown by the two principal
components.
• The angles between vectors of different variables show their correlation; small angles
indicate a high positive correlation, right angles represent a lack of correlation, and
opposite angles represent a highly negative correlation.

In this example, the variable h strongly correlates with x 5 , followed by x 1 , and x 4 . We

will prove in Chap. 3 the effect of dimensionality reduction when performing regression.

The corresponding R file of this example is Ex2.12.rmd.

2.6 Clustering and Dimensionality Reduction 51

2.6.3 Variance-Based Sensitivity Analysis

Global sensitivity analysis defines the importance of model inputs and their interactions
regarding the model output. This analysis is called global, as (i) all the input factors are
varied simultaneously, and (ii) it is performed over the range of each input factor.
Variance-based sensitivity analysis or Sobol indices is a technique to perform a global
sensitivity analysis, as it provides a better understanding of the relationship between the
variables of a model. The variance of the model’s output is decomposed into fractions or
measures of sensitivity attributed to the inputs.
The first-order Sobol sensitivity index S, measures the contribution of each variable or
parameter to the output variable, and hence, its effect on the variance of the model:

V[E[Y]Qi ]
Si = (2.5)
V[Y]
where [E[Y ]Q i ] is the expected value of the output Y when the variable or parameter Q i
is fixed.
The total Sobol sensitivity index (ST) accounts for the sensitivity of the first-order and
higher-order effects (interaction between variables):

V E[Y]Q−i
STi = 1 − (2.6)
V[Y]
where Q −i includes all uncertain parameters except Q i .
Note: The sum of the first-order Sobol sensitivity indices cannot exceed one. Likewise,
the sum of the total Sobol sensitivity indices is equal to or greater than one.

Example 2.13 Sobol indices. We use a dataset included in Ex2.13, showing the variable y
(corrosion rate) as a function of the variables x 1 (acid gas), x 2 (heat stable salts), x 3 (velocity),
and x 4 (temperature) to perform a global sensitivity analysis using Sobol indices.

The corresponding R code is shown as follows:

The package sensitivity is required to run this example.

###Install
```{r, echo = TRUE}
install.packages("sensitivity")
```

The following code allows estimating first-order Sobol indices:

52 2 Exploratory Data Analysis

###Sobol
```{r, echo = TRUE}
library(sensitivity)
setwd("C:/Book/Chapter2/Examples ")
data <-
read.csv(file="Ex2.13.csv",head=TRUE,sep=",")
x1=data$x1
x2=data$x2
x3=data$x3
x4=data$x4
y=data$y
XX1=data.frame(data[c(1:500),c(1:4)]);
rownames(XX1) <- NULL
XX2=data.frame(data[c(501:1000),c(1:4)]);
rownames(XX2) <- NULL

modelX <-
lm(y~x1+x2+x3+x4+x1^2+x2^2+x3^2+x4^2+x1*x2+x1*x3+x1*x
4+x2*x3+x2*x4+x3*x4)
#summary(modelX)
sol <-sobol(model=modelX,X1=XX1,X2=XX2,order=1)
print(sol)
plot(sol)
```

We assume that the corrosion rate y is correlated with a linear model with interaction
parameters (modelX). The corresponding summary statistics shows an adjusted R-squared
of 0.7189. When calculating the first-order Sobol indices for the assumed model, it shows
the following contribution of the input variables (x 1 to x 4 ):
The corresponding values of the figure are:

Figure 2.20 (and its corresponding values) shows that x 4 (temperature) is the most
influential variable on the corrosion rate, followed by x 2 , x 3 , and x 1
2.7 Summary and Final Remarks 53

Fig. 2.20 Sobol indices of variables affecting corrosion rate

The corresponding R file of this example is Ex2.13.rmd.

2.7 Summary and Final Remarks

Exploratory data analysis (EDA) is a critical process that data analysts, scientists, and
engineers perform to investigate data, discover patterns, pick anomalies, check assump-
tions, and test hypotheses. There is no standard recipe for performing EDA. Nevertheless,
we propose the workflow shown in Fig. 2.21, supported by the techniques, algorithms,
and methods included in this chapter.
The raw data (or data before being processed) shall be processed through six steps,
including (i) summary statistics, (ii) data visualization, (iii) capture and treatment of out-
liers and missing values, (iv) correlation, (v) clustering and dimensionality reduction, and
(vi) sensitivity analysis to generate ‘clean data’ ready for modelling and prediction pur-
poses. The readiness of the data can be revised when modelling and predicting values,
depending on the model’s accuracy, and the physical interpretation of the analyzed phe-
nomenon or process. For instance, our EDA might suggest a dimensionality reduction that
leads to a significant accuracy loss for modelling and prediction purposes; hence, we can
decide to omit this step. Another example can be associated with capturing and treating
outliers; some chemical processes, for example, might exhibit extreme values due to the
intrinsic nature of the processes even under normal operating conditions; therefore, we
cannot simply delete these values. EDA is critical for defining accurate models, and this
analysis must be accompanied by a process engineering interpretation of the process to
be successful and lead to representative models.
54 2 Exploratory Data Analysis

Fig. 2.21 Exploratory data

analysis workflow
Problems 55

In this chapter, we checked several data visualization techniques, including time-series

plots, scatter plots, multivariate scatter plots, box plots, and histograms. We reviewed the
concepts of outliers and missing values, including detection techniques (such as Grubb’s
test, Dixon’s test, Rosner’s test, and Cook’s distance) and treatment options (imputation,
capping, and prediction). We also studied correlograms or graphical representations of cor-
relation matrices; this is a useful technique employed to analyze the correlation between
several variables. We studied clustering and dimensionality reduction techniques, such as
k-means clustering (or grouping data in clusters based on similarity) and principal com-
ponent analysis (PCA). PCA is a very well-known dimensionality-reduction technique
commonly used in large datasets. Finally, we introduced the variance-based sensitivity
analysis or Sobol indices, a global sensitivity analysis technique. This method enhances
the understanding of the relationship between the inputs and output of a model by study-
ing the sensitivity of all inputs and, hence, their individual and combined effects on the
output.

Data Disclosure

The data was generated, maintaining the described problem’s physical meaning of the
analyzed phenomenon or process.

Problems

2.1 Analyze the current Exploratory Data Analysis on the data for modelling nitro-
gen dioxide concentration levels across Germany. The article can be down-
loaded from: https://www.sciencedirect.com/science/article/pii/S2352340921006089.
The dataset can be downloaded from: https://zenodo.org/record/5148684#.Yp-J-
ajMJro, file DATA_MonitoringSites_DE.csv. Perform your own EDA with the tools
provided in this chapter and compare it with the current one.
2.2 Perform an Exploratory Data Analysis on the water treatment plant experiment
described by Souza et al. (2013)[4]. The objective is to estimate the fluoride concen-
tration in the effluent of an urban water treatment plant. The corresponding dataset
can be downloaded from: https://home.isr.uc.pt/~rui/publications/datasets.html, see
‘WWTP’. Discuss and report your conclusions about the dataset.
56 2 Exploratory Data Analysis

Resources

K-means clustering in R: https://rpkgs.datanovia.com/factoextra/index.html.

PCA in R: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/prcomp.html.
Sobol indices in R: https://cran.r-project.org/web/packages/sensitivity/sensitivity.pdf.
R-codes and data repository: https://github.com/CHE408UofT/DGSD_UofT.

Recommended Readings

Chakraborty, S., & Dey, L. (2023). Computing for data analysis: Theory and practices.
Springer Verlag, Singapore.
Roy, Kavika (2022). Dimensionality reduction techniques in Data Science. KDnuggets.
https://www.kdnuggets.com/2022/09/dimensionality-reduction-techniques-data-science.
html.
Frost, J. (2023, May 18). Box plot explained with examples. Statistics By Jim. https://
statisticsbyjim.com/basics/graph-groups-boxplots-individual-values/.
Glen, G., & Isaacs, K. (2012). Estimating sobol sensitivity indices using correlations.
Environmental Modelling & Software, 37, 157–166. https://doi.org/10.1016/j.envsoft.2012.
03.014.
Glen, S. (2022, January 14). Cook’s distance / cook’s D: Definition, interpretation.
Statistics How To. https://www.statisticshowto.com/cooks-distance/.
Giudici, P. (2013). Statistical models for data analysis. Springer.
Irizarry, R. A. (n.d.). Introduction to data science. Chapter 28 Smoothing. http://raf
alab.dfci.harvard.edu/dsbook/smoothing.html.
McGregor, M. (2020, September 21). 8 clustering algorithms in machine learning that
all data scientists should know. freeCodeCamp.org. https://www.freecodecamp.org/news/
8-clustering-algorithms-in-machine-learning-that-all-data-scientists-should-know/.
Menon, K. (2023, January 12). The complete guide to Skewness and kurtosis:
Simplilearn. Simplilearn.com. https://www.simplilearn.com/tutorials/statistics-tutorial/ske
wness-and-kurtosis.
n.d. (2017, October 7). Principal component analysis in R: Prcomp VS princomp.
STHDA. http://www.sthda.com/english/articles/31-principal-component-methods-in-r-pra
ctical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/.
Pearson, R. K. (2018). Exploratory Data Analysis Using R. CRC Press.
Prakash, K. B. (2022). Data science handbook: A practical approach. Wiley-Scrivener.
Snehal_bm. (2021, July 8). How to treat outliers in a data set?. Analytics Vidhya.
https://www.analyticsvidhya.com/blog/2021/07/how-to-treat-outliers-in-a-data-set/.
What is cluster analysis? when should you use it for your results?. Qualtrics.
(2022, November 30). https://www.qualtrics.com/experience-management/research/clu
ster-analysis/.
References 57

Zhang, X., Trame, M., Lesko, L., & Schmidt, S. (2015). Sobol Sensitivity Analysis:
A tool to guide the development and evaluation of Systems Pharmacology Models. CPT:
Pharmacometrics & Systems Pharmacology, 4(2), 69–79. https://doi.org/10.1002/psp4.6.

References

1. Behera, S., & Suresh, A. K. (2019). Data on of interfacial hydrolysis kinetics of an aromatic acid
chloride. Data in Brief, 26, 104337. https://doi.org/10.1016/j.dib.2019.104337
2. Andrady, A. (2015). Degradation of plastics in the environment. Plastics and Environmental
Sustainability, 145–184. https://doi.org/10.1002/9781119009405.ch6.
3. Mallick, J., Singh, C., AlMesfer, M., Kumar, A., Khan, R., Islam, S., & Rahman, A. (2018).
Hydro-geochemical assessment of groundwater quality in ASEER region Saudi Arabia. Water,
10(12), 1847. https://doi.org/10.3390/w10121847
4. Souza, F., Araújo, R., Matias, T., & Mendes, M. (2013). A Multilayer-Perceptron Based Method
for Variable Selection in Soft Sensor Design. Journal of Process Control, 23(10), 1371–1378.
https://doi.org/10.1016/j.jprocont.2013.09.014
5. California Air Resources Board. Inhalable Particulate Matter and Health (PM2.5 and PM10) |
California Air Resources Board. (n.d.). https://ww2.arb.ca.gov/resources/inhalable-particulate-
matter-and-health.
6. Taiyun Wei, V. S. (2021, November 18). An introduction to corrplot package. https://cran.r-pro
ject.org/web/packages/corrplot/vignettes/corrplot-intro.html
Data-Based Modelling for Prediction
3

3.1 Regression and Models

Regression is a statistical analysis that allows inferring if there is a relationship between

two or more variables; the regression model the function that describes this relationship.
Regression analysis is a predictive modelling technique.

3.2 Simple Regression Models

Simple regression models include simple linear regression, which is the most common
form of regression analysis, multivariate linear regression, splines, multivariate adap-
tive regression splines, other functions (such as exponential, logarithmic, and polynomial
functions), response surface regression, Kriging, among others.

3.2.1 Simple Linear Regression

Simple linear regression is the linear regression approach that attempts to model the
relationship between an explanatory or independent variable x and a dependent variable
y.
A simple linear regression model has an equation of the form

y = mx + b (3.1)

Supplementary Information The online version contains supplementary material available at

https://doi.org/10.1007/978-3-031-46866-7_3.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 59

where m is the slope and b is the y-intercept

Simple linear regression models are often fitted using the linear least squares (LLS)
( ) ( ) ( )
approach; the equation of the best line fitting the pairs x1 , y1 , x2 , y2 , …, xn , yn is
obtained by following these steps:

• Calculate the mean of the x-values and y-values:

En
i=1 xi
X= (3.2)
n
En
i=1 yi
Y= (3.3)
n
where n is the number of ordered pairs.

• Calculate the slope of the best line:

En ( )( )
i=1 xi − X yi − Y
m= E ( )2 (3.4)
n ni=1 xi − X

• Calculate the y-intercept of the line:

b = Y − mX (3.5)

Scatter plots allow us to observe the linearity between the variables x and y. Analytically,
the linearity is confirmed or not using the coefficient of determination r 2 , which shows the
proportion of the variance in y that is predictable from x:
RSS
r2 = 1 − (3.6)
TSS
where RSS is the sum squares of residuals and TSS is the total sum of squares.

E
n
( )2
RSS = yi − f(xi ) (3.7)
i=1

where f (xi ) is the predicted value of yi , and

E
n
( )2
TSS = yi − Y (3.8)
i=1
3.2 Simple Regression Models 61

The coefficient of determination (r2 ) ranges from 0 to 1. An r2 of 1 indicates that the

regression model perfectly fits the data.

Example 3.1 The hydrolysis of a prescription drug is a first-order reaction with regard to
it. Data for this reaction at 25 °C and pH 7.0 is provided in the file Ex3.1.csv. Determine the
rate constant.

The following package ggplot2 must be installed. The package ggplot2 is a data
visualization package for R.
A preliminary exploratory data analysis using a scatter plot reveals a linear relationship
between the variables (Fig. 3.1).

###General
```{r, echo = TRUE}
setwd("C:/Book/Chapter3/Examples")
data <-read.csv(file="Ex3.1.csv",head=TRUE,sep=",")
data=data[c(1:4),c(2:3)];
data
plot(data$InitialRate,data$PresDrug,xlab="Concen-
tration of Prescription Drug, M",ylab="Initial Rate,
M/min",pch=19)
```

In a first-order reaction, the reaction rate is proportional to the concentration of one of

the reactants. The linearized form of the integrated rate law is:

Fig. 3.1 Graphical relationship between the initial rate and concentration of the prescription drug
62 3 Data-Based Modelling for Prediction

ln[A] = ln[A]0 − kt (3.9)

where [A] is the concentration of the reactant A (prescription drug) at time t, [A]0 is the
initial concentration of the reactant A, and k is the rate constant. To find the value of k is
necessary to estimate the slope of the linear function between the variables. We use the
function lm in R and present the summary stats of the model:

###Model
```{r, echo = TRUE}
model <-lm(data$InitialRate~data$PresDrug)
summary(model)
```
Call:
lm(formula = data$InitialRate ~ data$PresDrug)

Residuals:
Min 1Q Median 3Q Max
-6.490e-06 -6.184e-07 4.021e-07 1.018e-06 3.767e-06

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.391e-06 1.657e-06 0.84 0.426
data$PresDrug 1.387e-03 3.198e-05 43.37 8.81e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.828e-06 on 8 degrees of freedom
Multiple R-squared: 0.9958, Adjusted R-squared: 0.9952
F-statistic: 1881 on 1 and 8 DF, p-value: 8.814e-11

The value of the rate constant k (slope) is 1.387 × 10–3 min−1 . The intercept of the
linear function is 1.391 × 10–6 .
Let us examine the adequacy of the model:

• The residuals are the difference between the dependent variable’s predicted and
observed values. For all points of the dataset, the residuals are negligible.
• The standard error is the standard deviation divided by the square root of the sample
size. In this case, it is reported as negligible.
• t-values denote the confidence we have in the coefficients as a predictor. The higher
the t-value is, the greater the confidence we have. Generally, any t-value greater than
+2 or less than −2 is acceptable; our model reflects this acceptance criterion.
• Pr (>|t|) is the acronym used by R in the model output, related to the probability of
observing any value equal or larger than t. A small p-value indicates that the null
hypothesis is weak; hence, it is unlikely that observed differences are due to chance.
3.2 Simple Regression Models 63

We can use different significant codes or thresholds, such as 0, 0.001, 0.01, 0.5 or 0.1.
R will show with * representations, the p-values related to a specific threshold.
• The model’s residual standard error is the residual deviation of the residuals. Smaller
residual standard errors imply that the predictions are better. For this model, this value
is negligible.
• R-squared is the coefficient of determination, which ranges from 0 to 1. This statis-
tical measure determines how well the data fits the regression model. The difference
between multiple R-squared and adjusted R-squared is that adjusted R-squared consid-
ers different independent variables against the model, while R-squared. For this model
and dataset, both values are the same, and equal to 1, denoting a perfect fit of the data
into a linear model.
• F-statistics is used to test the significance of regression coefficients in linear regres-
sion models. If the p-value associated with the F-statistics is ≥ 0.05, then there
is no relationship between the independent and dependent variables. If the p-value
associated with the F-statistics is <0.05, then at least one independent variable is
related to the dependent variable. The F-statistics compares the joint effect of all the
variables together. In the model of this example, the p-value associated with the F-
statistics is <0.05; hence, the independent variable (initial rate) is linearly related to the
dependent variable (concentration of prescription drug).
• In linear regression models, each term is an estimated parameter that uses one degree
of freedom.

A convenient way to showing the residuals of the model is through a residual plot
(Fig. 3.2):

###Residual Plot
```{r, echo = TRUE}
#Extract the residuals of the model
resi <-residuals(model)

#Plot the residuals versus fitted plot

plot(fitted(model),resi,ylab="Residuals")

#Add a horizontal line at 0

abline(0,0)
```

The x-axis displays the fitted values, while the y-axis displays the residuals. From the
plot, the spread of the results tends to be indistinctly higher or lower for higher or lower
fitted values.
64 3 Data-Based Modelling for Prediction

Fig. 3.2 Residual plot for Example 3.1

We can also generate a Q-Q plot, which is helpful to show if the residuals follow a
normal distribution. If the data values in this plot fall along a straight line at a 45-degree
angle, then the data follows a normal distribution (Fig. 3.3).

###Q-Q Plot
```{r, echo = TRUE}
#Create the Q-Q plot for residuals
qqnorm(resi)

#Add a straight diagonal line to the plot

qqline(resi)
```

We can observe that one residual strays from the line at the lowest theoretical quartiles,
which could indicate that the data is not normally distributed.

The corresponding R file of this example is Ex3.1.rmd.

3.2.2 Multiple Linear Regression and Multivariate Linear Regression

Multiple linear regression (MLR) is a statistical technique that uses two or more indepen-
dent variables to predict the outcome of one dependent variable. The MLR model has the
form
3.2 Simple Regression Models 65

Fig. 3.3 Q-Q plot for Example 3.1

E
p
yi = b0 + bj xij + ei (3.10)
j=1

where
yi εR is the real-valued response for the i-th observation
b0 εR is the regression intercept
b j εR is the j-th predictor’s regression slope
xi j εR is the j-th predictor for the i-th observation
( )
ei ∼ N 0, σ2 is a Gaussian error term, assuming that is an unobserved random variable
iε{1, . . . , n} is the regression intercept
p is the predictor (p > 1)
The multivariate (multiple) linear regression (MvLR) model has the form

E
p
yik = b0k + bjk xij + eik (3.11)
j=1

where
yik εR is the k-th real-valued response for the i-th observation
b0k εR is the regression intercept for k-th response
bjk εR is the j-th predictor’s regression slope for k-th response
xij εR is the j-th predictor for the i-th observation
66 3 Data-Based Modelling for Prediction

ei k ∼ N(0m , E) is a multivariate Gaussian error vector, assuming that is an unobserved

random vector
iε{1, . . . , n} is the regression intercept
p is the predictor (p > 1)
Parameters or fitted values for MLR and MvLR are estimated using least squares in
their scalar or matrix form.
Let us illustrate the use of MLR in the following example.

Example 3.2 Multiple linear regression and multivariate linear regression. The degra-
dation of lithium-ion batteries is observed as a decrement in their capacity over time. When
these batteries are at rest, degradation is caused by the temperature and state-of-charge
(SOC). Model the degradation for the dataset included in Ex3.2.csv as a function of the
inverse of the temperature (1/K), SOC (%), and time (d).

###General
```{r, echo = TRUE}
setwd("C:/Book/Chapter3/Examples")
data <-read.csv(file="Ex3.2.csv",head=TRUE,sep=",")
data=data[c(1:83),c(2:4)];
data
```

The proposed model is shown as follows:

Capacity = b0 + b1 InvT + b2 SOC + b3 time (3.12)

Let us create the corresponding model using the following code:

3.2 Simple Regression Models 67

###Model 1
```{r, echo = TRUE}
model<-lm(data$Capac-
ity~data$InvT+data$SOC+data$time)
summary(model)
```
Call:
lm(formula = data$Capacity ~ data$InvT + data$SOC + data$time)

Residuals:
Min 1Q Median 3Q Max
-6.6809 -2.9024 -0.4834 3.2543 8.0319

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.950e+01 8.065e+00 9.858 2.08e-15 ***
data$InvT 7.989e+03 2.468e+03 3.237 0.00176 **
data$SOC -8.050e-02 1.489e-02 -5.407 6.61e-07 ***
data$time -1.521e-02 2.662e-03 -5.714 1.88e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.54 on 79 degrees of freedom

Multiple R-squared: 0.4702, Adjusted R-squared: 0.4501
F-statistic: 23.37 on 3 and 79 DF, p-value: 6.291e-11

The summary results of the previous model suggest that a multiple linear model does
not accurately relate the stress factors with the capacity, with an adjusted R-square =
0.4501, and a residual standard error of 3.54. The coefficients b0 (intercept), b1 (for
InvT), b2 (for SOC), and b3 (for time) were estimated. All variables exhibit a very low
p-value; therefore, they can be considered all significative when predicting capacity.
Now, let us include linear interaction parameters in the previous model and examine
the accuracy of the model.
68 3 Data-Based Modelling for Prediction

###Model 2
```{r, echo = TRUE}
model <-lm(data$Capac-
ity~data$InvT+data$SOC+data$time+data$InvT*data$SOC+d
ata$InvT*data$time+data$SOC*data$time)
summary(model)
```
Call:
lm(formula = data$Capacity ~ data$InvT + data$SOC + data$time +
data$InvT * data$SOC + data$InvT * data$time + data$SOC *
data$time)

Residuals:
Min 1Q Median 3Q Max
-6.2272 -2.4546 0.1815 2.0677 6.9136

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.585e+02 4.298e+01 3.687 0.000424 ***
data$InvT -1.819e+04 1.341e+04 -1.356 0.179096
data$SOC -5.682e-01 4.704e-01 -1.208 0.230849
data$time -1.724e-01 4.921e-02 -3.504 0.000771 ***
data$InvT:data$SOC 1.715e+02 1.470e+02 1.167 0.246988
data$InvT:data$time 5.521e+01 1.511e+01 3.654 0.000473 ***
data$SOC:data$time -2.474e-04 9.288e-05 -2.664 0.009422 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.168 on 76 degrees of freedom

Multiple R-squared: 0.5918, Adjusted R-squared: 0.5596
F-statistic: 18.36 on 6 and 76 DF, p-value: 4.618e-13

As can be seen in the previous results, both the residual standard error and R-squared
improve when compared to model 1. Nevertheless, low p-values are restricted to the inter-
cept and time and the linear interaction parameters between the inverse of the temperature
and time, and SOC and time. Is it possible to dimensionally reduce this model by elim-
inating those variables exhibiting high p-values? How would it affect the accuracy of
the model? Is a linear model enough to accurately predict the capacity as a function of
the inverse of the temperature, SOC, and time? Later in this chapter, we answer these
questions using ‘response surface regression’ when solving this example.

The corresponding R file of this example is Ex3.2.rmd

3.2 Simple Regression Models 69

3.2.3 Exponential and Logarithmic Regression

An exponential regression refers to best fitting a dataset to an exponential function. As a

result, we get an equation of the form

y = abx or y = A0 ekx , where a and A0 / = 0 (3.13)

The most common form used in regression analysis is y = ab x . y is the response

variable, x is the predictor variable; a and b are the regression coefficients that describe
the relationship between y and x.
The relative predictive power of an exponential model is denoted by R2 ; its value varies
between 0 and 1.

Notes:
– b must be non-negative.
– When b > 1, we have an exponential growth model.
– When 0 < b < 1, we have an exponential decay model.

Like exponential regression, logarithmic regression is used to model processes where

growth or decay accelerates rapidly at first and then slows over time. This regression
produces an equation of the form

y = a + b ln(x), where a and b / = 0 (3.14)

y is the response variable, x is the predictor variable; a and b are the regression
coefficients that describe the relationship between y and x.
R2 denotes the relative predictive power of a logarithmic model; its value varies
between 0 and 1.

Example 3.3 Exponential decay. The isotope IT-99 is used as a radioactive tracer. It decays
by a process called isomeric transition, where it releases gamma rays and low-energy elec-
trons. The decay factors for IT-99 are given in Ex3.3.csv. Find a decay model for this
dataset.

###General
```{r, echo = TRUE}
setwd("C:/Book/Chapter3/Examples")
data <-read.csv(file="Ex3.3.csv",head=TRUE,sep=",")
data=data[c(1:24),c(1:2)];
data
```
70 3 Data-Based Modelling for Prediction

The proposed model is shown as follows:

Decay = Aebt (3.15)

The previous model in Eq. 4.15 is coded as follows:

###Model
```{r, echo = TRUE}
model <-lm(log(data$decay)~data$time)
summary(model)
betas<-model$coefficients
A<-exp(betas[1]);
b<-exp(betas[2]);

A
B
```

Equation 3.15 is first modelled as a logarithmic function (log in R is the natural log-
arithmic); the coefficients or betas of the model are then extracted to finally find the
coefficients of the exponential function by applying e to both sides.
The corresponding summary statistics is:

Call:
lm(formula = log(data$decay) ~ data$time)

Residuals:
Min 1Q Median 3Q Max
-0.0030848 -0.0006722 -0.0003317 0.0003191 0.0050043

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0050043 0.0006272 -7.979 6.17e-08 ***
data$time -0.2298932 0.0001129 -2035.734 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.001768 on 22 degrees of freedom

Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 4.144e+06 on 1 and 22 DF, p-value: < 2.2e-16

(Intercept)
0.9950082
data$time
0.7946185
3.2 Simple Regression Models 71

The R-squared 1; therefore, the regression is very accurate. Moreover, the residual
standard error is also negligible.

The corresponding R file of this example is Ex3.3.rmd

3.2.4 Polynomial and Response Surface Regressions

In a polynomial regression, the relationship between the independent variable x and

the dependent variable y is modelled as an nth-degree polynomial in x. This regression
produces an equation of the form

y = a0 + a1 x + a2 x2 + . . . + an xn (3.16)

where ai are the coefficients of the polynomial terms, and n is the degree of the
polynomial function. a0 is typically referred to as the intercept.
Polynomial regression is a special linear regression case since we fit the polyno-
mial equation on the data with a curvilinear relationship between the dependent and
independent variables.
R2 denotes the relative predictive power of a polynomial model; its value varies
between 0 and 1.; its value varies between 0 and 1.
The response surface regression (RSR) explores and finds the relationship between
several independent variables and one or more response or dependent variables. RSR
produces a polynomial regression model with cross-product terms of variables denoting
the interaction between them. For instance, a response variable y, which depends on the
variables x 1 , x 2 , and x 3 , can be modelled using an RSR model with an equation of the
form

y = b0 + b1 x1 + b2 x2 + b3 x3 + b1−2 x1 x2 + b1−3 x1 x3
+ b2−3 x2 x3 + b1−1 x21 + b2−2 x22 + b3−3 x23 (3.17)

where b0 is the intercept, bi are the linear coefficients of the RSR, bi−i are the coefficients
of the quadratic terms, and bi− j are the coefficients of the interaction terms.

Example 3.4 Response surface regression. In this example, we add the quadratic terms for
Example 3.2.
72 3 Data-Based Modelling for Prediction

###Model
```{r, echo = TRUE}
b1<-data$InvT;
b2<-data$SOC;
b3<-data$time;
b12<-data$InvT*data$SOC;
b13<-data$InvT*data$time;
b23<-data$SOC*data$time;
b11<-data$InvT*data$InvT;
b22<-data$SOC*data$SOC;
b33<-data$time*data$time;
model <-lm(data$Capac-
ity~b1+b2+b3+b12+b13+b23+b11+b22+b33)
summary(model)
```

The corresponding summary statistics is:

Call:
lm(formula = data$Capacity ~ b1 + b2 + b3 + b12 + b13 + b23 +
b11 + b22 + b33)

Residuals:
Min 1Q Median 3Q Max
-6.016 -2.481 -0.109 2.230 5.726

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.021e+02 1.649e+02 -1.832 0.071009 .
b1 2.640e+05 9.901e+04 2.666 0.009442 **
b2 -5.346e-01 4.829e-01 -1.107 0.271910
b3 -1.806e-01 4.771e-02 -3.786 0.000312 ***
b12 1.566e+02 1.616e+02 0.969 0.335652
b13 5.320e+01 1.438e+01 3.700 0.000416 ***
b23 -2.501e-04 8.876e-05 -2.818 0.006206 **
b11 -4.305e+07 1.515e+07 -2.841 0.005817 **
b22 2.153e-04 6.068e-04 0.355 0.723756
b33 3.516e-05 1.994e-05 1.764 0.081998 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.008 on 73 degrees of freedom

Multiple R-squared: 0.6466, Adjusted R-squared: 0.603
F-statistic: 14.84 on 9 and 73 DF, p-value: 2.323e-13
3.2 Simple Regression Models 73

The residual standard error and R-squared have considerably improved compared to
the simple linear model and the one including linear interaction terms. P-values for coef-
ficients b2 and b12 suggest that both terms, the linear for SOC and the linear interaction
parameter between the inverse of the temperature and SOC are not significant. Likewise,
the quadratic term b22 can also be discarded. Let us neglect both terms and re-estimate
the model:

###Model - simplified
```{r, echo = TRUE}
b1<-data$InvT;

b3<-data$time;

b13<-data$InvT*data$time;
b23<-data$SOC*data$time;
b11<-data$InvT*data$InvT;

b33<-data$time*data$time;
simplified_model<-lm(data$Capac-
ity~b1+b3+b13+b23+b11+b33)
summary(simplified_model )
```

The corresponding summary statistics is:

Call:
lm(formula = data$Capacity ~ b1 + b3 + b13 + b23 + b11 + b33)

Residuals:
Min 1Q Median 3Q Max
-6.1257 -2.5201 -0.1696 2.4765 6.2959

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.541e+02 1.564e+02 -2.264 0.026404 *
b1 2.820e+05 9.605e+04 2.936 0.004392 **
b3 -1.754e-01 4.672e-02 -3.754 0.000338 ***
b13 5.247e+01 1.426e+01 3.678 0.000436 ***
b23 -2.839e-04 4.687e-05 -6.058 4.91e-08 ***
b11 -4.373e+07 1.472e+07 -2.970 0.003983 **
b33 3.484e-05 1.980e-05 1.760 0.082432 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.987 on 76 degrees of freedom

Multiple R-squared: 0.637, Adjusted R-squared: 0.6084
F-statistic: 22.23 on 6 and 76 DF, p-value: 6.137e-15
74 3 Data-Based Modelling for Prediction

As seen in the previous summary, the dimensionality reduction has not significantly
impacted the accuracy of the simplified model, as the residual standard error and R-
squared do not significantly differ compared to the full quadratic model.

The corresponding R file of this example is Ex3.4.rmd

3.2.5 Splines

Spline is a function defined piecewise by polynomials. Spline Regression is a non-

parametric regression technique in which the dataset is divided into bins at intervals
called knots. Splines are polynomial segments strung together and joining at knots. This
approach allows smooth interpolating between knots. An efficient way to implement
splines is to place more knots where we believe the function might vary most quickly, and
place fewer knots in the stable region. Nevertheless, in practice, it is common to place
knots uniformly. This is done by specifying the desired degrees of freedom, and conse-
quently, the software places the knots at uniform quantiles of data. The desired degrees
of freedom are set to minimize the residual sum of squares.

Example 3.5 Splines. True boiling point (TBP) distillation is a batch distillation process
used to characterize crude oils. It is generated by plotting the cumulative volume distillation
fraction with temperature. The dataset Ex3.5.csv includes typical TBP data for crude oil.
We are asked to create a model ‘correlating’ the increasing temperature as a function of the
cumulative volume.

The following code allows for loading and plotting the data.

###General
```{r, echo = TRUE}
setwd("C:/Book/Chapter3/Examples")
data <-read.csv(file="Ex3.5.csv",head=TRUE,sep=",")
data=data[c(1:14),c(1:2)];
data
plot(data$Volume...,data$Temp...R,xlab = "Cumula-
tive volume, %", ylab= "Temperature, R")
```
3.2 Simple Regression Models 75

We can infer in Fig. 3.4 that polynomials can be used to represent the true boiling point.
Cubic polynomial splines can indeed be used to model the TBP, allowing for intersecting
many points for interpolation purposes. In this example, we use ‘smoothing splines’, a
mathematical complex version of splines that are smoother and more flexible as well, as
they do not require the selection of the number of knots or cut-points.

###Install package
```{r, echo = TRUE}
install.packages("npreg")
```

###Model
```{r, echo = TRUE}
library(npreg)
model<-smooth.spline(x=data$Vol-
ume...,y=data$Temp...R)
plot(x=data$Volume...,y=data$Temp...R)
lines(model,col="blue")
```

Figure 3.5 shows the goodness of fit of the smoothing spline model when fitting to the
TBP data.

The corresponding R file of this example is Ex3.5.rmd

Fig. 3.4 True boiling point curve

76 3 Data-Based Modelling for Prediction

Fig. 3.5 Smoothing spline fitting for example 3.5

3.2.6 Multivariate Adaptive Regression Splines

The multivariate adaptive regression splines (MARS) algorithm is a non-parametric

regression that generates a piecewise linear model for non-linearities and interaction
between variables. MARS captures the non-linear relationships in the data by assessing
knots or cut-point. The algorithm assesses each data point for each predictor as a knot and
creates a linear regression model. For instance, let us consider non-linear, non-monotonic
data where y = f(x). The MARS algorithm first looks for the single point across the range
of x values where two different linear relationships between y and x achieve the smallest
error, resulting in a hinge function h(x-a), where a is the cut-point value. This procedure
continues until all the knots are found, producing a non-linear prediction equation. Knots
that do not significantly contribute to the model’s predictive accuracy are removed; this
process is known as pruning. An example of the identification of knots in a dataset is
shown in Fig. 3.6.
Identified knots in Fig. 3.6 are shown with red circles.
A hinge function takes the form

max(0, x − c)or max(0, c − x) (3.18)

where c is a constant (knot).

3.2 Simple Regression Models 77

Fig. 3.6 Example of fitted

regression splines of two knots

Example 3.6 MARS. The dataset Ex3.6.csv includes data of the heat transfer coefficient h
as a function of five design parameters (x 1 , x 2 , x 3 , x 4 , x 5 ) associated with a battery thermal
management system (BTMS). Create a model of h as a function of the design parameter
using MARS.

###Install package
```{r, echo = TRUE}
install.packages("earth")
```

###General
```{r, echo = TRUE}
setwd("C:/Book/Chapter3/Examples")
data <-read.csv(file="Ex3.6.csv",head=TRUE,sep=",")
data=data[c(1:35),c(1:6)];
```
78 3 Data-Based Modelling for Prediction

The MARS code is presented as follows:

###MARS
```{r, echo = TRUE}
library(earth)
mars1 <-earth(data$h~ ., data = data,degree=3)
print(mars1)
summary(mars1)

datax=data[c(1:29),c(1:5)];
prediction <- predict(mars1,datax)
prediction

plot(prediction,data$h,xlab="predicted
'h'",ylab="real 'h'")
best_line<-lm(prediction~data$h)
abline(best_line,col="red")
```

Degree denotes the maximum degree of interaction (with 1 meaning no interaction).

Selected 5 of 10 terms, and 3 of 5 predictors

Termination condition: RSq changed by less than 0.001 at 10 terms
Importance: x5, x4, x1, x2-unused, x3-unused
Number of terms at each degree of interaction: 1 2 1 1
GCV 491.7462 RSS 5493.992 GRSq 0.9750781 RSq 0.9897007
Call: earth(formula=data$h~., data=data, degree=3)

coefficients
(Intercept) 223.082128
h(5.416-x5) -44.360444
h(x5-5.416) 153.465815
x4 * h(x5-5.416) -39.650983
x1 * x4 * h(x5-5.416) 2.320801

Selected 5 of 10 terms, and 3 of 5 predictors

Fig. 3.7 Real versus predicted value of the heat transfer coefficient h

The R-squared of the model is close to 1 (0.9897), showing the goodness of fit of
MARS in modelling and consequently predicting h. A key feature of MARS is the impor-
tance: the estimated order of importance of the design variables is x 5 , x 4 , and x 1 , with x 2
and x 3 unused. This algorithm includes a backward elimination feature selection routine
that estimates the error as each predictor is added to the model.
A graphical representation of the goodness of fit is shown in Fig. 3.7, where the real
versus the predicted values are plotted to fit the best line (R2 = 0.9970).
The corresponding R file of this example is Ex3.6.rmd.

3.2.7 Kriging

Kriging is a spatial interpolation method; it uses a limited set of sampled data points
to estimate the value of a variable by interpolation over a continuous spatial field. For
instance, the average monthly carbon dioxide concentration over a city varies across a
random spatial field. It differs from other simple methods like linear regression or splines
since it uses the spatial correlation between sampled points to estimate the variable’s
value through interpolation in the spatial field. Kriging weights are estimated such that
points close to the location of interest have more weight than those located farther away.
The Kriging procedure is performed in two steps: first, the spatial covariance structure
of the sample points is fitted in a variogram; second, weights derived from this structure
are used for interpolation in the spatial field. Remember that covariance measures the
direction of the relationship between two variables; thus, a positive covariance indicates
that both variables tend to be high or low simultaneously, while a negative covariance
means the opposite.
80 3 Data-Based Modelling for Prediction

A variogram is a visual representation of the covariance between each pair of sampled

data points. The gamma value (a measure of the half mean-square difference between
their values) is plotted against the distance (lag) between them for each pair of points.
We can choose between different variogram models; the best-fitting models use different
approaches such as least-squares, maximum likelihood, and Bayesian methods.
Kriging assumes (i) stationarity, which means that the joint probability distribution
does not vary across the space; and (ii) isotropy, or uniformity in all directions.
The Kriging interpolator is sensitive to the variogram model; moreover, this regression
method is limited if the number of data is limited in spatial scope.
Reference [1] includes an excellent example of how to use Krigings for the meuse
dataset; this open dataset gives locations and topsoil heavy metal concentrations collected
in a flood plain of the river Meuse, in the Netherlands. Kriging method is used to spatially
interpolate the concentration of contaminants.

3.3 Non-linear Regression Models

In simple linear regression, the linear model has two variables, x, the independent variable,
y, the dependent variable, and the parameters m and b, as per Eq. 3.1. We use a specific
method to estimate the parameters of the model and apply a certain criterion function:

E
n
( )2
0= ŷi − yi (3.19)
i=1

where ŷi are the estimated values of the dependent variable, and yi are the measured
values of the dependent variable. Here, we assumed that all the observations are equally
reliable; otherwise, a weighted (w) sum of squares may be minimized:

E
n
( )2
0= wii ŷi − yi (3.20)
i=1

In the least squares (LSQ) method, the estimators are the values of the parameters m
and b which minimize the objective function 0. Thus, we must calculate the derivatives
∂0 ∂0
∂m and ∂b , equate them to zero and solve the system of equations to find m and b.
In linear LSQ, the objective function 0 is a quadratic function of the parameters.
3.3 Non-linear Regression Models 81

Like simple LSQ, non-linear least squares (NLLSQ) is the form of least squares anal-
ysis used to find n parameters in non-linear models. In NLLSQ the objective function
is quadratic with respect to the parameters only in a region close to its minimum value;
in this case, we use a truncated Taylor series as a good approximation to the model.
Some examples of non-linear least squares solvers include Gauss–Newton (GN), QR
decomposition, and gradient methods.
The default function to solve NLLSQ problems in R is nls, which includes the solvers
GN, Golub-Pereyra for partially linear least-squares problems, and port, an algorithm with
parameter bounds constraints.

Example 3.7 Non-linear least squares. The non-linear degradation of lithium-ion batteries
is observed in the second life of the battery (approximately at 80% of its initial capacity)
due to a degradation mechanism known as lithium plating. A dataset including capacity (in
%) as a function of the number of cycles N (number of times a battery can be fully charged
and discharged) is found in Ex3.7.csv. Find the coefficients of the non-linear model using
the function nls in R.

###Install package
```{r, echo = TRUE}
install.packages("stats")
```

###General
```{r, echo = TRUE}
setwd("C:/Book/Chapter3/Examples ")
data <-read.csv(file="Ex3.7.csv",head=TRUE,sep=",")
data=data[c(1:14),c(1:2)];
data
```

The mathematical function that relates capacity and the number of cycles is:

Capacity = kN p (3.21)

where k and p are fitting parameters.

82 3 Data-Based Modelling for Prediction

###Model
```{r, echo = TRUE}
library(stats)
model <- nls(data$Capacity~ k*(data$N^p),
data = data,
start = list(k=4, p = -0.1),)
summary(model)
```
Formula: data$Capacity ~ k * (data$N^p)

Parameters:
Estimate Std. Error t value Pr(>|t|)
k 220.70788 13.06278 16.90 9.89e-10 ***
p -0.30227 0.01524 -19.83 1.54e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.45 on 12 degrees of freedom

Number of iterations to convergence: 8

Achieved convergence tolerance: 4.428e-07

The default algorithm/solver when no specification is provided is GN.

The residual standard error of the fitting is 1.425, with both parameters being
significant (low p-values). The convergence was achieved after eight iterations.

The corresponding R file of this example is Ex3.7.rmd

3.4 Non-linear Machine Learning Algorithms

The tasks performed by any machine learning algorithm are executed using three
techniques: supervised, unsupervised, and reinforcement learning.
In supervised learning, a model can learn from it to easily provide the result of the
problem. Supervised learning deals with classification and regression problems. Regres-
sion problems are used for continuous data; some examples include linear regression and
non-linear regression. Classification problems predict discrete values, for instance, the
classification/reconnaissance of images; some examples include support vector machine
and logistic regression.
In unsupervised learning, there is no clean labelled or complete dataset; hence, the
algorithm is set to find hidden features and clusters in the data; some examples include
neural networks and k-means clustering.
3.4 Non-linear Machine Learning Algorithms 83

Reinforcement learning is neither supervised nor unsupervised learning. Instead, these

algorithms react on their own. These algorithms are used in complex settings like robotics.
In this section, we discuss a couple of unsupervised algorithms (neural network and
random forest) and one supervised algorithm (support vector machine) that have been
extensively used for prediction purposes.

3.4.1 Neural Networks

A neural network (NN) is a computer system based on an interconnected group of artificial

neurons or nodes. Each connection or edge transmits a signal to other neurons. A signal
is indeed a real number, and the output of each neuron is computed by a non-linear
function of the sum of its inputs. NNs are trained by processing example data, each of
which contains an input and an output, forming probability-weighted associations between
them. The training or learning process is performed by calculating the error or difference
between the predicted output of the network and the target output. The network is adjusted
and later terminated after several training processes based on termination criteria.
NNs are typically used for pattern recognition and classification of objects (e.g., in
vision). The simplest NN is the perceptron, which consists of a single neuron and a
linear regression model with a sigmoid activation function. A feedforward NN, on the
other hand, includes an input layer, one or a few hidden layers, and an output layer.
Convolutional neural networks are another type of NN that uses convolutional layers, and
they are typically used in image processing. Finally, recurrent neural networks (RNNs)
are NN, including feedback loops modelling sequential dependencies in the input.
NN requires data preprocessing, so it is unlikely to apply them directly to the raw data.
Instead, the data is prepared to enhance the optimization process and, hence, maximize the
probability of obtaining accurate results. The two most preferred preprocessing methods
for NNs are normalization (by subtracting the minimum value from each value and then
dividing by the range), and standardization (by subtracting the mean and dividing the
result by the population’s standard deviation).
Another important feature to consider when building an NN is the selection of the
activation function, which is a transfer function used to get a node’s output. We can
use either linear or non-linear activation functions. Non-linear functions are the most
used, as the model is easily and accurately generated and adapted to various data. The
hyperbolic tangent (used in feedforward nets) offers several advantages over other non-
linear activation functions since negative inputs map strongly negative, and zero inputs
map near zero in the tanh graph.
Finally, the rule of 10 in machine learning is quite important when deciding whether
a dataset is sufficient to apply an NN. This rule means that the amount of input data
should be at least ten times more than the number of degrees of freedom (typically the
parameters of your dataset) of the model.
84 3 Data-Based Modelling for Prediction

3.4.2 Random Forest

Random forest is a machine learning algorithm that combines the outputs of decision
trees to obtain a single result. Its training algorithm applies bootstrapping or bagging
to tree learning, used to improve accuracy and stability, as well as minimize overfitting
and reduce variance. When bagging a training set A, new training sets are generated by
sampling from A uniformly and with replacement (which means that some observations
may be repeated); this ensures that each bootstrap step is independent. This technique is
useful when neural networks are unstable. Random forest, like neural networks, handles
both classification and regression problems.
The following example illustrates using a simple neural network and random forest.

Example 3.8 Neural network and random forest. The self-diffusion coefficient of a system
(DC) is a function of its concentration, the operating temperature, and the concentration of a
specific salt. Build an unsupervised machine learning prediction model for the corresponding
dataset (with normalized values) in Ex3.8.csv.

Let us first set a neural network.

###Install package
```{r, echo = TRUE}
install.packages("neuralnet")
install.packages("grid")
install.packages("MASS")
install.packages("nnet")
```

### General - Neural Network

```{r, echo = TRUE}
library(neuralnet)
library(grid)
library(MASS)
library(brnn)
library(nnet)
setwd("C:/Book/Chapter3/Examples")
data <-read.csv(file="Ex3.8.csv",head=TRUE,sep=",")
data=data[c(1:72),c(1:4)];
data
nn <- neuralnet(DC~ Salt+Temperature+Concentra-
tion,data=data, hidden=8, err.fct="sse",act.fct =
"tanh",stepmax=1000,linear.output=FALSE);
plot(nn)
```
3.4 Non-linear Machine Learning Algorithms 85

Fig. 3.8 Neural network plot

for Example 3.8

The previous code included 8 hidden nodes, the sum of squares error (SSE), and the
hyperbolic tangent as an activation function; a step-max of 1000 (maximum steps for
the neural network training. The linear output is set as FALSE (the activation function is
applied to the output) (Fig. 3.8).
The NN plot includes three input nodes (independent variables), the hidden layer, and
the output layer, all linked and showing their corresponding weights. The prediction error
for this NN configuration is 14.1. In addition to tuning the number of hidden nodes,
selecting the activation function is key when building and testing NNs.
We used the entire dataset in the previous code to build/train the neural network (NN).
A good practice when building NNs is to split the dataset into train (to build the NN) and
test (for prediction purposes). This practice is known as partitioning. A typical partitioning
or split ratio for many machine learning algorithms is 70/30. An example of a typical code
that can be used for this purpose is:

###Splitting
```{r, echo = TRUE}
sample <- sample(c(TRUE, FALSE), nrow(data), re-
place=TRUE, prob=c(0.7,0.3))
train <- data[sample, ]
test <- data[!sample, ]
```

Now, let us check the performance of random forest. The corresponding codes for
random forest are shown below:
86 3 Data-Based Modelling for Prediction

###Install package (random forest)

```{r, echo = TRUE}
install.packages("randomForest")
install.packages("caret")
```

###General (random forest)

```{r, echo = TRUE}
library(randomForest)
library(caret)
rf <- randomForest(DC~ Salt+ Temperature+Concentra-
tion,data=data)
print(rf)
```
Call:
randomForest(formula = DC ~ Salt + Temperature + Concentration,
data = data)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 1

Mean of squared residuals: 0.02504718

% Var explained: 12.11

The mean squared residuals for random forest is 0.025. A key parameter for tuning
random forest is the number of trees; we typically try values ranging from 50 and 500
trees. Finally, like NN, splitting the dataset into train (to build the random forest model)
and test (for prediction purposes) is a good practice.
Random forest performs accurately when predicting the dependent variable in the pre-
vious example. NN usually requires more data to achieve the same level of accuracy
as random forest, but they learn and benefit (increase accuracy) from large amounts of
data. On the other hand, random forest is computationally less expensive than NNs, often
experiencing no performance gain when a threshold amount of data is reached.

The corresponding R file of this example is Ex3.8.rmd

3.4 Non-linear Machine Learning Algorithms 87

3.4.3 Support Vector Machine

The support vector machine algorithms aim to find a hyperplane in a dimensional space of
N number of features) to classify the data points within the boundary of the hyperplane.
Support vectors are used to maximize the margin distance between data points of different
classes by influencing the orientation and location of the hyperplane. The function that
maximizes the margin is hinge loss; hence, the cost function is zero if the predicted value
and the actual value are of the same sign; otherwise, the loss value is calculated. The
hyperplane contains the maximum number of points, so SVM is ideal for fitting dispersed
data.
A machine learning model with several parameters must be learned from the data.
Nevertheless, there are hyperparameters chosen by humans before the training begins,
based on trial and error and even intuition. In SVM, hyperparameters are typically found
by using simple optimization strategies such as grid search (refer to Chap. 6). In the
following example, we perform an SVM-based regression, including the performance
tuning of the model.

Example 3.9 Support vector machine. Fit the dataset (in Ex3.9.csv), relating the total con-
centration of SARS-Co-2 in wastewater versus the number of virus cases reported by the
city hospitals over time.

Let us first fit a linear regression with the data.

###Install packages
```{r, echo = TRUE}
install.packages("e1071")
install.packages("ggpubr")
```

The corresponding plot and summary statistics are:

88 3 Data-Based Modelling for Prediction

###Linear regression
```{r, echo = TRUE}
setwd("C:/Book/Chapter3/Examples ")
data <-read.csv(file="Ex3.9.csv",head=TRUE,sep=",")
data=data[c(1:35),c(2:3)];
data

# Plot the data

plot(data$Concentration,data$PCR_hospital,
xlab="Virus concentration (copies/L)",ylab="Number of
reported COVID cases in hospitals", pch=16)

# Create a linear regression model

model <- lm(PCR_hospital ~ Concentration, data)

# Add the fitted line

abline(model)

# make a prediction for each X

predictedY <- predict(model, data)

# display the predictions

points(data$Concentration, predictedY, col = "red",
pch=4)

summary(model)
```

Call:
lm(formula = PCR_hospital ~ Concentration, data = data)

Residuals:
Min 1Q Median 3Q Max
-6725.3 -2857.5 -521.6 2407.4 6963.7

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1811.2 853.1 -2.123 0.0413 *
Concentration 13913.4 1114.5 12.484 4.77e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3726 on 33 degrees of freedom

Multiple R-squared: 0.8252, Adjusted R-squared: 0.82
F-statistic: 155.8 on 1 and 33 DF, p-value: 4.77e-14

The adjusted R-squared is 0.8252 is close to 1; this might suggest a good fit for the
data using a linear model; nevertheless, the residual standard error is 3726! Moreover,
3.4 Non-linear Machine Learning Algorithms 89

a dispersed cluster of data points is located in the bottom left part of Fig. 3.9. These
features suggest that SVM might be a good model for fitting the data. The corresponding
R code for SVM is:

###SVM
```{r, echo = TRUE}
library(e1071)
setwd("C:/Book/Chapter3/Examples ")
data <-read.csv(file="Ex3.9.csv",head=TRUE,sep=",")
data=data[c(1:35),c(2:3)];
data

model<- svm(PCR_hospital ~ Concentration, data)

predictedY<- predict(model, data)

# Plot the data

plot(data$Concentration,data$PCR_hospital,
xlab="Virus concentration (copies/L)",ylab="Number of
reported COVID cases in hospitals", pch=16)
points(data$Concentration, predictedY, col = "red",
pch=4)

The first section of the previous code models the two variables using SVM and plots
the model’s predicted values (red) against the measured concentration of the virus (blue).
A linear model is also added (black) for comparison purposes (Fig. 3.10).

Fig. 3.9 Linear regression model for Example 3.9

90 3 Data-Based Modelling for Prediction

Fig. 3.10 Untuned SVM model and linear regression model for Example 3.9

The SVM model does not accurately adjust the data; therefore, tuning is required.
Three parameters are typically required for tuning SVM models: (i) the regularization
parameter, which helps to balance the model complexity and empirical error; (ii) gamma,
which adjusts for overfitting or underfitting; and (iii) epsilon, which gives a tolerable error
of the regression model. In the following section of the code, we improve the quality of
the SVM model by modifying the parameter epsilon. The tuning is visualized in Fig. 3.11.
In Fig. 3.11, the dark parches show the optimal zone in which the algorithm’s epsilon
and computational cost are balanced. R automatically finds the optimal zone once we
define a range for epsilon (from 0.08 to 0.15, in our case), which must be changed in the
code and re-run, hence, obtaining the optimal regression.
As shown in Fig. 3.12, the performance of the tuned SVM model is remarkable, with
an R2 value of 0.99.

Fig. 3.11 SVM model performance after tuning

3.5 Distribution Models 91

Fig. 3.12 Tuned SVM model and linear regression model for Example 3.9

The corresponding R file of this example is Ex3.9.rmd

In this section, we have studied some supervised and unsupervised machine learning algo-
rithms that are typically used for prediction. In the next section, we explore distribution
models, a set of statistical techniques with several applications in the process engineering
field.

3.5 Distribution Models

Data distribution is a function that provides the value of a variable, quantifies relative fre-
quency, and transforms the raw data into a meaningful visualization tool that assists with
value information. The first step we shall perform to fit the data to a specific distribution
is conducting exploratory data analysis (EDA) to learn about the features in datasets that
might help us find patterns in them.
There are several types of data distribution, including Bernoulli, binomial, normal
(Gaussian), Poisson, uniform, Weibull, and gamma, among others. These models are
represented by a specific standard parameterization formula, where the corresponding
parameters are typically estimated using a maximum likelihood estimator. In this section,
we study three distribution models typically used in process, mechanical, and other
engineering disciplines.
92 3 Data-Based Modelling for Prediction

3.5.1 Normal Distribution

A normal or Gaussian distribution has the probability density function (pdf):

( )
1 x−μ 2
−1
f(x) = √ e 2 σ
(3.22)
σ 2π
where μ is the mean of the distribution, and σ is the standard deviation.
This type of distribution is a continuous probability function for a real random vari-
able and is based on the central limit theorem. Normal distributions are often used in
engineering and natural sciences, as several phenomena can be modeled by fitting the
corresponding data to its pdf.
Let us explore the following example of how to fit a normal distribution to a dataset.

Example 3.10 Normal distribution. Let us again fit the data included in Ex2.7 (filename
renamed Ex3.10.csv) to a normal distribution.

#Add linear:
modelL <- lm(PCR_hospital ~ Concentration, data)
predictedYL <- predict(modelL, data)
points(data$Concentration, predictedYL, col =
"blue", pch=4)

#Tuning SVM
tuneResult<- tune(svm, PCR_hospital ~ Concentration,
data = data,
ranges = list(epsilon = seq(0.08,0.15,0.01), cost =
2^(2:9))
)
print(tuneResult)
# Draw the tuning graph
plot(tuneResult)

tunedModel<- tuneResult$best.model
tunedModelY<- predict(tunedModel, data)

library("ggpubr")
cor(tunedModelY, data$PCR_hospital, method =
c("pearson", "kendall", "spearman"))
rsq<-cor.test(tunedModelY, data$PCR_hospital,
method=c("pearson", "kendall", "spearman"))
print(rsq)
plot(data$Concentration,data$PCR_hospital,
xlab="Virus concentration (copies/L)",ylab="Number of
reported COVID cases in hospitals", pch=16)
points(data$Concentration, tunedModelY, col = "red",
pch=4)
```
3.5 Distribution Models 93

The function fitdistr from the package MASS allows fitting the data to different types
of distribution, such as normal distribution. We can extract the distribution parameters
using estimate, then verify the normality using a histogram (visual inspection, as in Exam-
ple 3.7) and the Shapiro–Wilk normality test, which is the formal test and recommended
test for normality.

###Install packages
```{r, echo = TRUE}
install.packages("MASS")
```

The Shapiro–Wilk normality test indicates that if the test is non-significantly different
from a normal distribution. The p-value in this case is 0.05102; therefore, the data is
normally distributed.

The corresponding R file of this example is Ex3.10.rmd

3.5.2 Weibull Distribution

The Weibull distribution has the probability density function (pdf):

{ ( )
κ x κ−1 −(x/λ)κ
e ,x ≥ 0
f(x; λ, κ) = λ λ (3.23)
0, x < 0

where κ > 0 is the shape parameter and λ > 0 is the scale parameter of the distribution.
The Weibull distribution effectively provides reliability characteristics using a relatively
small sample size.

• A value of κ < 1 indicates that the failure rate decreases over time.
• A value of κ = 1 indicates that the failure rate is constant over time.
• A value of κ > 1 indicates that the failure rate increases over time.

Example 3.11 Weibull distribution. The data in Ex3.11.csv includes the days a device
was on test before failure (no censored data). Fit the data to a Weibull distribution and find
the device reliability at 15 years.
94 3 Data-Based Modelling for Prediction

###Normal distribution
```{r, echo = TRUE}
setwd("C:/Book/Chapter3/Examples ")
data <-
read.csv(file="Ex3.10.csv",head=TRUE,sep=",")

library(MASS)
fit <- fitdistr(data$Temperature, "normal")
class(fit)
para<-fit$estimate
para
hist(data$Temperature,xlab="Temperature")

shapiro.test(data$Temperature)
```

The data is fitted to a Weibull distribution using fitdist, and the parameters shape and
scale are then estimated. The corresponding summary and plots are shown below:
###Install packages
```{r, echo = TRUE}
install.packages("MASS")
install.packages("fitdistrplus")
install.packages("weibullness")
```

AIC and BIC are discussed in the next section of this chapter (Fig. 3.13).

Fig. 3.13 R plots of the Weibull distribution

3.5 Distribution Models 95

We performed a test [2] to formally verify that the data follows a Weibull distribution
(p-value > 0.05):

###Weibull distribution
```{r, echo = TRUE}
setwd("C:/Book/Chapter3/Examples ")
data <-
read.csv(file="Ex3.11.csv",head=TRUE,sep=",")
data<-data[c(1:10),c(1:1)];

library(MASS)
library(fitdistrplus)

# Fit model and extract parameters

fitW <- fitdist(data, "weibull");
w_shape <- fitW$estimate["shape"];
w_scale <- fitW$estimate["scale"];

# Summarize and plot

summary(fitW)
plot(fitW)

#Test the Weibull distribution

library(weibullness)
wp.test(data)

#Estimate reliability at 'X' years:

years<-15;
reliability <- exp(-(years / w_scale)**(w_shape))
reliability
```

The corresponding R file of this example is Ex3.11.rmd

3.5.3 Gamma Distribution

The Gamma distribution has the probability density function (pdf):

96 3 Data-Based Modelling for Prediction

x α−1 e−βx β α
f (x; α, β) = for x > 0 α, β > 0 (3.24)
|(α)
where | is the gamma function, α is the shape parameter, and β is the rate parameter.
Gamma distributions have been used to model degradation processes (e.g., lithium-ion
batteries) [3]; in medicine, to model the age distribution of cancer incidence [4], and
other engineering and science applications.

Example 3.12 Gamma distribution. The data in Ex3.12.csv includes the capacity fade (%
capacity) of a capacitor over time (in days). Fit the data to a Gamma distribution and estimate
the capacity after 30 days.
###Install packages
```{r, echo = TRUE}
install.packages("MASS")
install.packages("fitdistrplus")
install.packages("dgof")
```

The data is fitted to a Gamma distribution using fitdist, and the parameters shape
and scale are then estimated. The corresponding summary and plots are shown below
(Fig. 3.14):

###Gamma distribution
```{r, echo = TRUE}
setwd("C:/Book/Chapter3/Examples ")
data <-
read.csv(file="Ex3.12.csv",head=TRUE,sep=",")
data<-data[c(1:17),c(1:2)];
data

library(MASS)
library(fitdistrplus)

# Fit model and extract parameters

fitG <- fitdist(data$Capacity, distr="gamma");

# Summarize and plot

summary(fitG)
plot(fitG)

#Gamma test
library(goft)
gamma_test(data$Capacity)
```
3.6 Model Performance and Validation 97

Fig. 3.14 R plots for the Gamma distribution

We performed a test [5] to formally verify that the data follows a Gamma distribution
(p-value > 0.05):

Test of fit for the Gamma distribution

data: data$Capacity
V = -1.4082, p-value = 0.3194

The corresponding R file of this example is Ex3.12.rmd

3.6 Model Performance and Validation

Several metrics are used to measure the model performance. Any reference to metrics
or errors estimated with respect to the data used to train or validate a predictive model
is called in-sample, while the reference to test new data is called out-of-sample. The
difference between the predicted value and the actual value from the in-sample data is
called the residual for each point, while the corresponding out-of-sample difference is
called the prediction error.
98 3 Data-Based Modelling for Prediction

A model evaluation can be performed for model selection, comparison and/or tuning.
Many techniques and metrics are used, sometimes simultaneously for cross-validation
purposes, including: (i) regression performance metrics, such as R2 and adjusted R2 ,
mean squared error (MSE) or root mean squared error (RMSE), and mean or absolute
error, F-score, and others; (ii) bias variance trade-off and/or model complexity metrics,
such as the residual sum of squares; and (iii) model validation and selection metrics, such
as the Akaike information criterion (AIC) and Bayesian information criterion (BIC).
The main errors associated with predictive analytics are both in-sample and out-of-
sample errors. Model performance on training data is typically optimistic; therefore, the
data errors are usually low compared to the out-of-sample errors. A crucial decision-
making process for a data analyst is considering trade-offs between the types of errors. In
many cases, such as evaluating health outcomes using machine learning techniques, false
negatives might be expected, depending on the selected trade-off for the estimated errors.
The AIC is an estimator of prediction error, allowing for estimating the relative quality
of statistical models for a given dataset. In practice, we select a set of potential models to
represent/predict the data, and then, we calculate the models’ corresponding AIC values.
Let us say what we have three potential candidates for a given dataset, and we calculate
the values of AIC1 , AIC2 , and AIC3 , and ( let AICmin
) be the minimum of these three
AICmin −AICi
2
values. Then, we estimate the quantity e or relative likelihood of a model,
indicating the probability that the ith model minimizes the meaning or information loss.
Therefore, the best-fit model is the one that explains the greatest amount of variation
using the fewest number of independent variables. The BIC is related to the AIC; when
comparing several models, the ones with lower BIC are preferred; nevertheless, a lower
BIC does not always indicate one model is better than another!

Example 3.13 Model selection. Let us compare models for the data included in Ex3.2.csv,
using the AIC and BIC.

###Install packages
```{r, echo = TRUE}
install.packages("AICcmodavg")
```

###General
```{r, echo = TRUE}
setwd("C:/Book/Chapter3/Examples ")
data <-read.csv(file="Ex3.2.csv",head=TRUE,sep=",")
data=data[c(1:83),c(1:4)];
data
```
3.6 Model Performance and Validation 99

Now, let us fit the data to a linear (model 1) and a full quadratic model (model 2).

###Model 1 - linear
```{r, echo = TRUE}
b1<-data$InvT;
b2<-data$SOC;
b3<-data$time;
model1 <-lm(data$Capacity~b1+b2+b3)
summary(model1)
```

###Model 2 - quadratic
```{r, echo = TRUE}
b1<-data$InvT;
b2<-data$SOC;
b3<-data$time;
b12<-data$InvT*data$SOC;
b13<-data$InvT*data$time;
b23<-data$SOC*data$time;
b11<-data$InvT*data$InvT;
b22<-data$SOC*data$SOC;
b33<-data$time*data$time;
model2 <-lm(data$Capac-
ity~b1+b2+b3+b12+b13+b23+b11+b22+b33)
summary(model2)
```

Where k is the number of parameters in the model, Delta_AIC/Delta_BIC is the dif-

ference between the AIC or BIC of the best model compared to the current model. AIC/
BICWt is defined as the ratio of the total predictive power found in the model. Cum.Wt
is the cumulative sum of the AIC/BICs. LL is the log-likelihood of the model. The model
with the lowest AIC or BIC or best model is always shown first in R. In this example,
both criteria show that the full-quadratic model is better than the linear model for the
given dataset.

The corresponding R file of this example is Ex3.13.rmd

Finally, as we also explored distribution models, in this case, we recommend performing

statistical tests, such as the goodness of fit for normal, Weibull, and gamma distributions.
100 3 Data-Based Modelling for Prediction

Another popular test to decide if a sample comes from a population with a specific distri-
bution is the Kolmogorov–Smirnov. An important advantage of this test is the fact that the
test does not depend on the cumulative function being tested; moreover, this is an exact
test; therefore, non-adequate sample size is required. Some of the disadvantages of this
test include its sensitivity near the centre of the distribution; it only applies to continuous
distribution; perhaps the most critical disadvantage is that the location and parameters of
the distribution must be predefined.

3.7 Correlation and Causality

While causation and correlation can exist simultaneously, we cannot always state that
correlation implies causation. A correlation implies that there is a statistical associa-
tion between variables (two or more variables are related). Causation indicates that one
event or variable causes another event or response. An appropriate design of experiment
(DOE) typically reveals causation; for instance, we can run an experiment where similar
groups receive different treatments, and then we can record and analyze the outcomes
of each group to finally conclude that a treatment causes or not an effect if and only if
the groups have significantly different outcomes. Causal inference in statistics involves
studying a system by measuring one variable that we suspect might affect the measure
of another. Three conditions are required to claim causal inference: (i) covariation, (ii)
discarding rival explanations for the association between variables, and (iii) temporal
ordering. Researchers deal with causality by trying to provide a framework to rightfully
claim it; several efforts include general theories like the Structural Causal Model where
we analyze the covariance of any pair of observed variables and understand the inter-
action and/or confounding of variables as well as counterfactuals, attributions, and other
potential ingredients affecting causal inference.
How is machine learning (ML) addressing causality? Causal supervised learning, for
instance, can help us enhance predictive models (e.g., using invariant feature learning
approaches); causal generative modeling, on the other hand, can provide the basis to
generate counterfactual samples. Causal Reinforcement Learning proved to be efficient in
de-confounding data. But, despite these and several other research efforts and how promis-
ing ML looks at solving the causation dilemma, it is quite challenging to get evaluation
data for ML-causation-based algorithms. This book presents a series of recommended
readings at the end of this chapter, as this subject deserves an extended and deep under-
standing. However, in Example 4.15, we illustrate the use of causal forest to deal with
generated data on a health outcome (binary).

Example 3.14 Causal random forest. A causal forest can be used to estimate the conditional
average treatment. When a treatment assignment is binary and is unconfounded, we can
estimate the potential outcome of the two possible treatment states. Let us estimate if the
3.7 Correlation and Causality 101

variables socioeconomic state, race, parity, sex, and concentration of a carcinogen (in parts
per million, converted to binary on the basis of a value above a limit threshold = 1) might or
not might cause an illness (expressed as a probability, also converted to binary using a binary
variable using a binomial distribution), employing this algorithm in R. The corresponding
data is included in Ex3.14.csv.

###Model selection based on AIC and BIC

```{r, echo = TRUE}
library(AICcmodavg)

models<-list(model1, model2)

#specify model names

mod.names <- c('linear', 'full-quadratic')
#calculate AIC/BIC of each model
aictab(cand.set = models, modnames = mod.names)
bictab(cand.set = models, modnames = mod.names)
```
Model selection based on AICc:

K AICc Delta_AICc AICcWt Cum.Wt LL

full-quadratic 11 278.65 0.00 1 1 -126.47
linear 5 339.90 61.24 0 1 -164.56

Model selection based on BIC:

K BIC Delta_BIC BICWt Cum.Wt LL

full-quadratic 11 301.54 0.00 1 1 -126.47
linear 5 351.21 49.67 0 1 -164.56

BINaml transforms the probability of getting the illness into a binary variable
(simulating binomial trials).
102 3 Data-Based Modelling for Prediction

###Install packages
```{r, echo = TRUE}
install.packages("grf")
#install sufrep from zip file
```

###General
```{r, echo = TRUE}
library(sufrep)
setwd("C:/Book/Chapter3/Examples ")
data <-
read.csv(file="Ex3.14.csv",head=TRUE,sep=",")
data=data[c(1:160),c(1:9)];

dataX=data[c(1:160),c(1:8)];
#The probability of getting the illness
dataY=data$Prob;
BINaml <- rbinom(length(dataY),1,dataY);
```

W is the treatment assignment.

The average treatment effect (ATE) is then estimated:

###Causal forest
```{r, echo = TRUE}
library(grf)
n<-length(dataY)
W<-rbinom(n,1,0.5)
c.forest<-causal_for-
est(dataX,BINaml,W,tune.num.reps=80)
average_treatment_effect(c.forest)

The corresponding R file of this example is Ex3.14.rmd

The ATE is a causal estimand that calculates the difference between the potential out-
comes that would be observed if the exposure of all individuals is set to 1 and 0. ATE is
interpreted as the difference in risk when everyone (in the population) is exposed versus
if everyone (in the population) is not. A value different than zero might indicate causality.
Resources 103

While the previous example and its approach are quite simplistic, we encourage the
reader to verify causation between variables, as this will lead to accurate predict events
and responses and explain why the events and/or responses happen. Better decisions are
based on causation, not correlation.

3.8 Summary and Final Remarks

In this Chapter, we showed the use of data-based modelling for prediction purposes from
datasets to describe the behaviour of a system and/or a process. We explored different
regression models (linear and non-linear), supervised and unsupervised machine learning
algorithms, and distribution models and examined their adequacy. Moreover, we exam-
ined their adequacy by discussing physical meaning when predicting values or physical
validation, model performance and validation using estimators/errors, and model compar-
ison and selection criteria. Finally, we emphasize the need for the engineer or researcher
to verify the causation of the model/dataset, as better decision-making processes can be
effectively performed when the cause and effect between variables is understood and
verified.

Data Disclosure

The data was generated, maintaining the described problem’s physical meaning of the
analyzed phenomenon or process.

Problems

3.1 Use the different datasets in this chapter; fit the data and perform predictions
using different regression models, ML techniques and distribution models. Compare
models’ performance.
3.2. For problem 3.1, perform a sensitivity analysis on the training and tuning parameters
of the ML techniques.
3.3 .3 Design and run a causality problem to verify lithium-ion degradation’s cause(s).

Resources

Npreg in R: https://rdocumentation.org/packages/npreg/versions/1.0-9
earth in R: https://rdocumentation.org/packages/earth/versions/5.3.2
104 3 Data-Based Modelling for Prediction

stats in R: https://rdocumentation.org/packages/stats/versions/3.6.2
neuralnet in R: https://rdocumentation.org/packages/neuralnet/versions/1.44.2
grid in R: https://rdocumentation.org/packages/grid/versions/3.6.2
MASS in R: https://rdocumentation.org/packages/MASS/versions/7.3-58.3
brnn in R: https://rdocumentation.org/packages/brnn/versions/0.9.2
randomForest in R: https://rdocumentation.org/packages/randomForest/versions/4.
7-1.1
caret in R: https://rdocumentation.org/packages/caret/versions/6.0-94
e1071 in R: https://rdocumentation.org/packages/e1071/versions/1.7-13
ggpubr in R: https://rdocumentation.org/packages/ggpubr/versions/0.6.0
fitdistrplus in R: https://rdocumentation.org/packages/fitdistrplus/versions/1.1-11
weibullness in R: https://rdocumentation.org/packages/weibullness/versions/1.23.8
dgof in R: https://rdocumentation.org/packages/dgof/versions/1.4
R-codes and data repository: https://github.com/CHE408UofT/DGSD_UofT

Recommended Readings

1.3.3.24. Quantile-quantile plot. (n.d.). https://www.itl.nist.gov/div898/handbook/eda/

section3/qqplot.htm#:~:text=The%20quantile%2Dquantile%20(q%2Dq),of%20the%20s
econd%20data%20set.
7.2.1. do the observations come from a particular distribution? (n.d.). https://www.itl.
nist.gov/div898/handbook/prc/section2/prc21.htm
Alpaydin, E. (2014). Introduction to machine learning. The MIT Press.
Brownlee, J. (2020, August 27). Probabilistic model selection with AIC, BIC, and
MDL. MachineLearningMastery.com. https://machinelearningmastery.com/probabilistic-
model-selection-measures/
Burns, E., & Burke, J. (2021, March 26). What is a neural network? explanation and
examples. Enterprise AI. https://www.techtarget.com/searchenterpriseai/definition/neural-
network
Duval, F. (2018). Artificial Neural Networks: Concepts, tools and techniques explained
for absolute beginners. Createspace Independent Publishing Platform.
GeeksforGeeks. (2023, June 10). Support Vector Machine (SVM) algorithm. Geeks-
forGeeks. https://www.geeksforgeeks.org/support-vector-machine-algorithm/
Gupta, A., Mishra, P., Pandey, C., Singh, U., Sahu, C., & Keshri, A. (2019). Descriptive
statistics and normality tests for statistical data. Annals of Cardiac Anaesthesia, 22(1), 67.
https://doi.org/10.4103/aca.aca_157_18
Liu, Y., Wang, Y., & Zhang, J. (2012). New Machine Learning Algorithm: Random
forest. Information Computing and Applications, 246–252. https://doi.org/10.1007/978-3-
642-34062-8_32
References 105

Martens, H. (2023). Causality, machine learning and human insight. Analytica Chimica
Acta, 1277, 341585. https://doi.org/10.1016/j.aca.2023.341585
Rodríguez, E. M. (2023, March 27). Causal ML: What is it and what is its importance?.
Plain Concepts. https://www.plainconcepts.com/causal-ml/
Steinwart, I., & Christmann, A. (2008). Support Vector Machines. Springer New York.
Sullivan, W. (2017). Machine Learning for Beginners Guide Algorithms: Decision
tree & random forest introduction. Healthy Pragmatic Solutions Inc.
Thomas, S. (2022, March 2). What is a residual in stats?. Outlier. https://articles.out
lier.org/what-is-a-residual-in-stats
V., B. B. S. (2018). Introduction to machine learning with r: Rigorous mathematical
analysis. O’Reilly Media, Inc.
What is Random Forest?. IBM. (n.d.). https://www.ibm.com/topics/random-forest#:
~:text=Random%20forest%20is%20a%20commonly,both%20classification%20and%20r
egression%20problems.
Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for Data Science:
Import, Tidy, transform, visualize, and model data. O’Reilly Media, Inc.

References

1. RPubs. (n.d.). https://rpubs.com/nabilabd/118172

2. Park, C. (2017). Weibullness test and parameter estimation of the three-parameter weibull model
using the sample correlation coefficient. International Journal of Industrial Engineering: Theory,
Applications and Practice, 24, 4. https://doi.org/10.23055/ijietap.2017.24.4.2848
3. Galatro, D., Romero, D. A., Freitez, J. A., Da Silva, C., Trescases, O., & Amon, C. H. (2021).
Modeling degradation of lithium-ion batteries considering cell-to-cell variations. Journal of
Energy Storage, 44, 103478. https://doi.org/10.1016/j.est.2021.103478
4. Galatro, D., Trigo-Ferre, R., Nakashook-Zettler, A., Costanzo-Alvarez, V., Jeffrey, M., Jacome,
M., Bazylak, J., & Amon, C. H. (2023). Framework for evaluating potential causes of health risk
factors using average treatment effect and uplift modelling. Algorithms, 16(3), 166. https://doi.
org/10.3390/a1603016z
5. Villaseñor, J. A., & González-Estrada, E. (2015). A variance ratio test of fit for Gamma distribu-
tions. Statistics & Probability Letters, 96, 281–286. https://doi.org/10.1016/j.spl.2014.10.001
Data-Based Modelling for Control
4

4.1 Modern Control Theory and Data-Based Control

The modern control theory (MCT), also known as model-based control (MBC), was
introduced in the 1960s, with the parametric state-space model developed by Kalman.
MBC refers to the control basis for linear and non-linear systems. Linear control systems
methodologies typically include robust control, zero-pole assignment, and linear-quadratic
regulator (LQR) design, while non-linear control systems methodologies typically include
feedback linearization, backstepping controllers, etc. MBC requires first modeling or iden-
tifying the plant or process, then designing the controller using a plant model representing
the true system, being the certainty equivalence, the fundamental assumption in this the-
ory. Hence, a model-based controller might only work efficiently if the plant model falls
into the assumed model set. Alternatively, data-driven or data-based control (DBC) mod-
els were proposed, in which the controller is designed using online or offline input/output
(I/O) data from the data processing without having an implicit mathematical model of
the controlled process. Therefore, DBC controller design uses and depends only on plant
measurement I/O data; finally, stability, robustness, and convergence are guaranteed by
mathematical analysis under reasonable assumptions. DBC works efficiently for con-
trollers whose identification-based or first-principle models are highly non-linear or too
high order.
DBC approaches include (i) online data-based methods such as simultaneous per-
turbation stochastic approximation, model-free adaptive control, and unfalsified control
methodology; and (ii) offline data-based methods like the PID control method, the itera-
tive feedback tuning (IFT), the correlation-based tuning, and others. Most DBC methods

Supplementary Information The online version contains supplementary material available at

https://doi.org/10.1007/978-3-031-46866-7_4.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 107
D. Galatro and S. Dawe, Data Analytics for Process Engineers, Synthesis Lectures
on Mechanical Engineering, https://doi.org/10.1007/978-3-031-46866-7_4
108 4 Data-Based Modelling for Control

are designed using controller parameter tuning approaches, and some of them assume the
controller structure a priori. In any case, there is no established theoretical framework
for DBC, and advancements in this theory have skyrocketed with the advent of machine-
learning techniques and hardware capabilities. In this Chapter, we illustrate the use of
data-based modelling for control purposes through two examples, one for basic control
theory, for tuning a proportional-integral-derivative (PID) control, and another example
regarding model predictive control.

4.2 Basic Control Theory: PID Controllers

Proportional-integral-derivative (PID) control is the most used control algorithm in the

industry due to its simplicity and robust performance for a wide range of operating con-
ditions. It includes three coefficients: proportional, integral, and derivative, which vary to
get the optimal response of a system. The classic block diagram of a process under PID
control is shown in Fig. 4.1.
The Setpoint (SP) is the desired value for the process. A PID controller looks at the SP
and compares its value with the actual value of the Process Variable (PV), measured by a
sensor. If both values are not similar, corrective action is required until no error occurs. In
several cases, the actuator output might not be the only signal that can affect the process
or system; in these cases, we refer to a disturbance that alters the PV. Control systems are
designed to minimize this effect.
The performance of a control system or loop is typically quantified by applying a step
function as the SP and then measuring the response of the PV. This response is often
measured through waveform characteristics. A Rise Time is then required to reach the
steady state or final value. The Percent Overshoot is the amount that the PV overshoots
the final value. Settling time is the time the PV requires to settle (this value is typically 5%
near the final value). Stead-state error is the difference between the PV and SP. The control

Fig. 4.1 Block diagram of a process under PID control

4.2 Basic Control Theory: PID Controllers 109

system’s effectiveness or robustness is measured as its capacity to tolerate disturbances

and non-linearities. Some systems might present undesirable behaviors called dead time,
a delay occurring when a PV changes, causing slower responses to the control command.
The proportional component (P) depends only on the error term or difference between
the SP and PV. The proportional gain (Kc ) is the ratio of the output response to the error
signal. Large values of Kc cause oscillations in the PV.
The integral component (I) sums the error term over time, resulting in a small error
that causes the I component to increase slowly, driving the steady-state error to zero.
The derivate component (D) is proportional to the derivative or rate of change of the
PV, causing the output to decrease if the PV increases quickly. Thus, the derivative time
(Td ) parameter will cause the control system to react abruptly to changes in the error term,
increasing its speed response. Td values are typically small since the derivate response is
highly sensitive to noise in the PV signal.
Tuning is the process of setting the optimal gains for all the components P, I, and I, to
obtain an optimal response from a control system. This process is commonly performed
using the trial-and-error and Ziegler Nichols (ZN) methods.
When using the trial-and-error method, the I and D terms are first set to zero, followed
by increasing the P gain until the output of the loop oscillates. The proportional gain
then increases, making the system faster; once P has been adjusted to get the desired fast
response, the integral term is increased to reduce oscillation. The integral term is then
used to reduce the steady-state error while increasing overshoot. Once P and I are set
to get a fast control system with minimal steady-state error, the derivate term increases
until the loop is quick to its SP. Hence, increasing the derivate term decreases overshoot
and stability, but it is sensitive to noise. A trade-off is required to characterize a control
system to meet desired requirements optimally.
The ZN method is very similar to the trial-and-error method: I and D are first set to
zero, and P is increased until the loop oscillates. Once the control starts, Kc and the period
of oscillations Pc are recorded, while typical values are used to adjust P, I, and D.
Machine learning approaches have been alternatively used for designing and/or tuning
PID parameters. Automatic parameter tuning uses learned data from a regression model
or pattern recognition techniques such as Deep Neural Networks.
An excellent example of the design of a PID controller is included in Ref. [1]. Example
4.1 shows a data-based optimization for a similar PID controller.
110 4 Data-Based Modelling for Control

Example 4.1 Data-based optimization of a PID controller. We generated a dataset using

a modified version of the example shown in Ref. [1] to evaluate the impact of the values of
the controller parameters on the output response. This dataset (Ex4.1.csv) includes different
values for the proportional gain, integral time, and derivative time. Through this sensitivity
analysis, reflecting the error as the difference between PV and u, we can infer the optimal
set of the parameter values that minimizes this error.

A simple inspection using a plot reveals that the optimal values for Kp , Ti , and Td
correspond to set 1 (Fig. 4.2).

###Optimal set
```{r, echo = TRUE}
setwd("C:/Book/Chapter4/Examples")
data <-read.csv(file="Ex4.1.csv",head=TRUE,sep=",")
data
plot(data$Set, data$Error, pch = 10, col =
2,xlab="Set",ylab="Er-
ror",xlim=c(1,12),ylim=c(9.4,10))
```

Nevertheless, minimizing the cost function fun (created using a full quadratic
polynomial), reveals that the optimal values are 10, 0.89, and 0.01, respectively:

Fig. 4.2 Optimal values for the P&ID controller parameters

4.2 Basic Control Theory: PID Controllers 111

###Optimization
```{r, echo = TRUE}
install.packages("stats")
```

###Correlation and optimization

```{r, echo = TRUE}
b1=data$Kp;
b2=data$Ti;
b3=data$Td;
b12=data$Kp*data$Ti;
b23=data$Ti*data$Td;
b13=data$Kp*data$Td;
b11=data$Kp^2;
b22=data$Ti^2;
b33=data$Td^2;
qm <-lm(data$Error ~
b1+b2+b3+b12+b23+b13+b11+b22+b33)
summary(qm)
betas<-coefficients(qm)

library(stats)
fun=function(x)
betas[1]+betas[2]*x[1]+betas[3]*x[2]+be-
tas[4]*x[3]+betas[8]*x[1]^2+betas[9]*x[2]^2+be-
tas[9]*x[3]^2
result <- optim(fn=fun, par = c(1,1,0.01),
lower=c(0,0,0), upper=c(10,1,0.06), method="L-BFGS-
B")
result
```
112 4 Data-Based Modelling for Control

Call:
lm(formula = data$Error ~ b1 + b2 + b3 + b12 + b23 + b13 + b11 +
b22 + b33)

Residuals:
1 2 3 4 5 6
7 8 9 10
-5.204e-18 1.792e-17 4.962e-02 -2.114e-02 3.404e-02 -4.237e-02 -
6.704e-02 2.940e-02 1.596e-02 5.719e-02
11 12
-9.221e-02 3.657e-02

Coefficients: (3 not defined because of singularities)

Estimate Std. Error t value Pr(>|t|)
(Intercept) -77.6314 29.1262 -2.665 0.0446 *
b1 18.1663 6.1640 2.947 0.0320 *
b2 -0.9384 3.3425 -0.281 0.7901
b3 9.5041 25.6951 0.370 0.7266
b12 NA NA NA NA
b23 NA NA NA NA
b13 NA NA NA NA
b11 -0.9396 0.3233 -2.906 0.0335 *
b22 0.5259 1.6676 0.315 0.7652
b33 -16.6175 667.0228 -0.025 0.9811
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.07013 on 5 degrees of freedom

Multiple R-squared: 0.8671, Adjusted R-squared: 0.7075
F-statistic: 5.435 on 6 and 5 DF, p-value: 0.0416

$par
[1] 10.0000000 0.8921718 0.0100000

$value
[1] 9.747709

The corresponding R file of this example is Ex4.1.rmd.

4.3 Model Predictive Control

Classic control suits most of the control problems. Model Predictive Control (MPC) is a
fair control approach that can be applied to almost all control problems. This approach is
based on a real-time optimization of a mathematical model; MPC predicts the future sys-
tem behavior that determines the optimal trajectory of the manipulated variable, adjusting
a process model. A simplified block diagram of an MPC-based control loop is shown in
Fig. 4.3, where u is the manipulated variable on the control variable y.
4.3 Model Predictive Control 113

Fig. 4.3 Simplified block diagram of an MPC-based control loop

MPC determines an output value of u by solving a constrained optimization problem.

The cost function is defined so that the control variable y tracks a given reference r for a
horizon N 2 ; the first value of the optimized output trajectory is applied to the system, and
then, this prediction and optimization are repeated in each time instance. The prediction
horizon N 2 must be long enough to capture the effect of a change in u on y (Fig. 4.4).
MPC has been extensively used in the process industry, motivated by multi-variable
and complex processes with time delay. Typical applications of MPC include distillation
columns, steam generators, and petrochemical plants (e.g., polyvinyl chloride). Moreover,
the pharmaceutical and biology industries employ MPC to handle non-linearity and large
time delays in their processes.

Fig. 4.4 Principle of a model-based predictive control with horizons N 1 , N 2 , and N u

114 4 Data-Based Modelling for Control

An excellent example of an MPC code (in Python and Excel) is presented in Ref.
[2]. In their example, a sequence of manipulated variable (MV) adjustments that drive
the controlled variable (CV) is estimated for a linear dynamic model along an expected
reference trajectory or target.
MPC uses a dynamic model of the response of the process variables. Changes known
as manipulated variables must be performed to calculate the control moves, which force
the process variables to follow a predefined trajectory until reaching a target. An optimal
controller is designed by minimizing the error from the trajectory. MPC models can be
linear, empirical, first principle, machine learning, and hybrid-based. MPC is typically
used in MIMO (multiple input–multiple outputs) systems.
In the example presented in Ref. [2], the process variable is driven toward the target in
a SISO system (single input, single output system), by moving the manipulated variable.
The cost function is a quadratic error between Model and Target, which is minimized via
optimization.

4.4 Summary and Final Remarks

PID controllers are popular in industrial control systems since they compare data with a
reference data value, minimizing the error between them to keep the system reaching and
staying at a setpoint value. The parameters of PID controllers must be tuned based on the
requirements of system performance. Conversely, MPC depends on the process’s dynamic
models; these models are typically linear models obtained by system identification or non-
linear models that can be prescribed by machine learning tools in complex systems. The
optimizer finds an optimal control input to minimize a cost function relating to the model
response and a predefined target. MPC is an ideal option for MIMO systems.
In this Chapter, we showed the use of data-based modelling for control purposes
through one example for tuning a proportional-integral-derivative (PID) control. While
this example is a simple illustration of the enormous potential of using data-based
modelling for control purposes, the control of complex systems, such as the control
of manufacturing processes, we recognize that machine learning algorithms offer clear
advantages over traditional control approaches, as they can effectively handle uncertain-
ties and changes in the process, using them for deriving models of the plant for analysis,
simulation, controller design, and model-based estimation for control.

Data Disclosure

The data was generated, maintaining the described problem’s physical meaning of the
analyzed phenomenon or process.
References 115

Problems

4.1 Review, run, and discuss the set of MPC examples provided in: https://github.com/
rhalDTU/MPCR.
4.2 Strategize possible control strategies for a distillation column. Discuss their advan-
tages and disadvantages.

Resources

nls in R: https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/nls.
optim in R: https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/optim.
R-codes and data repository: https://github.com/CHE408UofT/DGSD_UofT.

Recommended Readings

Dummies guide to PID. PID for Dummies - Control Solutions. (n.d.). https://csimn.com/CSI_pages/
PID.html.
Hou, Z.-S., & Wang, Z. (2013). From model-based control to data-driven control: Survey, clas-
sification and perspective. Information Sciences, 235, 3–35. https://doi.org/10.1016/j.ins.2012.
07.014
Kouvaritakis, B., & Cannon, M. (2016). Model predictive control classical, robust and stochastic.
Springer.
Seborg, D. E. (2019). Process dynamics and control. Wiley.
Zulu, A. (2017). Towards explicit PID control tuning using machine learning. In 2017 IEEE
AFRICON, Cape Town, South Africa, 2017 (pp. 430–433). https://doi.org/10.1109/AFRCON.
2017.8095520
Zhou, Y. (2022, April 23). A summary of PID control algorithms based on AI-enabled embedded
systems. Security and Communication Networks, 1939-0114. https://doi.org/10.1155/2022/715
6713

References

1. Pang, L. (2013, February 9). PID control-R: R-bloggers. https://www.r-bloggers.com/2013/02/

pid-control-r/
2. APMonitor. Model Predictive Control. (n.d.). http://apmonitor.com/wiki/index.php/Main/Control
Optimization
5

5.1 Basic Optimization Concepts

When we optimize a process or a system, we look at selecting the best element from
a set of available options under certain conditions or criteria. An optimization problem
consists of minimizing or maximizing a real function, typically called objective function,
cost function, or loss function, by choosing input values from within an allowed set of
values while computing the value of the function.
An optimization problem is mathematically represented by:

• A function f: A → R from a set A to the real numbers, typically a subset of the

Euclidean spaced Rn , specified by a set of constraints, equalities, or inequalities in
which members of the subset must satisfy. The domain A of f is the search space, and
elements of A are called feasible solutions. The feasible solution that maximizes or
minimizers the objective function f is called the optimal solution.
• An element x0 ∈ A such that f(x0 ) ≤ f(x) for all x ∈ A (minimization) or such that
(x0 ) ≥ f(x) for all x ∈ A (maximization).

Some important concepts related to optimization include:

• A local optima is the minimum or maximum extrema of f for a given region of the
input space. Unless f is convex (in minimization), we can get several local minima.
Nonconvex problems are tackled with global optimization, which looks at converging
to the actual optimal solution of these problems.

Supplementary Information The online version contains supplementary material available at

https://doi.org/10.1007/978-3-031-46866-7_5.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 117
D. Galatro and S. Dawe, Data Analytics for Process Engineers, Synthesis Lectures
on Mechanical Engineering, https://doi.org/10.1007/978-3-031-46866-7_5
118 5 Optimization

• A global minimum is a point where the value of f is smaller than or equal to the value
at all other optimal solutions or feasible points.

Optimization algorithms achieve good approximations of minimums or maximums of

optimization problems by iteratively guessing these values. These algorithms consider (i)
accuracy, convergence speed, and iteration cost.
A multi-objective optimization (MOO) problem involves multiple objective functions.
A set of best solutions is obtained from MOO as a tradeoff between competing solutions.
The problem is formulated as follows:

Minimize or maximize fm (x), m = 1, 2, . . . , M (5.1)

subject to gj (x) ≥ 0, j = 1, 2, . . . , J and/or hk (x) ≥ 0, k = 1, 2, . . . , M (5.2)

and bounded by

xLi ≤ xi ≤ xU
i , i = 1, 2, . . . , n (L: lower, U: upper) (5.3)

In a multi-objective optimization problem, a solution is determined by the dominance,

prescribed by the dominance test:
Solution x1 dominates solution x2 , if.

• x1 is not worse than x2 in all the objectives.

• x1 is strictly better than x2 in at least one objective.

The non-dominated solution is the set of all solutions that are not dominated by any
element of the solution set; the feasible space of the non-dominated set is called the
Pareto-optimal set, while the boundary of all points mapped from the Pareto-optimal
set is called the Pareto-optimal front (see Fig. 5.1). The goal of the MOO is to find a
diverse set of solutions, interpreted as a favorable trade-off between the objectives and,
therefore, located as close as possible to the Pareto.
When designing PID controllers, for instance, optimization can be conducted by mini-
mizing an error function to find the parameters of the controller, using the optim function
and the method L-BFGS-B, or limited memory Broyden-Fletcher-Goldfarb-Shanno, an
optimization algorithm in the family of quasi-Newton methods, looking at finding zeroes
or local maxima and minima of functions. In this Chapter, we introduce optimization
methods grouped as (i) Group Search, Random Search, and Gradient Search, (ii) evo-
lutionary algorithms such as genetic algorithms and particle swarm, and (iii) Bayesian
inference, including MOO problems.
5.2 Grid Search, Random Search, and Gradient Search 119

Fig. 5.1 Pareto-optimal

solution and front

5.2 Grid Search, Random Search, and Gradient Search

Grid search is an optimization algorithm that selects the best values from a list of provided
value options. This algorithm is typically used for hyperparameter tuning. A domain (e.g.,
a hyperparameters domain) is divided into a discrete grid; then, every combination of the
grid is tried, and performance metrics are calculated to verify the effectiveness of the
algorithm in finding optimal values.
Random search, on the other hand, initializes x with a random position in the search-
space, it samples a new position y from a Euclidean space (hypersphere), iterates if f(y) <
f(x), and moves to the new position by setting x = y, until a termination criterion is met.
Examples 5.1 and 5.2 illustrate the use of grid search and random search to identify
the variables of the Arrhenius equation.

Example 5.1 Arrhenius equation–grid search. The Arrhenius equation represents the
fraction of collisions with enough energy to overcome the activation energy in a chemical
reaction. The formula of the Arrhenius equation is as follows:
Ea
k = Ae− RT (5.4)

where k is the rate constant, A is the pre-exponential factor, E a is the activation energy, R
is the universal gas constant and T is the absolute temperature. In the following code, we
use an equally spaced grid to identify (find) the parameters A and E a .
120 5 Optimization

###Install packages
```{r, echo = TRUE}
install.packages("pracma")
}
```

###Equally spaced grid

```{r, echo = TRUE}
library(pracma)

start_time <- Sys.time()

#Arrhenius example

A<-c()
Ea<-c()

n<-20000;
m<-20000;

A<-linspace(450,500,n); #Pre-exponential factor

Ea<-linspace(12,18,m); #Activation energy
k<-478.5; #Rate constant
R<-8.314; #Universal gas constant
Temp<-599; #Temperature

x1_opt <- 1e10;

x2_opt <- 1e10;
f<-10000;
f_tar <- 100000;
iter<-0;

for (i in 1:n) {
for (j in 1:m) {
kdot<-A[i]*exp(-Ea[i]/(R*Temp));
f_error[i]<-(k-kdot)^2;
if (f_error[i]<=f_tar) {
f_tar<-f_error[i];
x1_opt<-A[i];
x2_opt<-Ea[i];
} else {
iter<-i;
break;
}
}
}
x1_opt;
x2_opt;
kdot

end_time <- Sys.time()

end_time - start_time}
```
5.2 Grid Search, Random Search, and Gradient Search 121

By searching A and E a in an equally spaced grid within the ranges of 450–500 and
12–18, respectively, we find the values of 480.0 and 15.6, minimizing an error function
between the true rate constant and the approximated value. The error was estimated to be
a 4.1% relative error, with a running time of 1.2 min.

Example 5.2 Arrhenius equation–random search. In the following code, we use random
search to identify (find) the parameters A and Ea (see example 5.1).

###Random search
```{r, echo = TRUE}
start_time <- Sys.time()
numIter<-10000000;
f_tar <- 100000;
A<-c()
Ea<-c()

k<-478.5; #Rate constant

R<-8.314; #Universal gas constant
Temp<-599; #Temperature

f<- function(Ea,A) {
kdot<-A*exp(-Ea/(R*Temp));
f_error<-(k-kdot)^2;
}
for (i in 1:numIter) {
Ea=runif(numIter,12,18);
A=runif(numIter,450,500);
kdot<-A[i]*exp(-Ea[i]/(R*Temp));
f_error[i]<-(k-kdot)^2;
if (f_error[i]<=f_tar) {
f_tar<-f_error[i];
x1_opt<-A[i];
x2_opt<-Ea[i];
} else {
iter<-i;
break;
}
}
122 5 Optimization

By randomly searching A and E a within the ranges of 450–500 and 12–80, respec-
tively, we find the values of 475.5 and 17.7, minimizing an error function between the
true rate constant and the approximated value. The error was estimated to be 1.3% relative
error, with a running time of 1.7 s. Random search performs better than grid search when
considering accuracy and computational cost. The grid search method performance can
be improved by using a fine mesh or non-equally spaced mesh.

The corresponding R files of these examples are Ex5.1 and 5.2rmd.

Gradient-based (GB) optimization algorithms employ gradients, which are partial

derivatives of the cost function with respect to each model parameter. A GB algorithm
typically runs iteratively, computing predictions and updating parameter estimates by sub-
tracting their corresponding gradients weighted by an adequate learning rate, leading to
smaller prediction errors. The problem is mathematically defined as minimizing a func-
tion with search directions defined by the gradient of the function (greatest change of a
scalar function) at the current point. Some examples of gradient methods include gradient
descent and conjugate gradient descent. The gradient descent method is a first-order iter-
ative algorithm for finding a local minimum of a differentiable function, based on taking
steps in the opposite direction of the function’s gradient at the current point since this
leads to a local minimum due to the search process through the steepest descent. Con-
jugate gradient descent is an iterative algorithm used for minimizing quadratic functions
based on orthogonality to search for the minimum. Optim (library: stats) is a general-
purpose optimization package in R, which is based on Nelder-Mead, quasi-Newton, and
conjugate-gradient algorithms. The Nelder-Mead method is used as a direct search method
used to find the minimum or maximum of a cost function in a multidimensional space; it
is typically employed in non-linear optimization for which derivatives may not be known.
Quasi-Newton methods are used to find either zeroes or local minima and maxima of
functions; they require the Jacobian to search for zeros or Hessian for finding extrema.
Conjugate-gradient algorithms, on the other hand, are often used in sparse systems that
cannot be handled by other direct methods; they can also be used to solve unconstrained
minimization.
5.3 Evolutionary Algorithms 123

5.3 Evolutionary Algorithms

Evolutionary algorithms (EA) are inspired by nature, as they emulate the process of
natural selection. It includes four steps: initialization, selection, genetic operators, and
termination. In EAs, fitter members survive and reproduce, while unfit members die off
without contribution to the gene pool of subsequent generations.
Genetic algorithms (GA) are a subset of EA. The process cycle of GA is shown in
Fig. 5.2.
The GA process begins with initialization, a stage where an initial population of can-
didate solutions is randomly generated as binary strings. At each generation step, a pool
of parents is selected using a selection mechanism to pass on the genetic material to sub-
sequent generations. Then, a child population is created by variation operations, such as
crossover and mutation, forming the basis of the next generation. Crossover usually takes
pairs of parents from the parent population using random selection with replacement
until the new child population reaches the same size as the original parent population.
Mutation, on the other hand, introduces new material into the population by randomly
changing codons on the chromosome. Once the child population is created, the children
are evaluated by assigning a fitness value to rank the population. Then, the old popula-
tion is replaced with the new child population, usually with the generational replacement
method. Finally, the GA terminates after a predetermined number of iterations or until a
stopping criterion has been met.

Fig. 5.2 Schematic of a genetic algorithm

124 5 Optimization

5.4 Particle Swarm Optimization

In the previous example, we determined the optimal solutions for the operating parameters
leading to maximizing the profit of a process plant. Highly non-linear surrogate models
obtained from simulated and/or plant data can be coupled to genetic algorithms for mini-
mization or maximization purposes, destined to optimize costs, energy consumption, and
environmental restrictions. Now, let us study the method and solutions generated by using
another popular optimization algorithm: Particle Swarm Optimization.
Particle Swarm Optimization (PSO) is a simple optimization algorithm not dependent
on the gradient or differential form of the objective function. It is a biologically based
algorithm where some members of a flock of birds, for instance, can profit from the
experience of all other members of the flock. Each bird helps find the optimal solution
in a high-dimensional solution space; this solution is heuristic since the found solution is
close to the global optimal. The algorithm starts with several random points or particles
on the plane and then looks for the minimum point in random directions; after a certain
number of iterations, the minimum point of the function becomes the minimum point ever
searched by the swarm of particles.

Example 5.3 Genetic algorithm and particle swarm optimization. The tension–com-
pression spring design problem shown in Fig. 5.3 is a continuous constrained problem
where we look at minimizing the volume V of a coil spring under a constant tension/
compression load [1] This problem has been discussed by several researchers employing
different optimization algorithms, as shown in reference [1]. The mathematical formulation
of the optimization problem is shown as follows:

min f(x) = (x3 + 2)x2 x21 (5.5)

Where

x1 number of spring’s active coils.

x2 diameter of the winding.
x3 diameter of the wire.
5.4 Particle Swarm Optimization 125

Fig. 5.3 Tension/compression spring design problem

Table 5.1 Lower and upper

Variable Lower bound Upper bound
bound of variables, for
example 5.3 x1 2 15
x2 0.25 1.3
x3 0.05 2

Let us solve the optimization problem of minimizing the volume V using genetic
algorithms and particle swarm optimization.
The lower and upper bound of each variable is shown in Table 5.1.
The inequality constraints for the optimization problem are given as follows:

x23 x3
1− ≤0 (5.6)
71785x14

4x22 − x1 x2 1
+ −1≤0 (5.7)
12566 x2 x13 − x14 5108x12
140.45x1
1− ≤0 (5.8)
x22 x3
x2 + x1
−1≤0 (5.9)
1.5

Note : In many cases, constraints can also be surrogate models fitted with the corresponding
data.

This is a convex optimization problem with an optimal set of solution of [x1 , x2 , x3 ] =

[0.0512, 0.3567, 11.2888] [1] The corresponding codes in R are shown as follows:
126 5 Optimization

Genetic Algorithm

### Install packages

```{r, echo = TRUE}
install.packages("GA")
```

### Genetic Algorithm

```{r, echo = TRUE}
library(GA)

fitness <- function(x){

#objective function
fitness_value <- (x[3]+2)*x[2]*x[1]^2;

#constraints
co1 <- 1 - x[2]^3*x[3]/(71785*x[1]^4);
co2 <- (4*x[2]^2-x[1]*x[2])/(12566*(x[2] * x[1]^3
- x[1]^4)) + 1/(5108 * x[1]^2) - 1;
co3 <- 1 - (140.45*x[1])/(x[2]^2*x[3]);
co4 <- (x[2]+x[1])/1.5 - 1;

#imposing constraints
fitness_value <- ifelse( co1 <= 0 & co2 <= 0 & co3
<= 0 & co4 <= 0, fitness_value, fitness_value + 1000)

return(fitness_value)
}

GA <- ga(type = "real-valued",

fitness = fitness,
lower = c(0.05, 0.25, 2), upper = c(2, 1.3,
15),maxiter=1000,popSize=150,moni-
tor=FALSE,keepBest=TRUE);
summary(GA)
```
5.4 Particle Swarm Optimization 127

Particle Swarm Optimization

### Install packages

```{r, echo = TRUE}
install.packages("pso")
```

### Particle Swarm Optimization

```{r, echo = TRUE}
library(pso)

fitness <- function(x){

#objective function
fitness_value <- (x[3]+2)*x[2]*x[1]^2;

#imposing constraints
fitness_value <- ifelse( co1 <= 0 & co2 <= 0 & co3
<= 0 & co4 <= 0, fitness_value, fitness_value + 1000)

return(fitness_value)
}

set.seed(90)
psoptim(rep(NA,3), fn = fitness, lower = c(0.05,
0.25, 2), upper = c(2, 1.3, 15)
```

The function fitness is first defined as a function of x 1 to x 3 ; followed by the inequality

constraints, which were defined and then included using an ifelse statement with a penalty.
To run pso and minimize the objective function, the R function psoptim from the
library pso was used; the basic setting parameters for the optimization are the lower and
upper bounds (given in the problem statement). The corresponding solutions (x 1 to x 3 ) of
the optimization problem are indicated below:
128 5 Optimization

── Genetic Algorithm ───────────────────

GA settings:
Type = real-valued
Population size = 150
Number of generations = 1000
Elitism = 8
Crossover probability = 0.8
Mutation probability = 0.1
Search domain =
x1 x2 x3
lower 0.05 0.25 2
upper 2.00 1.30 15

GA results:
Iterations = 1000
Fitness function value = 1087.244
Solution =
x1 x2 x3
[1,] 1.989891 1.297155 14.98572

── PSO ───────────────────

$par
[1] 0.05378783 0.40934120 8.76019214

$value
[1] 0.01274305

$counts
function iteration restarts
13000 1000 0

$convergence
[1] 2

$message

[1] "Maximal number of iterations reached"

The components of the solutions are:

par: best of parameters found.

value: value of the objective function corresponding to par.
counts: vector containing the number of function evaluations, number of iterations, and
the number of restarts.
convergence: integer code (2) indicating the maximum number of iterations reached.
5.5 Multi-objective Optimization 129

The corresponding R file of this example is Ex5.3.rmd.

Both optimization algorithms (GA and PSO) perform poorly when minimizing the objec-
tive function since this is a convex optimization problem, which means that all the
constraints are convex functions, and the objective function is also a convex function (if
minimizing). In this case, gradient descent is recommended as an effective optimization
algorithm. Would any setting improve the previous solutions?
Now let us solve a multi-objective optimization (MOO) using a popular algorithm:
NSGA-II.
The non-dominated sorting genetic algorithm II (NSGA-II) is an algorithm that effec-
tively deals with issues such as (i) computational complexity, (ii) non-elitism approach,
and (iii) the specification of a sharing parameter. Moreover, the dominance is modified to
solve constrained MOO problems efficiently. In the following example, we illustrate the
use of NSGA-II in solving an optimization problem involving two objective functions.

5.5 Multi-objective Optimization

Now let us solve a multi-objective optimization (MOO) using a popular algorithm: NSGA-
II.
The non-dominated sorting genetic algorithm II (NSGA-II) is an algorithm that effec-
tively deals with issues such as (i) computational complexity, (ii) non-elitism approach,
and (iii) the specification of a sharing parameter. Moreover, the dominance is modified to
solve constrain MOO problems efficiently. In the following example, we illustrate the use
of NSGA-II in solving an optimization problem involving two objective functions.

Example 5.4 Multi-objective optimization. We are looking at optimizing a counter-

current heat exchanger design by maximizing effectiveness and minimizing cost, which
depends on the number of transfer units (NTU) and the heat transfer ratio (c). The
corresponding objective functions are:

1 − e[−NTU(1−c)]
effectiveness = (5.10)
1 − ce[−NTU(1−c)]

cost = 1500c2 + 1200NTU + 500 (5.11)

The lower and upper bound of each variable is shown in Table 5.2.
The corresponding code in R is shown as follows:
130 5 Optimization

Table 5.2 Lower and upper

Variable Lower bound Upper bound
bound of variables, for
example 5.4 NTU 1 10
c 0.1 0.9

### Install packages

```{r, echo = TRUE}
install.packages("mco")
install.packages("nsga2")
```

### NSGA-II optimization

```{r, echo = TRUE}
library(mco)

hex <- function(x) {

f1 <- (-1)*((1-exp(-x[2]*(1-x[1])))/(1-x[1]*exp(-
x[2]*(1-x[1])))); #to maximize

f2 <- 500+1200x[2]+1500x[1]^2; #to minimize

return(c(f1,f2))
}
res <- nsga2(hex, 2, 2, generations=1000,
lower.bounds=c(0.1, 1), up-
per.bounds=c(0.9, 10), mprob=0.2,cprob=0.8,pop-
size=200,vectorized=FALSE)

plot(res, xlab="effectiveness", ylab="cost")

print(res$value)
print(res$par)
```

Figure 5.4 shows different Pareto-optimal solutions for the objective functions effec-
tiveness and cost. For instance, a feasible solution would be a heat exchanger of
effectiveness = 0.9, costing approximately 3,400 USD.
The tuning parameters used in this example include mprob (mutation probability),
cprob (crossover probability), and popSize (size of the population); these values are tested
to determine their sensitivity to the solution, for instance, with population sizes of 50, 100,
and 200; mutation probability of 0.1, 0.2, and 0.5 (and less typical values of 0.8 and 1.0);
and crossover rate of 0.5, 0.6, up to 0.9.
5.6 Bayesian Inference and Optimization 131

Fig. 5.4 Pareto-optimal solution for Essxample 5.4

The corresponding R file of this example is Ex5.4.rmd.

5.6 Bayesian Inference and Optimization

Bayesian optimization aims at locating a global maximum or minimum of all feasible val-
ues in the environment; the search follows a guided policy to iteratively find the sampling
location, get the observation value, and refresh the policy. The objective function can be
a black box function; hence, the interaction with the environment is done by sampling
at a specific location. This sampled value is then corrupted by noise (Gaussian), which
is an indirect evaluation of the actual sampling value. A gradient can be used to further
optimize this method and, thus, improve the functional evaluation. Mathematically, this
model can be expressed as the probability distribution of the function (which includes a
perturbation or Gaussian noise) based on a location x and a true function value, which
means that there is a normally distributed probability function for the actual observation
around the objective function, spread by the noise variance.
Bayesian inference requires a prior distribution, the likelihood for a specific parame-
ter, a posterior distribution, and the evidence of the data. In the Bayesian approach, the
parameter of interest is a random variable following a probability distribution over all
feasible values, which can be obtained by employing the Bayes rule.
132 5 Optimization

Example 5.5 Bayesian optimization. The total cost (in USD) of a pipe carrying a liquid
can be calculated as the combination of individual costs, such as pipe material, installation,
depreciation, flow rate, energy cost for pumping, maintenance, liquid properties, pumping
efficiencies, and taxes. The following function gives the economical optimal pipe diameter
that minimizes this total cost: [2]
⎡ ⎤ 1
2.84
D4.84+n nXE(1 + F)(Z + ab)(1 − )
f=⎣ ⎦
(1 + 0.794LeD) 0.000189YKρ0.84 μ0.16 (1 + M)(1 − ) + ZM
ab_dash
(5.12)

where:

ab: fractional annual depreciation (a) and maintenance (b) on pipeline (0.2).
ab_dash: fractional annual depreciation on pumping installation (a) and installed cost of
pipeline, including fittings (b) (0.4).
D: pipe diameter (ft – unknown).
E: combined fractional efficiency of pump and motor (0.5).
F: factor for installation and fitting (6.7).
K: energy cost delivered to the motor (0.04 USD/kWh).
Le: factor for friction in fitting, equivalent length in pipe diameter per length of pipe
(2.74 1/ft).
M: factor to express cost of piping installation, in terms of yearly cost of power
delivered to the fluid.
n: exponent in pipe-cost Eq. (1.35).
P: installation cost of pump and motor (150 USD/HP).
X: cost of 1 ft of 1 ft diameter pipe (29.52 USD).
Y: days of operation per day (365 d).
Z: fractional rate of return of incremental investment (0.1).
: factor for taxes and other expenses (0.55).

We are aiming to find the optimal diameter by minimizing the pipe cost.
The corresponding code in R is shown as follows:
5.6 Bayesian Inference and Optimization 133

### Bayesian optimization

```{r, echo = TRUE}
library(rBayesianOptimization)

Q<-0.557; #ft³/s - flow rate

ro<-62.4; #lb/ft³ - density
miu<-1; #cP - viscosity
ab<-0.2; #fractional annual depreciation(a) and
maintenance(b) on pipeline
ab_dash<-0.4; #fractional annual depreciation on
pumping installation (a) and installed cost of pipeline
including fittings (b)
E<-0.5; #combined fractional efficiency of pump and
motor
FF<-6.7; #factor for installation and fitting
K<-0.04; #USA/kWh - energy cost delivered to the
motor
Le<-2.74 #1/ft - factor for friction in fitting,
equivalent length in pipe diameter per length of pipe
n<-1.35 #Exponent in pipe-cost equation
X<-29.52 #USD - Cost of 1ft of 1ft diameter pipe
Y<-365 #d Days of operation
Z<-0.1 #Fractional rate of return of incremental in-
vestment
fi<-0.55 #Factor for taxes and other expenses
M<-0.115; #Factor for cost of piping installation in
terms of yearly cost of power delivered to the fluid

D<-0.1;
fd <- function(D) {
numerator<-((D^(4.84+n))*n*X*E*(1+FF)*(Z+ab*(1-
fi)));
denominator<-
(1+0.794*Le*D*(0.000189*Y*K*(ro^0.84)*(miu^0.16)))*((
1+M)*(1-fi)+(Z*M/ab_dash));
f<--(Q-(numerator/denominator)^(1/2.84))^2;
list(Score=f,Pred=0);
}

search_bound <- list(D = c(0.1,0.8));

opt_res<-BayesianOptimiza-
tion(FUN=fd,bounds=search_bound,n_iter=20,init_points
= 3)
```
134 5 Optimization

All variables are first defined and assigned to their corresponding values. An initial
value of 0.1 is assigned to the pipe diameter. The function fd includes the numerator and
denominator of Eq. 6.12 separately to define the objective function as the mean squared
error (with a negative sign indicating that the function is minimized). Score and Pred
are then set as the function and zero, respectively; Pred is a table with validation/cross-
validation prediction for each algorithm iteration. The search bound for the diameter is
then defined between 0.1 and 0.8 ft. Finally, the Bayesian optimization is performed by
setting the number of iterations = 20 and initial points (random points chosen to sample
the target function before the algorithm fits the Gaussian process). Other parameters to
explore for this algorithm include the acquisition function and tuning parameters kappa
and epsilon. The best parameter found by the algorithm is:

Best Parameters Found:

Round = 21 D = 0.4050091 Value = -1.256902e-07

The optimal pipe diameter is 0.405 ft.

The corresponding R file of this example is Ex5.5.rmd.

x1_opt;
x2_opt;
kdot

end_time <- Sys.time()

end_time - start_time
}
```

In example 5.5, the model for approximating the objective function is given (Eq. 5.12).
Nevertheless, Bayesian optimization is ideal in complex optimization problems (e.g., com-
putationally expensive), as the algorithm employs a probabilistic model to optimize the
objective function. Basically, Bayesian optimization has two components: a Bayesian sta-
tistical model (e.g., a Gaussian process) to model the objective function and an acquisition
function for deciding where to sample next. Gaussian processes are ideal for the statis-
tical model, as they are tractable and flexible. The function GP_fit in R can be used
Data Disclosure 135

to create a Gaussian process. The bayesOpt package of R can be used to maximize a

user-defined function; this package employs a Gaussian process to fit the results of sam-
pling the function. An acquisition function is then maximized (iteratively) to find the
user-defined function’s global maximum.

5.7 Summary and Final Remarks

Optimization is set to find inputs to an objective function, resulting in a maximum or

minimum function evaluation. It is a challenging problem that underlies several machine
learning algorithms. Many types of optimization algorithms can be used for continuous
function optimization; perhaps, we can preselect a type by asking ourselves whether the
objective can be differentiated at a point or not from where the algorithm will require
using derivative information to estimate the slope. On the other hand, non-differentiable
objective functions might not include an analytical description, or might have noise, and
even exhibit discontinuity; hence, no first- or second-order derivatives are available! In
these cases, black-box optimization algorithms, such as stochastic and population-based
algorithms, are used (e.g., genetic algorithm and particle swarm optimization).
In this chapter, we have studied some of the many optimization algorithms that can be
used in process engineering and machine learning, such as grid search, random search,
gradient search, genetics algorithm, particle swarm optimization, and Bayesian optimiza-
tion. Our approach simplifies the mathematical development and understanding of the
algorithms while we pragmatically set and use them to solve process engineering-related
problems. We also explored using an extension of the GA to multi-objective optimiza-
tion, an approach that might reflect what we typically encounter when designing and/or
operating chemical plants, for instance.
Some important takeaways from setting an optimization algorithm (once selected,
based on the differentiability of the objective function) are: (i) the optimization problem’s
physical meaning is that we must carefully understand the function per se, the possi-
ble interaction between parameters/variables, and physical bounds and constraints; (ii)
the accuracy of the surrogate model physically reproducing values within the operational
envelope or expected values; (iii) the importance of correctly choosing the tuning param-
eters of each algorithm and understanding their impact on finding the optimal solutions
to the optimization problem, for which it is highly recommended to perform a sensitivity
analysis around typical values.

Data Disclosure

The data was generated, maintaining the described problem’s physical meaning of the
analyzed phenomenon or process.
136 5 Optimization

Problems

5.1 Review, run, compare, and discuss all the previous examples using different opti-
mization algorithms. Is there a metric (or set of metrics) to effectively compare these
algorithms?
5.2 Perform sensitivity analyses on each optimization algorithm’s tuning parameters to
evaluate their impact on the found solutions.

Resources

GA documentation: https://www.rdocumentation.org/packages/GA/versions/3.2.3/topics/ga
PSO documentation: https://cran.r-project.org/web/packages/pso/pso.pdf
NSGA-II documentation: https://cran.r-project.org/web/packages/nsga2R/nsga2R.pdf
Bayesian Optimization I: https://cran.r-project.org/web/packages/rBayesianOptimization/rBayesian
Optimization.pdf
Bayesian Optimization II: https://www.rdocumentation.org/packages/ParBayesianOptimization/ver
sions/1.2.6/topics/bayesOpt
R-codes and data repository: https://github.com/CHE408UofT/DGSD_UofT

Recommended Readings

Boyd, S. P., & Vandenberghe, L. (2011). Convex optimization. Cambridge Univ. Pr.
Chaves, I.D.G., López, J.R.G., Zapata, J.L.G., Robayo, A.L., Niño, G.R. (2016). Process Optimization
in Chemical Engineering. In: Process Analysis and Simulation in Chemical Engineering. Springer,
Cham. https://doi.org/10.1007/978-3-319-14812-0_7
Deb, K., Pratap, A. Agarwal, S. Meyarivan, T. (2002). A fast and elitist multiobjective genetic algo-
rithm: NSGA-II. IEEE Transactions on Evolutionary Computation, 6(2), 182-197. https://doi.org/
10.1109/4235.996017.
Kochenderfer, M. J., & Wheeler, T. A. (2019). Algorithms for optimization. The Mit Press.
Edgar, T. F., Himmelblau, D. M., & Lasdon, L. S. (2201). Optimization of Chemical Processes.
McGraw-Hill.
Simon, D. (2013). Evolutionary optimization algorithms: Biologically-inspired and population-based
approches to computer intelligence. Wiley-Blackwell.
Yu, X., & Gen, M. (2013). Introduction to evolutionary algorithms. Springer London.
References 137

References

1. Kazemzadeh-Parsi, M. J. (2014). A modified firefly algorithm for engineering design optimiza-

tion problems. IJST, Transactions of Mechanical Engineering, 38(M2), 403–421. Available in:
https://ijstm.shirazu.ac.ir/article_2504_921b2b38522d5104a1e37d58383df5d2.pdf
2. Economic Pipe Sizing-Maplesoft. (n.d). Economic pipe sizing. Available in: https://maplesoft.
com/products/maple/Pipeline-Design-Analysis/PDFs/Pipeline-Economics/Economic-Pipe-Siz
ing.flow.pdf
Final Remarks
6

6.1 R and RStudio: Introduction, Documentation, and Codes

R is a robust statistical software used for data analysis in different fields. Like Python, it
includes a set of libraries and extensive documentation to build and run codes. RStudio
is an integrated developing environment or interface for R. We used the desktop form of
this tool, available at: https://posit.co/download/rstudio-desktop/. The previous link will
link us to the installation/download of RStudio.
CRAN is the R network that includes up-to-date versions of codes and documenta-
tion for R. CRAN can be accessed through https://cran.r-project.org/. R documentation is
found in the following link: https://www.rdocumentation.org/packages/dgof/versions/1.4.
Finally, CRAN also contains a comprehensive introduction to R worth checking: https://
cran.r-project.org/doc/manuals/r-release/R-intro.html.

6.2 Data Analysis, Data Analytics, and Machine Learning

Data analysis and data analytics are sometimes interchangeable in many contexts, but
they are, indeed, different. Data analysis is extracting meaning from data to make the right
decision. Data analytics is a more complex process since we use data and techniques to
find new and/or complex insights to make enhanced predictions. In data analysis, we
collect, manipulate, and examine the data for insight. In data analytics, we analyze and
work with the data to make the right decision. The third term that brings us to a non-
universal definition is machine learning (ML). We can safely define machine learning as a
branch of computer science that combines data and algorithms to solve complex problems
that could be cost-prohibitive for humans. It is often associated with artificial intelligence
or AI (typically used interchangeably), but ML can be seen as a branch of AI. With

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 139
D. Galatro and S. Dawe, Data Analytics for Process Engineers, Synthesis Lectures
on Mechanical Engineering, https://doi.org/10.1007/978-3-031-46866-7_6
140 6 Final Remarks

AI, humans want to create intelligent machines simulating human behavior and human
capabilities; ML, on the other hand, while learning and adapting through experience, is
used for more specific tasks and applications.

6.3 Open Datasets

There are different sources of open datasets all over the web. From shared data collected
by agencies (e.g., COVID-19 datasets) to research data referred to in peer-reviewed papers
(e.g., data in brief , by Elsevier, a publishing company), and open datasets. Kaggle is, per-
haps, one of the most visited sites (https://www.kaggle.com/datasets) due to its ample
collection of datasets and codes, learning tools, and discussion forums. One interest-
ing dataset for process engineers to test their knowledge in data analytics is Tennessee
Eastman Process Simulation Dataset for anomaly detection evaluation (https://www.kag
gle.com/datasets/averkij/tennessee-eastman-process-simulation-dataset); be aware of the
user agreement of any dataset, for accessing, downloading, modifying, and citing or
referencing.

6.4 Data Analytics and the Physical Meaning of Phenomena

Assessing the physical meaning of the results generated from exploratory data analy-
sis (EDA), data-based modelling, data-based control, and optimization tasks makes data
analytics meaningful for process engineers.
Clean data typically results from managing outliers and missing data, clustering, and
performing dimensionality reduction. These tasks seem quickly achievable by running a
set of codes and techniques: an eyes-closed process! Let us imagine mathematically omit-
ting outliers associated with an unknown modification in the feed characterization of a
process plant or discarding variables identified in first-principle equations. Would your
model be reliable and representative of your phenomenon? What would be the impact of
using your model in a decision-making process? The physical meaning includes clearly
understanding, when possible, your variables’ potential interaction and effects. Risky sit-
uations can arise when, for instance, correlations defeat causality; our responsibility as
process engineers is to provide meaningful and accurate solutions; therefore, you must
comprehend how your process or system works! In some specific cases, this understand-
ing is quite challenging: in the early stages of investigating degradation in capacitors and
lithium-ion batteries, it was impossible to identify its causes and effects, as they are asso-
ciated with side reactions, challenging to measure and/or quantify. There are also complex
settings such as modelling earth sciences or out-of-space-related data or health outcomes.
While failing to account for confounding variables can play a role in accurately modelling
6.5 Remarks on Sources of Data 141

these systems, our inference techniques shall be supported at least by educated assump-
tions and hypotheses and, moreover, mathematically cross-validated using two or more
techniques to solve a specific task or even testing our models within your operational
envelope to verify that it keeps physical meaning. For instance, we can use a training
dataset to generate a model to predict the pressure drop in equipment, with R2 close to
1 and negligible residuals, and still test our model to discover that, within an expected
range of operation variables, it provides you with negative values. We can see similar out-
comes in optimization when our constraints and bounds are poorly defined, for example,
leading to unfeasible physical solutions! The message is clear for process engineers: We
are process engineers, we are not data analysts performing data analytics in a different
field, and as such, we understand the physical meaning of processes and systems.

6.5 Remarks on Sources of Data

In Chap. 1, we identified at least three different sources of data for analysis. When ranking
the familiarity of process engineers in acknowledging and/or using one or more of these
sources, we can think that plant process data, pilot plant, and laboratory data are, indeed,
quite familiar when we operate chemical plants or conduct research, for instance; followed
by process simulation data, when we design or troubleshoot a process, and finally, a
probably less familiar concept, synthetic data. Artificially generated data, however, can
be intuitively seen, for example, as interpolating or extrapolating data, as a means of
obtaining meaningful values when measured data is insufficient or biased.
The quality of the plant process data depends on the quality, maintenance, and loca-
tion of the sensors and instruments collecting the data. Gathering this information over
time is crucial for monitoring, controlling, and troubleshooting, but it can be used for
design and optimization.
The quality of the pilot plant and laboratory data depends on the design of experi-
ments to capture the phenomenon at different process variable ranges, which is typically
constrained by time and resources (e.g., equipment and/or computational cost).
The quality of the data generated by process simulation software might be unquestion-
able if and only if our simulation tool is capable of strictly reproducing the system or
process we simulate, which depends on the accuracy and extent in defining, for instance,
kinetics networks (e.g., side reactions, yields, and catalysts), simulation modes (steady-
state or dynamics), thermodynamics packages, a representative feed characterization,
among others.
The quality of synthetic data is still a debatable subject, as best practices shall be stan-
dardized to make it a safe option in many fields; moreover, machine learning is a relatively
new field of study compared to process simulation, for instance, and new and more effec-
tive algorithms are being developed at a rapid pace. On the other hand, it is understandable
that limited amounts of data inevitably require fast and cost-effective ways to scale our
142 6 Final Remarks

models; hence, synthetic data can be seen as the best option to generate data. It has been
tested in some applications that synthetic data might even enhance the performance of
predictive models! In any case, there is a critical requirement in ensuring the quality of
synthetic data: using clean data, which might require harmonization (being merged from
different sources). This step shall be followed by assessing the similarity of our synthetic
data with our real data to make proper decisions.
Finally, the combination of sources of data is also a reality in our process engi-
neering industry. Accelerated laboratory tests generate data that can also be harmo-
nized with online data captured by sensors and instruments. Reconciling data via har-
monization is a complex and iterative process that requires, as per any data analysis per
se, an understanding of the physical phenomena being measured and analyzed.
All sources of data are equally important, as they can provide different insights and
serve cross-validation purposes when designing, predicting, controlling, and optimizing
processes.

6.6 Remarks on Exploratory Data Analysis

Simple visualization can save cost and time. We can observe trends and already identify
patterns for further modelling solutions. Furthermore, it allows us to identify outliers and
missing values. It is good practice for data analytics when we start analyzing our data,
but exploratory data analysis is not a trickled-down process. We might manage outliers
using specific techniques to later discover, by simple visualization, that our patterns have
completely changed! Hence, simple visualization can fit at any step of the data analytics
framework as a helpful tool that can save cost, time, and accuracy!
Principal Component Analysis is perhaps the most robust and accurate technique when
studying the relationship between variables. Saturated PCA graphs can be expected as we
increase the number of variables to analyze. We consider that running correlograms (not
always the world is linear, we know…), might help us pre-understand this relationship,
which can even contradict when running a PCA or Sobol indices. Simplicity might also
save us cost and time when analyzing data.

6.7 Remarks on Data-Based Modelling

Splitting is a good practice when building and testing your machine learning model,
enhancing the model’s performance when predicting. Generally, we use a training set
including 80% of the data, 10% for a validation set, and 10% for a test set. This rule is
a good split to start with; however, this is not set. Splitting requires finding an optimum
split by carefully analyzing the dimension of the data, the type of model, and even custom-
ary factors associated with your process or system (e.g., temporal conditions like ramping
6.9 Remarks on Optimization 143

up the temperature in a reactor, which would require clustering your data according to
the process events).
Another important feature of machine learning tools is using standardization or nor-
malization to preprocess the data. They both help not to distort the differences in values.
The main difference between both is that Gaussian distribution is assumed in standardiza-
tion; hence, variables equally contribute to the analysis. Which one is better? When using
neural networks, for example, if the data has different dimensions, sometimes it is not
helpful to make assumptions about the data distribution; therefore, normalization is the
right data preprocessing option. A process engineer must choose their data preprocessing
technique before fitting a machine learning model to enhance its performance.
A particular topic when using machine learning is regularization, which refers to tech-
niques used to minimize overfitting or underfitting of the models. In neural networks, for
instance, Bayesian regularization is a powerful technique (package brnn in R); we invite
our readers to test datasets using this technique, as ML models sometimes efficiently learn
the training data but fail when generalizing for new data.
Finally, what seems obvious might not be. When using model distributions to fit
data, use a reliable test to evaluate the goodness of fit of the distribution. Moreover,
for all other models, using different metrics and comparing between them is a must;
we introduced BIC and AIC, R2 , residuals, and other errors, but there are many other
ways to explore when comparing models’ performance, so make sure the data is fit to
the right model that mathematically and physically represents the process or system.

6.8 Remarks on Data-Based Control

Machine learning techniques are proven to support tuning tasks for control purposes.
Process control only sometimes requires sophisticated algorithms, as PID controllers can
many times do the job. Model predictive control (MPC), on the other hand, is an advanced
subject that was not deeply covered in this book, as it deserves understanding the mathe-
matical principles defining it; in practice, MPC is not a simple task for one input and out
output variable, but it is a predecessor of modern control in process plants or complex
automated systems, where advanced machine learning techniques are required to control
several tasks on several input and output variables.

6.9 Remarks on Optimization

Optimization typically requires objective functions; in some cases, such as in Bayesian

inference, a black-box function. A valid question when using neural networks would be:
Is there a function approximation to a neural network? Yes, for instance, NN can be used
144 6 Final Remarks

to approximate the objective function in optimization problems [1]. This minimizes using
other less efficient heuristic solutions that might not find optimal solutions.
Another crucial consideration when optimizing process engineering problems is select-
ing the right optimization algorithm. As we studied in Chap. 5, convex optimization
problems cannot be solved using genetic algorithms or particle swarm optimization. We
recommend, when possible, using gradient-based optimization algorithms and leaving
other complex problems for GA or PSO.

6.10 The Future of Data Analytics in Process Engineering

There is so much to learn about data analytics for process engineers in a fast-paced world
where enthusiastic people in this field quickly develop and share thousands of lines of
code and algorithms annually. We clustered the recent trends in (i) enhancing process
control and automation strategies, (ii) providing effective tools for failure analysis and
inspection, and (iii) applying big data analytics.
Some examples of these trends include (i) using convolutional neural networks for
failure localization and 3D characterization of materials and equipment [2, 3]; (ii) using
reinforcement learning in nuclear plants to improve the automated control of reactors, a
process-safety application that might bring us to a fully automatic control operation of
these plants; (iii) big data applications for process monitoring and control, manufacturing,
and modelling and optimization [4, 5].
Process engineering is one of the many fields being benefited by the advancement in
data analytics and machine learning. Process engineers must keep up with this progress
by staying updated in efficiently using these tools to ensure competitiveness in the market
and, ultimately, providing enhanced solutions to design, monitor, troubleshoot, control,
and optimize processes.

References

1. Villarrubia, G., De Paz, J. F., Chamoso, P., & la Prieta, F. D. (2018). Artificial neural networks
used in optimization problems. Neurocomputing, 272, 10–16. https://doi.org/10.1016/j.neucom.
2017.04.075
2. Paulachan, P., Siegert, J., Wiesler, I., & Brunner, R. (2023). An end-to-end convolutional neu-
ral network for automated failure localisation and characterisation of 3D interconnects. Scientific
Reports, 13(1). https://doi.org/10.1038/s41598-023-35048-0.
3. Chang, Z., Wan, Z., Xu, Y., Schlangen, E., & Šavija, B. (2022). Convolutional neural network
for predicting crack pattern and stress-crack width curve of air-void structure in 3D printed
concrete. Engineering Fracture Mechanics, 271, 108624. https://doi.org/10.1016/j.engfracmech.
2022.108624
References 145

4. Park, J., Kim, T., Seong, S., & Koo, S. (2022). Control automation in the heat-up mode of a
nuclear power plant using reinforcement learning. Progress in Nuclear Energy, 145, 104107.
https://doi.org/10.1016/j.pnucene.2021.104107
5. Sadat Lavasani, M., Raeisi Ardali, N., Sotudeh-Gharebagh, R., Zarghami, R., Abonyi, J., &
Mostoufi, N. (2021). Big data analytics opportunities for applications in process engineering.
Reviews in Chemical Engineering, 39(3), 479–511. https://doi.org/10.1515/revce-2020-0054

Phil Attard - Entropy Beyond The Second Law. Thermodynamics and Statistical Mechanics For Equilibrium, Non-Equilibrium, Classical, and Quantum systems-IOP Publishing (2023)
No ratings yet
Phil Attard - Entropy Beyond The Second Law. Thermodynamics and Statistical Mechanics For Equilibrium, Non-Equilibrium, Classical, and Quantum systems-IOP Publishing (2023)
504 pages
Process Modelling and Model Analysis-Hangos-Cameron
82% (11)
Process Modelling and Model Analysis-Hangos-Cameron
561 pages
Practical Chemical Process Optimization Book
No ratings yet
Practical Chemical Process Optimization Book
444 pages
Foreword 2021 Machine Learning and Data Science in The Oil and Gas Industry
No ratings yet
Foreword 2021 Machine Learning and Data Science in The Oil and Gas Industry
3 pages
PDF Transport Phenomena Bird Solutions Manual PDF
0% (3)
PDF Transport Phenomena Bird Solutions Manual PDF
2 pages
Fluid Dynamics Theoretical and Computational Approaches, Second Edition by Z.U.a. Warsi
No ratings yet
Fluid Dynamics Theoretical and Computational Approaches, Second Edition by Z.U.a. Warsi
748 pages
Optimisation and Decision
100% (1)
Optimisation and Decision
17 pages
Book ML in Python For PSE
No ratings yet
Book ML in Python For PSE
57 pages
Machine Learning For Process Monitoring and Predictive Maintenance
100% (1)
Machine Learning For Process Monitoring and Predictive Maintenance
105 pages
Multi-Objective Optimization in Chemical Engineering: Developments and Applications
From Everand
Multi-Objective Optimization in Chemical Engineering: Developments and Applications
Gade Pandu Rangaiah
No ratings yet
Aspen Plus
No ratings yet
Aspen Plus
214 pages
Heat and Mass Transfer: Fundamentals and Applications - Tables
No ratings yet
Heat and Mass Transfer: Fundamentals and Applications - Tables
52 pages
Transport Phenomena Data Companion
100% (2)
Transport Phenomena Data Companion
160 pages
Aspen Tech Compressor Modeling in Aspen PDF
100% (1)
Aspen Tech Compressor Modeling in Aspen PDF
16 pages
PROII91 GettingStartedGuide
No ratings yet
PROII91 GettingStartedGuide
122 pages
Analysisof Spectra BRUKER
No ratings yet
Analysisof Spectra BRUKER
916 pages
Aspen+ Essential
No ratings yet
Aspen+ Essential
60 pages
Surface Plasmon Resonance
No ratings yet
Surface Plasmon Resonance
20 pages
Krehl 2001
No ratings yet
Krehl 2001
142 pages
Modelling and Simulation in Engineering
No ratings yet
Modelling and Simulation in Engineering
21 pages
Computer Methods in Chemical Engineering
100% (2)
Computer Methods in Chemical Engineering
247 pages
Comsol
100% (2)
Comsol
38 pages
Mathermatica Computer Programs For Physical Chemistry PDF
100% (1)
Mathermatica Computer Programs For Physical Chemistry PDF
255 pages
Machine Learning For Predictive Maintainance On Wind Turbines
No ratings yet
Machine Learning For Predictive Maintainance On Wind Turbines
76 pages
Transport Phenomena: Notes For The 2nd Revised Edition of by R. B. Bird, W. E. Stewart, and E. N. Lightfoot
No ratings yet
Transport Phenomena: Notes For The 2nd Revised Edition of by R. B. Bird, W. E. Stewart, and E. N. Lightfoot
168 pages
Aspen Plus User Models
100% (2)
Aspen Plus User Models
339 pages
Introduction To Microkinetic Modeling
No ratings yet
Introduction To Microkinetic Modeling
227 pages
AA Introduction To MATLAB Applications in Chemical Engineering, PDF
No ratings yet
AA Introduction To MATLAB Applications in Chemical Engineering, PDF
277 pages
Aspen Tech Compressor Modeling in Aspen
100% (1)
Aspen Tech Compressor Modeling in Aspen
16 pages
Aspen HYSYS V8 Test Case Certification 21 Jan 13-Final
100% (1)
Aspen HYSYS V8 Test Case Certification 21 Jan 13-Final
26 pages
22nd International Colloquium On The Dynamics of Explosions and Reactive Systems
No ratings yet
22nd International Colloquium On The Dynamics of Explosions and Reactive Systems
144 pages
Aspen HYSYS Clean Fuels V7 - 0-User Guide
100% (1)
Aspen HYSYS Clean Fuels V7 - 0-User Guide
57 pages
Process Engineers Take Control
100% (2)
Process Engineers Take Control
11 pages
User Guide Vol 2
No ratings yet
User Guide Vol 2
354 pages
System Performance Modeling Software - Copy Yb
No ratings yet
System Performance Modeling Software - Copy Yb
18 pages
Solver Frontline Solvers Reference Guide
No ratings yet
Solver Frontline Solvers Reference Guide
305 pages
Advanced Operations: Thomas Kaiser Oliver D. Doleski
No ratings yet
Advanced Operations: Thomas Kaiser Oliver D. Doleski
51 pages
Aspen Plus PFR Reactors Tutorial Using Styrene With Multiple Reactions With Langmuir-Hinshelwood-Hougen-Watson Kinetics PDF
No ratings yet
Aspen Plus PFR Reactors Tutorial Using Styrene With Multiple Reactions With Langmuir-Hinshelwood-Hougen-Watson Kinetics PDF
4 pages
COMSOL Release Notes
No ratings yet
COMSOL Release Notes
142 pages
Aspen HYSYS Tutorials and Applications
No ratings yet
Aspen HYSYS Tutorials and Applications
7 pages
Sample - Solution Manual Reliability Engineering by Singiresu Rao
No ratings yet
Sample - Solution Manual Reliability Engineering by Singiresu Rao
18 pages
Mathematical Modeling in Chemical Engineering
100% (2)
Mathematical Modeling in Chemical Engineering
196 pages
Adsorption Analysis
100% (1)
Adsorption Analysis
913 pages
Process-Dynamics Modeling Analysis and Simulation Wayne Bequette PDF
No ratings yet
Process-Dynamics Modeling Analysis and Simulation Wayne Bequette PDF
632 pages
(Special Publication) S.D. Jackson, J.S.J. Hargreaves, D. Lennon-Catalysis in Application_ [Proceedings of the International Symposium on Applied Catalysis to Be Held at the University of Glasgow on 1
No ratings yet
(Special Publication) S.D. Jackson, J.S.J. Hargreaves, D. Lennon-Catalysis in Application_ [Proceedings of the International Symposium on Applied Catalysis to Be Held at the University of Glasgow on 1
332 pages
Process Simulation in Refineries Sampler 1
100% (1)
Process Simulation in Refineries Sampler 1
34 pages
Membrane Aspen Plus
No ratings yet
Membrane Aspen Plus
17 pages
Workshop: 2 The Optimizer
No ratings yet
Workshop: 2 The Optimizer
20 pages
PI DataLink Beyond Workbook - v2
No ratings yet
PI DataLink Beyond Workbook - v2
42 pages
Model Predictive Control - Rawlings - Mayne PDF
100% (2)
Model Predictive Control - Rawlings - Mayne PDF
724 pages
Introduction to Data Analytics
From Everand
Introduction to Data Analytics
Dan Martin
No ratings yet
Data Analytics For Process Engineers Prediction Control and Optimization
No ratings yet
Data Analytics For Process Engineers Prediction Control and Optimization
3 pages
(S. Pushpavanam (Eds.) ) Control and Optimisation o (B-Ok - CC) PDF
No ratings yet
(S. Pushpavanam (Eds.) ) Control and Optimisation o (B-Ok - CC) PDF
270 pages
Data Empowerment: Harnessing Advanced Mathematical and Statistical Methods for Data Science and Machine Learning
From Everand
Data Empowerment: Harnessing Advanced Mathematical and Statistical Methods for Data Science and Machine Learning
NAGARAJU CHEVURU
No ratings yet
Machine Learning For Dynamic Process Systems
100% (1)
Machine Learning For Dynamic Process Systems
68 pages
Machine Learning in Python For Process Systems Engineering: Ankur Kumar, Jesus Flores-Cerrillo
No ratings yet
Machine Learning in Python For Process Systems Engineering: Ankur Kumar, Jesus Flores-Cerrillo
352 pages
Machine Learning and Predictive Analytics Guidebook Ge Digital
No ratings yet
Machine Learning and Predictive Analytics Guidebook Ge Digital
45 pages
Applied Data Analysis and Modeling
No ratings yet
Applied Data Analysis and Modeling
446 pages
Introduction to Machine Learning and Neural Classification
From Everand
Introduction to Machine Learning and Neural Classification
Trilokesh Khatri
No ratings yet
Essentials of Data Analysis
From Everand
Essentials of Data Analysis
Agasti Khatri
No ratings yet
The Advanced (AES) : Encryption Standard
No ratings yet
The Advanced (AES) : Encryption Standard
37 pages
Project File - AI (Final Term Class X) 2024
No ratings yet
Project File - AI (Final Term Class X) 2024
7 pages
Tybsc-It Sem5 Ai Apr19
No ratings yet
Tybsc-It Sem5 Ai Apr19
2 pages
Addis Ababa University Addis Ababa Institute of Technology School of Electrical and Computer Engineering
No ratings yet
Addis Ababa University Addis Ababa Institute of Technology School of Electrical and Computer Engineering
34 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
A High Probability Analysis of Adaptive SGD With Momentum
No ratings yet
A High Probability Analysis of Adaptive SGD With Momentum
13 pages
Graph Theory Problems in Full Bloom
No ratings yet
Graph Theory Problems in Full Bloom
6 pages
Two-Step-Equations Guided Notes
No ratings yet
Two-Step-Equations Guided Notes
2 pages
Multiple Choice Questions
No ratings yet
Multiple Choice Questions
8 pages
Deep Learning Notes
No ratings yet
Deep Learning Notes
200 pages
A Python Based Multi-Point Geostatistics by Using Direct Sampling Algorithm
No ratings yet
A Python Based Multi-Point Geostatistics by Using Direct Sampling Algorithm
4 pages
Control Lab Report Experiment No. 01 PDF
No ratings yet
Control Lab Report Experiment No. 01 PDF
5 pages
(Big Data Analysis) : Python Scikit-Learn 機器學習
No ratings yet
(Big Data Analysis) : Python Scikit-Learn 機器學習
97 pages
3 Signals and Systems
No ratings yet
3 Signals and Systems
41 pages
Mec Vib Week 3 231019 140022
No ratings yet
Mec Vib Week 3 231019 140022
30 pages
PID Controllers
No ratings yet
PID Controllers
9 pages
Cs8080 Information Retrieval Techniques
No ratings yet
Cs8080 Information Retrieval Techniques
10 pages
Q1) An Array Contains 25 Positive Integers - Write A Module Which: A) Finds All The Require of Elements Whose Sum Is 25. Ans
No ratings yet
Q1) An Array Contains 25 Positive Integers - Write A Module Which: A) Finds All The Require of Elements Whose Sum Is 25. Ans
12 pages
Ch06 - Image Compression
No ratings yet
Ch06 - Image Compression
42 pages
Ass 1 Unit 26 16-17
0% (1)
Ass 1 Unit 26 16-17
3 pages
Cplex Concert Technology - Help File
No ratings yet
Cplex Concert Technology - Help File
4 pages
TSCS Week5 Trends
No ratings yet
TSCS Week5 Trends
19 pages
Solving Initial Value Problems Using Laplace Transforms
No ratings yet
Solving Initial Value Problems Using Laplace Transforms
4 pages
LNCS 2810 Measures of Rule Quality for Feature Selection in Text Categorization 1st Edition by Elena MontaÃ±Ã©s, Javier FernÃ¡ndez, Irene DÃaz, ElÃas Combarro, JosÃ© RanillaÂ ISBN 3540452311 9783540452317 download
No ratings yet
LNCS 2810 Measures of Rule Quality for Feature Selection in Text Categorization 1st Edition by Elena MontaÃ±Ã©s, Javier FernÃ¡ndez, Irene DÃaz, ElÃas Combarro, JosÃ© RanillaÂ ISBN 3540452311 9783540452317 download
40 pages
Options Greeks
No ratings yet
Options Greeks
12 pages
Crime Analysisand Prediction Using Data Mining
No ratings yet
Crime Analysisand Prediction Using Data Mining
8 pages
Lecture 05 - Sampled Data Systems
No ratings yet
Lecture 05 - Sampled Data Systems
26 pages
SMSSpam Detectionusing Deep Neural IJPAM
No ratings yet
SMSSpam Detectionusing Deep Neural IJPAM
13 pages
49 SEA Based Methods For High Frequency Analysis
No ratings yet
49 SEA Based Methods For High Frequency Analysis
15 pages
PLC Program To Implement A Combinational Logic Circuit (2) - Sanfoundry
No ratings yet
PLC Program To Implement A Combinational Logic Circuit (2) - Sanfoundry
4 pages