Industrial Applications of Machine Learning PDF
Industrial Applications of Machine Learning PDF
of Machine Learning
Chapman & Hall/CRC
Data Mining and Knowledge Series
Series Editor: Vipin Kumar
Data Classification
Algorithms and Applications
Charu C. Aggarwal
Healthcare Data Analytics
Chandan K. Reddy and Charu C. Aggarwal
Accelerating Discovery
Mining Unstructured Information for Hypothesis Generation
Scott Spangler
Event Mining
Algorithms and Applications
Tao Li
Text Mining and Visualization
Case Studies Using Open-Source Tools
Markus Hofmann and Andrew Chisholm
Graph-Based Social Media Analysis
Ioannis Pitas
Data Mining
A Tutorial-Based Primer, Second Edition
Richard J. Roiger
Data Mining with R
Learning with Case Studies, Second Edition
Luís Torgo
Social Networks with Rich Edge Semantics
Quan Zheng and David Skillicorn
Large-Scale Machine Learning in the Earth Sciences
Ashok N. Srivastava, Ramakrishna Nemani, and Karsten Steinhaeuser
Data Science and Analytics with Python
Jesus Rogel-Salazar
Feature Engineering for Machine Learning and Data Analytics
Guozhu Dong and Huan Liu
Exploratory Data Analysis Using R
Ronald K. Pearson
Human Capital Systems, Analytics, and Data Mining
Robert C. Hughes
Industrial Applications of Machine Learning
Pedro Larrañaga et al
For more information about this series please visit:
https://www.crcpress.com/Chapman--HallCRC-Data-
Mining-and-Knowledge-Discovery-Series/book-series/
CHDAMINODIS
Industrial Applications
of Machine Learning
Pedro Larrañaga
David Atienza
Javier Diaz-Rozo
Alberto Ogbechie
Carlos Puerto-Santana
Concha Bielza
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access
www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc.
(CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization
that provides licenses and registration for a variety of users. For organizations that have been granted
a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents
Preface xi
2 Machine Learning 19
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Basic Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.1 Descriptive Statistics . . . . . . . . . . . . . . . . . . . 23
2.2.1.1 Visualization and Summary of Univariate Data 24
2.2.1.2 Visualization and Summary of Bivariate Data 26
2.2.1.3 Visualization and Summary of Multivariate
Data . . . . . . . . . . . . . . . . . . . . . . 26
2.2.1.4 Imputation of Missing Data . . . . . . . . . . 29
2.2.1.5 Variable Transformation . . . . . . . . . . . . 31
2.2.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.2.2.1 Parameter Point Estimation . . . . . . . . . 32
2.2.2.2 Parameter Confidence Estimation . . . . . . 36
2.2.2.3 Hypothesis Testing . . . . . . . . . . . . . . 36
2.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.3.1 Hierarchical Clustering . . . . . . . . . . . . . . . . . . 40
2.3.2 K-Means Algorithm . . . . . . . . . . . . . . . . . . . 42
v
vi Contents
Bibliography 293
Index 325
Preface
xi
xii Preface
Madrid, Spain
September 2018
1
The Fourth Industrial Revolution
1.1 Introduction
Nowadays, global economies are undergoing a technology shift with all its
positive and negative connotations. As we have learned from history, technol-
ogy changes enrich society in terms of education, cohesion and employment.
However, the movements that have happened in recent history have taken
time to build structures capable of setting off the desired leap in industrial
development.
Technology shifts, shown in Figure 1.1, are commonly called industrial
revolutions because they are closely related to productivity and have caused
disruptive change in manufacturing processes since the 18th century. As a result,
specific fields of technology were improved. The first industrial revolution
used water and steam power to mechanize production. During the second
industrial revolution, water and steam power were replaced by electricity,
which boosted productivity even further. In the third industrial revolution,
electronic systems and information technologies (IT) were used to increase
factory automation1 .
Today’s technology shift is called the fourth industrial revolution (4IR).
It is a blurry mixture of the digital and physical worlds, leveraging emerging
digital technologies that are able to gather and analyze data across production
machines, lines and sites. It merges the third industrial revolution’s IT, such as
computer integrated manufacturing (Bennett, 1985), machine learning (Samuel,
1959), the Internet (Kleinrock, 1961) and many other technologies, with oper-
ational technologies (OT) to create the disruptive technologies that are the
backbone of the 4IR. A technical report published by PricewaterhouseCoopers
(2017) listed the top ten technologies as being:
1. Advanced materials with improved functionality, mechanical and
chemical properties, e.g., nanomaterials.
2. Cloud technology capable of delivering computational capabilities
over the Internet without the need for local and expensive machines.
1 The fourth industrial revolution: what it means and how to respond.
https://www.weforum.org/agenda/2016/01/the-fourth-industrial-revolution-what-it-
means-and-how-to-respond/
1
2 Industrial Applications of Machine Learning
FIGURE 1.1
Industrial technology shifts.
FIGURE 1.2
Giving added value to data from raw to actionable insights during the 4IR.
value defined as an actionable insight (Figure 1.2). Each data life cycle step
is described below:
quintillion-bytes-of-data-created-every-day-how-does-cpg-retail-manage-it/
3 OPC-UA. https://opcfoundation.org/about/opc-technologies/opc-ua/
4 RTI DDS-Secure. https://www.rti.com/products/secure
5 MQTT. http://mqtt.org/
4 Industrial Applications of Machine Learning
The above data life cycle is the 4IR backbone for merging digital and
physical worlds. This data life cycle has been adopted around the world, albeit
according to slightly different approaches, which are briefly described in the
following sections.
6 Apache Hadoop. http://hadoop.apache.org/
7 https://www.xilinx.com/
The Fourth Industrial Revolution 5
Manufacturing. https://energy.gov/eere/downloads/report-president-capturing-
domestic-competitive-advantage-advanced-manufacturing/
The Fourth Industrial Revolution 7
• Edge, where the elements are near the connected assets, which is useful for
real-time analytics and control technologies.
• Cloud, where the data are sent out to computing services over the Internet,
which is useful for complex analytics and data storage.
Against this backdrop, GE, together with IBM and SAP, founded, in March
2014, the Industrial Internet Consortium (IIC)10 , with the aim of bringing
together the companies and technologies needed to speed up the development,
adoption, and widespread sharing of data and information, smart analytics,
and people at work. Although IIoT started out as a mainly American initiative,
the IIC has now gone global with more than 200 member companies around
the world.
https://www.ge.com/digital/blog/everything-you-need-know-about-industrial-
internet-things/
10 http://www.iiconsortium.org/
8 Industrial Applications of Machine Learning
Germany and the USA. However, there are several country-specific variations.
Some of these approaches are briefly described below.
In France, 4IR was adopted in April 2015 as the Industrie du Futur, which is
oriented towards the digital transformation of French industry. It is primarily an
implementation of the achievements of European Commission (EU) initiatives
such as Factory of the Future. Industrie du Futur has borrowed five main
notions from the EU initiatives: (1) Development of the technology supply for
the factories of the future in areas where France can become a leader in the
next three to five years by supporting large structural projects out of industry.
The supply of technologies will be based on additive manufacturing, IoT,
augmented reality, etc. (2) Financial support for companies. (3) Training for
the next generation of employees in the knowledge and skills needed to apply
new technologies in the factories of the future. (4) Support for European and
international cooperation, fostering innovation strategies together with other
European countries, especially Germany, and other international alliances. (5)
Promotion of activities oriented to showcase 4IR-related French developments
and technology know-how.
In Spain, 4IR adoption is driven by Industria Conectada supported by the
Ministry of Economy, Industry and Competitiveness. In this case, the initiative
is designed to provide financial support and assistance to promote the digital
transformation of the Spanish industrial sector. Like Industrie du Futur, the
approach taken by Industria Conectada is aligned with the German Industrie
4.0. However, it takes a specific business solution approach focusing on big
data and analytics, cybersecurity, cloud computing, connectivity and mobility,
additive manufacturing, robotics and embedded sensors and systems as the
main areas of development.
In Asia, there are several approaches: Made in China 2025, Made in India
and ASEAN 4.0 for the Association of Southeast Asian Nations (ASEAN)
whose members include technology development leaders like Singapore and
Malaysia. All these approaches are aligned with Industry 4.0 and designed to
push forward their respective industries in order to increase competitiveness.
Japan, on the other hand, has taken a different approach called Society 5.0.
Society 5.0 is oriented towards the transformation of society into a super
smart society. This policy expects CPS, viewed as the key elements capable
of combining cyber and physical space, to bring about a major societal shift.
Machines and artificial intelligence will be the main players in this fifth stage
of society.
In conclusion, the 4IR is more than technology development: it is an
industrial shift involving economic, technical and societal components aimed
at improving industrial competitiveness at all levels with a potential impact all
over the world. This revolution, and the different adopted policies, is leveraging
the smart industry described in Section 1.2.
The Fourth Industrial Revolution 9
Distribution
Production
Machine
Component
FIGURE 1.3
Different levels of industry smartization.
ERP
MES
SCADA
PLC
FIGURE 1.4
System integration of a production system.
• Edge tier collects the data sourced from different industrial levels: compo-
nent, machine, production line or logistics (see Section 1.2).
• Platform tier processes the data from the edge tier and provides a first
layer of services and feedback, where time is a critical variable for security
and integrity reasons.
• Enterprise tier collects information from the platform tier and deploys a
second layer of services that provides support for high-level decision making.
FIGURE 1.5
Role of machine learning within smart factory predictive maintenance12 .
learning algorithms that learn from the stream and not from databases (Silva
et al., 2013) or novelty detection algorithms, capable of performing online
learning from unknown situations (Faria et al., 2016).
2.1 Introduction
Huge amounts of data have to be visualized, modeled and understood nowa-
days. Standard descriptive statistics provide a rough overview of the data.
Multivariate statistics and machine learning –a burgeoning field of artificial
intelligence– are used for data modeling, that is, to transform data into math-
ematical abstractions of reality that can be manipulated by computers to
produce accurate predictions in both static and dynamic scenarios.
Albeit a mathematical discipline, the practice of statistics has become a
more computational field since the emergence of computers. Besides, machine
learning aims to build algorithm-based systems that learn from data, improving
their performance automatically based on experience. Algorithms search within
a large space of candidate models to find the one that optimizes a previously
specified performance metric (Jordan and Mitchell, 2015).
Statistics and machine learning can be regarded as two different cultures
for arriving at useful conclusions from data (Breiman, 2001b). There are three
main types of conclusions (Fig. 2.1): (a) clustering, aiming to find groups of
similar inputs; (b) prediction, forecasting the response for future inputs; and
(c) association discovery, looking for (probabilistic) relationships among input
and output variables. In industry, these conclusions mostly have to be drawn
from time series or data stream scenarios.
Although complementary, the statistics and machine learning cultures have
some differences, summarized as follows:
• Model assumptions. Statistical models are based on strong assumptions like
Gaussianity, homoscedasticity, etc., which very often do not hold. These
assumptions are not necessary in machine learning algorithms.
• Model selection. The standard criterion for model comparison in statistics
is based on the (penalized or marginal) likelihood. Machine learning drives
the search for the best model according to more specific scores, e.g., the area
under the ROC (receiver operating characteristic) curve (see Section 2.4.1)
which focuses on the correct classification rate in supervised classification
problems. Searching approaches are also quite different: simple selection
methods, like forward selection, backward elimination or stepwise regression,
are popular in statistics, whereas a plethora of more sophisticated and
19
20 Industrial Applications of Machine Learning
X2 X2
X1
X2 X3
X1 X1 X4 X5
FIGURE 2.1
Three examples of tasks solved by statistics and machine learning methods.
(a) Clustering. (b) Supervised classification. (c) Discovery of associations.
1. Business
understanding
2. Data
6. Deployment
understanding
CRISP-DM
3. Data
5. Evaluation
preparation
4. Modeling with
machine learning
FIGURE 2.2
The six steps of a CRISP-DM process that transforms a dataset into valuable
knowledge for a company.
Machine • Metaclassifiers
learning
Dynamic models
FIGURE 2.3
Machine learning techniques covered in this chapter.
Machine Learning 23
At the end of this step, a decision on the use of the machine learning
model should be reached.
6. Deployment. If the decision of the previous step is positive, the
model should be implemented in the company. It is important for
the customer to understand the actions needed to make use of the
created models.
In 2015, IBM corporation extended CRISP-DM by releasing a new method-
ology called analytics solutions unified method for data mining/predictive
analytics (also known as ASUM-DM).
This chapter is organized as follows. Section 2.2 presents basic descrip-
tive and inferential statistical methods. Section 2.3 introduces the concept
of clustering, explaining different approaches such as hierarchical clustering,
partitional clustering, spectral clustering, affinity propagation and probabilistic
clustering. Section 2.4 focuses on supervised classification methods, illustrating
non-probabilistic classifiers –k-nearest neighbors, classification trees, rule induc-
tion, artificial neural networks and support vector machines–, and probabilistic
classifiers –logistic regression and Bayesian classifiers– as well as metaclassifiers.
Section 2.5 reviews Bayesian networks which are solid probabilistic graphical
models in dynamic scenarios, a common feature in industry, as discussed in
Section 2.6. Section 2.7 reviews some machine learning computational tools.
The chapter closes with Section 2.8 on open issues in machine learning.
data points tend to be very close to the mean, whereas high values indicate
that points spread out over a large range of values. The square of s is the
sample variance, s2 .PThe mean absolute deviation about the mean is
defined as mad = N1 i=1 |xi − x̄|. Since the median is more robust, the
N
the absolute deviations from the data median, that is, the median of the values
|xi − Me|, i = 1, ..., N . A dimensionless measure for eliminating the dependence
of the s measurement units is the coefficient of variation (CV), the ratio of
the standard deviation to the mean, often multiplied by 100 and only defined
if x̄ =
6 0 as CV = x̄s 100. The higher the CV, the greater the dispersion in the
variable. The sample quartiles are the three points that divide an ordered
sample into four groups, each containing a quarter of the points. Thus, the
first or lower quartile, denoted Q1 , has the lowest 25% of the data to its left
and the highest 75% to its right. The third or upper quartile, denoted Q3 ,
has 75% of the data to its left and 25% to its right. The second quartile is
the median Me, with 50-50% on both sides. A sample with 10 divisions has
nine sample deciles. With 100 divisions, there are 99 sample percentiles.
Thus, the first quartile is the 25th percentile. Generally, in a sample quantile
of order k ∈ (0, 1), a proportion k of the data fall to its left and 1 − k to
its right. Since quantiles account for the tendency of data to be grouped
around a particular point, leaving a certain proportion of data to their left
and the rest to their right, they are measures of location but not of centrality.
Quantiles are also building blocks of another important dispersion measure:
the interquartile range, IQR= Q3 − Q1 , i.e., the difference between the
upper and lower quartiles. The range is the difference between the maximum
and minimum value and is also a dispersion measure.
Shape measures characterize the shape of a frequency distribution. They
are defined according to the r-th central moments (or moments about the
mean) of a data sample, mr = N1 i=1 (xi − x̄)r . Skewness measures the
PN
asymmetry of a frequency distribution and is defined as g1 = m3/2 3
. A negative
m2
value of g1 (left-skewed, left-tailed or skewed to the left) indicates that the left
tail of the distribution is longer or fatter than the right side and that the bulk
of the values lie to the right of the mean, that is, the mean is skewed to the
left of a typical central measure of the data. The distribution is usually plotted
as a right-leaning curve. A positive g1 value means the opposite. A zero value
corresponds with rather evenly distributed data on both sides of the mean,
usually implying a symmetric distribution. Another measure of the shape of the
distribution is kurtosis, indicating whether the data are peaked or flat relative
to a normal (Gaussian) distribution. It is applied to bell-shaped (unimodal
symmetric or slightly asymmetric) distributions. Kurtosis is dimensionless and
defined as g2 = m 4
m22
− 3. Leptokurtic distributions (g2 > 0) are more peaked
than normal, platykurtic distributions (g2 < 0) are less peaked than normal
and mesokurtic distributions (g2 = 0) have similar, or identical, kurtosis to a
normal distribution.
The box-and-whisker plot, or boxplot, is a very useful graph as it indicates
whether the data are symmetric or have outliers. The spread is shown via the
IQR, since a box is drawn with lines at Q1 and Q3 . Another line is marked
inside the box at the median. A “whisker” is drawn from Q1 to the smallest
data value greater than the lower fence, which is defined as Q1 − 1.5 IQR.
26 Industrial Applications of Machine Learning
Similarly, another whisker is drawn from Q3 to the largest data value lower
than the upper fence, defined as Q3 + 1.5 IQR. Any points beyond the whiskers
are depicted by points and are, by convention, considered as outliers.
Fig. 2.4 shows examples of the above visualization methods for univariate
data.
Frequency
0 50 100 150 200 250
a
b
c
(a) (b)
0 5 10 15 20 25
150
100
Frequency
50
0
(c) (d)
FIGURE 2.4
Plots representing univariate data. (a) Pie charts and (b) barplots are suitable
for categorical and discrete data, (c) histograms for continuous data, and (d)
boxplots for numerical data.
28 Industrial Applications of Machine Learning
60
X2 = yes
Positive
X2 = no
50
60
40
40
20
30
Frequency
0
Negative
20
60
40
10
20
0
0
802 813 815 816 821 824 827 828 0 100 200 300
X1 X
(a) (b)
300
25
20
200
15
X2
10
100
50
5
0
FIGURE 2.5
Plots representing bivariate data. (a) Side-by-side bar plot. (b) Conditional
histogram. (c) Side-by-side boxplot. (d) Scatterplot.
Machine Learning 29
Four major approaches for visualizing multivariate data are Chernoff faces,
parallel coordinates, principal component analysis and multidimensional scal-
ing. Chernoff faces (Chernoff, 1973) display a cartoon human face depicting
the size and shape of different facial features according to the variable val-
ues. The parallel coordinate plot (d’Ocagne, 1885) is a diagram including
parallel vertical equidistant lines (axes), each representing a variable. Then
each coordinate of each observation point is plotted along its respective axis
and the points are joined together with line segments. Principal component
analysis (PCA) (Jolliffe, 1986) describes the variation in a set of correlated
variables in terms of another set of uncorrelated variables, each of which is a
linear combination of the original variables. Usually a number of new variables
less than n will account for a substantial proportion of the variation in the
original variables. Thus, PCA is used for dimensionality reduction but also for
data compression, feature extraction and data visualization.
Multidimensional scaling (MDS) (Torgerson, 1952) is a visualization
technique that creates a map (that has fewer dimensions than the original data)
displaying the relative positions of the data. The map preserves, as closely as
possible, the pairwise distances between data points. The map may consist of
one, two, three, or even more dimensions.
Fig. 2.6 illustrates visualization methods for multivariate data.
60 70
70
20
60 X3
50
15
X1
40 50
40
20 30
30
10
20
X2
10
0 10
0
5
70
50 60 70
60
50 0
X1
40
30 a b c a b c
20 30 40 20
X2
(a) (b)
0 5 10 15 20 25
Negative Positive
Counts
78
73
68
300 64
59
54
49
200 44
X2
40
35
100 30
25
20
15
0 11
6
1
0 5 10 15 20 25
X1
(c)
40 41 43 44 45 46
48 49 50 51 52 53
(d)
70 35.8 76.9 100 0.533 16
30
20
principal component 2
10
−10
−20
(e) (f)
FIGURE 2.6
Multivariate data representation. Discrete or categorical variables are
permitted for multipanels. (a) Scatterplot matrix. (b) Multipanel 2D boxplot.
(c) Flat histogram. (d) Chernoff faces. (e) Parallel coordinates. (f) PCA.
Machine Learning 31
2.2.2 Inference
In industry, it is not usually possible to access all the members of a given target
population. For example, it is impossible to access all the pieces produced in
a factory during a given year. Therefore, we must be content to analyze the
information on a smaller number of pieces. Based on the characteristics of this
sample of pieces, we can generalize the results to the annual production of the
entire factory. Thanks to this generalization, referred to in statistical jargon
as inference process, we can estimate parameters from a given probability
distribution representing the population, as well as test hypotheses about
the values of these parameters or even about the actual distributions. This
section introduces the basic concepts of parameter estimation (parameter point
estimation and parameter confidence intervals) and hypothesis testing.
There are different random selection methods. However, if standard pro-
cedures are followed, mathematical expressions can be used to quantify the
accuracy of the estimations. Cluster sampling is based on the idea that the
whole population can be clustered into smaller subpopulations called clusters.
Clusters are homogeneous and are treated as the sampling unit. Suppose that
the factory has 1000 machines playing the role of clusters, cluster sampling
can select 20 of these machines and inspect all the pieces manufactured by this
smaller number of machines. Stratified sampling is used when the target
population can be easily partitioned into subpopulations or strata. Strata are
then chosen to divide the population into non-overlapping and homogeneous
regions, where elements belonging to a given stratum are expected to be similar.
Stratified sampling assumes that the different strata are very heterogeneous.
Simple random samples are taken from each stratum. For example, if our
factory has three types of machines, each producing different pieces, stratified
sampling will select some pieces at random from each of these subpopulations.
In systematic sampling, we have a list of all the members of a given
population and we decide to select every k-th value in our sample. The initial
starting point is selected at random. The remaining values to be sampled are
then automatically determined. For example, suppose we have an ordered
list of the 100,000 pieces produced in a factory on a specified day and we
plan to use systematic sampling to select a sample of size 200. The procedure
is to choose an initial starting point at random between 1 and 500 (since
100,000
200 = 500). If the generated random number is 213, then the units in the
sample of size 200 are numbered 213, 713 (213 + 500), 1213 (213 + 2 × 500),
..., and 99,713 (213 + 199 × 500).
denotes the probability of value 1, is the underlying distribution for the first
variable. Its probability mass function is p(x|p) = px (1 − p)1−x for x = 0, 1,
where p is unknown and should be estimated from the sample. A Gaussian
distribution, also called normal distribution, X ∼ N (x|µ, σ), or simply
1 2
N (µ, σ), is defined by the density function f (x|µ, σ) = √2πσ1
2
e− 2σ2 (x−µ) for
x, µ ∈ R and σ ∈ R+ , and can model the density of the weight of the piece. In
this case, θ is a vector with two components, µ and σ, that should be estimated
from the sample.
The observed random sample of size N , that is, the values, x1 , x2 , ..., xN of
the N independent and identically distributed (i.i.d.) random variables
X1 , X2 , ..., XN , are combined into a function θ̂ = t(X1 , X2 , ..., XN ), known
as the estimator of θ, which is also a random variable. Its specific value,
called an estimate of θ, is known after taking a sample. The sample mean
θ̂ = X̄ = N1 i=1 Xi is an estimator for p and also for µ, whereas the sample
PN
θ^2
θ^1
(a) (b)
FIGURE 2.7
(a) Graphical representation of the concepts of bias and variance: low bias
and low variance (top left), low bias and high variance (top right), high bias
and low variance (bottom left) and high bias and high variance (bottom
right). (b) θ̂1 is an unbiased estimator of θ and θ̂2 is a biased estimator of θ.
However, θ̂2 has a smaller variance than θ̂1 .
34 Industrial Applications of Machine Learning
α1 (θ1 , ..., θK ) = m1
α2 (θ1 , ..., θK ) = m2
...
αK (θ1 , ..., θK ) = mK .
α1 (µ, σ 2 ) = µ = X̄ = m1
∂ ln L(p|x)
PN PN
xi N − i=1 xi
= i=1
− = 0,
∂p p 1−p
and checking that its second-order partial derivate is negative. The MLE is
given by p̂(X1 , ..., XN ) = X̄.
To get the MLE for parameter θ = (µ, σ 2 ) of a normal density, we need
to compute the log-likelihood function of a sample x1 , ..., xN taken from a
N (µ, σ) as
i=1 (xi −
PN
N N µ)2
ln L(µ, σ |x) = − ln(2π) −
2
ln(σ 2 ) − .
2 2 2σ 2
The MLE (µ̂, σ̂ 2 ) is the solution of the following system of equations
PN
∂ ln L(µ,σ2 |x) = i=1 (xi −µ) = 0
∂µ σ2 P
(xi −µ)2
N
∂ ln L(µ,σ2 |x)
∂σ 2 = − N
2σ 2 + i=1
2σ 4 = 0.
Solving the system easily yields
i=1 (Xi
PN PN
Xi − X̄)2
µ̂(X1 , ..., XN ) = i=1
= X̄, σ̂ (X1 , ..., XN ) =
2
= SN
2
.
N N
Bayesian estimation considers θ to be a random variable with a known
prior distribution. With the observed sample this distribution is converted, via
the Bayes’ theorem, to a posterior distribution. Choosing a conjugate prior,
i.e., the prior and posterior belong to the same family of distributions, simplifies
calculation of the posterior distribution. Typical examples are the Dirichlet
(Frigyik et al., 2010) and Wishart (Wishart, 1928) distributions. Otherwise,
posteriors are often computed numerically or by Monte Carlo techniques.
The posterior distribution is used to perform inferences on θ. Thus, a
typical point Bayesian estimation is to choose the value of θ that maximizes the
posterior distribution (i.e., its mode). This is called maximum a posteriori
(MAP) estimation.
Bayesian estimation is used in Bayesian networks (Section 2.5), both for
finding the graph structure and for estimating its parameters. Small sample
sizes and data stream scenarios are also suitable for Bayesian estimation.
36 Industrial Applications of Machine Learning
the null hypothesis and our observations of the phenomenon under study is
statistically significant. Statistical significance means that the differences are
due not to chance, but to a real difference between the phenomenon under
study and the assumptions of the null hypothesis. For example, differences may
be due to chance if they were generated by the observations of the phenomenon
(i.e., the differences would perhaps not have arisen using other samples).
Note that when we decide whether or not to reject H0 , we can make two
different errors:
Y
y1 ··· yj ··· yJ Marginal
x1 N11 ··· N1j ··· N1J N1•
.. .. .. .. .. .. ..
. . . . . . .
xi Ni1 ··· Nij ··· NiJ Ni•
.. .. .. .. .. .. ..
. . . . . . .
xI NI1 ··· NIj ··· NIJ NI•
Marginal N•1 ··· N•j ··· N•J N
Table 2.1 contains the number of observations, Nij , in the sample taking
the i-th value in X and, at the same time, thePj-th value in Y . The total
number in the i-th row (1 ≤ i ≤ I) is Ni• = j=1 Nij , whereas the total
J
Estimations of θi• and θ•j are given by θ̂i• = NNi• and θ̂•j = N•j , re-
N
i=1 j=1
Eij
is used to compare the observed number of cases, Oij = Nij , in the sample
in each cell (i, j) with the expected number under the null hypothesis, Eij =
. W approximately follows a chi-squared density with (I − 1)(J − 1)
Ni• N•j
N
degrees of freedom. The null hypothesis of independence is rejected with a
significance level α when the value of W observed in the sample is greater
than the quantile χ2(I−1)(J−1);1−α . The chi-squared approximation is usually
satisfactory if Eij are not too small. A conservative rule is to require all Eij
to be five or more.
Machine Learning 39
TABLE 2.2
Blocks, treatments and ranked data in a randomized complete block design
Treatments
1 ··· j ··· k Row totals
1 r11 · · · r1j · · · r1k k(k + 1)/2
.. .. .. .. .. .. ..
. . . . . . .
Blocks i ri1 ··· rij ··· rik k(k + 1)/2
.. .. .. .. .. .. ..
. . . . . . .
b rb1 ··· rb2 ··· rbk k(k + 1)/2
Column totals R1 ··· Rj ··· Rk bk(k + 1)/2
Friedman test assumptions are that all sample populations are continuous
and identical, except possibly for location. The null hypothesis is that all
populations have the same location. Typically, the null hypothesis of no
difference among the k treatments is written in terms of the medians, ψi . Both
hypotheses, H0 and H1 , can be written as:
H0 : ψ1 = ψ2 = ... = ψk
H1 : ψi 6= ψj for at least one pair (i, j)
The standardized test statistic S, defined as
12
k
S= R2 − 3b(k + 1) (2.1)
X
bk(k + 1) j=1 j
is used to evaluate the null hypothesis. Under the assumption that H0 is true,
S is well approximated by a χ2k−1 distribution. Given a fixed significance level
α, we reject H0 if the value of S observed in the sample is greater than the
quantile χ2k−1;1−α .
1 The name “block” comes from the earliest experimental designs in agriculture, where
2.3 Clustering
This section develops two different approaches to clustering: the non-
probabilistic clustering and the probabilistic clustering. The objective of both
approaches is to group or segment a set of objects or instances into subsets
or “clusters.” Similar objects should be within the same group, whereas very
dissimilar objects should be in different groups. For example, groups could be
objects corresponding to the same specific state (idle, constant speed, accelera-
tion/deceleration) of servomotors used to position the machine tool axis (see
Chapter 5). In non-probabilistic clustering including hierarchical clustering and
partitional clustering –such as the K-means algorithm, spectral clustering or
affinity propagation– each object belongs to only one cluster. In probabilistic
clustering, however, each object can be a member of several clusters at the
same time, and they all have a membership probability of each cluster.
Mathematically, the dataset D = {x1 , ..., xN } to be clustered contains N
objects, xi = (xi1 , ..., xin ), with i = 1, ..., N , each of which is characterized by n
variables, X = (X1 , ..., Xn ). Hierarchical clustering and K-means clustering
work with a dissimilarity matrix, which is the result of a transformation
of D. The dissimilarity matrix is an N × N matrix D ≡ d(xi , xj ) i,j , where
d(xi , xj ) denotes the dissimilarity between the i-th and the j-th objects.
Standard dissimilarity measures d(xi , xj ) include the Minkowski dis-
P 1/g
tance for numerical features: dMinkowski (xi , xj ) = , with
n i j g
h=1 |x h − x h |
g ≥ 1. The Euclidean distance and the Manhattan distance are special
cases of the Minkowski distance, when g = 2 and g = 1, respectively. For binary
variables, the dissimilarity between objects can be computed based on a contin-
gency table. For example, in symmetric binary variables, where both states are
equally valuable, the dissimilarity can be defined as dbinary (xi , xj ) = q+r+s+t r+s
,
where q is the number of variables equal to 1 for both objects; t is the number
of variables equal to 0 for both objects; and r + s are the total number of
variables that are unequal for both objects.
Spectral clustering and the affinity propagation algorithm are based on a
similarity matrix, with elements s(xi , xj ) for i, j = 1, ..., N , denoting the
similarity between objects xi and xj . These similarities verify that s(xi , xj ) >
s(xi , xk ) iff xi is more similar to xj than to xk .
new cluster. The clustering results can be obtained by cutting the dendrogram
at different heights.
Agglomerative hierarchical clustering considers that there are initially
N singleton clusters, each of which is associated with one of the objects to be
grouped. At each stage of the algorithm, the most similar pair of clusters are
merged according to a previously defined linkage strategy. This merging process
is repeated until the whole set of objects belongs to one cluster (Fig. 2.8).
X2
dissimilarity
Cl3 x6 x7
Cl2 x4 x5
x3
x2
1
Cl1
x
X1 x1 x2 x3 x4 x5 x6 x7
(a) (b)
FIGURE 2.8
Example of agglomerative hierarchical clustering. (a) Seven points represented
in a two-dimensional space with three clusters Cl1 , Cl2 , Cl3 . (b) The
corresponding dendrogram. The clusters obtained after cutting the
dendrogram at the dotted line are the clusters in (a).
addition of the summed square distances to the centroid within cluster Cli
and cluster Clj :
xk ∈Cli ∪Clj
d xi , ci +
X X
d x j , cj ,
−
xi ∈Cli xj ∈Clj
where d denotes the squared Euclidean distance and cij , ci and cj are the
centroids of clusters Cli ∪ Clj , Cli and Clj , respectively.
Average linkage, complete linkage and Ward’s method are used when the
clusters are expected to be more or less circular clouds.
1 X
K
K N
S(N, K) = (−1)K−i i ,
K! i=0 i
K
FK−means ({Cl1 , ..., ClK }) = (2.2)
X X
||xi − ck ||22 ,
k=1 xi ∈Clk
where K is the number of clusters, xi = (xi1 , ..., xin ) denotes the n components
of the i-th object in the original dataset, Clk refers to the k-th cluster, and
ck = (ck1 , ..., ckn ) is its corresponding centroid.
Algorithm 2.1 shows the main steps of the K-means algorithm. The K-
means algorithm starts with an initial partition of the dataset. After calculating
the centroids of these initial clusters, each dataset object is reallocated to the
cluster represented by its nearest centroid. Reallocation should reduce the
square-error criterion by taking into account the storage ordering of the objects.
Whenever the cluster membership of an object changes, the corresponding
cluster centroids and the square error should be recomputed. This process is
repeated until all object cluster memberships are unchanged.
2 for i = 1 to N do
3 Reassign object xi to its closest cluster centroid;
4 Recalculate centroids for clusters;
endfor
until Cluster membership is stabilized
FIGURE 2.9
Example of the evolution of Forgy’s method. Ten objects are initially
partitioned into three clusters and their corresponding centroids are then
computed (left). Reassignments of each object are done according to its
nearest centroid (middle). The centroids of the new three clusters are then
computed (right). The process ends since no object changes its cluster
membership.
Once the selected operation is carried out and the unnormalized graph
Laplacian matrix L is calculated, the K eigenvectors corresponding to the
K smallest eigenvalues of L are output. These K vectors are organized into
a matrix with N rows and K columns. Each row of this new matrix can be
interpreted as the transformation of an original object xi , with i = 1, ..., N ,
into a space where the object grouping is easier than in the original space.
Although, in principle, these N transformed objects can be clustered using any
clustering method, standard spectral clustering uses the K-means algorithm.
an exemplar. Both matrices are initialized to all zeros. The algorithm then
performs the following updates iteratively:
1. Responsibility updates are sent around:
0
k 6=k
i0 ∈{i,k}
/
i0 6=k
The iterations continue until the stopping condition is met. For object
xi , the object xj that maximizes a(xi , xj ) + r(xi , xj ) identifies its exemplar.
A cluster contains all the objects that have the same exemplar, which is
considered as the cluster prototype.
k=1
Power consumption
FIGURE 2.10
Example of a finite mixture model with three components for fitting
servomotor power consumption density. The three components correspond to
idle (orange), acceleration/deceleration (blue) and constant speed (green)
states. The density of each component is assumed to follow a univariate
Gaussian.
possibly containing observed and missing data is not referred to here. D corresponds to the
concatenation of data X and Z.
48 Industrial Applications of Machine Learning
L(θ; X) = p(X|θ) =
X
p(X, Z|θ).
Z
Maximizing this function is hard and the EM algorithm tries to find the
maximum likelihood estimate by iteratively applying two steps:
E-step: Calculate the expected value of the log-likelihood function with
respect to the conditional distribution of Z given X under the current
estimate of the parameters θ (t) . To do this, an auxiliary function Q(θ|θ (t) )
is defined as
k=1
where µk is the mean vector and Σk is the variance-covariance matrix for the
k-th component, a multivariate normal density given by
1
1
fk (x; µk , Σk ) = (2π)− 2 |Σk |− 2 exp − (x − µk )T Σ−1 k (x − µk ) .
n
2
The parameter vector θ = (π1 , ..., πK , µ1 , Σ1 , ..., µK , ΣK ) is composed of
the weights of the different clusters, πk , and of the parameters, θk = (µk , Σk ),
of each component of the mixture.
The missing information z = (z1 , ..., zN ) relates to the assignment (yes/no)
of each data point to each cluster. The auxiliary function of the expected
complete data log-likelihood is
N X
K N X
K
Q(θ|θ (t) ) = rik log πk + rik log fk (x; θk ),
(t) (t)
X X
where rik = p(Zi = k|xi , θ (t) ) is the responsibility that cluster k takes for
(t)
π fk (xi ; µk , Σk )
(t) (t) (t)
rik = PK k (t)
(t)
.
r=1 πr fr (xi ; µr , Σr )
(t) (t)
1 X (t)
N
=
(t+1)
πk r .
N i=1 ik
For µk and Σk ,
PN (t)
i=1 rik xi
=
(t+1)
µk PN (t)
i=1 rik
weighted average of all data points, where the weights are the responsibili-
ties of cluster k, and, finally, the variance-covariance matrix, Σk , is an
(t+1)
ΩX ΩC
φ
−
→
x → φ(x).
bility or the simplicity of the learned model, which are also important in practice.
52 Industrial Applications of Machine Learning
φ(x)
+ -
+ TP FN
C
- FP TN
take values in the interval [0, 1]. Values close to 1 are preferred for all the
measures, except for error rate. Values close to 0 are better for error rate.
The Brier score (Brier, 1950) is very popular for probabilistic classifiers.
The Brier score is based on a quadratic cost function and measures the mean
square difference (d, the Euclidean distance) between the predicted probability
assigned to the possible outcomes for each instance and its actual label. It is
defined as
1 X 2
N
Brier(φ) = d pφ (c|xi ), ci ,
N i=1
where pφ (c|xi ) is the vector (pφ (+|xi ), pφ (-|xi )) containing the output of the
probabilistic classifier, and ci = (1, 0) or ci = (0, 1) when the label of the
i-th instance is + or -, respectively. The Brier score for a binary classification
problem verifies 0 ≤ Brier(φ) ≤ 2, and values close to 0 are preferred. The
Brier score can be regarded as a measure of calibration of a set of probabilistic
predictions.
TABLE 2.5
Output of a probabilistic classifier, pφ (c|x), on a hypothetical dataset, with
ten cases, and two labels, +, and -
54 Industrial Applications of Machine Learning
TABLE 2.6
Ten hypothetical instances used to generate the ROC curve shown in Fig. 2.11
xi 1 2 3 4 5 6 7 8 9 10
p(+|x ) 0.98
i
0.89 0.81 0.79 0.64 0.52 0.39 0.34 0.29 0.26
ci + + + - + - + - - -
The area under the ROC curve (AUC) is a summary statistic for the
ROC curve. For any classifier φ, AUC(φ) ∈ [0, 1]. AUC(φ) = 1 in a perfect
classifier (FPR = 0, TPR = 1), whereas AUC(φrandom ) = 0.5 for a random
classifier. We expect AUC(φ) > 0.5 for a reasonable classifier. To compute the
AUC, a rank is assigned to the classifier output for each instance in the order
of decreasing outputs. Then, the AUC is computed as
i=1 (i − ranki )
PN+
AUC(φ) = 1 − , (2.3)
N+ N-
where ranki is the rank of the i-th case in the subset of positive labels given
by classifier φ, and N+ and N- denote the number of real positive and negative
cases in D, respectively.
56 Industrial Applications of Machine Learning
1.0
0.8
True positive rate
0.6
0.4
0.2
0.0
FIGURE 2.11
ROC curve for Table 2.6 data, plotted with the ROCR R package (Sing et al.,
2005).
Example. AUC
The result of applying the above formula to the instances in Table 2.6 is:
(1 − 1) + (2 − 2) + (3 − 3) + (5 − 4) + (7 − 5)
AUC(φ) = 1 − = 0.88.
5·5
This is the same result as illustrated in Fig. 2.11:
AUC(φ) = 0.20 · 0.60 + 0.20 · 0.80 + 0.60 · 1.00 = 0.88.
For a multiclass problem, the AUC can be obtained as the volume under
the ROC surface or, alternatively, as an average AUC of all possible ROC
curves obtained from all class pairs.
Fig. 2.12, adapted from Japkowicz and Mohak (2011), shows the honest
performance estimation methods explained below. They are grouped into
multiple resampling methods, where D is sampled several times, and single
resampling methods, with only one sampling.
Performance
evaluation
Single Multiple
resampling resampling
FIGURE 2.12
Honest performance estimation methods.
FIGURE 2.13
Hold-out estimation method.
FIGURE 2.14
Four-fold cross-validation method.
Machine Learning 59
are chosen such that the class variable is approximately equally distributed
in all folds and similar to the original class distribution in D. Stratification is
appropriate for unbalanced datasets, where the class variable is far from being
uniformly distributed.
In repeated hold-out, the partition in the hold-out scheme is repeated
several times. The training and test cases are randomly assigned every time.
This resampling has the advantage that estimates are stable (low variance), as
a result of a large number of sampling repetitions. A drawback is that there is
no control over the number of times each case is used in the training or testing
datasets. Repeated k-fold cross-validation performs multiple rounds of
k-fold cross-validation using different partitions. The most popular version is
10 × 10 cross-validation (Bouckaert, 2003), which performs 10 repetitions of
10-fold cross-validation, reducing estimator variance.
Bootstrapping (Efron, 1979) generates estimations by sampling from the
empirical distribution of the observed data. It is implemented by using random
sampling with replacement from the original dataset D to produce several
(B) resamples of equal size N to the original dataset. Thus, all the datasets
Dbl , with l ∈ {1, ..., B}, obtained are of size N . As the probability of selection
is always the same for each of the N cases (i.e., 1/N ), the probability of a
case not being chosen after N selections is (1 − N1 )N ≈ 1e ≈ 0.368. A classifier
φlb is induced from Dbl . The l-th test set, Db-test
l
, with l ∈ {1, ..., B}, is then
formed by all the cases from D not present in Dbl , that is, Db-test
l
= D \ Dbl .
The performance measure of φlb is estimated. The average of all B measures is
known as the e0 bootstrap estimate.
The expected number of distinct instances in each of the B datasets Dbl
used for training φlb is 0.632N . Hence, the e0 bootstrap estimate may be
pessimistic. The .632 bootstrap estimate addresses this issue by combining
the e0 bootstrap and the resubstitution estimates with the weights 0.632 and
0.368, respectively (see Fig. 2.15). Bootstrap estimation is asymptotically (large
values of B) unbiased with low variance. It is recommended for use with small
datasets.
Generally speaking, the honest performance estimation methods described
above should be adapted for industrial scenarios where data arrives sequen-
tially, as in time series or in data streams settings (see Section 2.6.1). These
adaptations should take into account the data arrival time, that is, none of
the instances in the testing dataset should arrive before any of the training
dataset instances.
...
...
...
...
FIGURE 2.15
0.632 bootstrap method. The final performance measure is a weighted sum of
the e0 bootstrap and the resubstitution estimates, weighted 0.632 and 0.368,
respectively.
the target concept. This comes at the expense of increasing the modeling task
complexity due to the (added) feature subset selection process, especially if n
is large.
A discrete feature Xi is said to be relevant for the class variable C if,
depending on one of its values, the probability distribution of the class variable
changes, i.e., if there exists some xi and c, for which p(Xi = xi ) > 0, such that
p(C = c|Xi = xi ) 6= p(C = c). A feature is said to be redundant if it is highly
correlated with one or more of the other features. Irrelevant and redundant
variables affect C differently. While irrelevant variables are noisy and bias the
prediction, redundant variables provide no extra information about C.
Feature subset selection can be seen as a combinatorial optimization prob-
lem. The optimal subset S ∗ of features is sought from the set of predictor
variables X = {X1 , ..., Xn }, i.e., S ∗ ⊆ X . Optimal means with respect to an
objective score that, without loss of generality, should be maximized. The
search space cardinality is 2n , which is huge for large values of n. Heuristics are
an absolute necessity for moving intelligently in this huge space and provide
close to optimal solutions. Fig. 2.16 illustrates a toy example of a search space
whose cardinality is 16, with only four predictor variables.
The search is determined by four basic issues:
(a) Starting point. Forward selection begins with no features and successively
adds attributes. Backward elimination begins with all features and succes-
sively removes attributes. Another alternative is to begin somewhere in
the middle and move outwards from this point.
(b) Search organization. An exhaustive search is only feasible for a small
Machine Learning 61
X1 X2 X3 X4
X1 X2 X3 X4 X1 X2 X3 X4 X1 X2 X3 X4 X1 X2 X3 X4
X1 X2 X3 X4 X1 X2 X3 X4 X1 X2 X3 X4 X1 X2 X3 X4 X1 X2 X3 X4 X1 X2 X3 X4
X1 X2 X3 X4 X1 X2 X3 X4 X1 X2 X3 X4 X1 X2 X3 X4
X1 X2 X3 X4
FIGURE 2.16
Each block represents one possible feature subset selection in this problem
with n = 4. The blue rectangles are variables included in the selected subset.
By deletion/inclusion of one feature, we move through the edges.
(c) Evaluation strategy. How feature subsets are evaluated is the largest differ-
entiating factor of feature selection algorithms for supervised classification.
The filter approach, the wrapper approach, embedded methods and hybrid
filter and wrapper approaches are different alternatives that are explained
below.
62 Industrial Applications of Machine Learning
(d) Stopping criterion. A feature selector must some time stop searching
through the space of feature subsets. A criterion may be to stop when
none of the evaluated alternatives improves upon the merit of the current
feature subset. Alternatively, the search might continue for a fixed number
of iterations.
Filter feature subset selection considers intrinsic data properties to
assess the relevancy of a feature, or a subset of features. Filter methods act as
a screening step and are independent of any supervised classification algorithm.
A popular score in filtering approaches is the mutual information (and related
measures) between each predictor variable and the class variable.
The mutual information between two random variables is based on the
concept of Shannon’s entropy (Shannon, 1948), which quantifies the uncer-
tainty of predictions of the value of a random variable. For a discrete variable
with l possible values, the entropy is
l
H(X) = − p(X = xi ) log2 p(X = xi ).
X
i=1
p(xi , cj )
l X
m
I(X, C) = H(C) − H(C|X) = p(xi , cj ) log2
X
.
i=1 j=1
p(xi )p(cj )
(a)
(b)
FIGURE 2.17
Filtering approaches for feature subset selection. (a) Univariate filter:
X(1) , ...., X(n) are the original variables ordered according to a feature
relevancy score. The selected feature subset includes the top s variables. (b)
Multivariate filter: a 2n cardinality space is searched for the best subset of
features S ∗ as an optimization problem. Each subset is evaluated according to
a feature relevancy score f .
64 Industrial Applications of Machine Learning
X
r(Xi , C)
f (S) = s
Xi ∈S
,
k + (k − 1) r(Xi , Xj )
X
Xi ,Xj ∈S
FIGURE 2.18
Wrapper approach for feature subset selection. In this case, each candidate
feature subset Si ⊆ X = {X1 , ..., Xn } is evaluated according to the
(estimated) classification accuracy (Acc) of the classifier φStraining
i
built from Si
in the training set Dtraining . Any other performance measure could be used
instead of accuracy.
FIGURE 2.19
Example of a k-NN classifier for classifying the green instance in a binary
(square, diamond) classification problem.
In Fig. 2.19, the test instance (green circle) should be classified in either
the first class (magenta squares) or the second class (yellow diamonds). With
k = 3 neighbors, the instance is assigned to the second class because there
are two diamonds and only one square inside the inner circle. With k = 5
66 Industrial Applications of Machine Learning
neighbors, it is assigned to the first class because there are three squares vs.
two diamonds inside the outer circle.
The k-NN algorithm does not have a training phase, nor does it induce a
model. Some of the advantages of the k-NN algorithm are that it can learn
complex decision boundaries, it is a local method, it uses few assumptions
about the data, and it can be easily adapted as an incremental algorithm,
especially suitable for data inputs like streams (very common in industrial
applications). The main disadvantages are its high storage requirements and its
low classification speed. In addition, the algorithm is sensitive to the selected
distance (needed to find the neighbors), the value of k, the existence of irrelevant
variables, and noisy data. Another disadvantage is that, as there is no model
specification, no new knowledge about the problem can be discovered.
Although most implementations of k-NN compute simple Euclidean dis-
tances, it has been demonstrated empirically that k-NN classification can be
greatly improved by learning an appropriate distance metric from the training
data. This is the so-called metric learning problem. The neighborhood
parameter k plays an important role in k-NN performance (see the example
in Fig. 2.19). An increment in k should increase the bias and reduce the
classification error variance. The optimum value of k depends on the specific
dataset. It is usually estimated according to the available training sample: the
misclassification rate is estimated using cross-validation methods for different
values of k, and the value with the best rate is chosen. The selection of relevant
prototypes is a promising solution for speeding up k-NN in large datasets.
These techniques lead to a representative training set that is smaller than the
original set and has a similar or even higher classification accuracy for new
incoming data. There are three standard categories of prototype selection
methods (García et al., 2012): condensation methods, edition methods, and
hybrid methods. Condensation methods –like, for example, the condensed
nearest neighbors algorithm (Hart, 1968)– aim to remove superfluous
instances (i.e., any that do not cause incorrect classifications). Edition meth-
ods (Wilson, 1972) are designed to remove noisy instances (i.e., any that do
not agree with the majority of their k-nearest neighbors) in order to increase
classifier accuracy. Finally, hybrid methods combine edition and condensation
strategies, for example, by first editing the training set to remove noise, and
then condensing the output of the edition to generate a smaller subset.
Several variants of the basic k-NN have been developed. The k-NN with
weighted neighbors weighs the contribution of each neighbor depending
on its distance to the query instance, i.e., larger weight are given to nearer
neighbors. Irrelevant variables can mislead k-NN, and k-NN with weighted
predictor variables addresses this problem by assigning to each predictor
variable a weight that is proportional to its relevancy (mutual information)
with respect to the class variable. The distance is thus weighted to determine
neighbors. In k-NN with average distance, the distances of the neighbors
to the query instance are averaged for each class label, and the label associated
with the minimum average distance is assigned to the query instance. k-NN
Machine Learning 67
with rejection can leave an instance unclassified (and then be dealt by another
supervised classification algorithm) if certain guarantees, e.g., a minimum
number of votes in the decision rule (much more than k2 in a binary classification
problem), are not met.
Instance-based learning (IBL) (Aha et al., 1991) extends k-NN by
providing incremental learning, significantly reducing the storage requirements
and introducing a hypothesis test to detect noisy instances. The first algorithm
belonging to this family, IB1, includes the normalization of the predictor
variable ranges and the incremental processing of instances. Using incremental
processing, decision boundaries can change over time as new data arrive.
8
yes
no
X1
7
6 ≤2.5 >2.5
X2 yes
5
X2
≤6 >6
4
3
X1 no
2
≤1.5 >1.5
1
2 3 4 5 6 7 8
X1 no yes
(a) (b)
FIGURE 2.20
(a) Scatterplot of 14 cases in a classification problem with two predictor
variables and two classes: yes (red) and no (black); (b) A classification tree
model for this dataset.
and the stopping criteria. The most popular induction algorithms are C4.5 and
CART. They can be regarded as variations on a core algorithm called ID3
(Quinlan, 1986) that stands for iterative dichotomiser because the original
proposal used only binary variables. The nodes selected by ID3 maximize
the mutual information (see Section 2.4.2), called information gain in this
context, with the class variable C. After selecting the root node, a descendant
is created for each value of this variable, and the training instances are sorted
to the appropriate descendant node. The best variable at each point of the
tree is selected similarly, considering at this stage the as yet unused variables
in each path as candidate nodes. ID3 stops at a node when the tree correctly
classifies all its instances (all instances are of the same class) or when there
are no more variables to be used. This stopping criterion causes overfitting
problems, which have been traditionally tackled using pruning methods.
In prepruning, a termination condition, usually given by a statistical
hypothesis test, determines when to stop growing some branches as the classifi-
cation tree is generated. In postpruning, the full-grown tree is then pruned by
replacing some subtrees with a leaf. Postpruning is more widely used, although
it is more computationally demanding. A simple postpruning procedure is
reduced error pruning (Quinlan, 1987). This bottom-up procedure replaces
a node with the most frequent class label for the training instances associated
with that node as long as this does not reduce tree accuracy. The subtree
rooted by the node is removed and converted into a leaf node. The procedure
continues until any further pruning would decrease accuracy. Accuracy is
estimated with a pruning set or test set.
Machine Learning 69
C4.5 (Quinlan, 1993) is an evolution of ID3 that uses the gain ratio (see
Section 2.4.2) as a splitting criterion and can handle continuous variables
and missing values. C4.5 stops when there are fewer instances to be split
than a given threshold. The set of rules generated from the classification tree
are postpruned. Antecedents are eliminated from a rule whenever accuracy
increases. The rule is deleted if it has no antecedents. This prunes subpaths
rather than subtrees.
The classification and regression trees (CART) algorithm (Breiman
et al., 1984) builds binary trees. It implements many splitting criteria, mainly
the Gini index (univariate) and a linear combination of continuous predictor
variables (multivariate). The Gini index of diversity aims at minimizing the
impurity (not all labels are equal) of the training subsets output after branching
the classification tree. It can also be seen as a divergence measure between
the probability distributions of the C values. CART adopts cost-complexity
pruning and can consider misclassification costs. CART can also generate
regression trees, where a real number prediction of a continuous variable C is
found at the leaves.
Univariate splitting criteria, where the node is split according to the
value of a single variable, like the information gain, gain ratio or Gini index,
produce axis-parallel partitions of the feature space. However, multivariate
splitting criteria –like the linear combination of predictors in CART– result
in obliquely oriented hyperplanes. Fig. 2.21 illustrates both types of feature
space partitioning.
X2 X2 X2
X1 X1 X1
FIGURE 2.21
(a) Hyperrectangle partitioning of a classification tree in the feature space
(univariate splitting); (b) Polygonal partitioning produced by an oblique
classification tree (multivariate splitting); (c) An axis-parallel tree designed to
approximate the polygonal space partitioning of (b). Filling/colors refer to
different class labels.
else if then
add Rule to Ruleset
remove instances covered by Rule from (Pos, Neg)
endif
endwhile
Machine Learning 71
The strategy used by IREP to build a rule is as follows. First, the positive
(Pos) and negative (Neg) instances are randomly partitioned into two subsets,
a growing set and a pruning set producing four disjoint subsets: GrowPos
and GrowNeg (positive and negative instances used for growing the rule,
respectively); PrunePos and PruneNeg (positive and negative instances used
for pruning the rule, respectively). Second, a rule is grown. GrowRule starts
empty and considers adding any literal of the form Xi = xi (if Xi is discrete),
or Xi < xi , Xi > xi (if Xi is continuous). GrowRule repeatedly adds the literal
that maximizes an information gain criterion called first-order inductive
learner (FOIL). FOIL is improved until the rule covers no negative instances
from the growing dataset. Let R denote a rule and R0 be a more specific rule
output from R after adding a literal. The FOIL criterion is defined as:
AQR initially focuses on a class and forms the cover to serve as the
antecedent of the rule for that class label. AQR generates a conjunction of
literals, called a complex, and then removes the instances it covers from the
training dataset. This step is repeated until enough complexes have been found
to cover all the instances of the chosen class. The score used by AQR to trim
the antecedent during the generation of a complex is the maximization of the
positive instances covered, excluding the negative instances. The score used to
pick the best complex is the maximization of the positive instances covered.
The entire process is repeated for each class in turn.
The CN2 algorithm (Clark and Niblett, 1989) produces an ordered list
of IF-THEN rules. In each iteration, CN2 searches for a complex that covers
a large number of instances of a single class and a few other classes. When,
according to an evaluation function, the algorithm has found a good complex,
it removes the instances that it covers from the training dataset and adds the
corresponding rule to the end of the rule list. This process iterates until no
more satisfactory complexes can be found. At each stage of the search, CN2
retains a size-limited set of the best complexes found so far. Next, the system
considers specializations of this set, i.e., by either adding a new conjunctive
term or removing a disjunctive element. CN2 generates and evaluates all
possible specializations of each complex. The complex quality is heuristically
assessed with the entropy of the class variable, estimated from the instances
covered by this complex. Lower entropy is preferred.
A new instance is classified by following the rules in order (from first to last)
until we find a rule that the instance satisfies. This rule assigns its predicted
class to the instance. If no rules are satisfied, then the prediction is the most
frequent class in the training instances.
Here we focus on the most commonly used ANN model for supervised clas-
sification: the multilayer perceptron. The multilayer feedforward neural
network, also called multilayer perceptron (MLP) (Minsky and Papert,
1969), consists of a number of interconnected computing units called neurons,
nodes, or cells, which are organized in layers. Each neuron converts the received
inputs into processed outputs. The arcs linking these neurons have weights
representing the strength of the relationship between different nodes. Although
each neuron performs very simple computations, collectively an MLP is able
to efficiently and accurately implement a variety of (hard) tasks. MLPs are
suitable for predicting one or more response (output) variables (discrete or
continuous) simultaneously. Here we address standard supervised classification
problems with a single-class variable.
Fig. 2.22 shows the architecture of a three-layer MLP for supervised classi-
fication. Neurons (represented by circles) are organized in three layers: input
layer (circles in yellow), hidden layer (violet), and output layer (red). The
neurons in the input layer correspond to the predictor variables, X1 , ..., Xn ,
whereas the output neuron represents the class variable, C. Neurons in the
hidden layer are connected to both input and output neurons, and do not
have a clear semantic meaning, although they are the key to learning the
relationship between the input variables and the output variable. A vector w
of weights represents the strength of the connecting links. The most commonly
used MLP is a fully connected network (any node of a layer is connected to all
nodes in the adjacent layers) and includes only one hidden layer.
Input layer
Weights w
Hidden layer
Weights w'
FIGURE 2.22
Structure of a multilayer perceptron for supervised classification, with three
types of nodes (input, hidden and output) organized into layers.
it into an output. This is a two-step process. In the first step, the inputs,
x = (x1 , x2 , x3 , ..., xn ), areP
combined with the weights of the connecting links,
as a weighted sum, e.g., i=1 wi2 xi = w2T x for the second hidden neuron.
n
In the second step, the hidden node transforms this to an output via a
transfer function, f (w2T x). Generally, the transfer function is a bounded non-
decreasing function. The sigmoid or logistic function, f (r) = (1 + exp(−r))−1
is one of the most used transfer functions.
FIGURE 2.23
Transfer function in a hidden node of a multilayer perceptron.
∂E
new
wij = wij
old
−η ,
∂wij
where ∂w
∂E
ij
is the gradient of E with respect to wij and η is called the learning
rate and controls the size of the gradient descent step. The backpropagation
Machine Learning 75
FIGURE 2.24
Multidimensional error space E(w, w0 ). The gradient, or steepest, descent
method starts with the initialization of weights at (w, w0 )(0) . The goal is to
find the optimum point (w, w0 )∗ . Weights are updated according to ∂w ∂E
ij
, the
direction of the partial derivative of the error function with respect to each
weight.
algorithm is iteratively run until some stopping criterion is met. Two versions of
weight updating schemes are possible. In the batch mode, weights are updated
after all training instances are evaluated, while in the online mode, the weights
are updated after each instance evaluation. In general, each weight update
reduces the total error by only a small amount. Therefore, many passes of
all instances are often required to minimize the error until a previously fixed
small error value is achieved.
Several aspects should be considered when training ANNs. The most
important are: (a) weight values are initialized as random values near zero; (b)
overfitting is avoided using weight decay, that is, an explicit regularization
method that shrinks some of the weights towards zero; (c) input scaling can
have a big effect on the quality of the final solution, and it is preferable
for inputs to be standardized to mean zero and standard deviation one; (d)
the flexibility of the model for capturing data non-linearities depends on the
number of hidden neurons and layers, and, in general, it is better to have many
hidden units trained with weight decay or another regularization method; (e)
a multistart strategy (many different weight initializations) for minimizing the
non-convex E(w, w0 ) error function is often used.
Recently deep neural networks (Schmidhuber, 2015), defined as ANNs
with multiple hidden layers, have attracted the attention of many researchers,
since their learning process relates to a class of brain development theories
76 Industrial Applications of Machine Learning
FIGURE 2.25
(a) Many possible separating lines of two linearly separable classes; (b) Linear
SVM classifier maximizing the margin around the separating hyperplane; (c)
Hyperplane wT x + b = 0 for linearly separable data. Its margin is d1 + d2 .
The support vectors have double lines.
The hyperplane H that separates the positive from the negative instances
is described by wT x + b = 0, where vector w is normal (perpendicular) to
the hyperplane, |b|/||w|| is the perpendicular distance from the hyperplane
to the origin and ||w|| is the Euclidean norm (length) of w (Fig. 2.25(c)).
Machine Learning 77
Points above (below) H should be labeled +1 (-1), that is, the decision rule is
φ(x) = sign(wT x + b), and w and b must be found.
Assume that the data satisfy the constraints wT xi + b ≥ 1 for ci = +1 and
w x + b ≤ −1 for ci = −1, which can be combined into
T i
The points for which the equality in Eq. 2.4 holds are the points that lie closest
to H (depicted by double lines in Fig. 2.25(c)). These points are the support
vectors, the most difficult points to classify. Points that lie on the support
hyperplane wT x + b = −1 have distance d1 = 1/||w|| to H, and points that
lie on wT x + b = 1 have distance d2 = 1/||w|| to H. H must be as far from
these points as possible. Therefore, the margin, 2/||w||, should be maximized.
The linear SVM then finds w and b satisfying
2
max
w,b ||w||
subject to
1 − ci (wT xi + b) ≤ 0, ∀i = 1, ..., N.
w= (2.5)
X
λi ci xi ,
i∈S
where S denotes the set of indices of the support vectors (for which λi > 0).
Finally, offset b is calculated as
1 X s X
b= (c − λi ci (xi )T xs ). (2.6)
|S|
s∈S i∈S
Note that the support vectors completely determine the SVM classifier. The
other data points can be disregarded when the learning phase is over. Since
there are not usually too many support vectors, classification decisions are
reached reasonably quickly.
For non-linearly separable data, e.g., outliers, noisy or slightly non-linear
data, the constraints of Eq. 2.4 can, provided that we are still looking for a
78 Industrial Applications of Machine Learning
ci (wT xi + b) ≥ 1 − ξi ,
ξi ≥ 0, ∀i = 1, ..., N.
(a) (b)
FIGURE 2.26
(a) Example with two non-linearly separable classes. (b) Both classes become
linearly separable after the data are mapped to a higher-dimensional space.
Machine Learning 79
i∈S
where b = |S|1
s∈S (c −
s
i∈S λi c K(x , x )), and S are the indices i of
i i s
P P
TABLE 2.7
Typical kernel functions
Polynomial kernels are appropriate when the training data are normalized.
The Gaussian kernel is an example of a radial basis function (RBF) kernel. σ
must be carefully tuned: when σ is decreased, the curvature of the decision
boundary increases (the decision boundary is very sensitive to noise) and
overfitting may occur. With a high σ, the exponential will behave almost
linearly, and the higher-dimensional projection will start to lose its non-linear
80 Industrial Applications of Machine Learning
power. The exponential kernel is closely related to the Gaussian kernel, merely
omitting the square of the norm. It is also an RBF kernel. The sigmoid (or
hyperbolic tangent) kernel is equivalent to a two-layer perceptron ANN. A
common choice for a here is 1/N .
An appropriate selection of M and the kernel is a key issue for achieving
good performance. The user often selects both using a grid search with ex-
ponentially growing sequences. A validation dataset serves to estimate the
accuracy for each point on the grid. For a user’s guide to SVM, see Ben-Hur
and Weston (2010).
The multiclass SVM extends the binary SVM to class variables with
more than two categories. The most used option is to combine many binary
SVMs (Hsu and Lin, 2002). For instance, we can train an SVM on each pair of
labels. A new instance is then classified by voting, i.e., by selecting the label
most frequently predicted by these binary SVMs.
p(C = 1|x, β)
logit (p(C = 1|x, β)) = ln = β0 + β1 x1 + · · · + βn xn
1 − p(C = 1|x, β)
and makes it easier to interpret. Let x and x0 be vectors such that xl = x0l for
all l =
6 j and x0j = xj + 1, then logit p(C = 1|x0 , β) − logit p(C = 1|x, β) =
4 The term logit stands for logistic probability unit.
Machine Learning 81
i=1 i=1
= =0
c −
1 + eβ0 +β1 x1 +···+βn xn
i i
∂β0
i=1 i=1
∂ ln L(β) X i i X i eβ0 +β1 x1 +···+βn xn
N N i i
= = 0, j = 1, ..., n.
c xj − xj
1 + eβ0 +β1 x1 +···+βn xn
i i
∂βj i=1 i=1
∂ 2 ln L(β) ∂ ln L(β)
−1
βbnew = βbold − ,
∂β∂β T ∂β
where the derivatives are evaluated at βbold . The formula is initialized arbitrarily,
e.g., βbold = 0. Its choice is not relevant. The procedure is stopped when there
is a negligible change between successive parameter estimates or after running
a specified maximum number of iterations.
As with linear regression, multicollinearity among predictor variables must
be removed, since it produces unstable βj estimates. Again as in linear re-
gression, the statistical significance of each variable can be assessed based on
hypothesis tests on the βj coefficients. Testing the null hypothesis H0 : βr = 0
against the alternative hypothesis H1 : βr = 6 0 amounts to testing the elimina-
tion of Xr , a way of performing feature selection. Two nested models are used,
i.e., all terms of the simpler model occur in the complex model. Most standard
approaches are sequential: forward or backward. In a backward elimination
process, we can test the hypothesis that a simpler model M0 holds against
a more complex alternative M1 , where M0 contains the same terms as M1 ,
82 Industrial Applications of Machine Learning
1 − θ̂i
N
" ! !#
θ̂i
DM = −2 c ln + (1 − c ) ln
X
i i
,
i=1
ci 1 − ci
where θ̂i = p(C = ci |x, β)). Note that the first (second) term is considered
zero when ci = 0 (ci = 1). The statistic for testing that M0 holds against M1
is DM0 − DM1 , which behaves like an approximate chi-squared statistic χ21 . If
H0 is rejected, then we select the complex model (M1 ) over the simpler model
(M0 ). In general, several terms can likewise be eliminated from M1 to yield
M0 , although the degrees of freedom of the chi-squared distribution are equal
to the number of additional parameters that are in M1 but not in M0 (Agresti,
2013). A forward inclusion process works similarly, albeit starting from the
null model and including one variable at a time.
Regularization (Section 2.4.2) can also be used for modeling purposes
in logistic regression (Shi et al., 2010), especially when N n (i.e., the
so-called “large n, small N ” problem).
Pn L1 -regularization, known as lasso, is
designed to solve maxβ (ln L(β) − λ j=1 |βj |), where λ ≥ 0 is the penalization
parameter that controls the amount of shrinkage (the larger the λ, the greater
the shrinkage and the smaller the βj s). The solution includes coefficients that
are exactly zero, thus performing feature subset selection.
i=1
p(c | x1 , . . . , xn ) ∝ p(c, x1 , . . . , xn )
Q
n
p(c | pa(c)) p(xi | pa(xi ))
p(c)p(x1 , . . . , xn | c) i=1
Markov blanket-based
Naive Bayes
Unrestricted
Selective naive Bayes
Semi-naive Bayes
ODE Q
n
p(c) p(xi | pac (xi ))
TAN i=1
SPODE
Bayesian multinet
k-DB
BAN
FIGURE 2.27
Taxonomy of discrete Bayesian network classifiers according to three different
factorizations of p(x, c). The group to the left contains the augmented naive
Bayes models.
This is called the Lindstone rule. Special cases are Laplace estimation
(see Section 2.2.2.1) and the Schurmann-Grassberger rule, with α = 1 and
α = R1i , respectively.
Naive Bayes (Minsky, 1961) is the simplest Bayesian network classifier,
where all predictive variables are assumed to be conditionally independent
given the class. When n is high and/or N is small, p(x|c) is difficult to estimate
and this strong assumption is useful. The conditional probabilities for each c
given x are computed as
n
Y
p(c|x) ∝ p(c) p(xi |c).
i=1
Fig. 2.28 shows an example of naive Bayes with five predictor variables.
X1 X2 X3 X4 X5 X6
FIGURE 2.28
Naive Bayes: p(c|x) ∝ p(c)p(x1 |c)p(x2 |c)p(x3 |c)p(x4 |c)p(x5 |c)p(x6 |c).
Naive Bayes will improve its performance if only relevant, and especially
non-redundant, variables are selected to be in the model. In the selective
naive Bayes, probabilities are
p(c|x) ∝ p(c|xF ) = p(c)
Y
p(xi |c),
i∈F
where F ⊆ {1, 2, ..., n} denotes the indices of the selected features. Filter
(Pazzani and Billsus, 1997), wrapper (Langley and Sage, 1994) and hybrid
approaches (Inza et al., 2004) have been used for selective naive Bayes models.
The semi-naive Bayes model (Fig. 2.29) relaxes the conditional inde-
pendence assumption of naive Bayes trying to model dependences between
the predictor variables. To do this, it introduces new features that are the
Cartesian product of two or more original predictor variables. These new
predictor variables are still conditionally independent given the class variable.
Thus,
YK
p(c|x) ∝ p(c) p(xSj |c),
j=1
Machine Learning 85
where Sj ⊆ {1, 2, ..., n} denotes the indices in the j-th feature (original or
Cartesian product), j = 1, ..., K (K is the number of nodes), Sj ∩ Sl = ∅, for
j 6= l.
(X1,X3) X2 X4 (X5,X6)
FIGURE 2.29
Semi-naive Bayes: p(c|x) ∝ p(c)p(x1 , x3 |c)p(x5 , x6 |c).
The objective function driving the standard algorithm for learning a semi-
naive Bayes model (Pazzani, 1996) is classification accuracy. The forward
sequential selection and joining algorithm starts from an empty structure.
The accuracy is computed after assigning the most frequent label to all instances.
Then the algorithm chooses the best option (in terms of classification accuracy)
between (a) adding a variable not yet used as conditionally independent of the
already included features (original or Cartesian products), and (b) joining a
variable not yet used with each feature (original or Cartesian products) present
in the classifier. The algorithm stops when accuracy does not improve.
In one-dependence estimators (ODEs), each predictor variable de-
pends on at most one other predictor variable apart from the class variable.
Tree-augmented naive Bayes and superparent-one-dependence estimators are
two types of ODEs.
The predictor variables of the tree-augmented naive Bayes (TAN)
form a tree. Thus, all have one parent, except for one variable, called the root,
which is parentless (Fig. 2.30). Then
n
p(xi |c, xj(i) ),
Y
p(c|x) ∝ p(c)p(xr |c)
i=1,i6=r
where Xr denotes the root node and {Xj(i) } = Pa(Xi ) \ C, for any i 6= r.
The mutual information of any pair of predictor variables conditioned on C,
I(Xi , Xj |C), is first computed to learn a TAN structure (Friedman et al., 1997).
This measures the information that one variable provides about the other
variable when the value of C is known. Second, a complete undirected graph
with nodes X1 , ..., Xn is built. The edge between Xi and Xj is annotated with
a weight equal to the above mutual information of Xi and Xj given C. Third,
Kruskal’s algorithm (Kruskal, 1956) is used to find a maximum weighted
spanning tree in that graph, containing n − 1 edges. This algorithm selects a
subset of edges from the graph such that they form a tree and the sum of their
weights is maximized. This is performed by sequentially choosing the edge
with the heaviest weight, provided this does not yield a cycle. We then select
86 Industrial Applications of Machine Learning
C C
X1 X2 X3 X4 X5 X6 X1 X2 X3 X4 X5 X6
(a) (b)
FIGURE 2.30
(a) TAN with X3 as root node:
p(c|x) ∝ p(c)p(x1 |c, x2 )p(x2 |c, x3 )p(x3 |c)p(x4 |c, x3 )p(x5 |c, x4 ))p(x6 |c, x5 ). (b)
Selective TAN: p(c|x) ∝ p(c)p(x2 |c, x3 )p(x3 |c)p(x4 |c, x3 ).
any variable as the root node and set the direction of all edges as outgoing
from this node to make the undirected tree directed. This tree including only
predictor variables is added to a naive Bayes structure to produce the final
TAN structure.
If the weights are first filtered with a χ2 test of independence, the resulting
classifier is the selective TAN (Blanco et al., 2005) (Fig. 2.30(b)). This may
yield a forest (i.e., a disjoint union of trees) rather than a tree because there
are many root nodes.
Superparent-one-dependence estimators (Keogh and Pazzani, 2002)
(SPODEs) are an ODE where all predictors depend on the same predictor
called the superparent as well as the class. Note that this is a particular case
of a TAN model. Classification is given by
n
p(xi |c, xsp ),
Y
p(c|x) ∝ p(c)p(xsp |c)
i=1,i6=sp
1
n
p(c|x) ∝ p(c, x) = p(xi |c, xsp ), (2.9)
X Y
p(c)p(xsp |c)
|SP m
x | Xsp ∈SP m
x i=1,i6=sp
X1 X2 X3 X4 X5 X6
FIGURE 2.31
3-DB structure: p(c|x) ∝
p(c)p(x1 |c)p(x2 |c, x1 )p(x3 |c, x1 , x2 )p(x4 |c, x1 , x2 , x3 )p(x5 |c, x1 , x3 , x4 )
p(x6 |c, x3 , x4 , x5 ).
Xi enters the model according to the value of I(Xi , C), starting with the
highest. When Xi enters the model, its parents are selected by choosing the
k variables Xj that are already in the model and have the highest values of
I(Xi , Xj |C).
In the Bayesian network-augmented naive Bayes (BAN) (Ezawa
and Norton, 1996), the predictor variables can form any Bayesian network
structure (Fig. 2.32). Probabilities are now given by
n
p(xi |pa(xi )).
Y
p(c|x) ∝ p(c)
i=1
X1 X2 X3 X4 X5 X6
FIGURE 2.32
Bayesian network-augmented naive Bayes:
p(c|x) ∝ p(c)p(x1 |c)p(x2 |c)p(x3 |c)p(x4 |c, x1 , x2 , x3 )p(x5 |c, x3 , x4 , x6 )p(x6 |c).
I(Xi , Xj |C) ≥ tXX i<j I(Xi , Xj |C), where 0 < tXX < 1 is the thresh-
Pe Pk
i<j
old. Edges are oriented according to the variable ranking in the first step:
higher-ranked variables point towards lower-ranked variables.
If C has parents,
n
p(xi |pa(xi )).
Y
p(c|x) ∝ p(c|pa(c))
i=1
X1 X2 C X3 X5 X6
X4
FIGURE 2.33
Markov blanket-based Bayesian classifier:
p(c|x) ∝ p(c|x2 )p(x2 )p(x3 )p(x4 |c, x3 ). The Markov blanket of C is
MBC = {X2 , X3 , X4 }.
Therefore, the relations among variables do not have to be the same for all c:
n
p(xi |pac (xi )),
Y
p(c|x) ∝ p(c)
i=1
where Pac (Xi ) is the parent set of Xi in the local Bayesian network associated
with C = c.
C
C=0 C=1
X1 X2 X3 X4 X1 X2 X3 X4
FIGURE 2.34
Bayesian multinet as a collection of trees: p(C = 0|x) ∝ p(C = 0)p(x1 |C =
0, x2 )p(x2 |C = 0, x3 )p(x3 |C = 0)p(x4 |C = 0, x3 ) and p(C = 1|x) ∝ p(C =
1)p(x1 |C = 1)p(x2 |C = 1, x3 )p(x3 |C = 1, x4 )p(x4 |C = 1).
2.4.10 Metaclassifiers
Metaclassifiers (Kuncheva, 2004) have featured prominently in machine
learning since the 1990s and have been used in successful real-world applications.
Metaclassifiers use multiple classification models, called base classifiers, to
make the final decision. The base classifiers are first generated and then
combined.
The main grounds underlying this idea is the no free lunch theorem
(Wolpert and Macready, 1997), whereby a general-purpose (and universally
good) classification algorithm is theoretically impossible. Combining multiple
algorithms, the overall performance is expected to improve; the variance of
the metaclassifier (Section 2.2.2) has shown to be reduced. Diversity in the
base classifiers is believed to be a good property for a metaclassifier. Base
classifiers should make mistakes in different instances and specialize in problem
subdomains. This resembles the behavior of a patient visiting more than one
doctor to get a second opinion before making a (better) final decision about
his or her health.
The possible ways of combining the outputs of L base classifiers φ1 , ..., φL
in a metaclassifier depend on whether or not base classifiers are probabilistic. A
non-probabilistic classifier outputs predicted class labels, whereas probabilistic
classifiers produce, for each instance, a distribution with the estimated posterior
probabilities of the class variable conditioned on x. In each case we have,
respectively, methods of fusion of label outputs and methods of fusion of
continuous-valued outputs.
The fusion of label outputs defines different voting rules for decision
making. Examples include unanimity vote, majority vote, simple majority
vote, thresholded majority vote and weighted majority vote. In the fusion
of continuous-valued outputs, the estimated posterior probabilities of cj
conditioned on x for classifier φi , i.e., pi (cj |x), can be interpreted as the
confidence in label cj . As with label outputs, there are multiple options for
summarizing all the outputs given by the L classifiers. Examples are the simple
mean, the minimum, the maximum, the median, the trimmed mean5 , and the
weighted average.
Besides this plethora of combiners, there are popular metaclassifiers, namely,
stacking, cascading, bagging, random forests, boosting and hybridizations, all
described below.
Stacked generalization (Wolpert, 1992) is a generic methodology, where
the outputs of the base classifiers, φ1 , ..., φL , are combined through another
classifier φ∗ . In its simplest version, two layers are stacked. The base classifiers
form layer-0, which are usually different classification algorithms. Their predic-
tions (output by honest estimation methods) along with the true class labels
are the inputs for the layer-1 classifier, which makes the final decision.
Cascading (Kaynak and Alpaydin, 2000) sorts the L classifiers in increas-
5 The trimmed mean calculates the mean after discarding equal parts from both extremes
Yes Yes
p 1 c ∣x 1
No No
* *
p 2 (c *∣x)⩾θ*2
D L−1
D1
D ϕ1 ϕ2 ... ϕL
FIGURE 2.35
Cascading metaclassifier. The prediction of an instance x by classifier φi is
c∗ = arg maxc pi (c|x) whenever pi (c∗ |x) ≥ θi∗ (and x is a correct
classification). Otherwise, if x is a misclassification or pi (c∗ |x) < θi∗ , x is
passed on to φi+1 , probably with a different threshold θi+1 ∗
.
Bootstrap Db
1 ϕ1
sampling
D Db
2
ϕ2 Majority vote
Average/median
...
...
Db
L ϕL
FIGURE 2.36
Bagging metaclassifier. The same type of classifier is trained from bootstrap
replicates. The final decision uses majority vote (label outputs) or
average/median (continuous outputs).
end
end
4 Select the class with the maximum support as the label for x
each classifier’s vote is a function of its accuracy in the training set, given by
ln(1/βi ). Note that a zero error (i = 0 in Step 2d) is a potential overfitting.
In this case, βi = 0, ln(1/βi ) = ∞ and φi has an infinite voting-weight that
should be avoided (despotic classifier). Classifiers with i < 0.5 are called weak
classifiers, they are AdaBoost targets. Fig. 2.37 is a flowchart showing how
AdaBoost.M1 works.
Hybrid classifiers hybridize two (or more) classification algorithms to
leverage their strengths. Naive Bayes tree (NBTree) (Kohavi, 1996) com-
bines classification trees and naive Bayes. NBTree recursively splits the instance
space into subspaces, and a (local) naive Bayes classifier is built in each sub-
space. It is this local model that predicts the class label of the instances that
reach the leaf. Lazy Bayesian rule learning algorithm (LBR) (Zheng
and Webb, 2000) combines naive Bayes and rules. To classify a test instance,
LBR generates a rule whose antecedent is a conjunction of predictor-value
pairs and whose consequent is a local naive Bayes classifier created from the
training cases satisfying the antecedent. Logistic model trees (Landwehr
et al., 2003) are classification trees with logistic regression models at the leaves,
applied to instances that reach such leaves.
94 Industrial Applications of Machine Learning
D D1 ϕ1 D2 ϕ2 ... DL ϕL
FIGURE 2.37
Boosting metaclassifier. The base classifiers are all of the same type. The
training set Di of size N sampled from D focuses more on the mistakes of
φi−1 . The final decision uses the weighted majority vote.
p(X1 , ..., Xn ) = p(X1 )p(X2 |X1 )p(X3 |X1 , X2 ) · · · p(Xn |X1 , ..., Xn−1 )
= p(X1 |Pa(X1 )) · · · p(Xn |Pa(Xn )). (2.10)
The first equality is the application of the chain rule. In the second equality,
we use the previous assumption. The resulting expression will (hopefully) have
fewer parameters. Furthermore, this modularity makes the maintenance and
reasoning easier, as explained below.
A BN represents this factorization of the JPD with a directed acyclic graph
(DAG). This is the qualitative component of a BN, called the BN structure.
A graph G is a pair (V, E), where V is the set of nodes and E is the set of edges
between the nodes in V . Nodes are the domain random variables X1 , ..., Xn .
If the edges are directed –called arcs– from one node to another, G is directed.
The parent nodes Pa(Xi ) of a node Xi are all the nodes pointing at Xi as
given by the arcs. Similarly, Xi is their child node. An acyclic graph contains
no cycles, that is, following the direction of the arcs, there is no sequence of
nodes (directed path) starting and ending at the same node.
The other component of the BN is quantitative and is a collection of
conditional probability distributions. They form the BN parameters. For
each node Xi , the distributions are p(Xi |Pa(Xi )), one per Pa(Xi ) value. These
conditional probabilities are multiplied as indicated by the arcs to output the
JPD (see Eq. 2.10). In discrete variables, BN parameters can be arranged
tabularly, yielding a conditional probability table (CPT).
has fewer than 100 employees and fewer than 20 machines, this probability is
only 0.10, i.e., p(p|¬e, ¬m) = 0.10.
Y p(Y)
Y E p(E|Y) y 0.75
Years
Y ¬y 0.25
y e 0.70
y ¬e 0.30
¬y e Y M p(M|Y)
0.20
¬ y ¬e 0.80 y m 0.10
Employees Machines
M
y ¬m 0.90
E M P p(P|E,M) E ¬y m 0.85
e m p 0.96 ¬y ¬m 0.15
e m ¬p 0.04
e ¬m p 0.40 Pieces Failures
M F p(F|P)
e ¬m ¬ p 0.60 P F
m f 0.75
¬e m p 0.45
m ¬f
¬e m ¬p 0.55 0.25
¬e ¬m p 0.10
¬m f 0.05
¬e ¬m ¬ p 0.90 ¬m ¬ f 0.95
FIGURE 2.38
Hypothetical BN modeling factory production.
random vectors (sets of nodes in the BN)6 . Checking whether X and Y are
u-separated by Z is a three-step procedure:
Years
Y
Employees Machines
E M
Pieces Failures
P F
FIGURE 2.39
Moralized ancestral graph of the factory production BN in Fig. 2.38.
to directed graphs, that is perhaps harder to verify, but also implies conditional independence
(Verma and Pearl, 1990a).
98 Industrial Applications of Machine Learning
X1 X1 X1 X1 X1
X2 X3 X2 X3 X2 X3 X2 X3 X2 X3
X5 X4 X5 X4 X5 X4 X5 X4 X5 X4
FIGURE 2.40
(a) A DAG with five variables. There is one immorality at X4 . (b) Its
essential graph. (c)-(e) The three DAGs equivalent to the DAG in (a). DAGs
(a),(c)-(e) form the equivalence class, represented by the essential graph (b).
(see Section 2.2.1). The required parameters are then µ and Σ. An interesting
property is that a variable Xi is conditionally independent of Xj given the
other variables iff wij = 0, where wij is the (i, j)-entry of W.
The JPD in a Gaussian BN can be equivalently defined by the product of
n univariate (linear) Gaussian conditional densities
f (x) = f1 (x1 )f2 (x2 |x1 ) · · · fn (xn |x1 , ..., xn−1 ), (2.12)
each given by
i−1
fi (xi |x1 , ..., xi−1 ) ∼ N (µi + βij (xj − µj ), vi ), (2.13)
X
j=1
f1 (x1 ) ∼ N (µ1 , v1 )
f2 (x2 ) ∼ N (µ2 , v2 )
f3 (x3 ) ∼ N (µ3 , v3 )
f4 (x4 |x1 , x2 ) ∼ N (µ4 + β41 (x1 − µ1 ) + β42 (x2 − µ2 ), v4 )
f5 (x5 |x2 , x3 ) ∼ N (µ5 + β52 (x2 − µ2 ) + β53 (x3 − µ3 ), v5 ).
X1 X2 X3
X4 X5
FIGURE 2.41
A Gaussian Bayesian network.
We can choose between any of the two representations, Eq. 2.11 or Eq. 2.13,
since both are equivalent. There are formulas to transform one into the other.
First, the unconditional means µi are the same in both representations. Second,
matrix W of the multivariate Gaussian density can be built recursively from
100 Industrial Applications of Machine Learning
BN inference can combine evidence from any part of the network and perform
any kind of query.
Abductive inference is an important type of inference that finds the
values of a set of variables that best explain the observed evidence. In to-
tal abduction, we solve arg maxU p(U|e), i.e., we find the most probable
explanation (MPE), whereas partial abduction solves the same problem
for a subset of variables in u (the explanation set), referred to as the partial
maximum a posteriori (MAP). Note that both probabilities are computed,
together with an optimization problem. Solving a supervised classification
problem, i.e., maxr p(C = cr |x) is a particular case of finding the MPE.
(a) (b)
FIGURE 2.42
Exact inference on the factory production example. (a) The bar charts show
the prior distributions p(Xi ) of each node Xi . (b) The evidence of observing a
factory with many machines (M = m) is propagated through the network to
yield the resulting updated (posterior) distributions p(Xi |m).
p(Xi , e) X
p(Xi |E = e) = ∝ p(Xi , e, U). (2.14)
p(e)
U
p(P ) = p(Y, E, M, F, P )
X
Y,E,M,F
Y,E,M,F
Machine Learning 103
Y E M F
Y E M
Y E
FIGURE 2.43
(a) Structure of the BN modeling factory production. (b) Its moral graph,
where a new edge has been added to marry two parents, E and M , with a
common child. (c) Its junction tree, with three cliques C1 = {E, M, P },
C2 = {Y, E, M } and C3 = {M, F }, and their corresponding separators. The
potentials assigned to each clique are ψ1 (E, M, P ) = p(P |E, M ),
ψ2 (Y, E, M ) = p(E|Y )p(M |Y )p(Y ), ψ3 (M, F ) = p(F |M ).
where Nbj is the set of indices of nodes that are neighbors of Cj . Thus, at
clique Cj , all the incoming messages M l→j are multiplied by its own potential
ψj . When the message passing ends, each clique Ci contains
l∈Nbi
that is, the clique marginals. To compute the marginal (unnormalized) distri-
106 Industrial Applications of Machine Learning
Example. Factory production (exact inference with the junction tree algorithm)
Suppose that our target distribution is p(P |m), i.e., the probability of piece
production (many, p, or few, ¬p) if the factory has many machines. The
application of Eq. 2.16 followed by Eq. 2.15 yields
p(C1 , e) = p(E, m, P )
= ψ1 (E, m, P )M 2→1 (E, m)
= p(P |E, m) ψ2 (Y, E, m)M 3→2 (m)
X
where ΣXi E is the vector with the covariances of Xi , and each variable in E,
ΣEE is the covariance matrix of E, and µE is the unconditional mean of E.
In general BNs using non-parametric density estimation techniques, in-
ference has been performed on networks with a small number of nodes only
(see, e.g., Shenoy and West (2011)). The problem is that the intermediate
Machine Learning 107
including M = ¬m are rejected and discarded, since they are not compatible
with the evidence M = m.
Hence, if e is very unlikely, probabilistic logic sampling wastes many
samples. The likelihood weighting method addresses this problem (Fung and
Chang, 1990; Shachter and Peot, 1989). Other powerful techniques are Gibbs
sampling (Pearl, 1987) and, more generally, Markov chain Monte Carlo
(MCMC) methods. MCMC methods build a Markov chain, the stationary
distribution of which is the target distribution of the inference process. Thus,
the states of the chain (when converged) are used as a sample from the desired
distribution. They are easy to implement and widely applicable to very general
networks (undirected networks included) and distributions.
Parameters θijk are estimated from the dataset D = {x1 , ..., xN }, where
x = (xh1 , ..., xhn ), h = 1, ..., N . Let Nij be the number of cases in D in which
h
Pa(Xi ) = paji has been observed, and Nijk be the number of cases in D
where Xi = k and Pa(Xi ) = paji have been observed at the same time
(Nij = k=1 Nijk ).
PRi
The maximum likelihood estimation (Section 2.2.2) finds θbML such that it
maximizes the likelihood of the dataset given the model:
N
θbML = arg max L(θ|D, G) = arg max p(D|G, θ) = arg max
Y
p(xh |G, θ).
θ θ θ
h=1
(2.17)
In BNs, probabilitiesQp(xh |G, θ) in Eq. 2.17 are factorized according to
G, that is, p(xh |G, θ) = i=1 p(xhi |pahi , θ). We now use the assumptions of
n
X1 X2 X1 X2 X3 X4
1 2 1 2
1 1 1 2
X3 2 1 2 1
2 1 2 2
1 2 1 2
X4 1 3 1 1
(a) (b)
FIGURE 2.44
(a) A BN with four nodes. (b) A dataset with N = 6 for {X1 , ..., X4 } from
which the BN in (a) has been learned.
TABLE 2.8
Parameters θijk of the BN in Fig. 2.44
Parameters Meaning
θ1 = (θ1−1 , θ1−2 ) (p(X1 = 1), p(X1 = 2))
θ2 = (θ2−1 , θ2−2 , θ2−3 ) (p(X2 = 1), p(X2 = 2), p(X2 = 3))
θ3 = (θ311 , θ312 , . . . , θ361 , θ362 ) (p(X3 = 1|X1 = 1, X2 = 1),
p(X3 = 2|X1 = 1, X2 = 1), . . .
p(X3 = 1|X1 = 2, X2 = 3),
p(X3 = 2|X1 = 2, X2 = 3))
θ4 = (θ411 , θ412 , θ421 , θ422 ) (p(X4 = 1|X3 = 1), p(X4 = 2|X3 = 1),
p(X4 = 1|X3 = 2), p(X4 = 2|X3 = 2))
Machine Learning 111
FIGURE 2.45
Methods for BN structure learning based on score and search.
There are three possible structure spaces: (a) the space of DAGs; (b)
the space of Markov equivalent classes; and (c) the space of orderings. The
cardinality of the space of DAGs is super-exponential in the number of
nodes: the number f (n) of possible BN structures with n nodes is given by
Machine Learning 113
i=1
i
qi Y
n Y Ri qi X
n X Ri
log L(θ|D,
b G) = log p(D|G, θ)
b = log θbijkijk = Nijk log θbijk ,
N
Y X
where θbijk is usually taken as the maximum likelihood estimate, i.e., θbijk =
= Nijk , the frequency counts in D.
ML N
θbijk ij
One problem is that this score increases monotonically with the complexity
of the model (known as structural overfitting), as shown in Fig. 2.46. The
optimal structure would be the complete graph. A family of penalized log-
likelihood scores addresses this issue by penalizing network complexity. Their
general expression is
114 Industrial Applications of Machine Learning
qi X
n X Ri
Nijk
(D, G) = Nijk log − dim(G)pen(N ),
X
P en
Q
i=1 j=1 k=1
Nij
where dim(G) = i=1 (Ri − 1)qi denotes the model dimension (number
Pn
of parameters needed in the BN), and pen(N ) is a non-negative penaliza-
tion function. The scores differ depending on pen(N ): if pen(N ) = 1, the
score is called Akaike’s information criterion (AIC) (Akaike, 1974) and if
pen(N ) = 12 log N , the score is the Bayesian information criterion (BIC)
(Schwarz, 1978).
Training data
Likelihood
Test data
DAG complexity
X1 X1 X1
X2 X3 X2 X3 X2 X3
X4 X4 X4
FIGURE 2.46
Structural overfitting: the likelihood of the training data is higher for denser
graphs, whereas it degrades for the test data.
where p(D|G, θ) is the likelihood of the data given the BN (structure G and
parameters θ), and f (θ|G) is the prior distribution over the parameters. De-
pending on f (θ|G), we have different scores. If a Dirichlet distribution is set, i.e.,
(θij |G) follows a Dirichlet of parameters αij1 , ..., αijRi , we have the Bayesian
Machine Learning 115
(Ri − 1)!
qi
n Y Ri
(D, G) = p(G) Nijk !.
Y Y
K2
Q
i=1 j=1
(Nij + Ri − 1)!
k=1
The K2 algorithm uses a greedy search method and the K2 score. The
user gives a node ordering and the maximum number of parents that any
node is permitted to have. Starting with an empty structure, the algorithm
incrementally adds, from the set of nodes that precede each node Xi in the
node ordering, the parent whose addition most increases the function:
(Ri − 1)!
qi Ri
g(Xi , Pa(Xi )) = Nijk !.
Y Y
j=1
(Nij + Ri − 1)!
k=1
When the score does not increase further with the addition of a single parent,
no more parents are added to node Xi , and we move on to the next node in
the ordering.
The likelihood-equivalent Bayesian Dirichlet score (BDe score)
(Heckerman et al., 1995) sets the hyperparameters as αijk = α p(Xi =
k, Pa(Xi ) = paji |G). The equivalent sample size α expresses the user’s confi-
dence in the prior network. In the BDeu score (Buntine, 1991), αijk = α qi1Ri .
The huge search space has led to propose many heuristics for structure learn-
ing, including greedy search, simulated annealing, estimation of distribution
algorithms (EDAs), genetic algorithms, and MCMC methods.
data is updated in real time. In the offline phase, the modeling is performed
on-demand on the stored synopsis.
Supervised classification data stream models should be validated by adapt-
ing the performance evaluation methodology explained in Section 2.4.1 to
the temporal scenario of data streams. Thus, the adaptation of the hold-out
validation method considers that data instances are clustered into chunks.
Each data chunk is first used as a test instance (if it belongs to the future with
respect to the training data chunks) and, once this chunk belongs to the past,
it is then used to update the model. A special case of this hold-out adaptation
method, where chunk size is equal to one instance, is called prequential
validation (Gama et al., 2013).
Nguyen et al. (2015) report a survey of data stream clustering and super-
vised classification methods. Here we briefly present some techniques. With
regard to clustering methods, STREAM (O’Callaghan et al., 2002) is a par-
titional clustering method for data streams, which extends the K-medians
algorithm8 using a divide-and-conquer strategy and performing clustering incre-
mentally. The data stream is broken down into chunks, each with a manageable
size for storage in main memory. For each chunk, STREAM uses a K-medians
algorithm to select K representatives (medians) which it stores. The process
is repeated for the next chunks. When the number of representative points
exceeds the main memory storage, a second level of clustering is applied, that
is, STREAM operates in a multilevel manner. CluStream (Aggarwal et al.,
2004) is a hierarchical clustering method (Section 2.3.1) for data streams that
uses micro-clusters, a temporal extension of the clustering feature vector, to
capture the summary information about the data stream. CluStream adopts
an online-offline learning approach. In the online phase, it continuously main-
tains a set of micro-clusters in the data stream. When a new micro-cluster is
created, an outlier micro-cluster is deleted or two neighboring micro-clusters
are merged. In the offline phase, it runs the K-means algorithm to cluster the
stored micro-clusters. SWEM (Dang et al., 2009) is a probabilistic clustering
method for data streams based on finite mixture models (see Section 2.3.5) that
uses a fading window. SWEM represents each Gaussian mixture as a vector of
parameters containing the weights of the mixture, and its mean and covariance
matrix. For the first data window, SWEM applies the EM algorithm until
parameter convergence. In the incremental phase, SWEM uses the converged
parameters from the previous window of data instances as the initial values
for the parameters of the new mixture model. Table 2.9 shows the reviewed
data stream clustering algorithms.
As far as supervised classification is concerned, on-demand-stream (Ag-
garwal et al., 2006) is a k-nearest neighbor classifier (Section 2.4.3) that extends
the CluStream method and works with the micro-cluster structure, the tilted
time window, and the online-offline approach. A micro-cluster is extended
8 K-medians is a variation of the K-means algorithm (Section 2.3.2), where the median
with a class label, and it only takes instances with the same class label. Its
offline classification process starts to find a good window of data instances.
On-Demand-Stream performs 1-NN classification and assigns the label of the
closest micro-cluster to a testing instance. The Hoeffding tree (Domingos
and Hulten, 2000) is a classification tree (Section 2.4.4) for data streams.
A Hoeffding tree uses the Hoeffding bound to choose an optimal splitting
variable within a sufficient amount of received instances. The Hoeffding bound
provides some probabilistic guarantees about this splitting variable selection.
The algorithm is incremental, which satisfies the single-pass constraint. For
each new data item received, it uses Hoeffding bounds to check whether the
best splitting variable is confident enough to create the next-level tree node. A
granular artificial neural network (Leite et al., 2010) has been proposed
to classify data streams. This is built by augmenting the structure of an ANN
(Section 2.4.6) with granular connections formed as intervals. There are two
model learning phases. In the first phase, information granules of incoming
data are constructed. In the second phase, the neural network is built on
the information granules rather than on the original data. Support vector
machines (Section 2.4.7) have also been adapted for data stream scenarios. The
core vector machine algorithm (Tsang et al., 2007) uses minimum enclosed
balls, basically hyperspheres, to represent the data instances that they contain,
thereby extending the support vector machine method. The algorithm first
finds a representative minimum enclosing ball set as a good approximation
of the original dataset. This set is then optimized to find the maximum mar-
gin directly. RGNBC (Babu et al., 2017) is a rough Gaussian naive Bayes
classifier (Section 2.4.9) for data stream classification with recurring concept
drift. Rough set theory is used to detect concept drifts, and then the current
Gaussian naive Bayes classifier is modified to handle the new underlying data
distribution. The online bagging and online boosting algorithms (Oza and
Russell, 2005) are adaptations of traditional bagging and boosting methods
(Section 2.4.10). In online bagging, each data instance is resampled according
to a Poisson distribution whose parameter is equal to one rather than using
the uniform distribution of the bootstrapping. The Poisson distribution is the
result of considering unlimited data instances. In online boosting, the weights
of incoming data instances and the base classifier are adjusted according to
Machine Learning 119
TABLE 2.10
Data stream supervised classification algorithms outline
the current classifier error rates. Table 2.10 shows the reviewed data stream
supervised classification algorithms.
t=2
p(X[1]) are the initial conditions, factorized according to the prior Bayesian
network structure. p(X[t] | X[t − 1]) is also factorized over each Xi [t] as
i [t] | Pa[t](Xi )), where the parent variables of Xi , Pa[t](Xi ), may be
Qn
i=1 p(X
in the same or in the previous time slice. In continuous domains, a multivariate
Gaussian distribution is mostly assumed for p(X[1]) and univariate conditional
Gaussian distributions for p(Xi [t] | Pa[t](Xi )).
Higher-order Markov models (the probability at time t depends on two or
more previous time slices) fit more complex temporal processes, although they
pose challenges for structure and parameter estimation.
FIGURE 2.47
A dynamic BN structure with four variables X1 , X2 , X3 and X4 and three
time slices (T = 3): (a) prior BN; (b) transition network, with the first-order
Markovian transition assumption; (c) dynamic BN unfolded in time for three
time slices.
t=2
· p(X2 [t] | X1 [t], X1 [t − 1], X2 [t − 1], X3 [t − 1])
· p(X3 [t] | X2 [t], X2 [t − 1], X3 [t − 1], X4 [t − 1])
· p(X4 [t] | X3 [t], X3 [t − 1], X4 [t − 1]) .
The constraint-based and the score and search learning methods presented
in Section 2.5.3 can be adapted to learn first-order Markovian dynamic Bayesian
networks. The prior network can be learned from a dataset containing instances
at time t = 1, whereas the transition network can be learned from a dataset
including 2n variables, instances from times t − 1 and also from t (t = 2, ..., T ).
Friedman et al. (1998b) developed an adaptation of score and search
learning methods to dynamic environments. The dynamic hill-climbing
(DHC) algorithm is based on a hill-climbing search procedure that iteratively
improves the BIC score of the prior and transition networks. Trabelsi (2013)
adapted the max-min hill-climbing (MMHC) (Tsamardinos et al., 2006),
a constraint-based learning algorithm, to dynamic settings, developing the
dynamic max-min hill-climbing (DMMHC). DMMHC consists of three
stages. In the first stage, the algorithm tests conditional independences to
identify, for each node, the sets of candidate parent and child nodes (neighbors).
In the second stage, the candidate neighbors for each node are identified in a
three-step procedure: (a) candidate neighbors of variables in the same time
slice t; (b) candidate parents in the past time slice t−1 of variables in time slice
t; and (c) candidate children in the future time slice t + 1 of variables in time
slice t. In the final stage, a restricted hill-climbing search is run considering
the temporal constraints.
Temporal nodes Bayesian networks (Galán et al., 2007) include two
types of nodes: instantaneous nodes and temporal nodes. A temporal node is
defined by a set of states, where each state is determined by an ordered pair
(λ, τ ), with λ denoting the value of a random variable and τ = [a, b] the time
interval in which the state change occurs. Nodes with no intervals defined for
any of their states are called instantaneous nodes.
machine has two components C1 and C2 . The failure can trigger an incorrect
response of C1 , of C2 or of both, C1 and C2 . Therefore, the possible values for
C1 and C2 are correct and incorrect. These three nodes, F, C1 and C2 , are
instantaneous nodes that may generate subsequent changes in other two nodes.
The incorrect response of the first component may produce oil O and/or water
W leaks, whereas the malfunction of the second component may cause oil O
leaks. The severity of the failure and the correctness of the first component
influence the time that it takes for water leaks to occur. The time taken for oil
leaks to occur depends on the three instantaneous nodes. Fig. 2.48 shows the
structure of this temporal nodes Bayesian network.
f1 = mild
f2 = moderate
F f3 = severe
FIGURE 2.48
Example of a temporal nodes Bayesian network. F , C1 and C2 are
instantaneous nodes, whereas W and O denote temporal nodes.
Temporal nodes Bayesian networks can be learned from data (Hernández-
Leal et al., 2013).
Continuous time Bayesian networks (Nodelman et al., 2002) over-
come the main limitations of dynamic Bayesian networks and temporal nodes
Bayesian networks by explicitly representing the dynamics of the process by
computing the probability distribution over time when specific events occur.
The nodes of a continuous time Bayesian network represent random variables
whose state evolves continuously over time. Thus, the evolution of each variable
depends on the state of its parents in the graph structure.
A continuous time Bayesian network over X1 , X2 , ..., Xn consists of two
components: (a) an initial probability distribution p0 (X1 , X2 , ..., Xn ) specified
as a Bayesian network; and (b) a continuous transition model specified as a
directed (possibly cyclic) graph and, for each variable Xi (with possible values
xi1 , ..., xiRi ), a set of conditional intensity matrices, defined as
Machine Learning 123
for each instantiation pa(xi ) of the parent variables Pa(Xi ) of node Xi . The
diagonal elements, qxik i = xij 6=xik qxik ,xiij are interpreted as the instanta-
pa(x ) P pa(x )
probability of transitioning from the k-th possible value of Xi , xik , to its j-th
possible value, xij , for a specific instantiation pa(xi ) of Pa(Xi ).
Continuous time Bayesian networks can deal with both point evidence and
continuous evidence. Point evidence refers to the observation of the value of
a variable at a particular instant in time, whereas continuous evidence refers
to the value of a variable over an entire time interval. Shelton et al. (2010)
developed inference and learning from data algorithms.
a15
a03 a45
a43
medium low
3 a34 4
a04 a33 a44
a35
FIGURE 2.49
Example of a Markov chain represented as a graph. In addition to the start
state (node 0) and final state (node 5), the other four states are represented by
nodes 1 to 4. A transition probability is associated with each arc in the graph.
world. Using a hidden Markov model, we can compute probabilities for both
observed and hidden events that we consider to be causal factors in our
probabilistic model.
An HMM is specified by the following components: (a) a set of h hidden
states {1, 2, ..., h} of a hidden variable H; (b) an h × h transition probability
matrix A with elements aij , each representing the probability of moving from
state i to state j, that is, aij = p(Ht = j|Ht−1 = i), with j=1 aij = 1; (c) a
Ph
sequence of T observations o = (o1 , ..., oT ); (d) a sequence of emission proba-
bilities bi (ot ) expressing the probability of an observation ot being generated
from a hidden state i, that is, bi (ot ) = p(Ot = ot |Ht = i) for all t = 1, ..., T .
These bi (ot ) are the elements of an h × K emission probability matrix B, where
K denotes the number of different possible observation values; (e) an initial
probability distribution over initial states π = (π1 , . . . , πh ). The HMM can be
described by a parameter vector, θ = (A, B, π).
A first-order hidden Markov model is based on two simplifying as-
sumptions. First, as a first-order Markov chain, the probability of a partic-
ular hidden state depends only on the previous state: p(Ht |Ht−1 , . . . , H1 ) =
p(Ht |Ht−1 ). Second, the probability of an output observation Ot = ot de-
pends only on the hidden state Ht = ht (with ht ∈ {1, 2, . . . , h})that pro-
duced the observation and not on any other states or any other observations:
p(ot |h1 , . . . , ht , . . . , hT , o1 , . . . , ot , . . . , oT ) = p(ot |ht ).
Machine Learning 125
a11 a11
a12 a12
H1 H2 Hi HT
a21 a21
a22 a22
b1(o1) b2(o1)
b1(o2) b2(o2)
O1 O2 Oi OT
FIGURE 2.50
Example of a first-order hidden Markov model.
Given an observation sequence o = (o1 , ..., oT ), HMMs have to solve three
fundamental problems: (i) evaluate its likelihood; (ii) discover the optimal
hidden state sequence; and (iii) learn the HMM parameters, π, aij and bi (ot ).
This is explored below.
i=1 i=1
i=1
Denoting by 0 and F the initial and final states, respectively, the ini-
tialization of the forward algorithm is given by α1 (j) = πj bj (o1 ), and
a0j = πj , for j = 1, ..., h. The termination of the recursion corresponds
to αT (hF ) = i=1 αT (i)aiF . This represents p(o1 , ..., oT , HT = hF ), which
Ph
would be introduced in Eq. 2.19.
2.6.3.2 Decoding
For any model containing hidden variables, the task of determining which
sequence of hidden variables is (has the highest probability of being) the
underlying source of some sequence of observations is a problem called the
decoding task. Given the observation sequence “normal speed, slow speed,
normal speed”, the task of the decoder is to find the best (most probable)
hidden machine quality sequence that can be seen as its underlying source.
A naive approach for solving this problem is to run the forward algorithm
for each possible hidden state sequence and select the sequence with the highest
probability. As explained above, the number of possible hidden sequences is
hT , and the forward algorithm requires h2 T operations for each sequence.
Machine Learning 127
The most common decoding algorithm for HMMs is the Viterbi algorithm
(Viterbi, 1967). Like the forward algorithm, the Viterbi algorithm is a kind
of dynamic programming procedure. vt (j) represents the probability that the
HMM is in its j-th hidden state after considering the first t observations and
passing through the most probable state sequence h0 , h1 , ..., ht−1 . Formally,
The most probable path is represented by taking the maximum of all the
possible previous state sequences. The Viterbi algorithm computes the value
of vt (j) recursively as
which is the probability of being in state i at time t given the observed sequence
(o1 , ..., oT ) and parameters θ = (A, B, π), and
which is the probability of being in the hidden state i and j at times t and t + 1,
respectively, given the sequence of observations o1 , ..., oT . The denominators of
γt (i) and ξt (ij) are the same. They represent the probability of the observed
sequence o1 , ..., oT (given the parameters θ = (A, B, π)).
The parameters of the HMM model can now be updated:
• πi = γ1 (i) represents the expected frequency of being in the hidden state i
at time 1.
PT −1
ξt (ij)
• aij = Pt=1T −1 which is the expected number of transitions from state i
γt (i)
t=1
to state j compared to the expected total number of times the hidden state
i is observed in the sequence from t = 1 to t = T − 1.
PT
I(O =o )γ (i)
• bi (ot ) = t=1
PT t t t , where I(Ot = ot ) is an indicator function, that is,
γt (i)
t=1
1 if Ot = ot
I(Ot = ot ) =
0 otherwise
and bi (ot ) is the expected number of times that the observed sequence is
equal to ot while being in hidden state i over the expected total number of
times that it is in hidden state i.
These three steps (forward and backward procedures and the updating
of θ = (A, B, π) parameters) are repeated iteratively until the convergence
criterion is met.
Table 2.11 lists the clustering and supervised classification methods sup-
ported by the five tools.
TABLE 2.11
Machine learning tools for clustering and supervised classification
For Bayesian networks, we have selected the following five software tools:
HUGIN15 , GeNIe, Open-Markov16 , gRain17 and bnlearn18 . HUGIN (Madsen
et al., 2005) is a software package –developed by HUGIN EXPERT, a company
located in Aalborg, Denmark– for building and deploying decision support
systems for reasoning and decision making under uncertainty. HUGIN software
is based on Bayesian network and influence diagram technology. The HUGIN
software package consists of the HUGIN Decision Engine (HDE), a GUI and
application program interfaces (APIs) to facilitate the integration of HUGIN
into applications. GeNIe modeler (Druzdzel, 1999) is a GUI providing
an interactive model for building and learning Bayesian networks and it is
connected with SMILE (Structural Modeling Inference and Learning Engine),
which provides exact and approximate inference algorithms. It is based on
research at the University of Pittsburgh, USA. Nowadays, it is developed
by BayesFusion, LLC. Open-Markov (Arias et al., 2012) is a software tool
that implements both constraint-based and score+search learning algorithms
and approximate inference methods. It is under development at the National
University for Distance Education in Madrid. gRain (Højsgaard, 2012) is an R
15 https://www.hugin.com/
16 http://www.openmarkov.org/
17 https://CRAN.R-project.org/package=gRain
18 http://www.bnlearn.com/
Machine Learning 131
TABLE 2.12
Software for Bayesian networks
Questions arise about how humans and machine learning systems interact
as well as the societal challenges of adapting to a world where these systems
are increasingly pervasive. During the first industrial revolution, individuals
had to adapt to new ways of communicating, traveling, and working. Nowadays
we wonder if humans will be able to adapt to changes in almost all aspects of
life as a consequence of advances in machine learning.
Educational barriers are avoiding the ubiquity of machine learning systems
in society. As early as the elementary school level, students could benefit
from greater encouragement to develop a science, technology, engineering,
and mathematics skill set. In postsecondary curricula, it is common to only
emphasize traditional mathematics, excluding any kind of computing and
communication skills. Although companies are aggressively hiring new talent
in machine learning, there are relatively few people trained in data science,
statistics and machine learning. In this sense, experiences, like the automatic
statistician project19 that aims to build an artificial intelligence for data
science, helping people make sense of their data, are more than welcome.
The modeling capabilities provided by machine learning pose new challenges
to managing privacy. Sometimes, machine learning will use data containing
sensitive information, whereas in other cases machine learning might create
sensitive insights from seemingly irrelevant data. Machine learning tools should
have the potential to provide new kinds of transparency about data and
inferences, and should be applied to detect, remove, or reduce human bias,
rather than reinforcing it. Transparency and interpretability of predictive
machine learning models are a must, as one needs to understand what the
predictive model is doing, and will do in the future, in order to trust and
deploy it.
To push forward the boundaries of machine learning, statisticians, engi-
neers, data scientists, computer scientists and mathematicians can benefit from
different key stakeholders: sociologists, to discuss the ethical and societal tech-
nological issues; psychologists, who could offer valuable insights into the ways
humans interact with technology; unions and industrial psychologists, as voices
of workforce changes from continued development of machine learning meth-
ods; policy makers and regulators, who could contribute to a dialogue about
autonomy and fair information principles; and historians, whose knowledge
about past eras of rapid technological change can be very useful.
Regulating machine learning activities (and artificial intelligence in general)
governance seems feasible and desirable. However, risks and considerations
are disparate in different domains. Policy markers should recognize that to
varying degrees and over time, various industries will need distinct, appropriate,
regulations.
19 https://www.automaticstatistician.com/index/
3
Applications of Machine Learning in Industrial
Sectors
The detailed and comprehensive structure for sector and industry analysis
called Industry Classification Benchmark1 of FTSE Russell is used in this
chapter.
133
134 Industrial Applications of Machine Learning
3.1.1 Oil
Crude oil is today the world’s leading fuel, and its price has a big impact
on the global environment, economy, and oil exploration and exploitation
activities. Oil price forecasts are very useful for industries, governments
and individuals. Major factors affecting the oil market are demand, supply,
population, geopolitical risks and economic issues. Crude oil price prediction
is a very challenging problem for machine learning methods due to its high
volatility. Several machine learning techniques have been proposed for oil price
prediction.
Nwiabu and Amadi (2017) develop a naive Bayes classifier containing
demand and supply variables to predict upward or downward price movements.
Xie et al. (2006) apply support vector machines to this problem. They are
empirically compared with multilayer perceptron artificial neural networks and
autoregressive integrated moving average (ARIMA) models on the monthly
spot prices of West Texas Intermediate (WTI) crude oil from 1970 to 2003.
Yu et al. (2008) estimate crude oil prices based on an ensemble of artificial
neural networks. First, the original crude spot price series is decomposed into a
finite number of chunks. Then, a one hidden layer perceptron is used to model
each of the dataset chunks. Finally, the prediction results of all the models are
combined using another artificial neural network. The WTI crude oil price and
Brent crude oil price series are used to test the effectiveness of the proposal.
Gao and Lei (2017) noticed that crude oil price series are not necessarily the
result of a stationary process and proposed the use of data stream learning
algorithms in the WTI benchmark dataset.
In recent years, pirate attacks against shipping and oil field installations
have become more frequent and more serious. Bouejla et al. (2012) provide
an innovative solution that addresses the problem from the perspective of the
entire processing chain, that is, from the detection of a potential threat to
the implementation of a response. Bayesian networks are used to discover the
relationships between 20 variables concerning the characteristics of the threat
Applications of Machine Learning in Industrial Sectors 135
3.1.2 Gas
Abstract dissolved gas analysis (DGA) of the insulation oil of power transform-
ers is a research tool to monitor their health and to detect impending failures by
recognizing anomalous patterns of DGA concentrations. The failure prediction
problem can be seen as a machine learning task on DGA samples, optionally
exploiting the transformer age, nominal power and voltage. There are two
approaches: binary classification for detecting failure and regression of time
to failure. Mirowski and LeCun (2018) review and evaluate 15 classification
and regression models for this problem, like k-nearest neighbors, classification
trees, artificial neural networks and support vector machines.
Pipelines are one of the most popular and effective ways of transporting
hazardous materials, especially natural gas. However, the rapid development
of gas pipelines and stations in urban areas poses a serious threat to public
safety and assets. A comprehensive methodology for accident scenario and risk
analysis of gas transportation systems, especially in natural gas stations, has
been developed using Bayesian networks (Zarei et al., 2017). The Bayesian
network for failure mode and effect analysis is provided by expert domains that
consider a total of 43 variables containing information about failure modes,
causes and effects on the system, severity of effects, detection level and risk
priority number.
136 Industrial Applications of Machine Learning
3.2.1 Chemicals
Like other industries, the chemical industry operates in a competitive global
market, where mergers and acquisitions designed to reinforce company position
are commonplace. This industrial sector can expect big opportunities by
embracing machine learning.
Four major application areas of machine learning in the chemical industry
are: (a) manufacturing, (b) drug design, (c) toxicity prediction, and (d) com-
pound classification. The volume of data that has to be handled in chemical
manufacturing to tackle optimization, monitoring, and control problems is
increasing (Wuest et al., 2016). Machine learning can contribute significantly to
speeding up the processes, finding more sustainable and cost-effective solutions,
and making room for innovations. Thus, Ribeiro (2005) used support vector
machines and artificial neural networks for monitoring the quality of in-process
data in a plastic injection molding machine.
The aim in drug design is to identify lead compounds with significant
activity against a selected biological target. The drug target is a protein whose
activity is modulated by its interaction with a chemical compound and may
thus control a disease. Lead compounds are identified at the drug discovery
stage. They are then optimized in the drug development phase, resulting in a
small number of chemicals that are evaluated in human clinical trials. Machine
learning has been used at the drug discovery stage to build functions that
rank the probability that a chemical will have activity against a known target
and to predict ligand-receptor affinity, target structure and the side-effects of
new drugs, as well as for target screening on cells and even in drug delivery
systems (Bernick, 2015). The review by Lima et al. (2016) mentions random
forests, decision trees, artificial neural networks and support vector machines
as the main techniques in new drug discovery, whereas Lavecchia (2015) adds k-
nearest neighbors, and naive Bayes. Another important use of machine learning
is to predict the pharmacokinetic and toxicological profile of compounds, i.e.,
the so-called ADME-Tox (absorption, distribution, metabolism, excretion and
toxicity) (Maltarollo et al., 2015).
Machine learning is used in many aspects of toxicity detection and
prediction. Predicting chemical toxicity is an important issue in both the
environmental and drug development fields. In vitro bioassay data are often
used to predict in vivo chemical toxicology. Judson et al. (2008) compare
k-nearest neighbors, artificial neural networks, naive Bayes, classification trees
and support vector machines with filter-based feature selection. Machine
learning is used in drug development to detect toxicities, such as hepatotoxicity,
Applications of Machine Learning in Industrial Sectors 137
Basic materials
FIGURE 3.1
Applications of machine learning in the basic materials sector.
marketplace. Master perfumers typically train for ten years before they become
proficient. Goodwin et al. (2017) use different classifiers to predict target
gender and average rating for unseen fragrances, all characterized by a set of
fragrance notes. Also, the data are projected in a 2D space using t-distributed
stochastic neighbor embedding (t-SNE) (van der Maaten and Hinton, 2008),
a non-linear dimensionality reduction technique. This discovers clusters of
perfumes and free spaces without data, which could suggest combinations of
as yet unexplored but promising fragrance notes. Xu et al. (2007) design a
new pigment mixing method based on an artificial neural network to emulate
real-life pigment mixing, as well as to support the creation of new artificial
pigments.
of labs or from other industrial sectors, as there are several critical issues
to be solved, including harsh environments, limited access to communication
infrastructures and so on.
Auto-correlated
data
classification
Fault diagnosis
Industrials and
classification
Production
Delay
planning
prediction
classification
Construction Industrial Goods
and and
Materials Services
Quality
Anomaly
control
detection
classification
Decision
support
systems
Collision
avoidance
systems
FIGURE 3.2
Applications of machine learning in the industrials sector.
From this general point of view, Hansson et al. (2016) investigate different
machine learning tools that can be applied to heavy industry. They analyze
feature subset selection (see Section 2.4.2), clustering (Section 2.3), artificial
neural networks (Section 2.4.6), support vector machines (Section 2.4.7), clas-
sification trees (Section 2.4.4) and metaclassifiers (Section 2.4.10) with respect
to their industrial applications. Additional techniques from those explained in
this work could have been applied, as it is possible to obtain similar actionable
insights with all of them. However, there is also a real challenge related to the
deployment of the algorithms into real working environments, where other re-
strictions define the optimal algorithm depending on its communications, data
storage, computing power, among other needs at the application field. These
integration challenges found between algorithms and their infrastructure needs
are being solved by the Fourth Industrial Revolution, which is introducing
machine learning applications to the industrials sector through IT and OT
integration. Some examples are summarized below.
Applications of Machine Learning in Industrial Sectors 141
technique for minimizing errors related to false alarms and true positive
detection rates with an increase in accuracy.
Within the industrial engineering sector, applications are related to
industrial equipment or components, such as pumps, engines, compressors,
bearings and elevators. In this scenario, Kowalski et al. (2017) describe a
marine engine application based on a single-hidden layer feedforward neural
network oriented to fault diagnosis. Another example is the application of
anomaly detection techniques from data streams (see Section 2.6.1) to assist
elevator manufacturing and maintenance (Herterich et al., 2016).
With regard to the aerospace industry, there are two levels of appli-
cations: component manufacturing and solutions, such as flight control, or
decision support systems. Machine learning is mainly applied for quality con-
trol in aerospace component manufacturing, where high accuracy rates on
100% inspection are required. In this case, the applications are similar to
other component manufacturing in terms of methodology. However, there are
some interesting applications such as an automatic horizon detection device
to assist the pilot during flight. Fefilatyev et al. (2006) describe the use of
support vector machines, C4.5 classification tree and naive Bayes classifiers.
In this case, accuracy is very high using only a small set of images. Apart
from classification tasks, feature subset selection techniques based on image
transformation and ranking depending on the pixel value are used.
In defense equipment manufacturing, industry efforts focus on the
cyber defense industry. Firewalls and other devices are being designed
to detect evolving threats at different levels. The most useful cyber defense
approach is anomaly detection (deviations from normal behavior), where
different variations have been implemented using machine learning techniques.
Communication signal anomalies are detected by learning normal signals traffic.
However, there are some improvements, such as anomaly detection related to
different user profiles described by Lane and Brodley (1997). In this case, the
user profile is learned from characteristic sequences of actions performed by
different users. To detect anomalies, the system first computes the sequence
similarity in order to classify user behavior. However, other machine learning
techniques, such as artificial neural networks, naive Bayes, k-nearest neighbors
and support vector machines, are applied in cyber defense depending on the
threat (Buczak and Guven, 2016).
Like cyber defense, industrial services, such as financial administration,
mainly use machine learning for fraud detection. Bose and Mahapatra (2001)
analyze the most common techniques such as rule induction, artificial neuronal
networks and case-based reasoning, all of which are used as data mining
techniques to predict or classify different types of fraud and threats in electronic
transactions.
For industrial transportation, machine learning is mainly applied to
manage traffic at ports and airports where there is a critical mixture between
high-volume traffic, tight scheduling and large vessels with potential hazardous
content. For marine ports, some of the collision avoidance systems are developed
Applications of Machine Learning in Industrial Sectors 143
Retail
Consumer
services Tourism
Media
FIGURE 3.3
Applications of machine learning in the consumer services sector.
3.4.1 Retail
Machine learning methods can provide grocers with competitive advantages
by avoiding costly problems of having too much or too little fresh food in stock
and as they are the secret to smarter fresh-food replenishment. In addition,
they can make demand forecasts based not just on historical sales data but on
other influential factors such as advertising campaigns, store opening times,
local weather and public holidays.
144 Industrial Applications of Machine Learning
3.4.2 Media
Entertainment, broadcasting, cinema and television are other fields where
machine learning can help. Examples include: (i) video analysis tracking for
emotions, eeriness, frights and loving, alongside audio analysis of music and
tone of voice; (ii) real-time indexing and analysis of broadcast programming
to provide advertising partners with increased visibility, transparency and
accountability; (iii) development of systems that learn about audience interests
and recommend television programs and movies based on taste; (iv) video
encoder improvement to provide incredible picture quality on a 4-inch and
5-inch screen.
An example of (iii) is the Netflix Prize4 , an open competition for the best
algorithm to predict user ratings for films. It was launched in 2006 and in
September 2009 the grand prize of $1,000,000 was given to a team which
improved Netflix’s own algorithm by 10%.
Marketing can also take advantage of the real-time analysis of tons of
data (Sterne, 2017). Tableau5 and Qlikview6 provide a rich palette of data
visualization widgets widely used in marketing research. For a marketing
campaign to be successful, understanding which words, phrases, sentences and
even content formats resonate with particular members of the audience is key.
Machine learning algorithms can be applied to the data from all campaigns to
deduce the best textual introduction for emails sent to an audience or even to an
individual, thereby increasing likelihood of success. Other examples include the
analysis of mobile customer behavior to help app publishers identify the most
loyal users and predict customer churn. Armed with this insight, marketers
can take actions across digital channels to deepen customer engagement or
invest more in retaining specific customer segments.
3.4.3 Tourism
Loyalty is one of the strategic marketing targets as companies can thus enhance
competitive advantages with clear benefits: customer willingness to repurchase
and promote products can lead to company revenue growth and increased mar-
ket share, costs reduction, and increased employee job satisfaction. Tourism
loyalty is influenced by several factors, including customer service offered by
employees, tour website functions, consumer perception of the characteristics
of local tourism and customer loyalty in terms of revisiting a destination. Hsu
et al. (2009) apply Bayesian networks to data collected from tourists with a
tour experience. They conclude that tourism loyalty is greater for tourists that
have a better perception of customer service, web functions and local charac-
teristics of the destination, which may lead them to revisit the destination or
recommend it to others. Wong and Chung (2008) apply classification trees to
4 http://www.netflixprize.com/
5 https://www.tableau.com
6 https://www.qlik.com
Applications of Machine Learning in Industrial Sectors 145
Neuroscience
Cardiovascular
Cancer Bioinformatics
Diabetes
Obesity
FIGURE 3.4
Healthcare applications of machine learning.
3.5.1 Cancer
Cancer has been characterized as a heterogeneous disease consisting of many
different subtypes. The early cancer diagnosis and prognosis is a necessity in
cancer research, as it can facilitate the subsequent clinical management of
patients. The importance of classifying cancer patients into high- or low-risk
groups is evident for diagnosis and prognosis, as well as for modeling the
progression and treatment of cancerous conditions (Kourou et al., 2015).
Three important problems that have been addressed using machine learning
methods are (i) the prediction of cancer susceptibility (risk assessment) like
breast cancer risk estimation (Ayer et al., 2010), where artificial neural networks
have been applied to a dataset with more than 48,000 mammographic findings,
as well as demographic risk factors and tumor characteristics. The area under
the ROC curve was used in order to assess the model discriminative power; (ii)
Applications of Machine Learning in Industrial Sectors 147
the prediction of cancer recurrence in oral cancer (Exarchos et al., 2012) with
clinical, imaging tissue, and genomic data comparing Bayesian classifiers with
artificial neural networks, support vector machines, classification trees and
random forests in combination with multivariate filter and wrapper approaches
for feature subset selection; (iii) the prediction of cancer survival (Park et al.,
2013), evaluating the survival of women who have been diagnosed with breast
cancer, with survivability as the class variable referring to patients who have
and have not survived.
Bayesian networks have been intensively applied in cancer research. Sesen
et al. (2013) have performed personalized lung cancer survival prediction es-
timates and treatment selection recommendations using Bayesian networks,
based on the English Lung Cancer Database, which includes more than 126,000
patients who were diagnosed between 2006 and 2010. The model was con-
structed on a set of relevance variables available when a new patient came in
for a treatment decision. Structure learning was carried out both manually
(elicited by experts) and automatically (with a score and search approach that
uses the K2 metric in combination with simulated annealing). The automatic
approach outperformed the manual method. Cruz-Ramírez et al. (2007) study
the effectiveness of several Bayesian networks classifiers (naive Bayes and
several variants, and unrestricted Bayesian network classifiers) for accurately
diagnosing breast cancer. Gevaert et al. (2006) learn a Bayesian network clas-
sifier from data for predicting breast cancer prognosis by integrating clinical
(including age, diameter, tumor grade, estrogen and progesterone receptor
status, and the presence of angioinvasion and lymphocytic infiltration) and
microarray data (mRNA expression levels of approximately 25,000 genes for
each patient). The two class variable values correspond to poor and good
prognosis. Poor prognosis refers to recurrence within fives years after diagno-
sis and good prognosis corresponds to a disease-free interval of at least five
years. The Markov blanket-based classifier structure was built using the K2
search algorithm in combination with the Bayesian Dirichlet scoring metric.
Parameters were estimated assuming a Dirichlet prior distribution.
Onisko and Austin (2015) develop a dynamic Bayesian network for pre-
dicting the risk of developing cervical precancer and invasive cervical cancer.
The aim is to identify women that are at higher risk of developing cervical
cancer and who should be screened differently than indicated in the guidelines.
The data were collected over an eight-year period (2005-2012) and contain two
screening tests (Pap and hrHPV) for more than 790,000 patients. The other
collected variables are diagnostic or therapeutic procedures, patient history
findings, and demographic variables. Some variables are temporal, i.e., they
are repeated for each time step. According to USA cervical cancer guidelines,
the time step chosen for this model was one year.
148 Industrial Applications of Machine Learning
3.5.2 Neuroscience
Neuroscience studies the nervous system. It is a multidisciplinary branch
of biology, which deals with the anatomy, biochemistry, molecular biology,
and physiology of neurons and neural circuits. Recent technological advances
have led, for example, to high spatial and temporal resolution recordings of
the activity of hundreds of cells located in a relatively small region of the
brain using either imaging approaches or electrodes. These big data offer an
opportunity for machine learning to provide a better understanding of both the
healthy and diseased brain, leading to more successful treatments in the future
(Landhuis, 2017). Bielza and Larrañaga (2014a) review the use of Bayesian
networks in neuroscience.
Some important works in the literature on machine learning applications
for neuroanatomy, neurosurgery, neuroimaging, and neurodegenerative diseases
follow.
In neuroanatomy, DeFelipe et al. (2013) apply machine learning methods
to gain new insights into the classification of cortical GABAergic interneurons
to produce an accepted nomenclature of neuron types, not currently available.
A web-based interactive system was used that collected data from several
experts in the neuroanatomy field about the terminological choices (common
type, horse-tail, chandelier, Martinotti, common basket, arcade, large basket,
Cajal-Retzius, neurogliaform or other) for a set of cortical interneurons. The
3D reconstructions of these neurons were used to measure a large number
of morphological features of each neuron. All the supervised classification
methods introduced in Section 2.4 were applied, with the exception of logistic
regression. Univariate and multivariate filtering were the chosen feature sub-
set selection approaches. Additionally, experts’ opinions were modeled using
Bayesian networks.
Celtikci (2017) reviews more than 50 studies on neurosurgery classified
according to the following topics: hydrocephalus, deep brain stimulation, neu-
rovascular, epilepsy, glioma, radiosurgery, spine, and traumatic brain injury. Six
supervised classification methods (neural networks, Bayesian classifiers, sup-
port vector machines, classification trees, logistic regression and discriminant
analysis) are applied.
Neuroimaging is a widespread technique in cognitive neuroscience. Several
imaging techniques are available. They differ in anatomical coverage, temporal
sampling and imaged hemodynamic properties. The most used modalities
are: fMRI, MRI, and EEG. Abraham et al. (2014) report the use of scikit-
learn software (see Section 2.7) in different neuroimaging tasks. The facilities
provided by the software are illustrated on several problems: decoding the
mental representations of objects in the brain, encoding brain activity and
decoding images and functional connectivity analysis at resting-stage. They use
methods like independent component analysis (a variant of PCA) univariate
filtering for feature subset selection, hierarchical clustering and K-means,
and logistic regression and support vector machines to tackle these problems.
Applications of Machine Learning in Industrial Sectors 149
Bielza and Larrañaga (2014a) review more than 40 papers on Bayesian network
applications in neuroimaging. Dynamic Bayesian networks have been applied
to different problems in fMRI (dyslexia, Parkinson’s disease, schizophrenia,
dementia in elder subjects), MRI (mild cognitive impairment) and EEG (motor
task).
Neurodegenerative diseases and brain disorders cost the economy of the
developed countries a huge amount of money. For example, brain disorders cost
Europe an estimated e798 billion in 2010 (Olesen et al., 2012). Parkinson’s
disease and Alzheimer’s disease are the two neurodegenerative diseases that
account for the highest financial expenditure. The K-means algorithm has
recently been applied to search for patient subtypes from a large, multicenter,
international, and well-characterized cohort of Parkinson’s disease patients
across all motor stages, using a combination of motor features (bradykine-
sia, rigidity, tremor, axial signs) and specific validated rater-based non-motor
symptom scales (Mu et al., 2017). Multi-dimensional Bayesian network classi-
fiers (Bielza et al., 2011) are used in Borchani et al. (2014) to simultaneously
predict the five items (mobility, self-care, usual activities, pain/discomfort
and anxiety/depression) of the European quality of life-5 dimensions (EQ-5D)
from the Parkinson’s disease questionnaire (PDQ-39). PDQ-39 is a 39-item
self-report questionnaire, which assesses Parkinson’s disease-specific health-
related quality, containing questions about mobility, activities of daily living,
emotional well-being, social support, cognition, communication, bodily dis-
comfort and stigma. Transcript interaction networks induced by ensembles
of Bayesian classifiers have provided new candidate transcripts in the study
of Alzheimer’s disease (Armañanzas et al., 2012). Bind et al. (2015) review
supervised machine learning methods (artificial neural networks, k-nearest
neighbors, support vector machines, naive Bayes classifier, random forest, bag-
ging and boosting) for Parkinson’s disease prediction, whereas Tejeswinee et al.
(2017) apply univariate and multivariate feature subset selection methods in
combination with classification trees, naive Bayes classifiers, support vector
machines, k-nearest neighbors, random forest and boosting in Alzheimer’s and
Parkinson’s disease datasets.
3.5.3 Cardiovascular
Cardiovascular medicine generates a plethora of biomedical, clinical and
operational data as part of patient care delivery. These data are often stored in
diverse data repositories which are not readily usable for cardiovascular research.
However, the application of machine learning techniques in cardiovascular dis-
eases is not new, although there has been an exponential growth in the number
of PubMed-listed publications including “cardiology” and “machine learning”
terms over the last few years (Shameer et al., 2018). Datasets in cardiology
can contain variables from cardiac imaging modalities like echocardiography,
magnetic resonance imaging, single-photon emission, computed tomography,
near-infrared spectroscopy, intravascular ultrasound, optical coherence to-
150 Industrial Applications of Machine Learning
3.5.4 Diabetes
Diabetes mellitus is defined as a group of metabolic disorders exerting
significant pressure on human health worldwide. Machine learning has been
applied in the field of diabetes mellitus to address issues like prediction and
diagnosis, diabetes complications, genetic background and environment, and
healthcare and management with prediction and diagnosis being the most
popular category. According to the review by Kavakiotis et al. (2017), 85% of
the applications used supervised classification algorithms and 15% clustering
methods. Support vector machines are the most successful and widely used
algorithm.
3.5.5 Obesity
Nowadays obesity researchers have access to a wealth of data. Sensor and
smartphone app data, electronic medical records, large insurance databases
and publicly available national health data provide input that machine learning
algorithms can transform into mathematical models. DeGregory et al. (2006)
show an empirical comparison of logistic regression, artificial neural networks
and classification trees to predict high levels of body fat percentage from
anthropometric predictor variables taken from a sample of more than 25,000
patient records extracted from the National Health and Nutrition Examination
Survey (NHANES) dataset.
3.5.6 Bioinformatics
Larrañaga et al. (2006) review applications of machine learning to different
bioinformatics topics, where bioinformatics is regarded as an interdisciplinary
field that develops methods and software tools for understanding biological
data. The topics covered in the review are genomics, proteomics, systems
biology, text mining and other applications. Filter, wrapper and hybrid feature
subset selection methods are discussed. The reviewed clustering methods
include hierarchical, K-means and probabilistic clustering. All supervised
classification approaches presented in Section 2.4, except rule induction, are
reviewed. Bayesian networks and hidden Markov models are also accounted
for.
Applications of Machine Learning in Industrial Sectors 151
Autonomous
vehicles
Automobiles Personal and
Pedestrian
household goods
detection
Consumer
services
Food and
beverages Home
Leisure Fashion
construction
(design, safety)
FIGURE 3.5
Applications of machine learning in the consumer goods sector.
3.6.1 Automobiles
Americans drive nearly three trillion miles, which translates into many hours
spent in traffic, and the number grows significantly considering the entire
planet. The time spent in traffic is potentially dangerous, considering that
more than 3,000 lives are lost every day and most accidents are due to human
error. Autonomous vehicles have the potential to improve the quality and
productivity of the time spent in cars, increase the safety and efficiency of the
transportation system, and transform transportation into a utility available
to anyone, anytime. This requires technical advances in many aspects of
vehicle autonomy, ranging from vehicle design to control, perception, planning,
coordination, and human interaction. Schwarting et al. (2018) review recent
advances in planning and decision-making for autonomous vehicles, focused on:
(a) how vehicles decide where to go next; (b) how vehicles use the data provided
by their sensors to make decisions with short and long time horizons; (c) how
interaction with other vehicles affects what they do; (d) how vehicles can learn
how to drive from their history and from human driving; (e) how to ensure that
the vehicle control and planning systems are correct and safe; and (f) how to
ensure that multiple vehicles on the road at the same time are coordinated and
managed to most effectively move people and packages to their destinations.
The review contains some approaches to perception based on machine learning
methods, mainly convolution neural networks and Bayesian deep learning
methods (a paradigm at the intersection between deep learning and Bayesian
152 Industrial Applications of Machine Learning
food supply chain, and (b) open, collaborative systems in which the farmer
and every other stakeholder in the chain network are flexible about choosing
business partners for both technology and food production.
Based on their previous experiences, Shakoor et al. (2017) develop an
intelligent system for prediction analysis on farming in Bangladesh. The
system suggests area-based beneficial crop rank before the cultivation process.
Six major crops are considered: Aus rice, Aman rice, Boro rice, potato, jute
and wheat. The prediction is made by analyzing a dataset sourced from the
Yearbook of Agricultural Statistics and the Bangladesh Agricultural Research
Council for the above crops according to the area and using classification trees
and k-nearest neighbors as machine learning models.
The impact of fishing on ecology and conservation has to be studied to
gain a better understanding of the behavior of the global fishing fleets in order
to prioritize and enforce fisheries management and conservation measures
worldwide. Satellite-based automatic information systems (S-AIS) are now
commonly installed on most ocean-going vessels and have been proposed as a
novel tool to explore the movements of fishing fleets in near-real time. de Souza
et al. (2016) present approaches to identify fishing activity from S-AIS data
for three dominant fishing gear types: trawl, longline and purse seine. Using
a large dataset containing worldwide fishing vessel tracks from 2011 to 2015,
hidden Markov models were developed to detect and map fishing activities.
Automatic fruit recognition using machine vision is considered as a
challenging task as fruits exist in various colors, sizes, shapes and textures.
Shukla and Desai (2016) recognize nine different classes of fruits. First, the
fruit images are preprocessed to subtract the background and extract the
blob representing the fruit. Then, visual characteristics, combination of colors,
shapes and textures are used as predictor variables. k-nearest neighbors and
support vector machines were applied.
Crop productivity plays a significant role in India’s economy. Vegetables
are grown throughout the year under particular climatic conditions and culti-
vation periods. The vegetables may be affected by bacteria, viruses or insects.
It is important to monitor the crops to control the spreading of disease, and
thus Tippannavar and Soma (2017) propose a machine learning technique for
vegetable leaf identification and abnormality detection from leaf images. The
leaf part is segmented from the background by threshold and morphological
operations, and then texture and color features are extracted by fractal features
and color correlogram, respectively. Two classification tasks such as vegetable
identification (six different types of vegetables) and disease identification (dis-
order or normal leaf) are carried out using k-nearest neighbors and artificial
neural networks.
Bakhshipour et al. (2018) classify different classes of black tea, including
orange pekoe, flowery orange pekoe, flowery broken orange pekoe, and pekoe
dust one, from three types of predictor variables (18 color variables, 13 gray
image texture variables, and 52 wavelet texture variables), acquired and pro-
cessed using a computer vision system. Correlation-based feature selection with
154 Industrial Applications of Machine Learning
indexed by time or space, such that every finite collection of these random variables has a
multivariate normal distribution.
8 Multi-task learning is a subfield of machine learning in which multiple learning tasks
are solved at the same time, while exploiting commonalities and differences across tasks.
Applications of Machine Learning in Industrial Sectors 155
neural networks predict chemometric targets such as pH, alcohol and maximum
volume of foam. Also, Li (2017) proposes the backorder prediction of Danish
brewery craft beer in the early stage of the supply chain based on historical
data. These historical data contain information on the orders for the eight
weeks prior to the week to be predicted. Less than one percent of the products
in the dataset went on backorder. This leads to a big imbalance in the binary
classes. k-nearest neighbors, classification trees, logistic regression, support
vector machines and an ensemble of k-nearest neighbors were applied.
Coffee is a plant whose seeds called coffee beans are grown all over the
world, particularly in Ethiopia. There are three major types of coffee disease
affecting the leaves of a coffee plant: coffee leaf rust, coffee berry disease, and
coffee wilt disease. Mengistu et al. (2016) develop an automatic system for the
recognition of these diseases using imaging and machine learning techniques.
More than 9,000 coffee plant images were preprocessed by removing low
frequency background noise, normalizing the intensity of the image, removing
reflection and applying filters to reduce image noise. Genetic algorithms were
applied for multivariate filtering feature selection showing that color features
are generally more relevant than texture features for this recognition problem.
k-nearest neighbors, artificial neural networks, and naive Bayes were the
supervised classification models used.
issues on a job site to which it then attaches a tag indicating whether they
could lead to a potential fatality. Special attention is given to issues related to
the “fatal four” –fall, struck by, caught between, and electrocution– as they
are the causes behind a high percentage of accidents.
The video game market has become an established and ever-growing
global industry in the leisure goods sector. Serious games are one of its most
interesting fields. A serious game should be educational and, at the same time,
fun and entertaining. A serious game is thus designed to be both attractive
and appealing to a broad target audience and to meet specific educational
goals. Frutos-Pascual and García-Zapirain (2017) review the use of decision
making and machine learning techniques in serious games published in journal
papers between January 2005 and September 2014. A total of 129 papers are
reviewed. From the point of view of machine learning, naive Bayes (13 papers),
artificial neural networks (12 papers), k-nearest neighbors (10 papers), and
support vector machines (5 papers) had the highest inclusion rate.
Machine learning-based predictions in the fashion industry have been
carried out by Dadoun (2017) analyzing Apprl’s dataset. Apprl’s network
consists of bloggers and online magazines, monitoring hundreds of thousands
of visitors to online retailers per month, each generating unique pieces of data.
Apprl started collecting this data in 2012 and now stores more than three
million records. The variables in this dataset include the product brand, the
category (shoes, shirts, etc.), the color, currency of the payment, gender of
the client, stock (whether or not the product was in stock), vendor selling the
product, name of the product, product regular price, name of the publisher
who published the product, date when the product was sold, sale amount
by product, number of clicks a product generated, and a popularity rate
calculated by product. The last three variables are the targets to be predicted
by the machine learning model. k-nearest neighbors, classification trees, logistic
regression, random forest and boosting were applied in these three prediction
problems.
accessible cloud analytics. Fig. 3.6 shows a summary of the machine learning
applications in telecommunications.
Telecommunications
Data transmission:
Software for
network analysis • Network management
• Reliable protocols
FIGURE 3.6
Applications of machine learning in the telecommunications sector.
• It should classify millions of messages a day, and the algorithm therefore has
to scale properly.
• It should work with different languages.
• It should be interpretable in order to understand the underlying decision
criteria of the classifier.
• It should detect not only spam, but also other malicious intents such as
phishing attacks.
is also expected to become more complex. One of the main approaches for
improving network management is to deploy self-organizing networks (SONs)
(Klaine et al., 2017). The 3rd Generation Partnership Project and the Next
Generation Mobile Networks association encourage the use of SONs in 3G
and Long Term Evolution (LTE) networks. SONs can automatically take the
necessary actions to maintain network operation at near optimal levels. Taking
the correct actions involves the construction of more intelligent systems that
can be addressed using machine learning techniques. These machine learning
models can be trained using huge amounts of data that are regularly collected
by network operators to monitor the state of the mobile network. Multiple
machine learning techniques have been applied to the SON model:
Generation Distribution
• Renewable sources
– Forecasting solar radiation
• Demand prediction
– Forecasting wind power
– Electric power
• Non-renewable sources
– Water
– Condition monitoring
– Fault detection
FIGURE 3.7
Applications of machine learning in the utilities sector.
Consumer-focused
Operations-focused
Financial
services Trading and portfolio
management
Regulatory compliance
and supervision
FIGURE 3.8
Financial applications of machine learning.
financial-service/
162 Industrial Applications of Machine Learning
Hardware and
semiconductors
Software
Information
Technology Data center
management
Cybersecurity
FIGURE 3.9
Applications of machine learning in the information technology sector.
3.10.2 Software
Software developers can also improve their results using machine learning
techniques. Zhang and Tsai (2003) provide an overview of the problems in
software engineering that could be tackled using machine learning. Machine
learning can address the following tasks:
3.10.4 Cybersecurity
Cybersecurity is a prominent example of machine learning applications in a
more connected world. In this field, one of the most common solutions is the
use of antivirus software. Kasperksy, a cybersecurity and antivirus provider,
is using decision tree ensembles to improve the detection performance of its
products. These ensembles are trained in-lab on constantly renewed selections
of files. The same company, also announced the use of deep learning models to
analyze execution logs of suspicious files to detect malware.
Another cybersecurity-related activity is fraud detection in Internet
applications, especially applications that make use of credit card payments
(Dorronsoro et al., 1997; Srivastava et al., 2008). PayPal is one of the most
important companies managing online payments and uses machine learning
techniques, such as artificial neural networks and deep learning. Visa, one
of the largest card payment organizations, is reported to be using machine
learning systems (such as artificial neural networks and other self-improving
algorithms) for fraud detection.
4
Component-Level Case Study: Remaining
Useful Life of Bearings
4.1 Introduction
As mentioned in Section 3.3, bearings (see Fig. 4.1) play a major role in
industry and many mechanical processes, and constitute the weakest servomotor
component. The useful life of a bearing depends on many factors like mechanical
load, fatigue, resonance, heat loads, quality of the materials and many more.
If a bearing inside a complex mechanical system fails, the losses in terms of
production line, time and money are substantial. On this ground, accurate
bearing remaining useful life (RUL) estimations are important.
If the RUL of a bearing is known, its replacement can be planned to avoid
possible accidents. If the RUL is incorrectly predicted though, there are two
scenarios both of which have associated costs: a failure might occur before
the bearing is replaced or the bearing might be replaced earlier than required
leading to an unnecessary stoppage of a mechanical process. Therefore, it is
important to accurately predict the RUL.
Nowadays, vibrational and thermal sensor readings supply data on the
current state of a bearing. From these fingerprints and previous sensor readings,
it is possible to constantly predict bearing RUL. However, the use of the raw
signal may lead to intractable datasets. To overcome this problem, feature
extraction methodologies can be applied to reduce the size of the datasets.
This chapter is divided into four sections. In Section 4.2, we outline current
data-driven prognosis techniques for ball bearing, together with the bearing
dataset selected for this case study. In Section 4.3, we examine which features
can be extracted from the raw signal, the role of the frequency domain in
sampling vibrational phenomena, how to filter raw signals and how these
filtering techniques can be useful for the feature extraction. We also explain the
chosen degradation model and its parameters, and the theoretical background
of RUL estimation and its assumptions. In Section 4.5, we report and discuss
the results of the proposed model. Finally, the conclusions of this case study
are outlined, together with gaps and future research directions in this area, in
Section 4.6.
167
168 Industrial Applications of Machine Learning
(a) (b)
FIGURE 4.1
Image from two bearings. (a) Healthy bearing. (b) Damaged roller bearing
(bottom part). Image by xersti at Flickr.
1 to propositional variables.
170 Industrial Applications of Machine Learning
For this case study, we use HMMs, which are able to represent the bearings
degradation evolution through hidden states which facilitates the comprehen-
sion of the estimated parameters. The fact that the bearings health is not
observable fits perfectly with the idea of hidden states.
Operating
Dataset Label Purpose # Samples # Data
conditions
1 1 Bearing1_1 Training 2,803 7,188,231
2 1 Bearing1_2 Training 871 2,228,889
3 1 Bearing1_3 Testing 1,802 4,613,120
4 1 Bearing1_4 Testing 1,139 2,915,840
5 1 Bearing1_5 Testing 2,302 5,893,120
6 1 Bearing1_6 Testing 2,302 5,893,120
7 1 Bearing1_7 Testing 1,502 3,845,120
8 2 Bearing2_1 Training 911 2,331,249
9 2 Bearing2_2 Training 797 2,039,523
10 2 Bearing2_3 Testing 1,202 3,077,120
11 2 Bearing2_4 Testing 612 1,566,720
12 2 Bearing2_5 Testing 2,002 5,125,120
13 2 Bearing2_6 Testing 572 1,464,320
14 2 Bearing2_7 Testing 172 440,320
15 3 Bearing3_1 Training 515 1,317,885
16 3 Bearing3_2 Training 1,637 4,189,083
17 3 Bearing3_3 Testing 352 901,120
Let us first make some observations about the time and frequency domains.
Suppose that we have a signal f (t) with period T ∈ R. This signal is composed
of several frequency components that contain important information about
it. However, those frequency components are not explicit in f (t), so a trans-
formation is needed to identify and measure them. Therefore, the Fourier
transform is commonly used to perform the decomposition of the signal
in frequency components. Eq. 4.1 defines the Fourier transform fˆ(z) of the
signal2 :
Z T2
ˆ
f (z) = f (t)e−2πizt dt. (4.1)
− T2
T
We also assume that f (t) satisfies the following condition 2
|f (t)|2 dt < ∞.
R
− T2
As a result, we are sure that f (t) belongs to a special vector space called
L2 ([− T2 , T2 ]) or the square-integrable functions in the interval [− T2 , T2 ]. This
space includes a special set of functions E = {e2πi T t }n∈Z , which can generate
n
f (t) = (4.2)
n
X
cn e2πi T t .
n∈Z
The right-hand side of this equation is known as the Fourier series of f (t).
On the other hand, Euler’s formula indicates that eiθ = sin(θ) + i cos(θ), which
implies that f (t) can be decomposed into periodic functions. The coefficients
{cn }n∈Z denote which periodic functions or frequencies play a more important
role in f (t). Notice that the function f (t) can be reconstructed only if we know
the coefficients {cn }n∈Z . Therefore, these coefficients fully determine f (t).
We want to find the value of the coefficients {cn }n∈Z . The space L2 ([− T2 , T2 ])
RT
has an inner product3 which is4 hf, gi = T1 −2T f (t)g ∗ (t)dt. According to this
2
inner product, any two different vectors f, g ∈ E must have the following
properties: hf, gi = 0 and hf, f i = 1. It follows from the above properties that
each coefficient cn can be expressed as:
1
Z T2
cn = hf (t), e2πi T t i = f (t)e−2πi T t dt. (4.3)
n n
T − T2
Note also that cn can be computed as the Fourier transform of f (t) evaluated
3 An inner product on a vector space L2 is a bilinear function h·, ·i with the following
properties:
at n/T , i.e., cn = T1 fˆ( Tn ). We conclude that the coefficients {cn }n∈Z can
be computed directly using the Fourier transforms. Fig. 4.2 illustrates the
relationship between fˆ(z) and f (t), showing how f (t) is decomposed into
simpler periodic functions.
FIGURE 4.2
The red plot is the signal f (t) and the blue plot is fˆ(z). f (t) can be
decomposed into simple periodic functions. The amplitude of each simple
periodic function determines the relevancy of its frequency.
0.0
Kurtogram
3.5
0.5
3.0
1.0
2.5
1.5
2.0
Level
2.0
1.5
2.5
1.0
3.0
0.5
3.5
0.0
4.0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Frequency (Hz)
FIGURE 4.3
Example of a fast kurtogram. The frequency spectrum is divided into a factor
of 2k parts, where k ∈ N is the level of decomposition. A pass-band filter is
applied to filter each region of the 2k parts. Kurtosis is calculated for each
filtered signal, and the filter of the signal with the highest kurtosis value is
chosen.
The main problem with the Fourier transform is that it assumes that the
input signal is stationary or that its intrinsic parameters are always constant.
However, this assumption does not hold in real-world applications. This has
led to the development of time-frequency representations. For example, the
short-time Fourier transform (STFT)5 which is, for a signal f (t), f˜(t, z) =
x(τ )w(τ − t)e−2πizτ dτ . This transform should extract the frequencies at
R∞
−∞
each point in time, as shown in Fig. 4.4. When a non-stationary signal is being
studied, its frequencies will change over time. The STFT can identify and
track these frequencies.
4000 4000
Frequency (Hz)
Frequency (Hz)
3000 3000
2000 2000
1000 1000
0 0
0 2 4 6 8 10 0 2 4 6 8 10
Time (s) Time (s)
(a) (b)
FIGURE 4.4
The Fourier transform is computed for short times, as a result of which we get
the frequencies over time. (a) A stationary signal, whose frequencies do not
change over time. (b) A non-stationary signal, whose principal frequencies
change over time.
RUL value will thus be E[RUL] = E[D], where E is the expectation operator.
The random variable D can be decomposed into its h hidden states. Suppose
also that, in each trial of our machine, we can also determine the time spent
in each of the hidden states. Therefore,Peach state will have a random variable
Di . Then the decomposition is D = i=1 Di , and the expected RUL value
h
where Dit is a random variable defined as Dit = Di − Di (t). Dit expresses the
difference between the total time for state i (Di ) and the time spent in state i
up to timepoint t (Di (t)). If we also assume that each Di ∼ N (µDi , σDi ), then
we can deduce that Dit follows a Gaussian distribution with mean µDi − Di (t),
standard deviation σDi , which is truncated to the left of zero (Le et al., 2015).
To conclude, the expected RUL value at time t given that the current hidden
state is i can be computed according to:
h
E[RUL(t)] = µDi − Di (t) + (4.6)
X
µDj .
j=i+1
h h
E[RULb (t)] = µDi − D̄i + (4.8)
X X
µDj − η σDk .
j=i+1 k=i
These bounds consider the mean deviations from the time duration of each
hidden state. As Eq. 4.7 overestimates, and Eq. 4.8 underestimates, the RUL, a
confidence parameter η (Tobon-Mejia et al., 2012) is added to both equations.
For this case study, this parameter is set to η = 0.5.
In Eq. 4.7 – Eq. 4.8 we need to find µDi and σDk for each hidden state.
It is here that we use the learned HMMs. Suppose that we have M training
datasets. For each dataset, we use the Viterbi algorithm to estimate the hidden
states and, therefore calculate the duration dij , for each hidden state i and
Component-Level Case Study: Remaining Useful Life of Bearings 177
every dataset j. We can estimate µDi and σDi using Eq. 4.9 and Eq. 4.10,
respectively:
1 X
M
µ̂Di = dij , (4.9)
M j=1
v
1 X
u M
=t (dij − µ̂Di )2 . (4.10)
u
σ̂Di
M − 1 j=1
We will decode the hidden states. The third entry of π2 has probability
one and is recognized as the good state. We find that the highest RMS value
in the µ2 vector corresponds to the second entry; hence it is recognized as the
bad state. We conclude, then, that the third entries represent the good state,
the first entries denote the fair state and the second entries correspond to
the bad state.
Component-Level Case Study: Remaining Useful Life of Bearings 179
Eq. 4.13 shows the learned parameters for the operating condition 3.
We would expect that the bearing state is always good at the start time.
However, looking at vector π3 , there is no entry with probability one. Therefore,
with this criterion, none of the states can be classified as good state. Let us
check vector µ3 to determine the good state. We pick the entry with the
smallest RMS value as the good state and the entry with the highest value
as the bad state. Since the first entry has the smallest value, it is recognized
as the good state. On the other hand, the second entry has the highest value
and is therefore decoded as the bad state, with the third entry being the fair
state.
Note that this codification is necessary for RUL estimation since we need
an ordered set of states, indicating, for each pair of states, which represents a
better state for the bearing.
1X |RUL(t) − RUL(t)|
T [
RUL relative error = w(t) , (4.14)
T t=1 RUL(t)
180 Industrial Applications of Machine Learning
TABLE 4.3
Relative error for each test dataset
where w(t) = sin( πt T ), and T satisfies RUL(T ) = 0. The use of the Hann
window avoids overpenalizing the relative error when the RUL(t) is close to 0.
Fig. 4.5(a) shows the results for operating condition 1, where the estimated
RUL value slightly overestimates the actual RUL value. As shown in Table 4.3,
the relative error is 10.67%.
Fig. 4.5(b) shows the results for operating condition 2, where predicted and
actual RUL values are very close throughout the bearing life. This planning
strategy would not cost almost any money or time as the relative error is very
low.
For operating condition 3, its RUL prediction is shown in Fig. 4.5(c). In this
case, the estimated RUL value is highly overestimated, especially from t = 0 s
to t = 700 s. At t = 700 s, the estimated RUL values suddenly decrease and get
closer to the actual RUL values up to the end. The actual RUL values never
lie within the confidence interval, so this situation is undesirable, producing a
relative error higher than 100%, as shown in Table 4.3. Notice that a relative
error higher than 100% can only be obtained if the prediction is overestimated.
better estimation of the transition matrix and the parameters µ̂Di from Eq. 4.9
and σ̂D
2
i
from Eq. 4.10. In this case, we only had two training datasets for
each operating condition and it does not seem enough to obtain an accurate
estimation in some cases. However, we have found that RUL prediction can be
accurate and help with prognosis.
transitions and the observations can occur at arbitrary continuous times. This
is a reasonable assumption in the case of the health state of a bearing. Moving
away from HMMs, Cartella et al. (2015) use hidden semi-Markov models,
where the time duration of the hidden states is relevant for the model and
can modify the transition matrix A. As a result, they modify the forward-
backward and Viterbi algorithms to take into account these state durations.
On the other hand, a bearing may fail due to different failure modes, and each
failure mode causes a different degradation process. Le et al. (2015) develop
a multibranch hidden semi-Markov model in which each branch contains a
HMM that represents a failure mode.
Clearly, the research trend in this field is towards weakened Markovian
assumptions with the modification of the EM, Viterbi and backward-forward
algorithms. The idea is to model more general and realistic situations where the
pure Markovian property may lead to inconsistencies. Another important issue
is to determine the failure mode online in order to get a more precise model
describing degradation, where it would be useful to mix filtering (kurtogram
for example) methods for diagnosis with prognosis methods.
For this case study we have only used the RMS values to learn the model
parameters. Since only one feature has been used to train the HMMs, model
parameters are not difficult to interpret and the model configuration is clear;
however, if more time or frequency features (which can be extracted as shown
in Fig. 4.6) are used, the model and its parameters can be harder to understand
and interpret. Nevertheless, more insights might be obtained, leading to more
accurate and informative models.
Component-Level Case Study: Remaining Useful Life of Bearings 183
(a)
(b)
(c)
FIGURE 4.5
RUL estimations for the three test datasets selected, one per operating
condition. (a) Operating condition 1. (b) Operating condition 2. (c) Operating
condition 3. The blue line represents the true RUL value for each timepoint in
time and the yellow line is the expected RUL value predicted from the learned
model. The red and black lines stand for the upper and lower bounds of the
RUL confidence interval, respectively.
184 Industrial Applications of Machine Learning
Spectral kurtosis
f (t)
f (t)
Sample Signal filter
f (t)
Fourier transform
h1 , ..., hn g1 , ..., gk
FIGURE 4.6
A diagram of an extended modeling procedure including some frequency
features. The idea is to use not only time features as RMS to train a HMM,
but also features from the signal spectrum. In the above diagram, the output
from spectral kurtosis is taken into account and more sophisticated signal
processing techniques can be used to remove the noise from the signal. Once
the signal has been processed, g1 , . . . , gk features are extracted from the
spectrum. For example, the amplitude of the frequencies from the BPFO,
BPFI, FTF, BSF and their harmonics can be used as features. Also, the HMM
can be replaced with hidden semi-Markov models or autoregressive HMMs.
5
Machine-Level Case Study: Fingerprint of
Industrial Motors
5.1 Introduction
Nowadays, general-purpose performance models for products like pumps, ser-
vomotors or spindles are developed based on theory, experience or under
laboratory conditions because there is no way of knowing exactly where they
will be installed and which will be their specific operating conditions. Besides,
these types of products are developed to work under a wide range of operating
conditions where performance is defined as nominal behavior.
At machine level, the fourth industrial revolution provides for the use of data
from different manufacturing assets to gather useful knowledge. This knowledge
can be used to compare the performance of similar parts, e.g., the effect of
a specific boundary condition on throughput time or the effect of operating
conditions on degradation. Additionally, this feature-based knowledge can
be extrapolated from machines to other levels, like machine components at
the bottom level and manufacturing plants at the top level.
Lee et al. (2014) explained that there are similarities between machines per-
forming the same tasks at the same maintenance level, where health conditions
and performance may be similar, leading to a potentially useful pattern.
Therefore, Industrial Internet of Things (IIoT) technologies are capable
of moving away from theory-based or laboratory models to real operating
data-based models. The insights gained from this approach could be useful for
comparing information on anything from machine performance to maintenance,
thus having a direct positive impact on overall plant utilization resulting in
greater productivity.
To illustrate this conceptualization, the case study reported here focuses
on machine components, specifically, machine tool axis servomotors. The
function of this component is to drive the machine tool. Consequently, it
suffers high levels of jerk related to the positioning control systems. Depending
on its usage, this situation leads to high levels of stress that could produce
premature degradation of the internal components. Additionally, servomotors
are rotating machinery, as a result of which extrapolation to other rotation-
based components, e.g., spindles or pumps, is straightforward.
185
186 Industrial Applications of Machine Learning
18.5
18
Mean torque (N · m)
17.5
17
16.5
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Time (day)
(a)
19
18
17
16
15
Temperature (◦ C)
14
13
12
11
10
9
8
7
6
5
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Time (day)
(b)
FIGURE 5.1
Servomotor values measured during real operation with nominal value 27 N · m
at 100o C. (a) Servomotor mean torque. (b) Servomotor mean temperature.
Machine-Level Case Study: Fingerprint of Industrial Motors 189
Fingerprint
component 1
Fingerprint
consensual
component
Fingerprint
component 2
FIGURE 5.2
Generalized component performance fingerprint approach.
study servomotors used for positioning the machine tool axis. Nevertheless,
the resulting methodology is easily extrapolated to other machine systems,
like spindles or motors. This is possible because the shaft and bearings are
the main mechanical parts subject to degradation inside rotative machinery
(Fig. 5.3).
A common issue related to servomotors is their performance when brand
new. It is important to know their starting conditions to infer if they are
reliable enough to be installed. For this reason, as described by Siddique
et al. (2005), the performance of the internal rotating components of electrical
motors is studied under laboratory conditions to provide a clear picture,
BPFI
BSF
BPFO FTF
(a) (b)
FIGURE 5.3
(a) Shaft and bearing assembly. (b) Bearing composition.
190 Industrial Applications of Machine Learning
1Ω
Bd
FTF = 1− cos θ , (5.1)
2 60 Pd
Nb Ω Bd
BP F I = 1+ cos θ , (5.2)
2 60 Pd
Nb Ω Bd
BP F O = 1− cos θ , (5.3)
2 60 Pd
" 2 #
Pd Ω Bd
BSF = 1− cos θ . (5.4)
2Bd 60 Pd
1 http://www.iiconsortium.org/smart-factory-machine-learning.htm
192 Industrial Applications of Machine Learning
(a) (b)
FIGURE 5.4
(a) Industrial Internet Consortium testbed. (b) Data acquisition system using
a cyber-physical system.
TABLE 5.1
Servomotor specifications2
Specification Value
Rated speed 3,000 rpm
Static torque 3.0 N · m
Stall current 2.2 A
Rated torque 2.6 N · m
Rated current 2.0 A
Rotor moment of inertia 2.9 kg · m2
TABLE 5.2
Ball bearing ref. 6204 specifications with Bd = 0.312 mm, N b = 8 balls,
P d = 1.358 mm and θ = 0o
NC-Code
INI
G01 X5000 Y5000 Z5000 F83120
G01 X0 Y0 Z0
GOTO INI
M30
The accelerometers used to collect the bearing and shaft vibration signal
have a nominal sensitivity of 100 mV/g and a frequency range of 0.2 Hz to
10,000 Hz in ±3 dB. Additionally, power and temperature values are gathered
directly from the NCU, which is a SIEMENS SINUMERIK 840D. This NCU
stores values from variable memory spaces in specific databases where they
are collected by the acquisition system.
FIGURE 5.5
Accelerometer dashboard: time-based signal and fast Fourier transform.
For this case study, we built a dataset covering one week’s operation to get
a representative number of instances. The size is 1,462,585 instances by a total
of 39 variables, 13 per servomotor:
• Angular speed, Ω.
• Power, P .
• Torque, τ .
• Vibration amplitude: Ashaft , AF T F , ABP F I , ABP F O , ABSF .
• Vibration frequencies: Fshaft , F T F , BP F I, BP F O, BSF .
• Linkage method: the linkage criterion is the distance between two clusters
used by the algorithm in order to find the optimum merging. We use Ward’s
method which computes the dissimilarity between two clusters, Cli and Clj ,
as the difference between the summed square in their joint cluster and the
addition of the summed square distances to the centroid within these two
clusters.
• Number of clusters K: selected according to the experts’ opinion.
• Connectivity matrix and distance metric for computing the linkage. According
to Ward’s method, the distance is Euclidean.
The main parameters for the AP algorithm are preference and damping.
– Random, where the instance with least minimum squared error after
use is taken.
– K-means, where the K-means algorithm is used to get initial compo-
nents.
• Number of components, chosen as described for the K-means algorithms.
198 Industrial Applications of Machine Learning
Cluster P (W) T (o C) AF T F (g) Ashaft (g) ABSF (g) ABP F O (g) ABP F I (g)
0 4.5 36.3 0.0003 0.0014 0.0009 0.0004 0.0005
1 23.3 35.6 0.0003 0.0014 0.0009 0.0005 0.0005
2 20.3 38.0 0.0004 0.0012 0.0009 0.0004 0.0005
Point clouds shown in Fig 5.6 – 5.8 may be referred to as the servomotor
MDS-fingerprint, which represents how good their test cycle performance is
with respect to each of their variables. Larger point distances to the cluster
would denote anomalous servomotors. The distance threshold must be defined
after sufficient testing with more servomotors that have the same reference.
Nevertheless, this approach is outside the scope of this chapter.
After running the five clustering algorithms, we found that the cluster
shapes are similar regardless of the algorithm. Within the denser data region
in particular, there are three predominant clusters with a definite shape and
distribution showing three different servomotor behaviors.
From the engineering point of view, these three behaviors could be
directly related to the servomotor states during operation: idle, accelera-
tion/deceleration and constant speed. Therefore, these three behaviors are
defined as servomotor clusters.
However, the differences between centroids are more noticeable even if they
are in the same area. K-means and agglomerative algorithms show similarities
for some K values. The spectral clustering and GMM centroid positions for
K = 3 are similar too. Nevertheless, centroids are concentrated in the middle
of the instance cloud.
For affinity propagation, the algorithm automatically detects nine different
clusters using the parameters described in Section 5.3.6. Shapes and centroid
positions are similar, but this algorithm is highly parameter sensitive, where
small changes to parameters may cause radically different results, especially,
with respect to preference.
We selected K = 3 and the agglomerative algorithm to illustrate the analysis
of the clustering results in order to study the behavior of each servomotor.
Additionally, power consumption and shaft vibration were the variables selected
for this purpose. Both variables provided interesting information about motor
performance. However, other combinations could be selected depending on
needs.
Results are shown in Fig. 5.9, where red (Cluster 0), green (Cluster 1) and
blue (Cluster 2) clusters stand for different levels of power, validating the three
servomotor clusters detected using the MDS.
For further analysis, centroid coordinates are shown in Tables 5.5 – 5.7.
Analyzing each of the three clusters, we find that:
200 Industrial Applications of Machine Learning
(a) (b)
(c) (d)
(e) (f)
FIGURE 5.6
MDS for agglomerative hierarchical and K-means algorithm with different
values of K. (a) Agglomerative with K = 3. (b) K-means with K = 3. (c)
Agglomerative with K = 5. (d) K-means with K = 5. (e) Agglomerative with
K = 7. (f) K-means with K = 7.
Machine-Level Case Study: Fingerprint of Industrial Motors 201
(a) (b)
(c) (d)
(e) (f)
FIGURE 5.7
MDS for spectral clustering and GMM algorithm with different values of K.
(a) Spectral clustering with K = 3. (b) GMM with K = 3. (c) Spectral
clustering with K = 5. (d) GMM with K = 5. (e) Spectral clustering with
K = 7. (f) GMM with K = 7.
202 Industrial Applications of Machine Learning
FIGURE 5.8
MDS for affinity propagation algorithm.
TABLE 5.6
Y-axis servomotor cluster centroids
Cluster P (W) T (o C) AF T F (g) Ashaft (g) ABSF (g) ABP F O (g) ABP F I (g)
0 5.41 36.6 0.0002 0.0022 0.0052 0.0004 0.0030
1 32.8 38.6 0.0002 0.0023 0.0050 0.0004 0.0029
2 21.1 36.9 0.0002 0.0018 0.0045 0.0003 0.0026
TABLE 5.7
Z-axis servomotor cluster centroids
Cluster P (W) T (o C) AF T F (g) Ashaft (g) ABSF (g) ABP F O (g) ABP F I (g)
0 4.6 33.4 0.0002 0.0022 0.0058 0.0008 0.0023
1 24.1 32.8 0.0002 0.0023 0.0057 0.0008 0.0022
2 21.6 34.7 0.0002 0.0018 0.0050 0.0007 0.0020
Machine-Level Case Study: Fingerprint of Industrial Motors 203
(a) (b)
(c)
FIGURE 5.9
Power vs. shaft vibration. (a) X-axis servomotor. (b) Y-axis servomotor. (c)
Z-axis servomotor.
204 Industrial Applications of Machine Learning
6.1 Introduction
One of the main opportunities that machine learning offers to the smart facto-
ries of the future is the possibility of analyzing large amounts of data output by
manufacturing activities while they are underway. The outcome of this analysis
will be to identify patterns enabling the detection of unwanted situations and
anomalies in industrial processes. The aim is to improve production quality
by automatically pinpointing possibly defective manufactured products for
their immediate set aside and revision as soon as they have been produced
and before the end of the manufacturing process. This is known as in-process
quality control.
Although visual inspection and quality control were traditionally performed
by human experts, automated visual inspection (AVI) systems are being
studied and used more and more often in manufacturing processes in order
to enhance automation (Golnabi and Asadpour, 2007). Malamas et al. (2003)
noted that, even though they are better than machines at visual inspection
and quality control in many situations, human experts are slower, get tired, are
inconsistent, are unable to simultaneously account for a lot of variables, and
are hard to find, train and maintain. Additionally, there are very demanding
situations in manufacturing caused by fast or repetitive analysis requirements
or hazardous environments where computer vision may effectively replace
human inspection.
In typical industrial AVI systems like the one shown in Fig. 6.1, a fixed
camera (or several cameras) with sufficient illumination captures images of a
scene under inspection. These raw images are then preprocessed to remove noise,
background or unwanted reflections. At this point, a set of features containing
key information, such as the size, position or contour of objects, or specific
measurements of certain regions, are extracted from the preprocessed images.
These features are known in advance, and the position of the camera and the
illumination of the scene are arranged in order to optimize their perception.
Machine learning techniques are then applied in order to analyze the extracted
features and make decisions that are communicated to the manufacturing
process control systems for their execution. The feature extraction and analysis
207
208 Industrial Applications of Machine Learning
FIGURE 6.1
Typical industrial AVI system.
Automated Visual Inspection of a Laser Process 209
tasks are performed with software (SW) built ad-hoc for specific applications
because no industrial machine vision system is capable of performing all
analysis tasks in every application field (Malamas et al., 2003). The software
is also programmed in application-specific hardware (HW), such as digital
signal processors, application-specific integrated circuits, or FPGAs, capable of
operating in highly time constrained and computationally intensive processes.
In smart factories, these specialized processors will be the heart of the CPS.
In this case study, we report an AVI system for the in-process quality
control of the laser surface heat treatment of steel cylinders1 . Several works
have highlighted how inspection methods can be based on the output of the
monitoring of laser processes with high-speed thermal cameras, since the
recorded emissions provide information about the stability and dynamics of the
process (Alippi et al., 2001; Jäger et al., 2008; Atienza et al., 2016; Ogbechie
et al., 2017). Thus, any anomalous sequences that are recorded are related to
defects during the laser surface heating process.
In the construction of the AVI system, however, we found that the only
available examples were from correctly processed cylinders. This scenario is
very common in manufacturing inspection applications because significant
aberrations rarely occur in efficient industrial processes (Timusk et al., 2008).
This makes it difficult to train automated systems that rely on statistical
learning because they require datasets with examples of faulty situations,
balanced, whenever possible, against examples of normal conditions (Jäger
et al., 2008; Surace and Worden, 2010). When errors have not been reported
during the training stage, the classification task of discerning between normal
and anomalous products can be performed using one-class classification.
One-class classification is an anomaly detection technique used in machine
learning to solve binary classification problems when all the labeled examples
belong to one of the classes (Chandola et al., 2009).
Additionally, a requirement that is in growing demand in real-life problems
is to have interpretable machine learning models. Here, good model accuracy
is not enough: the model and its operation have to be understandable and the
outcomes and the patterns learned by the machine to yield those outcomes have
to have an explanation. With this meaningful model, decision makers will have
a reliable tool for making their choices. In summary, following the color-coded
levels of interpretability proposed by Sjöberg et al. (1995), we should steer
clear of blackbox models (Section 2.1) and look for more transparent, so-called
gray-box, models capable of providing an interpretation of what the machine
has automatically learned from data. At the other end of the scale, we have
white-box models that are based purely on prior theoretical knowledge (like
a system of differential equations). They are useful only in very controlled and
constrained scenarios.
In order to meet the interpretability requirement, we also aim in this case
study to test if the automatically learned model is capable of capturing the
1 An exhaustive description of the laser process is given in Gabilondo et al. (2015).
210 Industrial Applications of Machine Learning
FIGURE 6.2
TTT curve with a possible cooling trajectory of a hardening process.
the cost. On the one hand, electron beam needs an inert gas atmosphere with
a relatively small process chamber and expensive peripheral equipment. On
the other hand, laser beam is able to work without any special atmospheric
requirement, and is a very promising technology for industrial applications.
Even though the dynamics of thermal processes are relatively slow, electron
and laser beam are high-density energy sources that induce fast heating-cooling
cycles. For this reason, high-speed thermal cameras that record the generated
radiation are the key technology used to monitor these processes. Thus, as
a result of the combination of fast cycles, requiring data collection in short
sampling times and thermal cameras that provide multidimensional data, an
AVI system for a beam-based heat treatment will have to analyze large amounts
of information. This increases the computational power required by the system
and jeopardizes its capability of providing on-time feedback. CPSs are able to
handle this situation because of their embedded processing capabilities (Baheti
and Gill, 2011).
a heat-affected zone (HAZ) on the surface of the cylinder that was monitored
by a fixed thermal camera. The camera used in the experiment was a high-
speed thermal camera with a recording rate of 1,000 frames per second and a
region of interest of 32 × 32 pixels, each of which could have 1,024 different
colors (10 bits per pixel) proportional to the temperature reading. The field of
view of the images was approximately 350 mm2 , while the area of the laser
beam spot was 3 mm2 . This moved along the cylinder width, producing a 200
mm2 HAZ. One full rotation of the surface of each cylinder took 21.5 seconds.
Therefore, sequences of 21,500 frames were output for each processed cylinder.
A sample image of the HAZ produced during the normal process is shown
in Fig. 6.4(a), where the laser spot is noticeable at the top right of the image
(green circle). The spot was programmed to move along the steel surface
according to a pattern, as represented in Fig. 6.4(b). This pattern was repeated
at a frequency of 100 Hz. Therefore, the camera captured approximately 10
frames per cycle.
Nevertheless, sequence analysis was subject to other difficulties because
the process was not stationary. On the one hand, there was a two-second
(2,000-frame) thermal transient at the beginning of the process until the HAZ
reached a high enough temperature because the cylinders were initially at
room temperature (see Fig. 6.4(c)). On the other hand, the spot pattern was
modified for approximately four seconds (4,000 frames) in order to avoid an
obstacle on the surface of the steel cylinders. The pattern variants are shown
in Fig. 6.5.
During the experiment, no anomalies were detected in the 32 processed
cylinders, and they were considered normal. This is very common in mass-
production industries, where machines are expected to manufacture thousands
of error-free workpieces every day without stopping. For this reason, experts
decided to simulate two different defects in the 32 normal sequences in order
to assess the response of the AVI system to anomalies2 :
• Defect in the laser power supply unit (negative offset): The laser scanner
control was in charge of adjusting the energy that the beam deposited on
the HAZ. A failure in the power supply unit could prevent a high enough
temperature from being reached to correctly treat the surface of the steel
cylinders. This was simulated by introducing a negative offset on the pixel
values. The value of the negative offset was set first to 3.5% and then to
4% of the pixel value range (36 and 41 units, respectively).
• Camera sensor wear (noise): The camera was operating in dirty conditions
due to heat, sparks and smoke that gradually stained or deteriorated the
sensors, producing noise. This situation was simulated by adding Gaussian
noise centered on the real pixel values. The standard deviation of this noise
was set to 2.5% of the pixel value range (26 units).
2 The percentages selected for simulating defects in the correctly processed steel cylinders
FIGURE 6.3
The diagram shows the physical arrangement of the different elements used to
carry out and monitor the laser surface heat treatment of the steel cylinders
(Diaz et al., 2016). The laser beam (dashed-red line) hits the surface of the
rotating steel cylinder. The area of the laser beam spot was smaller than the
width of the cylinder. Therefore, it moved very fast according to a predefined
pattern in order to heat the whole surface of the cylinder. This movement
produced a heat-affected zone (HAZ) that was recorded by a fixed high-speed
thermal camera (blue continuous line).
214 Industrial Applications of Machine Learning
FIGURE 6.4
(a) An illustrative frame taken from the HAZ by the high-speed thermal
camera during the laser process. (b) The pattern that the spot traced to
produce the HAZ in normal conditions as defined by Gabilondo et al. (2015)
in the U.S. patent property of Etxe-Tar, S.A. The numbers indicate the order
in which the different segments of the pattern were formed. (c) Thermal
transient in the HAZ at the beginning of the process.
FIGURE 6.5
During the heat treatment, the spot was programmed to avoid an obstacle on
the surface of the cylinders. There were three different variants of the spot
movement pattern depending on the position of the obstacle, namely, when it
was at the top (a), in the middle (b) or at the bottom (c) of the HAZ.
Automated Visual Inspection of a Laser Process 215
C = c3 a2 C = c1
a3
a1 a2
C = c1
C = c2
a1
x x
(a) (b)
FIGURE 6.6
Example of anomalies (ai ) and normal instances grouped by its corresponding
class labels (C = ci ) in a two-dimensional dataset. We can distinguish the
multiclass classification (a) and the one-class classification (b) scenarios.
score to new unseen test data, where a large anomaly score means a higher
degree of “anomaly” with respect to the normality model. Finally, an anomaly
threshold has to be defined for establishing the decision boundary, such that
new examples are classified as “anomalous” if its anomaly score is higher than
the anomaly threshold, or “normal” otherwise.
Parametric approaches
State-space models are often used for carrying out anomaly detection when
dealing with time-series data, being hidden Markov models (HMMs) (see
Section 2.6.3) one of the most common approaches. Anomaly detection with
HMMs has been normally performed by establishing a threshold to the likeli-
hood of an observation sequence giving the normality model (Yeung and Ding,
2003), or by defining explicitly an “anomalous” state in the HMM (Smyth,
1994). For example, in Jäger et al. (2008) HMMs were used for detecting
unusual events in image sequences recorded from an industrial laser welding
process.
Non-parametric approaches
The most used non-parametric technique in anomaly detection is kernel
density estimators (KDEs) that build the probability density function of
the normal behavior of a system by placing (typically Gaussian) kernels on
each data point and aggregating them. For example, in Atienza et al. (2016)
KDEs were used for detecting anomalies in a laser surface heat treatment
recorded with a high-speed thermal camera by tracking the movement of the
laser beam spot.
Clustering approaches
Clustering algorithms characterize the normal behavior of a system with a
reduced number of prototype points in the space of attributes. Then, the
distance of a test instance to its nearest prototype helps to discriminate
if it is a “normal” or an “anomalous” point. The different clustering-based
algorithms differ in how they define these prototypes, being K-means the most
used algorithm for data streams. An advantage of clustering-based algorithms
is that they only require to store the information of the prototype points
rather than the complete training dataset. Additionally, they allow to build
incremental models where new data points could constitute a new cluster
or can change the properties of existing prototypes. However, as in nearest
neighbors approaches, clustering algorithms suffer with high-dimensional data
and their performance highly depends on the proper selection of the number of
clusters. In Zeng et al. (2008) a clustering-based algorithm was used in order
to extract key-frames from news, entertainment, home and sports videos.
in order to identify outliers, which show a large reconstruction error, i.e, there
is a wide distance between the test instance and the output generated by the
model. It is important to note that reconstruction-based techniques rely on
several parameters that define the structure of the model and need to be very
carefully optimized, since solutions are very sensitive to them.
Artificial neural networks (Section 2.4.6) are the most used reconstruction-
based models and they have been successfully applied to numerous anomaly
detection applications (Markou and Singh, 2003). For example, in Markou and
Singh (2006) a novelty detection method based on artificial neural networks
was used for analyzing image sequences. Additionally, authors like Newman
and Jain (1995) or Malamas et al. (2003) have noticed that artificial neural
networks are very appropriate for AVI applications. A recent example can be
found in Sun et al. (2016) where artificial neural networks were employed for
inspecting automatically thermal fuse images.
FIGURE 6.7
Schematic flowchart of the AVI system. In this one-class classification scenario,
the data acquired from the laser process corresponded to its normal behavior
only. Thus, we added simulated defects to the dataset. Then the dataset was
preprocessed to extract meaningful features from the images and divided into
training and test sets. Only normal sequences were available in the training
set, whereas the test set included both normal and anomalous sequences. The
preprocessed training normal image sequences were used to learn the
normality model with DBNs. Then, their anomaly score (AS) was calculated
and used to establish the anomaly threshold (AT) by selecting the least
normal example. Afterwards, the AVI system classified a new test sequence by
calculating its anomaly score and comparing it with the anomaly threshold: if
the anomaly score was greater than the anomaly threshold, the sequence was
classified as anomalous (positive class), otherwise it was classified as normal
(negative class). Finally, the performance of the AVI system was assessed for
both normal and anomalous sequences based on its classification accuracy.
222 Industrial Applications of Machine Learning
Ri [t]
1 1
2
t
t+1
...
m k
T
Extraction of s statistical
measures from the pixel
values of each cluster
R[t] = (R1 [t], ..., Rm [t])
FIGURE 6.8
Dimensionality reduction of the feature vector R[t] to Q[t] based on
segmenting the frames into k different regions and extracting s statistical
measures from their pixel values.
(a) (b)
FIGURE 6.9
(a) The 14 regions into which the frame was segmented. The regions adjacent
to the edges were considered to be background. (b) The movement pattern of
the spot through the regions under normal conditions. However, this pattern
was changed during the obstacle avoidance stage according to the position of
the obstacle, namely at the top (c), in the middle (d) and at the bottom (e) of
the HAZ.
Automated Visual Inspection of a Laser Process 225
1,024 possible colors (bits per pixel)
1 ... 104 ... 206 ... 308 ... 411 ... 513 ... 615 ... 718 ... 820 ... 922 ... 1024
1 2 3 4 5 6 7 8 9 10
Ω (Qi) = 10
FIGURE 6.10
Equal-width interval binning discretization was used to reduce the initial
number of possible discrete pixel values (colors) (1,024) to only 10 by
assigning the same label to all values within an interval, all with the same
width (102 colors).
t−1 t
1
c
e
a
1 1 g
b
f
d
2
FIGURE 6.11
Constraints were placed on the arcs between variables in the implementation
of the DBN algorithms. Variables are represented by a number and a color.
Variables labeled with the same number belonged to the same region and
variables labeled with the same color belonged to the same feature extracted
from the regions. These variables could be observed in the past (t − 1) (darker
color) or in the present frame (t) (lighter color). Additionally, permitted arcs
are colored green, whereas banned arcs are colored red. Since time naturally
flows from past to present, arcs in the opposite direction were prohibited (b).
For this particular application, the arcs connecting different types of variables
from different regions (g) were also prohibited. All other arcs, namely
persistent (a), temporal intra-region (c), temporal inter-region (d),
instantaneous intra-region (e) and instantaneous inter-region (f) arcs, were
permitted.
the same time slice, they were called instantaneous inter-region arcs (f),
whereas if they connected variables in different time slices, they were called
temporal inter-region arcs (d). Second, any arcs connecting variables from
the same region were permitted (e.g., medians with minimums). If these arcs
connected variables in the same time slice, they were called instantaneous
intra-region arcs (e), whereas if they connected variables in different time
slices, they were divided into arcs connecting the same variable type, called
persistent arcs (a), and arcs connecting different variable types, called tem-
poral intra-region arcs (c). Also, the number of possible parents of each
variable was limited to two in order to reduce complexity and enhance the
interpretability of the resulting model.
Finally, we also made the required assumptions set out above in order
to study the causal relations appearing in the normality model learned with
DBNs.
Automated Visual Inspection of a Laser Process 227
bigger, the anomalies would have been more noticeable and have been more
readily detected by the AVI system, thereby increasing sensitivity. This was
demonstrated by increasing the negative offset anomaly to 4%. Table 6.1 reports
the specificity and sensitivities achieved by the AVI systems learned with each
of the proposed DBN algorithms when applied to normal and anomalous image
sequences.
DHC correctly classified 93.8% of the normal sequences, while DMMHC
achieved just over 90%. Hence, the response of the classification system when
classifying normal sequences was slightly better with DHC. DHC also out-
performed DMMHC when detecting anomalies produced by Gaussian noise
(with a more notable difference in this case), since sensitivity was 100% for
DHC and only 62.5% for DMMHC. However, this tendency was reversed for
the detection of anomalies produced by negative offset. Even though both
algorithms detected 100% of anomalous sequences with a negative offset of 4%,
DMMHC worked better at lower disturbances, achieving a sensitivity of 81.3%
for a negative offset of 3.5%, while DHC scored only 78.1%.
It is vital in industrial applications to detect most of the sequences with
errors (high sensitivity) without triggering false alarms (high specificity). A high
sensitivity ensures the early detection of errors, identifying defective workpieces
to prevent them from being further processed; whereas, a high specificity
avoids having to close down the production line unnecessarily, improving line
availability. This is especially critical for plant managers because they lose
faith in the monitoring system if too many false positives are detected and end
up turning it off to avoid downtimes. Specificity or sensitivity could be more
or less important depending on the specific application. In this particular laser
application, the aim was to reach a trade-off between both measures. Thus, the
best option was to use the DHC algorithm to learn the AVI system normality
model. This ensured the highest specificity with sensitivities better than 78%
for the different types of anomalies.
The AVI system was implemented on a PC with an Intel Core i7 processor
and 16GB of RAM. Here, the proposed methodology met the in-process
classification requirement of taking less than three seconds to classify a new
Automated Visual Inspection of a Laser Process 229
TABLE 6.2
Mean time and standard deviation (in seconds) taken by the AVI system
learned with the DHC and DMMHC algorithms to classify a new sequence on
a PC with an Intel Core i7 processor and 16GB of RAM (the results reported
after repeating the process 1,000 times)
sequence with both the DHC and DMMHC algorithms. As Table 6.2 shows,
both were able to classify new sequences in approximately two seconds3 .
Finally, note that the widespread lack of examples with errors is not the
only reason why the applied anomaly detection approach is appropriate in
manufacturing. In fact, there is such a wide range of things that could go wrong
in such an uncertain world as industrial practice that it would be unmanageable
to try to learn a model for each possibility. The “generalizability” required
for quality control activities is achieved in anomaly detection by modeling the
normality of the process and then detecting situations that deviate from the
norm.
is a common approach.
230
3
3 3
3
3 3
3
5 5 5
5
5 5
7
7
4
7
7
4 4
4 7
4 4
4
6
6 6 6
6
6
8
8 8
8 8
9
9
8 9
9 9
9 9
12
10 med med
10 (past) (present)
10 10 10 10 sd sd
10
(past) (present)
12 12
max max
12
12 12 (past) (present)
7 7 12 min min
(past) (present)
FIGURE 6.12
Transition network learned with the DHC algorithm. A vertical line separates the past and the present frames.
Industrial Applications of Machine Learning
Automated Visual Inspection of a Laser Process 231
represent the state of the variables in the past frame, whereas the lighter
colored nodes represent the state of the variables in the present frame. Only 61
out of the 72 arcs are represented because nine variables in the past frame were
independent, i.e., had no incoming or outgoing arcs. Table 6.3 lists the number
of arcs appearing in the network, broken down by the type of relationship they
produced. Note that all the variables had two parents. This was the maximum
number allowed in order to reduce the complexity of the model.
Some conclusions of the process can be drawn from the information on
the transition network. The median, maximum and minimum were persistent
variables in 85.2% of the cases. This was compatible with the inertia property
of thermal processes where the temperature of a region tends to be stable
unless affected by an external source. This finding was particularly important
for the median of the regions because we wanted the temperature of the HAZ
to be stable at a high enough value to reach the austenite phase. However, the
medians of regions 7 and 8 were not persistent and had an incoming arc from
the medians in the past frame of regions 6 and 8, and in the present frame of
region 9, respectively. This meant that the median temperature of adjacent
regions had a greater impact in these cases.
On the other hand, the standard deviation was never persistent and usually
depended on the values in the present frame of the other variables in its
region, namely the maximum and minimum or the minimum and median.
Another possibility (regions 3, 7 and 8) was that the standard deviation was
instantaneously influenced by that of an adjacent region (regions 5, 12 and
9, respectively), meaning that knowledge of the degree of disorder in the first
region sufficed to infer the degree of disorder in the second region.
Moreover, the common structure within regions was for the median to be
the parent of both minimum and maximum. Then, the median again or the
minimum or maximum was usually the parent of the standard deviation. The
direction of these arcs was aligned with what we might expect from a thermal
point of view, since the maximum and minimum values are usually proportional
to the average heat in a region (represented here by the median). Additionally,
a discrepancy in the trend alignment of at least two of the above variables could
signify a high heterogeneity in the temperature of the region, increasing the
standard deviation. In this way, we concluded that the relationships captured
by the DBN structure seemed reasonable.
Another interesting fact was that the median appeared to be the most
influential variable, since it was normally the ancestor of the other variables
of the region in the present frame. This conclusion was tested using network
centralities that assign a score to nodes in a network based on their structural
properties (Zaman, 2011). They have proved to be very powerful tools for
analyzing complex networks with thousands of nodes and arcs, like tools
used to model web searches in the Internet (Page et al., 1999), or social
networks (Bar-Yossef and Mashiach, 2008; Zaman, 2011). In this particular
case, we wanted to determine which transition network nodes were the most
influential in terms of their capability of reaching more network nodes. The
232 Industrial Applications of Machine Learning
TABLE 6.3
Number (in parentheses) of network arcs learned with the DHC algorithm
according to the type of direct relationship. They are broken down by type of
variable when an arc connected two variables of the same type, which is the
case only for persistent and inter-region arcs. For instance, the arc from
med_4_past to med_5_present is a non-persistent temporal arc that connects
two regions (inter-region) through the medians
Intra-region
(29)
Instantaneous med
(42) (5)
Inter-region
(13) sd
(3)
Total max
(72) (4)
min
(1)
med
(7)
Persistent
(23) sd
(0)
Temporal max
(30) (7)
min
(9)
Intra-region
(3)
Not persistent med
(7) (4)
Inter-region
(4) sd
(0)
max
(0)
min
(0)
Automated Visual Inspection of a Laser Process 233
TABLE 6.4
Ranking of the most influential variables in the transition network according
to outdegree, outgoing closeness, betweenness and Reverse PageRank network
centrality measures. The medians in both the past and present frames are in
bold
The results in Table 6.4 indicate that network centrality measures generally
identified the median as the most influential type of variable, since it occupied
the top positions of the ranking in all cases. To be more precise, if we focus on
Reverse PageRank, we find that, except for region 12 (where the minimum was
the most influential variable), the median of a region, in both the present and
past frames, was more influential than any other type of variable. In fact, of
the top 14 positions, 12 corresponded to the medians of seven different regions.
It is also interesting to see that outgoing closeness yielded similar results
to the findings reported for Reverse PageRank. This is because both took into
account the number of nodes reachable from each node. Outdegree, on the
other hand, was only able to analyze the structure locally. For example, the
most influential node for outdegree was the median of region 4 in the present
frame (med_4_present) with three outgoing arcs. Looking at the network in
Fig. 6.12, however, those arcs were not so influential because they pointed to
leaf nodes. Reverse PageRank was able to take this into account and ranked
med_4_present in the 13th position. Looking at betweenness, we find that
all the nodes (including leaf nodes) in the past frame were scored zero since,
because of the non-symmetry of temporality, they had to be parent nodes.
For this problem, however, it was also important to measure their influence.
This shows how critical it is to select the correct network centrality measure
in order to get useful conclusions.
Markov Blanket
Intuitively, we expected closer regions to have a thermodynamic influence on
each other and to be independent of distant regions. The DBN structure can
answer these questions through the query: Does a given variable belong to the
Markov blanket (Section 2.4.9) of another variable?
Translating this concept to our application, we wanted to identify the
minimal number of regions that knowing their state in the past or present
frame shielded the state in the present frame of a specific target region from
the influence of the states of the other regions (i.e., made them independent).
Since each target region was composed of a set of four variables, we defined
the Markov blanket of a target region as the union of the Markov blankets
of their variables. Then, even if only one variable from a different region in
the past or present frame was in the Markov blanket of the target region, we
identified that region as part of the Markov blanket of the target region.
Fig. 6.13 shows, for each target region (in yellow), which other regions (in
blue) shielded them from the rest (in white). We find that the regions were
locally dependent on other adjacent or fairly close regions (such as regions 7
and 8 in Fig. 6.13(e)). At first glance, we could guess at the trail of the spot.
For example, when it hits several regions at the same time, like regions 6 and
9 with respect to region 8 in Fig. 6.13(f); or when it moves across regions,
like region 10 with respect to region 9 in Fig. 6.13(g), or regions 6 and 8 with
respect to region 7 in Fig. 6.13(e).
236 Industrial Applications of Machine Learning
FIGURE 6.13
Illustration of the regions with variables within the Markov blanket (in blue)
of the variables of the target region (in yellow). Knowledge of the state of
these regions shielded the target region from the influence of other regions (in
white). As expected, both the regions and their Markov blanket regions were
close. Markov blanket of (a) region 3, (b) region 4, (c) region 5, (d) region 6,
(e) region 7, (f) region 8, (g) region 9, (h) region 10 and (i) region 12.
Automated Visual Inspection of a Laser Process 237
Causal Influence
By taking a closer look at the direction of inter-region arcs, we were able
to establish some direct causal influences between regions. More precisely,
we wanted to find out which regions had a direct effect on each region. We
defined a region to be the cause of a target region in the present frame if
at least one of its variables in the past or present frame was the parent in
the transition network of a variable of the target region. Fig. 6.14 shows the
parent regions (green) for each of the regions in the present state (yellow). We
made a distinction between two types of parental relationships: relationships
produced by instantaneous arcs only (light green) and relationships produced
by temporal arcs only (dark green). In no case did the same region have a
mixture of these two types of relationships over the target region. Note that
the results for direct causal influence were consistent with a particular case of
the results for Markov blankets, since the focus is on the parents of a target
region from a different region. By definition, they also belong to the Markov
blanket of the target region.
We first analyzed the regions that instantaneously influenced (colored in
light green) the state of other regions. They were regions adjacent to the one
that they were influencing, and the images recorded during the process showed
that they were all consistent with situations where the spot was hitting both
regions at the same time. For this reason, it was possible to somehow infer the
state of the target region from the known state of its neighbor. There were
even some cases where, because of their width, the spot hit the same regions
in consecutive frames, and they became very related. In such situations, some
of the variables of a region were children of the other region or vice versa,
resulting in regions that were simultaneously children and parents of a different
region. This applied to regions 3 and 5, and 6 and 8 and could be an indication
that the DBN detected these highly related regions as artifacts produced when
segmenting the HAZ. Therefore, they could potentially have been merged into
a single region.
We then analyzed the regions that had a temporal influence over another
region, i.e., the state in the past of these regions was conditioning the present
state of their child regions. This was the case of region 4 with respect to region
5 (Fig. 6.14(c)), regions 8 and 6 with respect to region 7 (Fig. 6.14(e)), and
region 10 with respect to region 9 (Fig. 6.14(g)). In all cases, we found that the
connected regions were situated along the same horizontal line. In fact, they
were related to the horizontal movement of the spot when tracing the middle
and bottom sections of the pattern under normal conditions (segments 3 and
7, and 5 in Fig. 6.9(b), respectively). The type of variable that was capable of
capturing these temporal inter-region connections was the median.
From these results, we can conclude that the DBN was able to learn that the
separate temporal characterization of each region was not enough to represent
the thermal properties of the process because there were also spatio-temporal
238 Industrial Applications of Machine Learning
FIGURE 6.14
Illustration of the regions with variables that were parents (in green) of at
least one variable of the target region (in yellow). There were two types of
parent regions: regions that produced instantaneous influences only (light
green) and regions that produced temporal influences only (dark green). In no
case did the same region have a mixture of these two types of relationships
over the target region. The spot movement patterns during the process were
captured by the direct temporal causal influences. Causal effect in (a) region
3, (b) region 4, (c) region 5, (d) region 6, (e) region 7, (f) region 8, (g) region
9, (h) region 10 and (i) region 12.
Automated Visual Inspection of a Laser Process 239
t−1 t
5 5
4 4 7
(a) (b)
FIGURE 6.15
(a) Subgraph of the transition network in Fig. 6.12 representing the nodes
whose parameters were analyzed (med_5_present and med_4_present) and
their parent nodes. (b) The regions of the HAZ involved in the subgraph.
Region 4 5 7
States range 3-6 7-10 6-9
p(X|Y, Z) = θX
X
Y Z Y Z Pa(X) x1 x2 x3
y1 z1 pa1X θX,1,1 θX,1,2 θX,1,3
y1 z2 pa2X θX,2,1 θX,2,2 θX,2,3
... ... ... ... ... ...
y2 z1 pa6X θX,6,1 θX,6,2 θX,6,3
X y2 z2 pa7X θX,7,1 θX,7,2 θX,7,3
y2 z3 pa8X θX,8,1 θX,8,2 θX,8,3
... ... ... ... ... ...
X y4 z5 pa20
X θX,20,1 θX,20,2 θX,20,3
X = x3
x3 y4
y3
y2
y1
z1 z2 z3 z4 z 5
x2 X = x2
y4
y3
y2
y1
z1 z2 z3 z4 z 5
x1
X = x1
y4 y4
y3 y3
y2 y2
Y
y1 y1
z1 z2 z3 z4 z 5
z1 z2 z3 z4 z5
Z
p(X = x1 |Y = y2 , Z = z3 ) + p(X = x2 |Y = y2 , Z = z3 ) + p(X = x3 |Y = y2 , Z = z3 )
FIGURE 6.16
CPT of variable X that has two parents, Pa(X) = {Y, Z}. Here, the CPT can
be represented three-dimensionally, where, for each state X = xk , there is a
matrix with the probability of that state conditioned upon the different
combinations of states of Y and Z (pajX ).
242 Industrial Applications of Machine Learning
5 5 Probability
1
5 4 4
med 4 past
3 3 0.75
4 6 7 8 9 6 7 8 9
med 4 present = 5 med 4 present = 6 0.50
6 6
3 0.25
6 5 5
m 5 0
ed 4 4 4
4p 3
as 6 7 8 9
t 3 3
med 7 present
6 7 8 9 6 7 8 9
med 7 present
(a)
med 5 present = 7 med 5 present = 8
med 5 present
10 10 10
9 9 Probability
1
9 8 8
med 5 past
7 7 0.75
8 3 4 5 6 3 4 5 6
med 5 present = 9 med 5 present = 10 0.50
10 10
7 0.25
10 9 9
m 9 0
ed 8 8 8
5p 7
as 3 4 5 6
t 7 7
med 4 past
3 4 5 6 3 4 5 6
med 4 past
(b)
FIGURE 6.17
CPTs of the median in the present frame of regions 4 (a) and 5 (b). The CPTs
were reduced to the states specified in Table 6.5. Each matrix corresponds to
the probabilities (represented by a color gradient) of a fixed state of the child
variable for the different states of the parent variables. Since the analyzed
variables were persistent, their states in the past frame were always situated
in the rows of the matrices. The matrices were sorted according to the state of
the analyzed variable in ascending order from left to right and top to bottom.
Automated Visual Inspection of a Laser Process 243
med 4 present = 3 med 4 present = 4
6 (viii) 6 (ii)
5 5
Probability
1
4 (vii) 4 (iv)
med 4 past
6 7 8 9 6 7 8 9
med 4 present = 5 med 4 present = 6 0.50
6 6
0.25
5 5
(iii)
(i) 0
4 4
(iii-2)
(iii-1)
3 3
6 7 8 9 6 7 8 9
med 7 present
FIGURE 6.18
Annotated version of the CPT of the median in the present frame of region 4
(med_4_present) illustrated in Fig. 6.17(a).
ability zero) to dark blue (with probability one). These reduced CPTs were
analyzed to check whether the movement pattern of the spot was represented
in the causal relations and whether the behavior of the different regions was
stable when they were not hit by the spot.
region 7, which was located beside region 4. This maximum temperature was
not stable because it had a very high probability of decreasing rapidly to
state 4 in the next frame (annotation ii). This was, presumably, because of its
proximity to the background. Likewise, state 5, which was the next hottest
state, was reached only from lower temperatures, but this time, with more
disparate values of the median of region 7 (annotation iii). On the one hand,
when the median was very hot (states 8 and 9) (annotation iii-1), this might
mean that the spot was at segment 4 and starting on segment 5. On the other
hand, when the medians of both region 4 and region 7 were colder (states
3 and 7, respectively) (annotation iii-2), this could be indicative of the spot
being at the end of segment 7 and starting on segment 8.
The most stable median temperature of region 4 corresponded to state
4 because it was where the median decreased after exposure to the spot
(annotation ii). This was a highly persistent state, and had a high probability
of continuing into the next frame (annotation iv). It could be reached from state
3 provided the state of region 7 was not the absolute minimum (annotation
v). State 3 was also highly stable (annotation vi), having a higher probability
of being persistent at low values of the median of region 7 (annotation vi-1).
It could plausibly be reached from state 4 if the state of region 7 was cold
(states 6 and 7) (annotation vii), meaning that the spot was distant from both
regions. However, it was striking that it could be reached from the hottest
states (states 5 and 6) in the past frame with region 7 in a very cold state
(state 6) in the present frame (annotation viii) because it meant a sudden drop
in temperature in both regions. This could be an example of what occurred
during the unstable initial heating transient, where a region rapidly cooled
down after the spot stopped hitting it, causing a big temperature dip.
10 10 (iv)
9 9 (ii)
Probability
(ii) 1
8 8 (i)
med 5 past
7 (i) 7 0.75
3 4 5 6 3 4 5 6
med 5 present = 9 med 5 present = 10 0.50
S U S U
(ii)
10 (iv) 10 (v)
(iii)
(iii) 0.25
9 (i) 9
0
8 8
7 7
3 4 5 6 3 4 5 6
med 4 past
FIGURE 6.19
Annotated version of the CPT of the median in the present frame of region 5
(med_5_present) illustrated in Fig. 6.17(b).
corroborated by the fact that it was likely for states 7, 8 and 9 to be reached
through a gradual decrease of the median from one state to the next without
big jumps (annotation ii). Looking more closely at the time when the maximum
value of the median in region 5 was reached and the median of region 4 was
stable at state 4 (indicating that the spot was hitting only region 5), we found
that the tendency of region 5 was to decrease its median to state 9 or remain
in this maximum state (annotation iii). This was compatible with the time
when the spot was moving along segments 1, 3 and 7 of the pattern in normal
conditions, since the spot remained above region 5 for several frames.
When the spot was over region 4, which was unstable (states 5 and 6
situated in the last two columns of each matrix and marked with “U”), we saw
that there were probabilities that, after reaching the maximum state, region 5
cooled down to states 9 or 8 (annotation iv). This was compatible with the
movement of the spot to the right during segment 3 and the beginning of
segment 4, since after both had been hit by the spot it moved away. However,
it was surprising that, for high values of the median in region 4 in the previous
frame, there was a high probability of the maximum median in region 5 being
reached (or maintained) irrespective of its past temperature (annotation v).
This revealed something that was out of the question in normal conditions,
namely, the right to left movement of the spot from region 4 to region 5.
246 Industrial Applications of Machine Learning
5 4 5 4
7 7
FIGURE 6.20
Two consecutive frames recorded from the laser process, showing how the
normal movement pattern of the spot changed to avoid an obstacle at the
bottom of the HAZ, going from regions 4 and 7 to region 5. To be exact, the
spot was covering segment 11 in frame 19515 and segment 12 in frame 19516,
according to the movement pattern illustrated in Fig. 6.9(e).
Nevertheless, this was feasible during the frames where the spot was avoiding
an obstacle at the top or bottom of the HAZ (see Fig. 6.9(c) and Fig. 6.9(e),
respectively). There, the direction of the horizontal segment of the pattern
was inverted, moving from region 4 to region 5 (see Fig. 6.20 for an example).
Experts noted that this phenomenon was particularly pronounced when the
obstacle was at the bottom of the HAZ because the spot hit region 4 for longer,
allowing it to reach higher temperatures. In fact, there was evidence of the
median of region 4 reaching states 7 and 8 in this situation, but this was so
unlikely that they were not visible in the CPT heat maps. This was a clear
example of the major effect that the obstacle avoidance step of the normal
laser process had on the CPT even if it occurred during a small fraction of the
process.
the detection of anomalies in real image sequences from the laser surface heat
treatment of steel workpieces. The implementation of this AVI system in a
production line will provide on-time feedback about the quality of the process
and minimize product failures and waste. To be precise, wrongly processed
workpieces will be immediately marked and removed from the production line
for later manual inspection.
The normal behavior of the process was learned using DBNs that provided
an interpretable representation of the dynamics of the laser process. We saw how
the structure of the DBN embodied conditional dependence and independence
relationships among the features of the system that were exploited in order to
understand how they interacted locally. These interactions were seen, under
restrictive assumptions, as local causal influences that were reflected in the
parameters of the DBN. We used all this information to verify that the machine
was accurately learning the inherent patterns of the laser process.
Furthermore, DBNs, as shown above, could also be helpful for discovering
new knowledge by finding relationships that were previously unknown, allowing
experts to gain insight into the thermodynamic-spatial behavior that occurred
in the HAZ where the laser spot was moving.
Additionally, thanks to their transparency, we could have detected wrong
or illogical relationships in the DBN produced, for example, by noise in the
measurements. In these situations, DBNs can be “repaired” by deleting or
adding arcs in the structure, or modifying some parameters. This possibility
of adding expert prior knowledge to the machine learning model is a valuable
capability of BNs that is missing in blackbox models like artificial neural
networks.
All the above points highlight that DBNs are a promising tool for the in-
depth analysis of dynamic processes, which are very common in manufacturing
environments.
7.1 Introduction
Not all the industrial processes required to create a final product can usually be
enacted in the same physical place. In fact, a hierarchy of industries (often called
supply chain) can be built for most industrial outputs. For example, different
industries, ranging from iron ore mining that extracts the raw materials to
car assembly, through many intermediate processing industries, such as the
metallurgy or machine tool industries, are necessary to produce a car. As a
result, the distribution of goods has a major impact on the correct operation
of a factory or group of factories by transporting materials, workpieces and
final goods.
The distribution of goods is usually called logistics. The Council of Supply
Chain Management Professionals (CSCMP) defines logistics as:
Logistics
The process of planning, implementing, and controlling proce-
dures for the efficient and effective transportation and storage
of goods including services, and related information from the
point of origin to the point of consumption for the purpose of
conforming to customer requirements. This definition includes
inbound, outbound, internal, and external movements.
249
250 Industrial Applications of Machine Learning
up to four trips, although the dataset does not contain any transport leg with
more than three trips.
The dataset contains 3,942 actual business processes, comprising 7,932
transport legs and 56,082 transport services. The information available in the
dataset is
In total, there are 98 variables in the dataset. The dataset does, of course,
include all actual times because the business processes have finished. However,
actual times are shared in real time with the Cargo 2000 system as the business
process progresses. In this case study, we simulate this behavior to gain insight
into how machine learning method performance improves when new information
about actual times becomes available.
Table 7.1 shows the number of transport services of each type and the rate
of violations of the planned times. Note that 26.6% of the business processes
did not finish on time. The majority of delayed transport services are of the
DEP type. DEP transport services are the most unpredictable, perhaps because
external factors, such as meteorological conditions or airport traffic congestion,
may have a bearing on departure from the airport. It is important to note
that, although 84% of DEP transport services are delayed, the rate of delayed
business processes is relatively lower on mainly two grounds:
• Time loss during the DEP process can be recovered across the remaining
transport services.
• A delay in an inbound transport leg because of a DEP process delay will
not delay the entire business process provided the delayed transport leg
finishes before the longer inbound transport leg. The outbound transport
leg starts when the last inbound transport leg has been consolidated at the
hub.
FIGURE 7.1
UML 2.0 diagram of the business process involving up to three shipments that are consolidated before being sent to the
customer (Metzger et al., 2015).
253
254 Industrial Applications of Machine Learning
TABLE 7.1
Number of transport services grouped by type and their respective actual rate
of violation of planned times (Metzger et al., 2015)
ij_dep_k_p = 0
ij_rcf _k_p = 0
, ∀ j, k | ij_hops 6= NA, k > ij_hops (7.1)
ij_dep_k_e = 0
ij_rcf _k_e = 0,
where ij_dep_k_p and ij_rcf _k_p are the planned times for the k-th trip
of the j-th inbound transport leg of the DEP and RCF services. Respectively,
actual times are represented similarly with the suffix _e. The number of trips
for the j-th transport leg is denoted as ij_hops. Eq. 7.1 checks for a transport
leg with the condition ij_hops 6= NA. Moreover, it applies the imputation for
the non-existing trips with the condition k > ij_hops. The same preprocessing
should be applied to the outbound leg.
A zero value does not change the duration of any transport leg or the
business process. Also, as the actual time is equal to the planned time, the
transformation does not add any delayed transport services.
This imputation does not solve the problem of how to deal with the non-
existent transport legs, as all transport service times on those transport legs are
marked as missing. The solution to this problem is proposed in Section 7.2.1.3.
This is not the only possible transformation of the time data that solves
the problem of missing data in the non-existing trips. For example, we could
sum all DEP and RCF variables to create a super collapsed transport service
that includes all the planned and actual times of every departure and arrival:
256 Industrial Applications of Machine Learning
TABLE 7.2
Inbound transport leg data for two business processes. Two instances are
shown as examples of the changing number of missing data. The table has
been divided into two parts because of the high data dimensionality. The
columns on both sides of the table contain information on a different instance
ij_hops
collapsedj _p = (ij_dep_k_p + ij_rcf _k_p)
X
k=1
ij_hops
collapsedj _e = (ij_dep_k_e + ij_rcf _k_e) .
X
k=1
Again, missing data are not considered, although the classifier could, thanks
to the auxiliary variables ij_hops, still ascertain the number of trips in each
transport leg. Furthermore, this transformation generates a smaller set of
variables for the classifier. Nevertheless, we did not use this data representation
in this case study because it does not report updates about the status of the
business process until all the stopover flights have finished. As we are looking
for a finer-grained analysis of the delivery process, the separation of DEP and
RCF is a better option.
FIGURE 7.2
UML 2.0 diagram of the business process taking into account the bottleneck transport leg only.
259
260 Industrial Applications of Machine Learning
Delay
P (i1_dep_1_place | Delay)
Delay
i1_dep_1_place 0 1
101 0 0.0009
....
815 0.1496 0.1011
FIGURE 7.3
Example of a CPT in a naive Bayes classifier for a variable with high
cardinality.
cardinality. As the original IATA codes have been masked, we cannot use
information about the airports. The only known information is the number of
times each airport has been used in the Cargo 2000 dataset. We assume that
the frequency of use of each airport is representative of its real traffic, and
this level of traffic can have an impact on service times. For example, there
are more likely to be landing and take-off delays at a congested than at a low
traffic airport.
Therefore, we create four possible airport labels: low traffic (L), medium
traffic (M), high traffic (H) and also a non-existing tag (NA) for the non-
existent flights (less than three trips in a transport leg). Airport usage has been
computed counting the number of take-off and landing (DEP and RCF) services
for each anonymized IATA airport code. The least used airports that account
for at most the 33% of total airport uses will be tagged as L. The least used
airports that have not been tagged and also account for at most the 33% of total
airport uses will be tagged as M. The rest of the non-tagged airports are tagged
as H. We opted for equal-frequency discretization (Section 2.2.1) because it
provides a fair division of the airports. None of the tags is overrepresented in
the dataset (each airport label contains about the same frequency), whereas
it still provides a division following the criterion that we defined for airport
use. Fig. 7.4 shows the cumulative sum of airport uses. Also, the 33% and
66% thresholds of the total sum of airport uses are represented as blue/orange
horizontal lines, respectively. The label of each airport is color coded: blue for
low traffic airports, orange for medium traffic airports and red for high traffic
airports. We find that there is a clear imbalance in the number of airports for
Forecasting of Air Freight Delays 261
25,000
20,000
15,000
10,000
5,000
0
0 20 40 60 80 100 120 140 160 180 200 220 240
n-th least used airport
FIGURE 7.4
Cumulative sum of the number of uses for each airport sorted in ascending
order. The n-th airport on the x-axis corresponds to the n-th least used
airport. Low/medium/high traffic airports are color coded blue/orange/red,
respectively. The blue and orange horizontal lines are the cumulative
maximum values for an airport to be tagged as low/medium traffic airport,
respectively.
each label: there are 221 low traffic airports, 13 medium traffic airports and
three high traffic airports.
This transformation of the airport information makes the job of the classifier
easier because the cardinality of each _place variable is reduced to four. Also,
if an instance with a new airport (not in the dataset) is to be classified, the
airport will be tagged as L and the instance can be classified. Furthermore,
the tag of each airport and the number of uses can be updated as new data
come in. Therefore, new information can update our knowledge about each
airport and improve our classification.
business process was delayed. Is the value of 300 minutes really meaningful for
predicting a delay in a subsequent shipment? This could be well below delay
expectations for a long-haul international shipment. As we do not know where
each airport is located in the Cargo 2000 dataset, we cannot take into account
the distance to correct the absolute times. Suppose, instead, that we used
relative times in the above example. Then, we would say that the national
shipment trip took 80% of the total planned business process time. If we find
a long-haul international shipment where a given trip accounted for 80% of
the business process time, this international shipment could reasonably be
classified as a possible delay because this is not a usual feature of non-delayed
national or international business processes.
Of course, this correction is not perfect and can underrate/overrate the
expected time for short-/long-haul flights or light/heavy freights because a
particular service execution may take more or less time for flights with different
characteristics. Nevertheless, relative times are more commensurable than
absolute times and are always in the same [0, 1] range.
A A
B C
C 1 2 30 40 50
3 4 5
FIGURE 7.5
Subtree raising applied to node B (in red). The largest branch (C) replaces B,
and the instances in leaves 1 and 2 (in orange) are moved to the children of
node C.
by each rule. Commonly, the weight of each instance is the same and equal
to 1. Nevertheless, more important instances may be weighted higher.
• Number of optimization runs: the number of times that the optimization
step is executed.
• Pruning: training with rule pruning.
• Error rate: this parameter checks whether or not there is an error rate of
at least 0.5 in the stopping criterion.
• Learning rate (η): a value in the range [0, 1] that changes the speed at
which the weights of each connection between neurons is updated. A lower
value makes slight changes to the weights, as a result of which it can take
longer to get a (usually local) optimal value. A greater value makes bigger
changes to the weights, possibly leading to a faster convergence to local
optima. However, a higher value can easily overshoot the optimal point,
again possibly failing to reach an optimum. This behavior is illustrated in
Fig. 7.6, where the low learning rate in (a) makes slight changes to the
weight values, usually leading to small changes in the objective function.
The last update comes very close to the optimum point. However, it makes
many more weight updates than the high learning rate example in (b).
• Momentum: a value in the range [0, 1] that uses the direction of previous
weight updates to adjust the weight change speed. Thus, if weights were
previously updated in the same direction, the weight change speed can be
increased because we are definitely moving in the right direction towards the
optimization of the objective function. However, if weights were updated
in opposite directions, the speed of weight change should be decreased.
Fig. 7.6(c) illustrates high momentum weight updates. If the changes are
too large when the weights are near the optimum, the optimum is overshot.
Thus, the momentum causes a reduction in the weight change speed. Low
266 Industrial Applications of Machine Learning
FIGURE 7.6
Example of weight optimization with different parameters. (a) Low learning
rate, (b) high learning rate and (c) high momentum.
• Complexity constant or cost (M): see Section 2.4.7. It can take any value
in the domain of positive real numbers, but values in the range [0, 1] are
common.
• Tolerance: the tolerance parameter controls the amount of permissible SVM
optimization problem-solving error. This value is usually equal to 10−3 .
More accurate results will have a slower convergence.
• Kernel function (K): see Section 2.4.7.
7.3.8 Metaclassifiers
As discussed in Section 2.4.10, metaclassifiers (Kuncheva, 2004) combine the
results of multiple base classifiers to classify an instance. We use four different
types of metaclassifiers: stacking, bagging, random forest and AdaBoost.M1.
The stacking method stores multiple layers of classifiers. Each layer uses
the results from the previous layer, and the last layer makes the final decision.
Typically, different types of classifiers are used to complement each other.
The stacking generalization has to learn how to combine the classifiers in the
previous layer to achieve the best results. The parameterization required by
the stacking classifier is the definition of the base classifier hierarchy.
The bagging method trains several classifiers using slightly different training
sets. Thus, each classifier is trained with a bootstrap sample of the training
set. These bootstrap samples are usually called bags. The bagging method is
Forecasting of Air Freight Delays 269
commonly used with unstable classifiers, where a slight change in the training
data can cause large changes in the trained models. A new instance is classified
by majority vote of all classifiers. The bagging method has the following
parameters:
• Size of each bootstrap bag: this parameter controls the number of instances
of each bag used to train each classifier.
• Number of trained classifiers.
The random forest method trains several decision trees with different
datasets, all of which are sampled from the training set. Unlike the bagging
method, not only does it sample instances from the training set, but it also
selects a random set of variables from the training set. As in the bagging
algorithm, majority voting is usually performed to classify a new instance. The
random forest method has the following parameters:
• Size of each bootstrap bag: this parameter controls the number of instances
of each bag used to train each classifier.
• Number of variables to be selected in each bag.
• Number of trees to be trained.
• Parameters controlling the behavior of each tree:
– Minimum number of instances per leaf.
– Maximum depth.
• Weight sum of each bag: this parameter controls the weight sum of each
training bag. The weight sum of a bag is the sum of the weights in a training
bag. As opposed to the bagging and random forest methods, the weight sum
is used instead of the number of instances. Using the weight sum, the new
training bags tend to contain instances misclassified by previous classifiers
because the instances that are harder to classify tend to have larger weights.
This alleviates the computational burden of classifying easy instances too
often.
• Number of trained classifiers.
270 Industrial Applications of Machine Learning
The stacking, bagging and AdaBoost.M1 methods have to pick which base
classifiers to use. In the case of random forests, we know that the base classifiers
are trees. The base classifiers can have parameters of their own that can affect
performance. Furthermore, the combination of multiple types of classifiers can
generate a large number of parameters for each metaclassifier.
• WEKA classifiers usually have more parameters than are mentioned above.
The remaining parameters are usually devoted to computational concerns
(e.g., whether an algorithm should be parallelized) or possible previous data
preprocessing (e.g., the SMO algorithm has a parameter to normalize or
standardize the data before they are processed by the algorithm).
• Parameters can be configured using WEKA’s graphical user interface (“Ex-
plorer”) or, if WEKA is run from the command line, by entering the name
of each parameter.
• The documentation on WEKA parameters is available at http://weka.
sourceforge.net/doc.stable/.
the results for this dataset are not in any way representative of the overall
performance of each classifier type (no free lunch theorem).
k-NN
k value 4
Search algorithm Linear search
Distance function Minkowski distance with p = 6.5
Weighting scheme Inverse of distance
C4.5
Minimum number of instances per leaf 2
Binary splits No
Collapse tree Yes
No pruning No
Confidence threshold for pruning 0.32
Reduced error pruning No
Subtree raising Yes
RIPPER
Minimal required weight 4
Number of optimization runs 9
Pruning Yes
Error rate Do not check
Multilayer perceptron
Learning rate 0.1
Momentum 0.9
Number of epochs 500
Percentage size of validation set 30
Network topology 1 hidden layer with 42 neurons
Learning rate decay Yes
272 Industrial Applications of Machine Learning
Logistic regression
Ridge 0.09766
Bayesian classifiers
NB TAN K2
Discretize continuous variables1 Yes Yes Yes
Scoring metric NA MDL AIC
Maximum number of parents per node 1 2 100,000
Prior count to estimate Bayesian network
NA 0.7 0.5
parameters
Stacking
2 layers of classifiers: a base classifier layer and a metaclassifier layer
Complexity constant 1
Tolerance 0.001
Base classifier SVM
Kernel function (xT · x)
Standardize data before training (Section 7.3.9)
Learning rate 0.1
Momentum 0.9
Number of epochs 500
Metaclassifier MLP Percentage size of validation set 30
1 hidden layer
Network topology
with 44 neurons
Learning rate decay Yes
Transfer function Sigmoid
1 WEKA Bayesian classifiers, except naive Bayes, only work with discrete variables.
Therefore, they are automatically discretized using the discretization procedure introduced
by Fayyad and Irani (1993).
Forecasting of Air Freight Delays 273
Bagging
Size of each bootstrap bag 100%
Number of trained classifiers 10
Random forest
Size of each bootstrap bag 100%
Number of variables to be selected in each bag 15
Number of trees to be trained 100
Minimum number of instances per leaf 5
Maximum depth 11
AdaBoost.M1
Weight sum of each bag 100%
Number of trained classifiers 10
H01 : µN B = µC4.5
H02 : µN B = µSV M (7.3)
H03 : µC4.5 = µSV M
If, for example, we reject H01 and H02 , we could say that naive Bayes performs
better/worse than C4.5 and SVM. The problem with the hypotheses in Eq. 7.3
is how to control the family-wise error (FWER), i.e., the probability of
making at least one type I error. Performing multiple tests increases the
probability of a type I error. Suppose that we want to conduct a post-hoc
test as in Eq. 7.3, making individual tests with α0 = 0.05 for each H0 . The
probability of not making any type I errors in all three tests is equal to
(1 − 0.05)3 . Thus, there is a probability of 1 − (1 − 0.05)3 ≈ 0.14 of making at
least one type I error. This value is the true α for all three tests in Eq. 7.3.
The expected α for m comparisons with α0 (probability of type I error in each
comparison) is equal to 1 − (1 − α0 )m . Of course, there is a higher probability
of making a type I error when α0 and the number of classifiers increase because
the number of pairwise comparisons is equal to m = k(k − 1)/2. If k = 13, as
in this case study, and α0 = 0.05 for each test, α ≈ 0.98, which is usually an
unacceptable value for drawing any conclusion.
The Bonferroni correction or Bonferroni-Dunn test (Dunn, 1961) can be
used to adjust α0 in order to control the FWER, i.e., the α of the experiment.
The Bonferroni correction divides α by the number of comparisons being tested
to compute α0 . In our example, there are three hypotheses in Eq. 7.3, hence
α0 = 0.05/3 ≈ 0.0166. With this α0 , α is guaranteed to be below 0.05, and, in
fact, α ≈ 0.49. The Bonferroni correction is a very simple post-hoc test with
very low power, especially when the number of comparisons increases.
Forecasting of Air Freight Delays 275
k(k + 1)
r
CD = qα , (7.4)
6b
where
√ qα are critical values based on the Studentized range statistic divided by
2. A table of values can be found in Demšar (2006). More advanced methods
are discussed in further detail in Demšar (2006); García and Herrera (2008).
unbalanced data (He and Garcia, 2009). There are no actual times available
at the starting checkpoint, whereas all information is available in the end
checkpoint. We find that performance increases substantially for all classifiers
the more information is available. However, not all services contribute to
improving performance. In fact, there is a sizable performance increase for
DLV services. When the information on other services is received, however,
there is no major performance increase, and, in some cases, there is even a
slight drop in performance. This phenomenon will be explored in more detail
later.
Random forest appears to be the best classifier for this problem, and
stacking and SVM are the worst classifiers. However, we conduct a statistical
test to detect statistically significant differences (α = 0.05). Table 7.4 tabulates
the results shown in Fig. 7.7. These were the average results for stratified
10-fold cross-validation run 30 times with different seeds as recommended by
Pizarro et al. (2002). The rank of each algorithm on each dataset is shown in
parentheses. The sum of ranks, Rj in Eq. 2.1, and the average rank for each
classifier, are shown at the bottom of the table. Before executing the post-hoc
tests, we first need to reject the null hypothesis that the performance of all
classifiers is equal. In our case, k = 13 and b = 17. Therefore, the Friedman
statistic in Eq. 2.1 for our dataset is equal to:
12
S= (1612 + · · · + 1092 ) − 3 · 17 · 14 = 184.297,
17 · 13 · 14
a value of a random variable distributed according to χ2 with 12 degrees of
freedom if the null hypothesis is true. The corresponding p-value is equal to
9.76E-11. As the p-value of the statistic is well below α = 0.05, we can reject
the null hypothesis of equality between classifiers. The next step is to run a
Nemenyi test. First of all, we compute the critical difference (Eq. 7.4):
13 · 14
r
CD = 3.3127 = 4.425.
6 · 17
Any classifiers with a rank difference above 4.425 can be considered to
have a statistically significant difference in performance. These differences are
commonly plotted using the critical difference diagram. Fig. 7.8 illustrates
the differences found between the classifiers in our case study. The diagram
plots an axis representing the classifier ranks. In our case, the ranks range from
1 to 13. The vertical lines connecting the rank axis are labeled with the classifier
name. The critical distance specified at the top of the figure visualizes the
minimum distance between classifiers required for differences to be statistically
significant. The horizontal lines below the axis indicate groups of classifiers
that are not significantly different. Therefore, we can say that there is no
significant difference between random forest, logistic regression, multilayer
perceptron, TAN and bagging classifiers. This is because they are linked by the
first horizontal line below the rank axis. However, there is a difference between
the classifiers in the best group: for example, random forest is significantly
Forecasting of Air Freight Delays 277
better than AdaBoost.M1, but logistic regression is not significantly better than
AdaBoost.M1 and naive Bayes. In the worst classifiers group, the performance
is clearly poorer for stacking and SVM than for the other classifiers until the
inbound transport leg finishes (Fig. 7.7). Then, there is a slight increase in
performance up to the point that they outperform RIPPER. One possible cause
of the performance of the worst classifiers is that it is not easy to generalize
well using the same parameter configuration for multiple datasets.
• k-NN: this algorithm merely saves the data and matches instances to the
training set for classification. It does not provide any additional qualitative
information.
• MLP: the usual representation of an artificial neural network is a ma-
trix/vector containing the neuron weights. This model is especially difficult
to interpret because of the large number of weights involved. Neither are
the hidden neurons meaningful in the context of the interpretation.
• SVM: it could be difficult to show the max-margin boundary hyperplane
when the dimensionality of the data projection is above three.
• Metaclassifiers: these classifiers are composed of multiple base classifiers.
Interpretability depends on the number and type of base classifiers. If, for
example, we use a multilayer perceptron as a base classifier, the algorithm
is at least as difficult to interpret as a multilayer perceptron. In the case of
the random forest, the classifier is composed of trees. However, it is rather
difficult to interpret random forests due to the large number of trees.
We report the key findings for the other classifiers, and we reason about
the relations between the dataset and the trained models.
7.4.3.1 C4.5
The simplest way to interpret a C4.5 model is to look at the structure of the
tree. Trees have two important facets: the variables selected as tree nodes and
the branching values used for each node. In the case of discrete values, the
branching usually covers all possible values. However, a cut point is selected
in the case of continuous variables to discretize the range values.
Fig. 7.9 shows the representation of a partial C4.5 classifier learned in
Section 7.4.1. The complete tree is quite large (95 leaves and 189 respective
278
k-NN C4.5
0.9 RIPPER MLP
SVM Logistic
Naive Bayes TAN
K2 Stacking
0.8 Bagging Random Forest
AdaBoost.M1
AUC
0.7
0.6
0.5
rt cs 1 1 2 2 3 3 lv r cs _1 _1 _2 _2 _3 _3 d
Sta i_r ep_ cf _ ep_ cf _ ep_ cf _ i_d o_ dep r cf dep r cf dep r cf En
i_d i_r i_d i_r i_d i_r o_ o_ o_ o_ o_ o_
Checkpoint
FIGURE 7.7
Classifier performance at different times of the business process.
Industrial Applications of Machine Learning
TABLE 7.4
AUC values rounded to three decimal places for each classifier on each checkpoint dataset. The rank of each algorithm for the
given datasets is shown in parentheses. The best algorithm for each dataset is marked in bold. Sum and average of the
computed ranks are shown at the bottom of the table
Dataset k-NN C4.5 RIPPER MLP SVM Logistic NB TAN K2 Stack. Bag. RF Boost.
Start 0.671 (8) 0.649 (10) 0.623 (11) 0.690 (5) 0.546 (12) 0.692 (3) 0.692 (4) 0.701 (2) 0.666 (9) 0.544 (13) 0.678 (6) 0.728 (1) 0.677 (7)
i_rcs 0.670 (8) 0.647 (10) 0.616 (11) 0.688 (4) 0.546 (12) 0.693 (3) 0.687 (5) 0.697 (2) 0.664 (9) 0.545 (13) 0.676 (7) 0.727 (1) 0.676 (6)
Forecasting of Air Freight Delays
i_dep_1 0.661 (9) 0.647 (10) 0.621 (11) 0.693 (4) 0.548 (12) 0.694 (3) 0.692 (5) 0.702 (2) 0.667 (8) 0.547 (13) 0.679 (6) 0.728 (1) 0.674 (7)
i_rcf_1 0.665 (9) 0.646 (10) 0.619 (11) 0.693 (4) 0.548 (12) 0.693 (3) 0.688 (5) 0.699 (2) 0.665 (8) 0.546 (13) 0.678 (6) 0.726 (1) 0.673 (7)
i_dep_2 0.663 (9) 0.645 (10) 0.618 (11) 0.693 (4) 0.548 (12) 0.693 (3) 0.688 (5) 0.699 (2) 0.664 (8) 0.547 (13) 0.679 (6) 0.727 (1) 0.674 (7)
i_rcf_2 0.662 (9) 0.649 (10) 0.618 (11) 0.693 (4) 0.550 (12) 0.694 (3) 0.688 (5) 0.699 (2) 0.668 (8) 0.548 (13) 0.678 (6) 0.728 (1) 0.674 (7)
i_dep_3 0.662 (9) 0.649 (10) 0.621 (11) 0.693 (4) 0.550 (12) 0.694 (3) 0.688 (5) 0.699 (2) 0.665 (8) 0.548 (13) 0.679 (6) 0.728 (1) 0.673 (7)
i_rcf_3 0.663 (9) 0.647 (10) 0.618 (11) 0.694 (3) 0.550 (12) 0.694 (4) 0.688 (5) 0.700 (2) 0.668 (8) 0.548 (13) 0.679 (6) 0.728 (1) 0.675 (7)
i_dlv 0.765 (10) 0.770 (9) 0.725 (13) 0.819 (3) 0.732 (11) 0.822 (2) 0.790 (8) 0.808 (5) 0.792 (7) 0.727 (12) 0.808 (4) 0.836 (1) 0.794 (6)
o_rcs 0.762 (10) 0.770 (9) 0.725 (13) 0.819 (3) 0.733 (11) 0.822 (2) 0.791 (8) 0.809 (4) 0.795 (6) 0.728 (12) 0.807 (5) 0.836 (1) 0.793 (7)
o_dep_1 0.763 (10) 0.770 (9) 0.726 (13) 0.820 (3) 0.734 (11) 0.824 (2) 0.792 (8) 0.809 (5) 0.792 (7) 0.730 (12) 0.810 (4) 0.838 (1) 0.795 (6)
o_rcf_1 0.763 (10) 0.767 (9) 0.727 (13) 0.821 (3) 0.737 (11) 0.824 (2) 0.790 (8) 0.807 (5) 0.792 (7) 0.732 (12) 0.810 (4) 0.839 (1) 0.797 (6)
o_dep_2 0.763 (10) 0.768 (9) 0.725 (13) 0.822 (3) 0.739 (11) 0.824 (2) 0.791 (8) 0.807 (5) 0.792 (7) 0.734 (12) 0.810 (4) 0.839 (1) 0.797 (6)
o_rcf_2 0.764 (10) 0.765 (9) 0.724 (13) 0.822 (3) 0.744 (11) 0.826 (2) 0.792 (8) 0.808 (5) 0.793 (7) 0.739 (12) 0.811 (4) 0.839 (1) 0.797 (6)
o_dep_3 0.764 (10) 0.765 (9) 0.729 (13) 0.823 (3) 0.744 (11) 0.825 (2) 0.792 (8) 0.808 (5) 0.792 (7) 0.739 (12) 0.809 (4) 0.839 (1) 0.797 (6)
o_rcf_3 0.764 (10) 0.765 (9) 0.727 (13) 0.823 (3) 0.743 (11) 0.824 (2) 0.792 (8) 0.808 (5) 0.793 (7) 0.739 (12) 0.810 (4) 0.840 (1) 0.796 (6)
End 0.957 (11) 0.970 (8) 0.933 (13) 0.996 (2) 0.976 (6) 0.997 (1) 0.949 (12) 0.970 (9) 0.969 (10) 0.976 (7) 0.995 (3) 0.990 (4) 0.987 (5)
Rj 161 160 205 58 190 42 115 64 131 207 85 20 109
Avg. Rank 9.471 9.412 12.059 3.412 11.176 2.471 6.765 3.765 7.706 12.176 5.0 1.176 6.412
279
280 Industrial Applications of Machine Learning
CD
1 2 3 4 5 6 7 8 9 10 11 12 13
Random Forest
Logistic Stacking
Multilayer perceptron RIPPER
TAN SVM
Bagging K-NN
AdaBoost.M1 C4.5
Naive Bayes K2
FIGURE 7.8
Critical difference diagram of the results in Table 7.4.
rules). Only a small portion of the tree is shown here. The subtrees that are
not shown in this figure are represented by the nodes labeled with . . .. The
leaves show the expected class label and the distribution of TRUE/FALSE
instances in parentheses.
First of all, we find that the majority of the variables selected near the
root node are of the DLV type. This suggests that DLV variables play an
important role in detecting delays. The complete tree does not contain any
_place variable. Therefore, we could say that the airport does not have much
bearing on delivery performance. We check if i1_dlv_e is greater than 0.47
in the root of the tree and again greater than 0.70 in the right branch. If
the bottleneck leg of the DLV service accounted for more than 70% of the
time taken by the business process, the process will be classified as delayed. If
0.47 ≤ i1_dlv_e ≤ 0.70, then the o_rcf _1_e and the o_dlv_e variables are
checked. When the actual times are lower than a computed threshold, the tree
should classify the Delay as FALSE. Similarly, when actual times are higher,
the tree should classify the Delay as TRUE.
The branch on the left of the root contains instances in which the DLV
service in the bottleneck leg does not take too long. For such instances, the tree
then checks whether the DLV service in the outbound transport leg took too
much time. If this is the case (o_dlv_e ≥ 0.65), the tree classifies the business
process as delayed. Let us look at what happens when 0.40 ≤ o_dlv_e ≤ 0.65,
that is, when the actual DLV times are rather long, albeit not long enough
to confirm a delay. In this case, the tree also checks the planned DLV service
times, which tend to behave contrary to the actual times. When the planned
times are lower than a computed threshold, the tree should classify the Delay
as TRUE. Accordingly, when planned times are higher, the tree should classify
the Delay as FALSE. This behavior makes perfect sense: if the tree is not sure
because the actual times are borderline, then the planned times should be
compared against the actual times. However, the last i1_dlv_p check, where
two leaf nodes are created with two and six instances, respectively, does not
Forecasting of Air Freight Delays 281
i1_dlv_e
o_dlv_e i1_dlv_e
FIGURE 7.9
Partial representation of the C4.5 structure. The nodes representing omitted
subtrees are labeled with . . ..
282 Industrial Applications of Machine Learning
obey this rule. A possible ground is overfitting, which causes the tree to make
the above comparison to correctly classify the above eight instances.
Now, one may wonder why the decision tree attached so much importance
to the DLV variables. The DLV variables also produced a sizable performance
increase in Fig. 7.7. We analyzed the behavior of the different types of services
using descriptive statistics of the data for business processes with Delay=TRUE.
We analyzed two different issues: the rate or frequency and the severity of
service violations when there is a business process violation. We say that
a service has been violated when it takes longer than planned. Table 7.5
summarizes the information regarding the rate of service violations. At the top
of the table, we find that DEP and DLV are the most often violated services.
Thus, 84% of DEP services are violated in a business process where there is
a delay. This information suggests that DEP variables are likely to be more
informative than DLV variables. However, the bottom row of the table shows
how many business processes are affected by at least one service violation.
Thus, we find that 99.61% of the delayed business processes have at least one
DLV service violation (in the bottleneck or the outbound transport legs). This
value is close to 99.14% for DEP times. Thus, the rate of DEP and DLV service
violations can be considered to be more or less the same.
However, the rate of service violations does not provide the whole picture.
We should analyze the severity of each violation to understand its effects on
business process delays. To measure the severity of each service violation,
we have to check whether the delay in the service violations accounted for a
significant amount of time with respect to the business process delay. Table 7.6
summarizes this information. We say that a service satisfies condition VX if
the delay in the service execution is at least X% of the total business process
delay, that is,
X
sd > pd · , with sd, pd > 0,
100
where sd denotes the service delay as the difference between the actual and
planned times of the service, and pd denotes the business process delay as the
difference between the actual and planned total business process times. From
Table 7.6, top, we find that more than 50% of DLV services accounted for at
least 50% of the total business process delay. In fact, 37.93% of DLV services
suffered a delay equal to or greater than the business process delay. The service
could be delayed longer than the business process if the other services were
faster than planned. This analysis leads us to conclude that the violations of
the DLV services are much more severe than for other services. The bottom
row of the table shows the number of business processes in which there is at
least one service satisfying condition V50 and V100 . We find that at least one
DLV satisfied condition V50 in a remarkable 98.47% of the violated business
processes. This statistic shows that the delays in the DLV services provide a
pretty good explanation of a business process delay. For this reason, the C4.5
tree tends to use its values.
Forecasting of Air Freight Delays 283
TABLE 7.5
Descriptive statistic of the rate of service violations when Delay=TRUE.
Percentages are shown in parentheses
TABLE 7.6
Descriptive statistics of process violations severity when Delay=TRUE.
Percentages are shown in parentheses
7.4.3.2 RIPPER
The RIPPER algorithm learned the set of rules shown in Table 7.7. The values
used by the rules were rounded to two decimal places for ease of representation.
RIPPER generated 17 rules to try to recognize a delay. If the business process
does not satisfy any of the above rules, the 18th rule is applied, where the
business process is classified as not delayed. There are many similarities between
the results of RIPPER and C4.5. First of all, this set of rules again denotes the
importance of DLV services. The DLV service values are used exclusively in
the first rules and extensively in the rest of the rules. Also, longer actual times
and shorter planned times tend to be classified as delayed in both models. For
this reason, most actual times are compared using a “greater than” sign, while
planned times are compared using a “less than” sign.
The set of rules is more compact than the tree learned using C4.5. This is
a desirable property because, with so few rules, a human can review all rules
one by one to picture the problem.
TABLE 7.7
Set of rules learned by RIPPER
Nr. Rule
1 IF (o_dlv_e ≥ 0.54) THEN Delay=TRUE (375.0/17.0)
2 IF (i1_dlv_e ≥ 0.56) THEN Delay=TRUE (285.0/6.0)
IF (o_dlv_p ≤ 0.19 AND o_dlv_e ≥ 0.33 AND
3
i1_dlv_p ≤ 0.38) THEN Delay=TRUE (56.0/3.0)
IF (o_dlv_p ≤ 0.19 AND o_dlv_e ≥ 0.12 AND
4
i1_dlv_e ≥ 0.18) THEN Delay=TRUE (77.0/2.0)
IF (i1_dlv_p ≤ 0.23 AND o_dlv_e ≥ 0.47) THEN
5
Delay=TRUE (18.0/2.0)
IF (i1_dlv_p ≤ 0.10 AND o_dlv_p ≤ 0.07 AND
6 i1_dlv_e ≥ 0.10 AND i1_rcs_p ≥ 0.17) THEN
Delay=TRUE (42.0/3.0)
IF (i1_dlv_p ≤ 0.10 AND o_dlv_p ≤ 0.07 AND
7
o_dlv_e ≥ 0.1) THEN Delay=TRUE (33.0/6.0)
IF (i1_dlv_e ≥ 0.14 AND o_dlv_e ≥ 0.28 AND
8 i1_dep_1_e ≥ 0.03 AND i1_dlv_p ≤ 0.41) THEN
Delay=TRUE (42.0/5.0)
IF (i1_dlv_e ≥ 0.31 AND i1_rcs_e ≥ 0.13) THEN
9
Delay=TRUE (44.0/12.0)
IF (i1_dlv_p ≤ 0.09 AND o_dlv_e ≥ 0.17 AND
10
o_dlv_p ≤ 0.21) THEN Delay=TRUE (22.0/6.0)
IF (i1_dlv_e ≥ 0.14 AND o_dlv_p ≤ 0.08 AND
11 i1_dlv_p ≤ 0.12 AND o_dlv_p ≥ 0.06) THEN
Delay=TRUE (15.0/1.0)
IF (o_rcs_e ≥ 0.08 AND i1_dlv_e ≥ 0.42) THEN
12
Delay=TRUE (27.0/8.0)
IF (o_dlv_p ≤ 0.05 AND o_rcs_p ≥ 0.53 AND
13
o_dep_1_e ≥ 0.02) THEN Delay=TRUE (14.0/0.0)
IF (o_dlv_e ≥ 0.18 AND i1_rcs_p ≥ 0.31 AND
14
o_dep_1_p ≥ 0.02) THEN Delay=TRUE (8.0/1.0)
IF (i1_dlv_p ≤ 0.07 AND o_dlv_e ≥ 0.24 AND
15
i1_dep_1_e ≤ 0.03) THEN Delay=TRUE (8.0/1.0)
IF (i1_rcf _1_e ≥ 0.11 AND i1_dlv_p ≤ 0.11 AND
16 o_dlv_p ≤ 0.20 AND i1_dlv_e ≥ 0.02) THEN
Delay=TRUE (10.0/1.0)
IF (o_dlv_e ≥ 0.31 AND i1_dlv_e ≥ 0.27) THEN
17
Delay=TRUE (8.0/2.0)
18 IF ∅ THEN Delay=FALSE (2,858.0/40.0)
286 Industrial Applications of Machine Learning
Delay
o_dlv_p
o_dlv_e
FIGURE 7.10
Partial representation of the TAN structure, showing how the o_dlv variables
are related.
TABLE 7.8
Conditional probability table of o_dlv_e for the TAN classifier
lower. Therefore, if we know that the business process has not been delayed,
the actual times of o_dlv usually account for less than the 23% of the total
business process time. Note that this applies even when the planned time for
o_dlv is in the range (0.21, 1], where there is only a slight increase in the
probability of higher actual times. We would expect there to be a greater
probability of the range of o_dlv_e being [0, 0.08] if o_dlv_p is in the range
[0, 0.07]. Correspondingly, if o_dlv_p is in the range (0.07, 0.21], the most
probable range for o_dlv_e would be (0.08, 0.23]. However, this is not the case
because the greatest probability is for [0, 0.08], suggesting that o_dlv tends to
take shorter when the business process is not delayed.
If Delay=TRUE, there is a greater probability of the values of o_dlv_e
being higher, as expected. There is again a dependence between o_dlv_p and
o_dlv_e because when o_dlv_p increases then o_dlv_e tends to increase.
The pairwise comparison of the rows with equal o_dlv_p with Delay=FALSE
Forecasting of Air Freight Delays 287
o_dlv_p
FIGURE 7.11
Markov blanket structure of the Delay variable in the K2 algorithm.
o_dep_1_p o_dlv_p
[0, 0.068] (0.068, 0.214] (0.214, 1]
[0, 0.026] 0.155 0.264 0.581
(0.026, 1] 0.435 0.34 0.225
Again, the CFS method selected three out of four DLV variables. Also,
Forecasting of Air Freight Delays 289
TABLE 7.10
Ranking of information gain for the 10 best variables
the selected subset included some DEP variables. This could be due to the
high impact that they have on business process delays, as shown in Table 7.5.
However, the selection of the variable o_rcf _3_place is noteworthy, because
this variable was not used by the interpretable models in Section 7.4.3.
As an illustrative example, Table 7.11 shows the AUC values for the
classifiers in Section 7.4.1 after applying the above feature subset selection
procedures. The column labeled Full includes the results with the entire
dataset for comparison. The performance of the unfiltered dataset was better
for seven out of 13 classifiers, whereas the information gain and CFS methods
performed better for five and one classifiers, respectively. Note that classifier
parametrization was driven by the unfiltered dataset. Therefore, it is expected
to have an impact on the performance of the filtered datasets. However, the
application of feature subset selection improved classifier performance for
almost half of the classifiers. Therefore, this example shows that feature subset
selection procedures are useful.
insights into how the business processes unfold. Not every supervised classifier
is suitable for a qualitative analysis. Therefore, we compared only the most
promising classifiers: classification trees, rules and Bayesian classifiers.
An online classification procedure was shown where each business process
classification is updated when new information about service completion is
received. The addition of actual times increased classifier performance as
expected. However, performance only increased for the key service executions.
The qualitative comparison of classifiers also found key service executions to
be important. Thus, this case study proves that supervised machine learning
classification algorithms are applicable in the distribution industry and play
an important role in detecting weak points in business processes.
we would consider that the business process is currently at the i_rcs checkpoint,
even though we have more information for the i1 transport leg. This is especially
relevant if the i1 transport leg is the bottleneck transport leg. This is not,
fortunately, an overly common scenario because the bottleneck transport
leg is usually the slowest throughout the entire business process. Even so,
an expected bottleneck transport leg could happen to be faster than other
inbound transport legs or faster for some services and then slower for others
later on.
The algorithm tuning is another potential area for improvement. Some
tools, such as AutoWeka (Kotthoff et al., 2017), could help to search for
optimal classifier parameters. Such tools have some limitations (e.g., they
cannot optimize all parameters). However, they can reduce the amount of time
required to find a reasonably good solution.
292 Industrial Applications of Machine Learning
However, the comparison between the planned and actual times appears
to be reasonable for delay detection and was confirmed by C4.5. The feature
subset selection procedure could potentially be improved by evaluating pairs
of features composed of the planned and actual times of each service execution.
Therefore, the information gain for a service execution, say o_rcs, could be
computed as
Abraham, A., Pedregosa, F., Eickenberg, M., Gervias, P., Mueller, A., Kossaifi,
J., Gramfort, A., Thirion, B., and Varoquaux, G. (2014). Machine learning
for neuroimaging with scikit-learn. Frontiers in Neuroinformatics, 8:Article
14.
Acevedo, F., Jiménez, J., Maldonado, S., Domínguez, E., and A, N. (2007).
Classification of wines produced in specific regions by UV-visible spectroscopy
combined with support vector machines. Journal of Agricultural and Food
Chemistry, 55:6842–6849.
Aggarwal, C., Han, J., Wang, J., and Yu, P. (2004). A framework for projected
clustering evolving data streams. In Proceedings of the 29th International
Conference on Very Large Data Bases, pages 81–92.
Aggarwal, C., Han, J., Wang, J., and Yu, P. (2006). A framework for on-demand
classification of evolving data streams. IEEE Transactions on Knowledge
and Data Engeniering, 18(5):577–589.
Agresti, A. (2013). Categorical Data Analysis. Wiley.
Aha, D., Kibler, D., and Albert, M. (1991). Instance-based learning algorithms.
Machine Learning, 6(1):37–66.
Akaike, H. (1974). A new look at the statistical model identification. IEEE
Transactions on Automatic Control, 19(6):716–723.
Akyildiz, I. F., Su, W., Sankarasubramaniam, Y., and Cayirci, E. (2002).
Wireless sensor networks: A survey. Computer Networks, 38(4):393–422.
Ali, A., Shah, G. A., Farooq, M. O., and Ghani, U. (2017). Technologies and
challenges in developing machine-to-machine applications: A survey. Journal
of Network and Computer Applications, 83:124–139.
Alippi, C., Braione, P., Piuri, V., and Scotti, F. (2001). A methodological
approach to multisensor classification for innovative laser material processing
units. In Proceedings of the 18th IEEE Instrumentation and Measurement
Technology Conference, volume 3, pages 1762–1767. IEEE Press.
Arias, M., Díez, F., M.A. Palacios-Alonso, M. Y., and Fernández, J. (2012).
293
294 Bibliography
Atienza, D., Bielza, C., and Larrañaga, P. (2016). Anomaly detection with
a spatio-temporal tracking of the laser spot. In Frontiers in Artificial
Intelligence and Applications Series, volume 284, pages 137–142. IOS Press.
Awoyemi, J., Adelunmbi, A., and Oluwadare, S. (2017). Credit card fraud
detection using machine learning techniques: A comparative analysis. In
2017 International Conference on Computing Networking and Informatics,
pages 1–9. IEEE Press.
Ayer, T., Alagoz, O., Chhatwal, J., Shavlik, J., Kahn, C., and Burnside, E.
(2010). Breast cancer risk estimation with artificial neural netwroks revisited.
Cancer, 116:3310–3321.
Babu, D. K., Ramadevi, Y., and Ramana, K. (2017). RGNBC: Rough Gaussian
naive Bayes classifier for data stream classification with recurring concept
drift. Arabian Journal for Science and Engineering, 42:705–714.
Baheti, R. and Gill, H. (2011). Cyber-physical systems. The Impact of Control
Technology, 12:161–166.
Bakhshipour, A., Sanaeifar, A., Payman, S., and de la Guardia, M. (2018).
Evaluation of data mining strategies for classification of black tea based on
image-based features. Food Analytical Methods, 11(4):1041–1050.
Ban, G.-Y., El Karoui, N., and Lim, A. E. B. (2016). Machine learning and
portfolio optimization. Management Science, 64(3):1136–1154.
Bar-Yossef, Z. and Mashiach, L.-T. (2008). Local approximation of PageRank
and Reverse PageRank. In Proceedings of the 17th ACM Conference on
Information and Knowledge Management, pages 279–288. ACM.
Barber, D. and Cemgil, T. (2010). Graphical models for time series. IEEE
Signal Processing Magazine, 27(6):18–28.
Bibliography 295
Baum, L., Petrie, T., Soules, G., and Weiss, N. (1970). A maximization
technique occurring in the statistical analysis of probabilistic functions of
Markov chains. The Annals of Mathematical Statistics, 41(1):164–171.
Bellman, R. E. (1957). Dynamic Programming. Princeton University Press.
Ben-Hur, A. and Weston, J. (2010). A user’s guide to support vector machines.
In Data Mining Techniques for the Life Sciences, volume 609, pages 223–239.
Humana Press.
Bennett, R. G. (1985). Computer integrated manufacturing. Plastic World,
43(6):65–68.
Bernick, J. (2015). The role of machine learning in drug design and delivery.
Journal of Developing Drugs, 4(3):1–2.
Blanco, R., Inza, I., Merino, M., Quiroga, J., and Larrañaga, P. (2005). Feature
selection in Bayesian classifiers for the prognosis of survival of cirrhotic
patients treated with TIPS. Journal of Biomedical Informatics, 38(5):376–
388.
Böcker, A., Derksen, S., Schmidt, E., Teckentrup, A., and Schneider, G. (2005).
A hierarchical clustering approach for large compound libraries. Journal of
Chemical Information and Modeling, 45(4):807–815.
Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification
and Regression Trees. Wadsworth Press.
Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. (2000). LOF:
identifying density-based local outliers. In Proceedings of the 2000 ACM
SIGMOD International Conference on Management of Data, pages 93–104.
ACM.
Brier, G. (1950). Verification of forecasts expressed in terms of probability.
Monthly Weather Review, 78:1–3.
Buczak, A. L. and Guven, E. (2016). A survey of data mining and machine
learning methods for cyber security intrusion detection. IEEE Communica-
tions Surveys Tutorials, 18(2):1153–1176.
Bibliography 297
Cartella, F., Lemeire, J., Dimiccoli, L., and Sahli, H. (2015). Hidden semi-
Markov models for predictive maintenance. Mathematical Problems in
Engineering, 2015:1–23.
Catlett, J. (1991). On changing continuous attributes into ordered discrete
attributes. In Proceedings of the European Working Session on Learning,
pages 164–178.
Celtikci, E. (2017). A systematic review on machine learning in neuro-
surgery: The future of decision-making in patient care. Turkish Neurosurgery,
28(2):167–173.
Chen, N., Ribeiro, B., and Chen, A. (2016). Financial credit risk assessment:
A recent review. Artificial Intelligence Review, 45(1):1–23.
Chen, Z., Li, Y., Xia, T., and Pan, E. (2018). Hidden Markov model with
auto-correlated observations for remaining useful life prediction and optimal
maintenance policy. Reliability Engineering and System Safety, In press.
298 Bibliography
Chong, M., Abraham, A., and Paprzycki, M. (2005). Traffic accident analysis
using machine learning paradigms. Informatica, 29:89–98.
Ciccio, C. D., van der Aa, H., Cabanillas, C., Mendling, J., and Prescher, J.
(2016). Detecting flight trajectory anomalies and predicting diversions in
freight transportation. Decision Support Systems, 88:1–17.
Dempster, A., Laird, N., and Rubin, D. (1977). Maximum likelihood from
incomplete data via the EM algorithm. Journal of the Royal Statistical
Society, Series B, 39(1):1–38.
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets.
Journal of Machine Learning Research, 7:1–30.
Diaz, J., Bielza, C., Ocaña, J. L., and Larrañaga, P. (2016). Development
of a cyber-physical system based on selective Gaussian naïve Bayes model
for a self-predict laser surface heat treatment process control. In Machine
Learning for Cyber Physical Systems, pages 1–8. Springer.
Diaz-Rozo, J., Bielza, C., and Larrañaga, P. (2017). Machine learning-based
CPS for clustering high throughput machining cycle conditions. Procedia
Manufacturing, 10:997–1008.
Diehl, C. P. and Hampshire, J. B. (2002). Real-time object classification and
novelty detection for collaborative video surveillance. In Proceedings of the
2002 International Joint Conference on Neural Networks, volume 3, pages
2620–2625. IEEE Press.
d’Ocagne, M. (1885). Coordonnées Parallèles et Axiales: Méthode de Transfor-
mation Géométrique et Procédé Nouveau de Calcul Graphique Déduits de la
Considération des Coordonnées Parallèles. Gauthier-Villars.
Doksum, K. and Hbyland, A. (1992). Models for variable-stress accelerated
life testing experiments based on Wiener processes and the inverse Gaussian
distribution. Technometrics, 34:74–82.
Domingos, P. and Hulten, G. (2000). Mining high-speed data streams. In
Proceedings of the 6th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pages 71–80.
Dong, S. and Luo, T. (2013). Bearing degradation process prediction based on
the PCA and optimized LS-SVM model. Measurement, 46:3143–3152.
Dorronsoro, J. R., Ginel, F., Sánchez, C., and Cruz, C. S. (1997). Neural fraud
detection in credit card operations. IEEE Transactions on Neural Networks,
8(4):827–834.
Druzdzel, M. (1999). SMILE: Structural modeling, inference, and learning
engine and GeNIe: A development enviroment for graphical decision-theoretic
models. In Proceedings of the 16th American Association for Artificial
Intelligence, pages 902–903. Morgan Kaufmann.
Dua, S., Acharva, U., and Dua, P. (2013). Machine Learning in Healthare
Informatics. Springer.
Dunn, O. J. (1961). Multiple comparisons among means. Journal of the
American Statistical Association, 56(293):52–64.
Bibliography 301
Fei-Fei, L., Fergus, R., and Perona, P. (2006). One-shot learning of object
categories. IEEE Transactions on Pattern Analysis and Machine Intelligence,
28(4):594–611.
Figueiredo, M. and Jain, A. K. (2002). Unsupervised learning of finite mixture
models. IEEE Transactions on Pattern Analysis and Machine Intelligence,
24(3):381–396.
Florek, K., Lukaszewicz, J., Perkal, H., Steinhaus, H., and Zubrzycki, S. (1951).
Sur la liaison et la division des points d’un ensemble fini. Colloquium
Mathematicum, 2:282–285.
Flores, M. and Gámez, J. (2007). A review on distinct methods and approaches
to perform triangulation for Bayesian networks. In Advances in Probabilistic
Graphical Models, pages 127–152. Springer.
302 Bibliography
Flores, M. J., Gámez, J., Martínez, A., and Salmerón, A. (2011). Mixture
of truncated exponentials in supervised classification: Case study for the
naive Bayes and averaged one-dependence estimators classifiers. In 11th
International Conference on Intelligent Systems Design and Applications,
pages 593–598. IEEE Press.
Foley, A. M., Leahy, P. G., Marvuglia, A., and McKeogh, E. J. (2012). Current
methods and advances in forecasting of wind power generation. Renewable
Energy, 37(1):1–8.
Forgy, E. (1965). Cluster analysis of multivariate data: Efficiency versus
interpretability of classifications. Biometrics, 21:768–769.
Fournier, F. A., McCall, J., Petrovski, A., and Barclay, P. J. (2010). Evolved
Bayesian network models of rig operations in the Gulf of Mexico. In IEEE
Congress on Evolutionary Computation, pages 1–7. IEEE Press.
Freeman, L. C. (1977). A set of measures of centrality based on betweenness.
Sociometry, pages 35–41.
Freund, Y. and Schapire, R. (1997). A decision-theoretic generalization of
on-line learning and an application to boosting. Journal of Computer and
System Sciences, 55(1):119–139.
Frey, B. J. and Dueck, D. (2007). Clustering by passing messages between
data points. Science, 315:972–976.
Friedman, M. (1937). The use of ranks to avoid the assumption of normality
implicit in the analysis of variance. Journal of the American Statistical
Association, 32(200):675–701.
Friedman, N. (1998). The Bayesian structural EM algorithm. In Proceedings of
the 14th Conference on Uncertainty in Artificial Intelligence, pages 129–138.
Morgan Kaufmann.
Friedman, N., Geiger, D., and Goldszmidt, M. (1997). Bayesian network
classifiers. Machine Learning, 29:131–163.
Friedman, N., Goldszmidt, M., and Lee, T. (1998a). Bayesian network classi-
fication with continuous attibutes: Getting the best of both discretization
and parametric fitting. In Proceedings of the 15th National Conference on
Machine Learning, pages 179–187.
Friedman, N., Linial, M., Nachman, I., and Pe’er, D. (2000). Using Bayesian
networks to analyze expression data. Journal of Computational Biology,
7(3-4):601–620.
Friedman, N., Murphy, K., and Russell, S. (1998b). Learning the structure of
dynamic probabilistic networks. In Proceedings of the 14th Conference on
Uncertainty in Artificial Intelligence, pages 139–147. Morgan Kaufmann.
Bibliography 303
Frigyik, B. A., Kapila, A., and Gupta, M. (2010). Introduction to the Dirichlet
distribution and related processes. Technical Report, University of Washing-
ton.
Frutos-Pascual, M. and García-Zapirain, B. (2017). Review of the use of AI
techniques in serious games: Decision making and machine learning. IEEE
Transactions on Computational Intelligence and AI in Games, 9(2):133–152.
Fung, R. and Chang, K.-C. (1990). Weighing and integrating evidence for
stochastic simulation in Bayesian networks. In Proceedings of the 6th Con-
ference on Uncertainty in Artificial Intelligence, pages 209–220. Elsevier.
Fürnkranz, J. and Widmer, G. (1994). Incremental reduced error pruning. In
Machine Learning: Proceedings of the 11th Annual Conference, pages 70–77.
Morgan Kaufmann.
Gabilondo, A., Domínguez, J., Soriano, C., and Ocaña, J. (2015). Method and
system for laser hardening of a surface of a workpiece. US20150211083A1
patent.
Galán, S., Arroyo-Figueroa, G., Díez, F., and Sucar, L. (2007). Comparison
of two types of event Bayesian networks: A case study. Applied Artificial
Intelligence, 21(3):185–209.
Gama, J. (2010). Knowledge Discovery from Data Streams. CRC Press.
Gama, J., Sebastião, R., and Rodrigues, P. (2013). On evaluating stream
learning algorithms. Machine Learning, 90(3):317–346.
Gao, S. and Lei, Y. (2017). A new approach for crude oil price prediction
based on stream learning. Geoscience Frontiers, 8:183–187.
García, S., Derrac, J., Cano, J., and Herrera, F. (2012). Prototype selection
for nearest neighbor classification: Taxonomy and empirical study. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 34(3):417–435.
García, S. and Herrera, F. (2008). An extension on “Statistical Comparisons
of Classifiers over Multiple Data Sets” for all pairwise comparisons. Journal
of Machine Learning Research, 9:2677–2694.
Geiger, D. and Heckerman, D. (1996). Knowledge representation and inference
in similarity networks and Bayesian multinets. Artificial Intelligence, 82:45–
74.
Geng, X., Liang, H., Yu, B., Zhao, P., He, L., and Huang, R. (2017). A scenario-
adaptive driving behavior prediction approach to urban autonomous driving.
Applied Sciences, 7:Article 426.
Gevaert, O., De Smet, F., Timmerman, D., Moreau, Y., and De Moor, B.
(2006). Predicting the prognosis of breast cancer by integrating clinical and
microarray data with Bayesian networks. Bioinformatics, 22(14):184–190.
304 Bibliography
Inman, R. H., Pedro, H. T., and Coimbra, C. F. (2013). Solar forecasting meth-
ods for renewable energy integration. Progress in Energy and Combustion
Science, 39(6):535–576.
Inza, I., Larrañaga, P., Blanco, R., and Cerrolaza, A. (2004). Filter versus
wrapper gene selection approaches in DNA microarray domains. Artificial
Intelligence in Medicine, 31(2):91–103.
Jäger, M., Knoll, C., and Hamprecht, F. A. (2008). Weakly supervised learning
of a classifier for unusual event detection. IEEE Transactions on Image
Processing, 17(9):1700–1708.
Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern
Recognition Letters, 31(8):651–666.
Japkowicz, N. and Mohak, S. (2011). Evaluating Learning Algorithms. A
Classification Perspective. Cambridge University Press.
Jiang, F., Jiang, Y., Zhi, H., Dong, Y., Li, H., Ma, S., Wang, Y., Dong, Q.,
Shen, H., and Wang, Y. (2017). Artificial intelligence in healthcare: Past,
present and future. Stroke and Vascular Neurology, e000101.
John, G. H., Kohavi, R., and Pfleger, P. (1994). Irrelevant features and the
subset selection problem. In Proceedings of the 11th International Conference
in Machine Learning, pages 121–129. Morgan Kaufmann.
Judson, R., Elloumi, F., Setzer, R. W., Li, Z., and Shah, I. (2008). A comparison
of machine learning algorithms for chemical toxicity classification using a
simulated multi-scale data model. BMC Bioinformatics, 9:241.
Kagermann, H., Wahlster, W., and Helbig, J. (2013). Securing the future of
German manufacturing industry. Recommendations for Implementing the
Strategic Initiative INDUSTRIE 4.0. Technical report, National Academy
of Science and Engineering (ACATECH).
Kamp, B., Ochoa, A., and Diaz, J. (2017). Smart servitization within the
context of industrial user–supplier relationships: contingencies according to
a machine tool manufacturer. International Journal on Interactive Design
and Manufacturing, 11(3):651–663.
Kavakiotis, I., Tsave, O., Salifoglou, A., Maglaveras, N., Vlahavas, I., and
Chouvarda, I. (2017). Machine learning and data mining methods in diabetes
research. Computational and Structural Biotechnology Journal, 15:104–116.
Kaynak, C. and Alpaydin, E. (2000). Multistage cascading of multiple clas-
sifiers: One man’s noise is another man’s data. In Proceedings of the 17th
International Conference on Machine Learning, pages 455–462. Morgan
Kaufmann.
Kearns, M. and Nevmyvaka, Y. (2013). Machine learning for market mi-
crostructure and high frequency trading. In High Frequency Trading. New
Realities for Traders, Markets and Regulators, pages 1–21. Risk Books.
Keogh, E. and Pazzani, M. (2002). Learning the structure of augmented
Bayesian classifiers. International Journal on Artificial Intelligence Tools,
11(4):587–601.
Kezunovic, M., Obradovic, Z., Dokic, T., Zhang, B., Stojanovic, J., Dehghanian,
P., and Chen, P.-C. (2017). Predicting spatiotemporal impacts of weather
on power systems using big data science. In Data Science and Big Data: An
Environment of Computational Intelligence, pages 265–299. Springer.
Khare, A., Jeon, M., Sethi, I., and Xu, B. (2017). Machine learning theory and
applications for healtcare. Journal of Healtcare Engineering, ID 5263570.
Kim, D., Kang, P., Cho, S., Lee, H., and Doh, S. (2012). Machine learning-based
novelty detection for faulty wafer detection in semiconductor manufacturing.
Expert Systems with Applications, 39(4):4075–4083.
Kim, J. and Pearl, J. (1983). A computational model for combined causal
and diagnostic reasoning in inference systems. In Proceedings of the 87th
International Joint Conference on Artificial Intelligence, volume 1, pages
190–193.
308 Bibliography
Klaine, P. V., Imran, M. A., Onireti, O., and Souza, R. D. (2017). A survey
of machine learning techniques applied to self-organizing cellular networks.
IEEE Communications Surveys and Tutorials, 19(4):2392–2431.
Kleinrock, L. (1961). Information Flow in Large Communication Nets. PhD
thesis, MIT.
Kohavi, R. (1996). Scaling up the accuracy of naive-Bayes classifiers: A
decision-tree hybrid. In Proceedings of the 2nd International Conference on
Knowledge Discovery and Data Mining, pages 202–207.
Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles
and Techniques. The MIT Press.
Koller, D. and Sahami, M. (1996). Toward optimal feature selection. In
Proceedings of the 13th International Conference on Machine Learning,
pages 284–292.
Kotthoff, L., Thornton, C., Hoos, H. H., Hutter, F., and Leyton-Brown, K.
(2017). Auto-WEKA 2.0: Automatic model selection and hyperparameter
optimization in WEKA. Journal of Machine Learning Research, 18(25):1–5.
Kourou, K., Exarchos, T., Exarchos, K. P., Karamouzis, M., and Fotiadis, D.
(2015). Machine learning applications in cancer prognosis and prediction.
Computational and Structural Biotechnology Journal, 13:8–17.
Kowalski, J., Krawczyk, B., and Woźniak, M. (2017). Fault diagnosis of
marine 4-stroke diesel engines using a one-vs-one extreme learning ensemble.
Engineering Applications of Artificial Intelligence, 57:134–141.
Kraska, T., Beutel, A., Chi, E. H., Dean, J., and Polyzotis, N. (2017). The
case for learned index structures. ArXiv 1712.01208.
Kruskal, J. B. (1956). On the shortest spanning subtree of a graph and the
traveling salesman problem. Proceedings of the American Mathematical
Society, 7(1):48–50.
Kuncheva, L. (2004). Combining Pattern Classifiers: Methods and Algorithms.
Wiley-Interscience.
Kurtz, A. (1948). A research test of Rorschach test. Personnel Psychology,
1:41–53.
Lafaye de Micheaux, P., Drouihet, R., and Liquet, B. (2013). The R Software.
Fundamentals of Programming and Statistical Analysis. Springer.
Landhuis, E. (2017). Big brain, big data. Nature, 541:559–561.
Landwehr, N., Hall, M., and Frank, E. (2003). Logistic model trees. Machine
Learning, 59(1-2):161–205.
Bibliography 309
Lee, J., Kao, H.-A., and Yang, S. (2014). Service innovation and smart analytics
for industry 4.0 and big data environment. Procedia CIRP, 16:3–8.
Leite, D., Costa, P., and Gomide, F. (2010). Evolving granular neural network
for semi-supervised data stream classification. In The 2010 International
Joint Conference on Neural Networks, pages 1–8.
Lessmann, S., Baesens, B., Seow, H.-V., and Thomas, L. C. (2015). Benchmark-
ing state-of-the-art classification algorithms for credit scoring: An update of
research. European Journal of Operational Research, 247:124–136.
Lewis, P. (1962). The characteristic selection problem in recognition systems.
IRE Transactions on Information Theory, 8:171–178.
Li, H., Liang, Y., and Xu, Q. (2009). Support vector machines and its
applications in chemistry. Chemometrics and Intelligent Laboratory Systems,
95(2):188–198.
Li, H. and Zhu, X. (2004). Application of support vector machine method
in prediction of Kappa number of kraft pulping process. In Proceedings of
the Fifth World Congress on Intelligent Control and Automation, volume 4,
pages 3325–3330.
Li, K., Zhang, X., Leung, J. Y.-T., and Yang, S.-L. (2016). Parallel ma-
chine scheduling problems in green manufacturing industry. Journal of
Manufacturing Systems, 38:98–106.
Li, S., Xu, L. D., and Wang, X. (2013). Compressed sensing signal and
data acquisition in wireless sensor networks and Internet of Things. IEEE
Transactions on Industrial Informatics, 9(4):2177–2186.
Li, Y. (2017). Backorder prediction using machine learning for Danish craft
beer breweries. Master’s thesis, Aalborg University.
Lima, A., Philot, E., Trossini, G., Scott, L., Maltarollo, V., and Honorio, K.
(2016). Use of machine learning approaches for novel drug discovery. Expert
Opinion on Drug Discovery, 11(3):225–239.
Lin, S.-C. and Chen, K.-C. (2016). Statistical QoS control of network coded
multipath routing in large cognitive machine-to-machine networks. IEEE
Internet of Things Journal, 3(4):619–627.
Lin, S.-W., Crawford, M., and Mellor, S. (2017). The Industrial Internet of
Things Reference Architecture. Technical Report Volume G1, Industrial
Internet Consortium.
Lipton, Z. C. (2016). The mythos of model interpretability. In ICML Workshop
on Human Interpretability in Machine Learning, pages 96–100.
Bibliography 311
Liu, H., Hussain, F., Tan, C., and Dash, M. (2002). Discretization: An enabling
technique. Data Mining and Knowledge Discovery, 6(4):393–423.
Liu, J., Seraoui, R., Vitelli, V., and Zio, E. (2013). Nuclear power plant
components condition monitoring by probabilistic support vector machine.
Annals of Nuclear Energy, 56:23–33.
Liu, Y., Li, S., Li, F., Song, L., and Rehg, J. (2015). Efficient learning of
continuous-time hidden Markov models for disease progression. Advances in
Neural Information Processing Systems, 28:3600–3608.
Lu, C. and Meeker, W. (1993). Using degradation measures to estimate a
time-to-failure distribution. Technometrics, pages 161–174.
Lusted, L. (1960). Logical analysis in roentgen diagnosis. Radiology, 74:178–193.
Madsen, A., Jensen, F., Kjærulff, U., and Lang, M. (2005). The HUGIN
tool for probabilistic graphical models. International Journal of Artificial
Intelligence Tools, 14(3):507–543.
Malamas, E. N., Petrakis, E. G., Zervakis, M., Petit, L., and Legat, J.-D.
(2003). A survey on industrial vision systems, applications and tools. Image
and Vision Computing, 21(2):171–188.
Maltarollo, V., Gertrudes, J., Oliveira, P., and Honorio, K. (2015). Applying
machine learning techniques for ADME-Tox prediction: A review. Expert
Opinion on Drug Metabolism & Toxicology, 11(2):259–271.
McEliece, R. J., MacKay, D. J. C., and Cheng, J.-F. (1998). Turbo decoding
as an instance of Pearl’s “belief propagation” algorithm. IEEE Journal on
Selected Areas in Communications, 16(2):140–152.
McLachlan, G. and Krishnan, T. (1997). The EM Algorithm and Extensions.
Wiley.
McLachlan, G. and Peel, D. (2004). Finite Mixture Models. John Wiley &
Sons.
Mengistu, A. D., Alemayehu, D., and Mengistu, S. (2016). Ethiopian coffee
plant diseases recognition based on imaging and machine learning techniques.
International Journal of Database Theory and Application, 9(4):79–88.
Metzger, A., Leitner, P., Ivanović, D., Schmieders, E., Franklin, R., Carro,
M., Dustdar, S., and Pohl, K. (2015). Comparing and combining predictive
business process monitoring techniques. IEEE Transactions on Systems,
Man, and Cybernetics: Systems, 45(2):276–290.
Michalski, R. S. and Chilausky, R. (1980). Learning by being told and learning
from examples: An experimental comparison of the two methods of knowledge
acquisition in the context of developing an expert system for soybean disease
diagnosis. International Journal of Policy Analysis and Information Systems,
4:125–160.
Minsky, M. (1961). Steps toward artificial intelligence. Transactions on
Institute of Radio Engineers, 49:8–30.
Natarajan, P., Frenzel, J., and Smaltz, D. (2017). Demystifying Big Data and
Machine Learning for Healthcare. CRC Press.
Bibliography 313
National Academy of Sciences and The Royal Society (2017). The Frontiers
of Machine Learning. The National Academies Press.
Navarro, P., Fernández, C., Borraz, R., and Alonso, D. (2017). A machine
learning approach to pedestrian detection for autonomous vehicles using
high-definition 3D range data. Sensors, 17:Article 18.
Nectoux, P., Gouriveau, R., Medjaher, K., Ramasso, E., Morello, B., Zerhouni,
N., and Varnier, C. (2012). PRONOSTIA: An experimental platform for
bearings accelerated life test. IEEE International Conference on Prognostics
and Health Management, pages 1–8.
Newman, T. S. and Jain, A. K. (1995). A survey of automated visual inspection.
Computer Vision and Image Understanding, 61(2):231–262.
Nguyen, H.-L., Woon, Y.-K., and Ng, W.-K. (2015). A survey on data stream
clustering and classification. Knowledge Information Systems, 45:535–569.
Niu, D., Wang, Y., and Wu, D. D. (2010). Power load forecasting using
support vector machine and ant colony optimization. Expert Systems with
Applications, 37(3):2531–2539.
Nodelman, U., Shelton, C., and Koller, D. (2002). Continuous time Bayesian
networks. In Proceedings of the 18th Conference on Uncertainty in Artificial
Intelligence, pages 378–387.
Nwiabu, N. and Amadi, M. (2017). Building a decision support system for
crude oil price prediction using Bayesian networks. American Scientific
Research Journal for Engineering, Technology, and Sciences, 38(2):1–17.
O’Callaghan, L., Mishra, N., Meyerson, A., Guha, S., and Motwani, R. (2002).
Streaming-data algorithms for high-quality clustering. In Proceedings of the
18th International Conference on Data Engineering, pages 685–694.
Ogbechie, A., Díaz-Rozo, J., Larrañaga, P., and Bielza, C. (2017). Dynamic
Bayesian network-based anomaly detection for in-process visual inspection
of laser surface heat treatment. In Machine Learning for Cyber Physical
Systems, pages 17–24. Springer.
Olesen, J., Gustavsson, Q., Svensson, M., Wittchen, H., and Jonson, B. (2012).
The economic cost of brain disorders in Europe. European Journal of
Neurology, 19(1):155–162.
Onisko, A. and Austin, R. (2015). Dynamic Bayesian network for cervical
cancer screening. In Biomedical Knowledge Representation, pages 207–218.
Springer.
Oza, N. and Russell, S. (2005). Online bagging and boosting. In 2005 IEEE
International Conference on Systems, Man and Cybernetics, pages 2340–
2345.
314 Bibliography
Page, L., Brin, S., Motwani, R., and Winograd, T. (1999). The PageRank
citation ranking: Bringing order to the web. Technical report, Stanford
InfoLab.
Pardakhti, M., Moharreri, E., Wanik, D., Suib, S., and Srivastava, R. (2017).
Machine learning using combined structural and chemical descriptors for
prediction of methane adsorption performance of metal organic frameworks
(MOFs). ACS Combinatorial Science, 19(10):640–645.
Park, K., Ali, A., Kim, D., An, Y., Kim, M., and Shin, H. (2013). Robust
predictive model for evaluating breast cancer survivability. English Applied
Artificial Intelligence, 26:2194–2205.
Park, S., Jaewook, L., and Youngdoo, S. (2016). Predicting market impact costs
using nonparametric machine learning models. PLOS ONE, 11(2):e0150243.
Parzen, E. (1962). On estimation of a probability density function and mode.
The Annals of Mathematical Statistics, 33(3):1065–1076.
Pazzani, M. (1996). Constructive induction of Cartesian product attributes.
In Proceedings of the Information, Statistics and Induction in Science Con-
ference, pages 66–77.
Pazzani, M. and Billsus, D. (1997). Learning and revising user profiles: The
identification of interesting web sites. Machine Learning, 27:313–331.
Pearl, J. (1982). Reverend Bayes on inference engines: A distributed hierarchical
approach. In Proceedings of the 2nd National Conference on Artificial
Intelligence, pages 133–136. AAAI Press.
Pearl, J. (1987). Evidential reasoning using stochastic simulation of causal
models. Artificial Intelligence, 32(2):245–257.
Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems. Morgan
Kaufmann.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos,
A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011).
Scikit-learn: Machine learning in Python. Journal of Machine Learning
Research, 12:2825–2830.
Peng, H., Long, F., and Ding, C. (2005). Feature selection based on mutual in-
formation: Criteria of max-dependency, max-relevance, and min-redundancy.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8):1226–
1238.
Pérez, A., Larrañaga, P., and Inza, I. (2006). Supervised classification with
conditional Gaussian networks: Increasing the structure complexity from
naive Bayes. International Journal of Approximate Reasoning, 43:1–25.
Bibliography 315
Pérez, A., Larrañaga, P., and Inza, I. (2009). Bayesian classifiers based on kernel
density estimation: Flexible classifiers. International Journal of Approximate
Reasoning, 50:341–362.
Petropoulos, A., Chatzis, S., and Xanthopoulos, S. (2017). A hidden Markov
model with dependence jumps for predictive modeling of multidimensional
time-series. Information Sciences, 412-413:50–66.
Pimentel, M. A., Clifton, D. A., Clifton, L., and Tarassenko, L. (2014). A
review of novelty detection. Signal Processing, 99:215–249.
Pizarro, J., Guerrero, E., and Galindo, P. L. (2002). Multiple comparison
procedures applied to model selection. Neurocomputing, 48(1):155–173.
Platt, J. (1999). Fast training of support vector machines using sequential
minimal optimization. In Advances in Kernel Methods - Support Vector
Learning, pages 185–208. The MIT Press.
Pokrajac, D., Lazarevic, A., and Latecki, L. J. (2007). Incremental local
outlier detection for data streams. In IEEE Symposium on Computational
Intelligence and Data Mining, 2007, pages 504–515. IEEE Press.
PricewaterhouseCoopers (2017). Innovation for the earth. Technical Report
161222-113251-LA-OS, World Economic Forum, Davos.
Qian, Y., Yan, R., and Hu, S. (2014). Bearing degradation evaluation using
recurrence quantification analysis and Kalman filter. IEEE Transactions on
Instrumentation and Measurement Society, 63:2599–2610.
Quinlan, J. (1986). Induction of decision trees. Machine Learning, 1(1):81–106.
Quinlan, J. (1987). Simplifying decision trees. International Journal of Man-
Machine Studies, 27(3):221–234.
Quinlan, J. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann.
Rabiner, L. (1989). A tutorial on hidden Markov models and selected applica-
tions in speech recognition. Proceedings of the IEEE, 77(2).
Rabiner, L. and Juang, B. (1986). An introduction to hidden Markov models.
IEEE Acoustics, Speech and Signal Processing Magazine, 3:4–16.
Rajapakse, J. C. and Zhou, J. (2007). Learning effective brain connectivity
with dynamic Bayesian networks. Neuroimage, 37(3):749–760.
Ribeiro, B. (2005). Support vector machines for quality monitoring in a
plastic injection molding process. IEEE Transactions on Systems, Man, and
Cybernetics, Part C (Applications and Reviews), 35(3):401–410.
Robinson, J. W. and Hartemink, A. J. (2010). Learning non-stationary dynamic
Bayesian networks. Journal of Machine Learning Research, 11:3647–3680.
316 Bibliography
Saha, S., Saha, B., Saxena, A., and Goebel, K. (2010). Distributed prognostic
health management with Gaussian process regression. IEEE Aerospace
Conference, pages 1–8.
Sahami, M. (1996). Learning limited dependence Bayesian classifiers. In
Proceedings of the 2nd International Conference on Knowledge Discovery
and Data Mining, pages 335–338.
Samuel, A. L. (1959). Some studies in machine learning using the game of
checkers. IBM Journal of Research and Development, 3(3):210–229.
Sarigul, E., Abbott, A., Schmoldt, D., and Araman, P. (2005). An interactive
machine-learning approach for defect detection in computed tomography
(CT) images of hardwood logs. In Proceedings of Scan Tech 2005 Interna-
tional Conference, pages 15–26.
Sbarufatti, C., Corbetta, M., Manes, A., and Giglio, M. (2016). Sequential
Monte-Carlo sampling based on a committee of artificial neural networks
for posterior state estimation and residual lifetime prediction. International
Journal of Fatigue, 83:10–23.
Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural
Networks, 61:85–117.
Schölkopf, B., Williamson, R., Smola, A., Shawe-Taylor, J., Platt, J., Solla,
S., Leen, T., and Müller, K.-R. (2000). Support vector method for novelty
detection. In 13th Annual Neural Information Processing Systems Conference,
pages 582–588. The MIT Press.
Schwarting, W., Alonso-Mora, J., and Rus, D. (2018). Planning and decision-
making for autonomous vehicles. Annual Review of Control, Robotics and
Autonomous Systems, 1:8.1–8.24.
Bibliography 317
Shi, J., Yin, W., Osher, S., and Sajda, P. (2010). A fast hybrid algorithm for
large-scale l1 -regularized logistic regression. Journal of Machine Learning
Research, 11(1):713–741.
Shigley, J. E., Budynas, R. G., and Mischke, C. R. (2004). Mechanical Engi-
neering Design. McGraw-Hill.
Shigley, J. E. and Mischke, C. R. (1956). Standard Handbook of Machine
Design. McGraw-Hill.
Shukla, D. and Desai, A. (2016). Recognition of fruits using hybrid features
and machine learning. In International Conference on Computing, Analytics
and Security Trends, pages 572–577. IEEE Press.
Siddique, A., Yadava, G., and Singh, B. (2005). A review of stator fault
monitoring techniques of induction motors. IEEE Transactions on Energy
Conversion, 20(1):106–114.
Silva, J. A., Faria, E. R., Barros, R. C., Hruschka, E. R., de Carvalho, A. C.,
and Gama, J. (2013). Data stream clustering: A survey. ACM Computing
Surveys, 46(1):13.
Sing, T., Sander, O., Beerenwinkel, N., and Lengauer, T. (2005). ROCR: Visual-
izing classifier performance in R. Bioinformatics, 21:3940–3941.
Sjöberg, J., Zhang, Q., Ljung, L., Benveniste, A., Delyon, B., Glorennec, P.-Y.,
Hjalmarsson, H., and Juditsky, A. (1995). Nonlinear black-box modeling in
system identification: A unified overview. Automatica, 31(12):1691–1724.
Tibshirani, R., Walther, G., and Hastie, T. (2001). Estimating the number of
clusters in a data set via the gap statistic. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 63(2):411–423.
Tikhonov, A. (1943). On the stability of inverse problems. Doklady Akademii
Nauk SSSR, 39(5):176–179.
Timusk, M., Lipsett, M., and Mechefske, C. K. (2008). Fault detection us-
ing transient machine signals. Mechanical Systems and Signal Processing,
22(7):1724–1749.
Tippannavar, S. and Soma, S. (2017). A machine learning system for recognition
of vegetable plant and classification of abnormality using leaf texture analysis.
International Journal of Scientific and Engineering Research, 8(6):1558–1563.
Tsang, I., Kocsor, A., and Kwok, J. T. (2007). Simpler core vector machines
with enclosing balls. In Proceedings of the 9th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pages 226–235.
Tüfekci, P. (2014). Prediction of full load electrical power output of a base
load operated combined cycle power plant using machine learning methods.
International Journal of Electrical Power and Energy Systems, 60:126–140.
Tukey, J. (1977). Exploratory Data Analysis. Addison-Wesley.
Tuna, G., Kogias, D. G., Gungor, V. C., Gezer, C., Taşkın, E., and Ayday, E.
(2017). A survey on information security threats and solutions for machine
to machine (M2M) communications. Journal of Parallel and Distributed
Computing, 109:142–154.
Bibliography 321
Tylman, W., Waszyrowski, T., Napieralski, A., Kaminski, M., Trafidlo, T.,
Kulesza, Z., Kotas, R., Marciniak, P., Tomala, R., and Wenerski, M.
(2016). Real-time prediction of acute cardiovascular events using hardware-
implemented Bayesian networks. Computers in Biology and Medicine, 69:245–
253.
Wang, K.-J., Chen, J. C., and Lin, Y.-S. (2005). A hybrid knowledge discovery
model using decision tree and neural network for selecting dispatching rules
of a semiconductor final testing factory. Production Planning and Control,
16(7):665–680.
Wang, X. and Xu, D. (2010). An inverse Gaussian process model for degradation
data. Technometrics, 52:188–197.
Wong, J.-Y. and Chung, P.-H. (2008). Retaining passenger loyalty through
data mining: A case study of Taiwanese airlines. Transportation Journal,
47:17–29.
Wuest, T., Weimer, D., Irgens, C., and Thoben, K.-D. (2016). Machine learning
in manufacturing: Advantages, challenges, and applications. Production and
Manufacturing Research, 4(1):23–45.
Xie, L., Huang, R., Gu, N., and Cao, Z. (2014). A novel defect detection
and identification method in optical inspection. Neural Computing and
Applications, 24(7-8):1953–1962.
Xie, W., Yu, L., Xu, S., and Wang, S. (2006). A new method for crude oil
price forecasting based on support vector machines. In Lectures Notes in
Coputer Sciences 2994, pages 444–451. Springer.
Xu, D. and Tian, Y. (2015). A comprehensive survey of clustering algorithms.
Annals of Data Science, 2(2):165–193.
Xu, S., Tan, H., Jiao, X., Lau, F., and Pan, Y. (2007). A generic pigment
model for digital painting. Computer Graphics Forum, 26(3):609–618.
Bibliography 323
Zorriassatine, F., Al-Habaibeh, A., Parkin, R., Jackson, M., and Coy, J. (2005).
Novelty detection for practical pattern recognition in condition monitoring of
multivariate processes: A case study. The International Journal of Advanced
Manufacturing Technology, 25(9-10):954–963.
Index
325
326 Index