Outlier Detection Based On Gaussian Process With Application To
Outlier Detection Based On Gaussian Process With Application To
highlights
article info a b s t r a c t
Article history: Due to the extensive usage of data-based techniques in industrial processes, detecting outliers for indus-
Received 15 July 2017 trial process data become increasingly indispensable. This paper proposes an outlier detection scheme
Received in revised form 4 September 2018 that can be directly used for either process monitoring or process control. Based on traditional Gaussian
Accepted 22 December 2018
process regression, we develop several detection algorithms, of which the mean function, covariance
Available online 27 December 2018
function, likelihood function and inference method are specially devised. Compared with traditional
Keywords: detection methods, the proposed scheme has less postulation and is more suitable for modern industrial
Outlier detection processes. The effectiveness of the proposed scheme is verified by experiments on both synthetic and
Gaussian process real-life data sets.
Industrial process © 2019 Elsevier B.V. All rights reserved.
1. Introduction with available physical models obtained from first principles could
be competent. For modern large-scale complicated industrial pro-
For an industrial process, it is of great importance to guarantee cesses, however, constructing a model based monitoring system
product quality as well as control performance simultaneously. seems to be impossible. Hence, utilizing the available process
Due to the increasing demands on system performance, production measurements to establish an efficient and reliable monitoring
quality and economic operation, technical processes of modern system becomes increasingly popular. Approaches on the basis of
industry are becoming much more complicated. This has led to multivariate statistical are the most popular ones and have been
great challenges for process control and process monitoring due applied in many industrial applications. Under this framework,
to the lack of sufficient knowledge regarding system mechanism. principal component analysis (PCA) and partial least squares (PLS)
Thus the usage of process data becomes an attractive solution are two most common techniques. Furthermore, many pruning
in contrast to the relatively traditional model-based techniques methods on PCA and PLS have also been proposed to cope with the
relying on physical models from first principles. significant process dynamics under industrial operating condition.
For industrial process control, the category of data-based tech- Unfortunately in practical applications, process data points or
niques has drawn much attention during recent years. Via us- systems states are usually prone to be contaminated by some
ing the input and output measurements of plants, parameters of abnormal data points, which are usually referred to as outliers.
controller could be directly determined or adapted by data-based As stated in [1], there are many ways of inducing outliers in the
control methods. From the practical point of view, the model- industrial context, such as parameter changes, structural changes,
free adaptive/predictive control, iterative learning, and automatic faulty sensors and faulty actuators. Data patterns under these
tuning control have been deemed as the most typical strategies for conditions would usually deviate much from those under the nor-
process control in modern industry. On the other hand, for process mal condition. It has been observed in [2] that databases contain
monitoring or fault detection, traditional model-based framework data from normal operating conditions, faulty conditions, various
operating modes, startup periods, and shutdown periods. The ma-
∗ Corresponding author. jority are sampled from normal operating conditions and such
E-mail address: [email protected] (Z. Mao). data are referred to as ‘‘normal data’’. While the remaining data,
https://doi.org/10.1016/j.asoc.2018.12.029
1568-4946/© 2019 Elsevier B.V. All rights reserved.
506 B. Wang and Z. Mao / Applied Soft Computing Journal 76 (2019) 505–516
including faulty data, data during shutdown or startup periods, is a nontrivial task. Moreover, the computational simplicity of GP
outliers from various sources, are all categorized into the outlier can facilitate the understanding of process data, which is important
class. These outliers occur frequently in practice and can have for controlling the system states. To our knowledge, nevertheless,
serious consequences [3]. For process control systems, outliers may most researches regarding GP usually use it for regression [11,12]
trigger deviations of the controller parameters, resulting in inferior or classification tasks [13–15]. For example in [15], Gaussian pro-
control performance even industrial accidents. On the other hand, cess is used for the classification problem. Robustness to noise or
for industrial process monitoring systems, outliers may induce outliers on data labels is considered from the practical point of
many false alarms, which may result in unnecessary economic view. But this is different from the problem of outlier detection
losses. As a consequence, the problem of guaranteeing data quality essentially. Outlier detection belongs to the unsupervised learn-
should be considered prior to the implementations of those data- ing paradigm where data labels are unavailable at the training
based techniques. In a nutshell, outliers should be detected as early phase and the robustness of an outlier detection model may be
as possible for the sake of reducing productivity losses, as well as enhanced only through a heuristic manner. In this paper, via spe-
health risks as much as possible. cific selections of mean function, covariance function, likelihood
To that end, several corresponding approaches were proposed function and inference method, we propose three outlier detection
during last years. For improving the performance of fault detec- algorithms either on the basis of Gaussian process regression or
tion, [4] proposed an outlier identification and removal scheme, Gaussian process classification. Compared with traditional outlier
in which a neural network was used to estimate the actual sys- detection methods, our proposed scheme may be more suitable
tem states and outlier detection was performed by comparing the for industrial processes. Here we conclude our contributions as
difference between the estimated and the measured states with follows.
the standard deviation over a moving time window. To enhance (1) We extend Gaussian process models to detect outliers in
the efficiency of the PCA-based chiller sensor fault detection, di- industrial processes;
agnosis and data reconstruction method, [5] presented a training (2) The mean function, covariance function, likelihood function
data cleaning strategy based on Euclidean distance and Z score of and inference method of the used GP models are designed specifi-
training samples. For the purpose of the safety and reliability of the cally;
energy system in steel industry, [6] proposed an anomaly detection (3) The effectiveness of proposed outlier detection scheme is
scheme with the consideration of the data feature of the energy validated with synthetic and real-world data sets.
system. After classifying the anomalies as the trend anomaly for The rest of this paper is structured as follows. Section 2 sum-
pseudo-periodic data and deviants for generic data, a dynamic time marizes methodology regarding Gaussian process models. Then
warping based method combining with adaptive fuzzy C means in Section 3, we extend GP models to outlier detection from the
was proposed for trend anomaly and K-nearest neighbor AFCM was perspective of both GP regression and classification. Then, several
proposed for generic data. Considering the characteristics of time discussions referring to Gaussian process for outlier detection are
series in PCSs, [7] proposed an improved RBF network to construct presented in Section 4. A series of experiments are carried out in
the model of controlled object and an Auto-Regression Hidden Section 5. Finally, some conclusions are drawn in Section 6.
Markov Model (HMM) was used to detect outliers according to
corresponding fitting residuals. Similarly, HMM was applied to 2. Gaussian process model
detect outliers in order to construct a robust dynamic PLS model,
with which an improved generalized predictive control was de- It is necessary to provide a simple explanation of difference
veloped finally [8]. For maintaining optimal product disposition between Gaussian process and Gaussian distribution, prior to in-
and control in advanced semiconductor processing, a robust outlier troducing the GP theory. A Gaussian process is a generalization of
removal method was proposed in [9], in which robust regression the Gaussian probability distribution. Whereas a probability distri-
methods were employed to accurately estimate outliers. Indeed, bution describes random variables which are scalars or vectors, a
almost all these proposed outlier detection schemes were verified stochastic process governs the properties of functions. Therefore,
to facilitate implementations of industrial process control and we can confirm that a Gaussian process model never assume a
process monitoring, several limitations such as Gaussian distribu- Gaussian distribution of any variable. In addition, it turns out many
tion assumption in [4] and [5], inadaptable for high-dimensional models that commonly employed in both machine learning and
data in [6,7] and [8], would hamper their prospects in practical statistics are in fact special cases of, or restricted kinds of Gaussian
applications. processes.
In this paper, we extend Gaussian process (GP) to outlier de- Suppose we are facing a supervised learning task, from which
tection and apply it to industrial processes. Gaussian processes are we have obtained N training points containing inputs {x1 , . . . , xN }
natural generalizations of multivariate Gaussian random variables ∈ Rd and their corresponding outputs {y1 , . . . , yN } ∈ R. We
to infinite index sets. Generally, GP models are routinely used to aim to generalize from these observed data, in the sense that
solve hard machine learning problems [10]. They are attractive our ability to predict uncertain aspects of a problem improves
due to their flexible non-parametric nature and computational after making the observations. There is no doubt that the optimal
simplicity. These two features are crucial for intricate industrial solution is to infer a distribution over functions given the training
process data. Provided we are facing a supervised learning task data, p (f |X , y) , (X = [x1 , . . . , xN ] , y = [y1 , . . . , yN ]) under the as-
and we aim to generalize from these observed data, in the sense
sumption that yi = f (xi ) for some unknown function f (possibly
that our ability to predict uncertain aspects of a problem improves
corrupted by noise). Then we can use this distribution, p (f |X , y) to
after making the observations. This is possible only if we postulate
make predictions for new inputs using Eq. (1).
a priori a relationship between the variables we will observe and ∫
the ones we need to predict. However, if this a priori postulate is
p (y∗ |x∗ , X , y) = p (y∗ |f , x∗ ) p (y∗ |X , y) df (1)
a very informed one a parametric approach governed by a finite
number of parameters is the method of choice, but if many aspects Suppose we have defined a prior over function f, then we can
of the phenomenon are unknown or hard to describe explicitly, obtain a posterior over this function once we have obtained some
a nonparametric approach can be more versatile and powerful. data/evidence through Bayesian theory:
Note that, additionally, for some parametric learning machines like
Neural Network, the determination of the number of hidden nodes p (f |X , y) ∝ p (f ) p (X , y|f ) (2)
B. Wang and Z. Mao / Applied Soft Computing Journal 76 (2019) 505–516 507
In Eq. (2), p (f ) is the defined prior over f, p (X , y|f ) is the likelihood µ∗ = K∗T Ky−1 y (12)
function. The necessary condition for this equation is that p (f )
Σ∗ = K∗∗ − K∗ Ky K∗
T −1
(13)
conjugates to p (X , y|f ). As first sight, it might seem difficult to
work with a distribution over the unaccountably infinite space of A specific case of above distribution is that the test set only contains
functions. However, it turns out that for a finite training set we a single input point x∗ . Then we have:
only need to consider the values of the function at the discrete set
P (y∗ |X , y, X ∗ ) ∼ N y∗ |kT∗ Ky−1 y , k∗∗ − kT∗ Ky−1 k∗
( )
of input values corresponding to the training set and test set data (14)
points, and so in practice we can work in a finite space [16]. To this
where k∗ = [κ (x∗ , x1 ) , . . . , κ (x∗ , xN )] and k∗∗ = κ (x∗ , x∗ ). We
end, a GP assumes that p (f (x1 ) , . . . , f (xN )) is jointly Gaussian,
can also write µ∗ in the following form:
with some mean µ (x) and covariance Σ (X ). Actually there are
many reasonable choices for determining this covariance as long N
∑
as we specify functions which can generate a non-negative definite µ∗ = kT∗ Ky−1 y = αi κ (xi , x∗ ) (15)
covariance matrix for X. Furthermore, we wish to specify covari- i=1
ance so that points with nearby inputs will give rise to similar
where α = Ky−1 y.
predictions from the perspective of modeling. In this situation,
kernel functions are competent for such a task. One widely used
kernel function for GP is given by the exponential of a quadratic 3. Gaussian process for outlier detection
form, with the addition of constant and linear terms:
In this section, we firstly discuss the problem of outlier detec-
θ1
( )
κ (x1 , x2 ) = θ0 exp − ∥x1 − x2 ∥2 + θ2 + θ3 xT1 x2 (3) tion for industrial processes. Then we present how GP models can
2 be applied to calculate outlier scores from the perspective of GP
Usually, the following kernel (Gaussian kernel or RBF kernel) is a regression and GP classification, respectively.
very representative choice
( ) 3.1. Outlier detection for industrial processes
1
κ (x1 , x2 ) = σ 2 exp − 2 (x1 − x2 )2 (4)
2l As discussed in Section 1 that data-based techniques that utilize
Now we define a Gaussian process as a Gaussian distribution over process data have become attractive and effective solutions for
functions: modern industrial process control, process monitoring, and other
industrial processes. Therefore, it is reasonable to summarize sev-
f (x) ∼ GP m (x) , κ x, x′
( ( ))
(5) eral general features of industrial process data for the sake of the
where m (x) = E (f (x)) , κ x, x = E (f (x) − m (x)) f x′ proposal of an appropriate detection approach. Here we conclude
( ′
) ( ( ( )
( ))T ) the main characteristics of industrial process data as
−m x′ .
(1) Industrial process measurements in most scenarios are un-
labeled. It is very expensive, if not impossible, to label all measure-
2.1. Gaussian process for regression
ments for online applications. This feature should be the greatest
challenge for several traditional outlier detection approaches like
Here we discuss GP models for the problem of regression. Tak-
ing account of the noise on the observed target values, we have: traditional classification-based ones, in which training set must in-
clude sufficient instances sampled from both normal and abnormal
class.
yi = f (xi ) + εi (6) (2) Industrial process measurements used for process control
and process monitoring or FDI should be processed in real-time.
εi is a random noise variable whose value is chosen independently
for each observation. Thus we can consider the output variable has This feature is prominent for industrial processes and has also
the following distribution: triggered great challenge for several outlier detection methods like
nearest neighbor-based ones, in which detections for test instances
p (yi |fi ) ∼ N yi |fi , βi−1
( )
(7) are time-consuming since there is a need for the calculation of the
distances from the test instances to all the training samples. This
where β is a hyperparameter representing the precision of the
computation complexity is unaffordable for large-scale data.
noise, fi = f (xi ). Since the noise is independent for each obser-
(3) Models of outliers (abnormal data) encountered in indus-
vation, we can obtain the following distribution:
trial processes are hardly, even impossible, to construct. Since
p (y|f ) ∼ N y|f , β −1 IN
( )
(8) measurements on the normal working conditions of a machine
are very cheap and easy to obtain. Measurements of outliers, on
where f = [f1 , . . . , fN ], IN denotes the N × N unit matrix.
the other hand, would require the destruction of the machine
Assuming we are given a test set X∗ of size N ∗ × d, then our aim
in all possible ways [17]. Since resources generating outliers in
is to predict the corresponding outputs f∗ of size N ∗ × 1:
∫ industrial processes are very various and forms of outliers can be
p (y ∗ |X∗ , X , y) = p (f |X , y) p (y ∗ |X∗ , f ) df (9) also respectable, constructing a generative model (or finite number
of models) to describe measurements of outliers is an impossible
By the definition of GP, we have the following distribution: mission.
( ) ( ( )) (4) In the case that a training set containing both normal data
y Ky K∗ and outliers has been obtained, the problem of data imbalance
∼ N 0, T (10)
y∗ K∗ K∗∗ should be another challenge for several traditional outlier detec-
tion approaches, in which balanced class distributions or equal
Here we assume the mean is zero for simplicity. Ky = K + σy2 IN ,
misclassification costs are expected or assumed usually. Therefore,
K = κ (X , X ) is N ×N, K∗ = κ (X , X∗ ) is N ×N∗ , and K∗∗ = κ (X∗ , X∗ )
when presented with complex imbalanced data sets, these ap-
is N∗ × N∗ . Then by the standard rules of conditioning Gaussian, we
proaches fail to properly represent the distributive characteristics
can obtain the posterior of y∗ that has the following form:
of the data and resultantly provide unfavorable accuracies across
p (y ∗ |X , y, X ∗ ) ∼ N (y ∗ |µ∗ , Σ∗ ) (11) the classes of the data [18].
508 B. Wang and Z. Mao / Applied Soft Computing Journal 76 (2019) 505–516
(5) Due to the fact that modern industrial systems are becoming
increasingly complicated, many data-based techniques used for
industrial processes inevitably suffer from noises even outliers
when training these data-driven models. Noisy data (including
outliers) may deteriorate the performance of these techniques
depending on their degrees of sensitiveness to data corruptions.
Furthermore, in noisy environments, the noise robustness of the
approaches can be more important than the performance results
themselves [19]. This feature has also triggered great troubles for
many outlier detection approaches, especially for those based on
statistical models.
According to above discussion, an appropriate outlier detection
approach for industrial processes should possess the following
characteristics: (1) being implemented under the framework of
Fig. 1. GP regression with mean of zero and covariance of squared exponential
unsupervised learning; (2) being applied in both on-line and off-
kernel. The solid line is the predictive mean and the shadow area is built with the
line scenarios; (3) being robust to the absence of outlier measure- values of predictive mean plus/minus two predictive standard deviations. The circle
ments at training phase; (4) being robust to the problem of data points at the line are 50 training points.
imbalance at training phase; (5) being robust to noise and outliers
in the training set.
(for visualization) training samples whose outputs are all set to
3.2. Definition of outlier one. From this figure, it is not difficult to find that for samples
that are far away from training samples, their predictive means
Outlier detection finds extensive use in a wide variety of ap- decrease and predictive variances increase.
plications apart from industrial processes, such as fraud detection Now that both predictive mean and predictive variance can be
for credit cards, insurance, or health care, intrusion detection for
used for calculating outlier scores for test points, a more informa-
cyber-security, and milit surveillance for enemy activities [20].
tive feature can also be extracted with the following form:
The definition of outlier may be different among different applica-
tions, but most definitions share one common feature, i.e. outliers µ∗
Z∗ = √ (16)
are patterns in data that do not conform to a well defined notion Σ∗
of normal behavior. The difference happens when defining the Considering the analysis in Section 3.1, it is reasonable for us to
corresponding normal behavior.
make the following assumption: examples at hand for training are
In industrial applications, operating conditions are usually so
all sampled from positive class, i.e. normal instances. As thus, our
harsh that measurements are prone to be contaminated by out-
outlier detection model is akin to a data description model, which
liers from various sources, such as parameter changes, structural
can also be applied to outlier detection under the conditions of data
changes, faulty sensors and faulty actuators. In addition to these
unlabeled or extreme data imbalance [23]. Under this assumption,
outliers, measurements from the startup or shutdown periods are
the training process of our detection approach can be deemed as
usually unstable and cannot represent the normal system state.
an unsupervised leaning process.
Moreover, for several multi-mode systems, measurements from
different operating modes are usually distinct much from each
other. In a nutshell, all data not from the normal operating con- 3.4. Calculating outlier score from Gaussian process classification
dition are referred to as ‘‘outliers’’ in this paper.
For the task of binary classification being solved by a probabilis-
3.3. Calculating outlier score from Gaussian process regression tic approach, the aim is to model the posterior probabilities of the
target variable for the test observations, given a set of training data.
Under the framework of GP regression, we could obtain the As we introduced in regression task, however, predictions made by
predictive distribution of the output for a test instance through GP models lie on the entire real axis. As a result, a transfer function
Bayesian inference. Since this predictive distribution is described that can transform outputs of GP model into the interval (0, 1)
completely by its first and second order moments, it is natural to
has to be adopted. This transfer function is usually referred to as
investigate the predictive mean and variance to calculate outlier
activation function. And logistic sigmoid function is a competent
scores for new and unseen instances. Such a strategy has already
activation function:
been applied to detect change points in time-series data sets in [21],
where both ‘‘jumping mean’’ and ‘‘jumping variance’’ can indicate 1
σ (f ) = (17)
the onset of outliers. 1 + exp (−f )
Now recall Eq. (12), in which K∗ is a function of the test point
So in a binary classification task we can assume a latent function
input value x∗ , Ky and y can be calculated by training set (X , y).
f and define a Gaussian process over f (x), and then transform
Thus we can regard the predictive mean as a function of test point
f (x) to y by a logistic sigmoid function σ . In this situation, we
input value and it is natural to use the predictive mean to calculate
will obtain a non-Gaussian stochastic process over function y (x) ∈
outlier score for a new and unseen test instance.
(0, 1). In practice applications, values of target variable in a binary
Next recall Eq. (13), in which both K∗ and K∗∗ are functions of x∗
classification usually belong to the set {0, 1}. Here we define the
and Ky can be calculated by training set. Thus the predictive vari-
target variable ast, and it conforms to the Bernoulli distribution
ance is also the function of test point input value and can also be
used to calculate outlier score for a new and unseen test instance. given the values of f :
This property keeps consistent with the statistical control theory, p (t |f ) = σ (f )t (1 − σ (f ))1−t (18)
i.e. the variance and its changes are strong features indicating the
onset of outliers in multivariate systems [22]. We illustrate afore- Again, we denote the training set inputs by X = [x1 , . . . , xN ]
mentioned two properties in Fig. 1: there are 50 one-dimensional with corresponding observed target variables t = [t1 , . . . , tN ]. And
B. Wang and Z. Mao / Applied Soft Computing Journal 76 (2019) 505–516 509
we consider a single test point x∗ with the target value t∗ . Our goal
is to determine the following predictive distribution:
∫
p (t∗ = 1|x∗ , X , t ) = p (f∗ |x∗ , X , t ) p (t∗ = 1|f∗ ) df∗ (19)
where p (t∗ = 1|f∗ ) = σ (f∗ ). Unlike Eq. (9) whose two terms
under the integral are two Gaussian distributions, this integral is
analytically intractable due to the fact that p (t∗ = 1|f∗ ) obeys a
Bernolli distribution. However, we can analytically approximate
this integral provided we have a Gaussian approximation to the
posterior distribution p (f∗ |x∗ , X , t ) in Eq. (19). As thus, the integral
becomes the convolution of a logistic sigmoid with a Gaussian Fig. 2. GP regression with mean of constant (equal to two) and covariance of
distribution. And we can evaluate it by the following approxima- squared exponential kernel. The solid line is the predictive mean and the shadow
tion [24]: area is built with the values of predictive mean plus/minus two predictive standard
∫ deviations. The circle points at the line are 50 training points (same as those in
Fig. 1).
= σ κ σ2 µ
σ (x) N x|µ, σ 2 dx ∼
( ) ( ( ) )
(20)
where
)−1/2 replacing parameterized latent functions in these models with
κ σ 2 = 1 + π σ 2 /8
( ) (
(21) Gaussian prior, GP models can be constructed. Thus, they can be
Referring to the problem of seeking a Gaussian approximation interpreted through a weight-space view.
to the posterior distribution p (f∗ |x∗ , X , t ), three well-known ap- These two views lead to different implementations, but are con-
proaches, i.e. Laplace Approximation (LA) [13], Expectation Prop- ceptually equivalent. The former is usually much simpler to work
agation (EP) [25] and Kullback–Leibler divergence (KL) minimiza- with, while the latter allows us to relate GP models to parametric
tion [26] comprising Variational Bound (VB) [27] as a special case, linear models rather directly.
can be competent. In the case of large training data, approximation
approach like FITC [28] is available. 4.2. Selection of mean function
Under the framework of GP classification, therefore, we could
obtain the class posterior conditioned on training set (X , t ) and Here we discuss the selection of the a priori mean function m (x)
test point input value x∗ . Thus, it is intuitive to apply GP binary over the latent function f (x) in GP model applied in our detection
classification directly for outlier detection provided we have a method. In Fig. 1, the setting for the a priori mean function is a zero-
labeled training set. Akin to the case of GP regression, we also mean function and the output values for all training samples are
assume that all examples at hand for training stem from positive assumed to be 1 (y = 1N ×1 ). Under this configuration, test points
class, i.e. t = 1, a N × 1 matrix. near training samples would have predictive means closed to one
(corresponding to the set output values of training samples), other-
4. Discussions wise their predictive means would approach zero (corresponding
to the assumed a priori mean of f (x)). As thus, the predictive means
Although the mechanisms for outlier detection based on GP of test inputs can intuitively reflect their similarities to training
models appear to be intuitive and simple, constructing a well- set. Actually, we can also set the a priori mean function to other
performed GP model is inherently non-trivial. In this section, we arbitrary constant except one like in Fig. 2, where the a priori mean
would like to discuss several issues concerning constructing Gaus- function is set to two and the predictive means for test points that
sian process models, detection approaches based on which would are far away from training set increase (approach two), which is
be influenced significantly. opposite to trend described in Fig. 1, but still can reflect the similar-
ities to training set. However, if we postulate a linear a priori mean
4.1. Two views of Gaussian process function, result can be demonstrated by Fig. 3, through which how
test samples resemble training set cannot be intuitively reflected
For the sake of further understanding the implementation of by predictive mean. From Figs. 1 to 3, however, we can find that
Gaussian process models and their relationship with other statisti- whatever the a priori mean function is, the predictive variance is
cal models which are extensively applied in the domain of machine always a reliable indicator (shadow area) for test instances.
learning, we firstly discuss two views of Gaussian process. As a result, in this paper, the priori mean functions for all
The first is process view [29]. The process view on a zero-mean GP models used for outlier detection are zero-mean functions
GP for latent function f (x) with covariance function Σ (x) is in (e.g. used in Fig. 1).
the spirit of the GP definition introduced in Section 3. This latent
function f (x) is defined implicitly for any finite input subset x = 4.3. Calculation of covariance function
{x1 , . . . , xN }, which induces a finite-dimensional distribution: f =
{f (x1 ) , . . . , f (xN )} ∼ N (0, Σ (x)). This definition is equivalent ( In) Gaussian process, the covariance of two functions f (x) and
to imposing a probability constraint on f (x). Then covariance f x′ is calculated indirectly through a function k with inputs x and
function Σ (x) is equivalent to imposing an smoothness constraint x′ :
which indicates that similar data points should have the same class
V f (x) , f x′ = E (f (x) − m (x)) f x′ − m x′ = κ x, x′
[ ( )] [ ( ( ) ( ))] ( )
assignments (for classification) or similar outputs (for regression).
As thus, the process view boils down to dealing with the projection (22)
of the GP onto a multivariate Gaussian distribution, thus to simple
linear algebra of quadratic forms. Functions that could achieve the above property are just kernel
The second is weight-space view [30]. Actually, Gaussian process functions. Thus, any kernel function that defines the covariance
models originally stem from classical statistical models, such as as a positive semi-definite matrix (Mercer theorem) would be
linear functions, Wavelet expansions and Neural networks. Via competent. There are several available kernels can be used directly
510 B. Wang and Z. Mao / Applied Soft Computing Journal 76 (2019) 505–516
Table 1
Compatibility between likelihoods and inference methods.
Likelihood\Inference Exact LA EP VB
√ √ √ √
Gaussian
√ √
Student’s t
√ √
Laplacian
Fig. 4. An illustrative example of GP regression with outliers. The left-hand part is Gaussian likelihood and the right-hand part is the Student’s t likelihood. The black solid
line is the real function, then the blue solid line and the red dashed line are the predictive mean and corresponding standard covariance.
Table 2
Confusion matrix of two-class classification problem.
Actual Label
Target Class Negative Class
Predicted Target Class True Positive (TP) False Positive (FP)
Label Negative Class False Negative (FN) True Negative (TN)
Table 4
Summary of results for TE data set (with outliers in training set) w.r.t G-mean,
F-measure and AUC. For GP methods, the likelihood function is Student’s t and the
inference method is Laplace.
GMM KM KNN SVDD PCA GP-RM GP-RV GP-RMV GP-C
G-mean 0.715 0.671 0.684 0.743 0.697 0.750 0.766 0.762 0.747
F-measure 0.688 0.639 0.642 0.707 0.670 0.732 0.739 0.735 0.732
AUC 0.810 0.759 0.764 0.853 0.799 0.861 0.883 0.879 0.859
Table 3
Summary of results for Friedman data set w.r.t. G-mean, F-measure and AUC
values. The results are average values over all ten training sets. GP-RM, GP-RV, GP-
RMV and GP-C represent methods based on predictive mean, predictive variance,
predictive mean plus variance of Gaussian process regression and Gaussian process
classification. Fig. 8. Results of comparison for different covariance functions w.r.t. AUC.
GMM KM KNN SVDD PCA GP-RM GP-RV GP-RMV GP-C
G-mean 0.605 0.632 0.539 0.701 0.513 0.744 0.752 0.748 0.713
F-measure 0.515 0.562 0.541 0.679 0.523 0.711 0.716 0.720 0.718
AUC 0.614 0.602 0.569 0.723 0.608 0.763 0.778 0.771 0.742 methods can be improved significantly. Results shown in Fig. 8
indicates that for TE data set, performance w.r.t. AUC value can be
improved by the composite kernel function, but this improvement
is not significant.
of Student’s t likelihood in the comparison (results of Laplace
is similar). We can see that our methods based on Student’s t 5.4. Electric arc furnace process control
likelihood outperform all the competitors w.r.t all three metrics.
Electric arc furnace (EAF) is widely used in many countries
for refining quality steel for industry. Nowadays in steel making
5.3. Tennessee Eastman benchmark process companies, the number of EAFs is rapidly increasing since they
are suitable devices to melt scrap and direct reduced iron for steel
The Tennessee Eastman (TE) benchmark process is a widely production. A schematic diagram of the EAF operation is shown
used simulation process for evaluating different approaches re- in Fig. 9. The scrap is loaded into the furnace and the roof is then
ferring to process monitoring and FDD [38]. The process has 41 closed, before the electrodes bore down the scrap to transfer elec-
process variables and 12 manipulated variables. Of the 41 process tric energy. Natural gas and oxygen are injected into the furnace
variables, 22 of them are easily measured while the remaining from the burners which get combusted releasing chemical energy
are difficult to measure. Therefore, these 22 variables are usually that is also absorbed by the scrap. The scrap keeps melting through
utilized to either predict other difficult-to measure variables or absorbing electrical chemical and radiation energy. When suffi-
process monitoring. cient amount of space is available within the furnace, another scrap
charge is added and melting continues until a flat batch of molten
Due to changes in the G/H ratio and stripper underflow, TE
steel is formed at the end of the batch. Through the evolution of
process has six basic operation modes. In our experiment, we
carbon monoxide from molten metal a slag layer is formed, which
sample 500 examples from each mode and construct six data sets.
contains most of the oxides resulting from the reactions of the
In each data set, examples from five modes are regarded as normal
metals with oxygen. Slag chemistry is adjusted through oxygen
data and examples from the rest mode are regarded as outliers.
and carbon lancing, beside some direct addition of carbon, lime and
Then 100 examples from each normal mode and 50 examples from
dolomite through the roof of the furnace.
the abnormal mode are selected to construct the training set, all
Generally, an EAF is among the highest electrical energy con-
the remaining examples construct the test set. As thus, the size of sumers in the power grid. The rising cost of energy has put pressure
training set and test set is 550 (including 50 outliers) and 2450 on the steel industry to improve their process control systems
(including 450 outliers), respectively. We repeat this process five to conserve energy without sacrificing quality and equipment.
times and present the average results since the training and test This pressure is more accentuated when we consider the adverse
set are selected randomly. Note that outliers are generated in faulty effects of EAFs on the power quality of its feeding power system.
conditions, so they must deviate from the data generated in normal Since an EAF is a non-stationary electric load, it can cause voltage
condition. However, there are still several outliers hard to detect fluctuation or flicker. It also produces current harmonics due to its
since they greatly resemble normal data. highly nonlinear behavior. The unbalance in the meltdown phase
Firstly, we compare results of our methods equipping with is another adverse effect of such loads in a power system. In the
Student’s t as likelihood and LA as inference with five competitors. literature regarding control strategies for EAF systems, adaptive
The covariance function used here is squared exponential kernel. control and predictive control are the most prominent ones. In
As can be seen in Table 4, our proposed methods outperform all the addition, different sets of state variables have been considered by
competitors in terms of three metrics, especially method based on these control strategies in order to reach higher control perfor-
the predictive variance. mance.
Then, we choose to use a composite kernel function as intro- A direct adaptive controller for EAF electrode regulator system
duced in Section 4.3 to investigate if the performance of GP-based was proposed in [40], from which we could find real-time values
514 B. Wang and Z. Mao / Applied Soft Computing Journal 76 (2019) 505–516
Fig. 10. Results of comparison for EAF data set w.r.t. ROC curve.
Fig. 11. The schematic diagram of transonic wind tunnel systems.
Source: Cited from [41].
[3] R.K. Pearson, Outliers in process modeling and identification, IEEE Trans.
Control Syst. Technol. 10 (1) (2002) 55–63.
[4] H. Ferdowsi, S. Jagannathan, M. Zawodniok, An online outlier identification
and removal scheme for improving fault detection performance, IEEE Trans.
Neural Netw. Learn. Syst. 25 (5) (2014) 908–919.
[5] Y. Hu, et al., A statistical training data cleaning strategy for the PCA-based
chiller sensor fault detection, diagnosis and data reconstruction method,
Energy Build. 112 (2016) 270–278.
[6] J. Zhao, et al., Adaptive fuzzy clustering based anomaly data detection in
energy system of steel industry, Inform. Sci. 259 (3) (2014) 335–345.
[7] F. Liu, Z. Mao, W. Su, Outlier detection for process control data based on a
non-linear auto-regression hidden Markov Model method, Trans. Inst. Meas.
Control 34 (5) (2012) 527–538.
[8] X. Jin, et al., An improved generalized predictive control in a robust dynamic
partial least square framework, Math. Probl. Eng. 2015 (12) (2015) 1–14.
[9] J.C. Robinson, et al., Improved overlay control using robust outlier removal
methods, Proc. SPIE 7971 (11) (2011) 79711G–79711G-10.
[10] C.E. Rasmussen, C.K.I. Williams, Gaussian processes for machine learning
(adaptive computation and machine learning), Int. J. Neural Syst. 14 (481)
(2005) 69–106.
[11] P.D. Kirk, M.P. Stumpf, Gaussian process regression bootstrapping: exploring
Fig. 12. Results of comparison for WT data set w.r.t. ROC curve.
the effects of uncertainty in time course data, Bioinformatics 25 (10) (2009)
1300–1306.
[12] A. Banerjee, D.B. Dunson, S.T. Tokdar, Efficient Gaussian process regression for
large datasets, Biometrika 100 (1) (2013) 75.
situation of EAF data, the required accuracy for Mach number is
[13] C.K.I. Williams, D. Barber, Bayesian classification with Gaussian processes,
higher, which may be a challenge for the detection model. IEEE Trans. Pattern Anal. Mach. Intell. 20 (12) (1998) 1342–1351.
Identically, for our proposed method, we select the one based [14] J. He, H. Gu, Z. Wang, Multi-instance multi-label learning based on Gaussian
on the predictive variance due to its performance for previous process with application to visual mobile robot navigation, Inform. Sci. 190
data sets. The covariance function is composite kernel function, (3) (2012) 162–177.
the likelihood function is Student’s t and the inference method is [15] H.C. Kim, Z. Ghahramani, Outlier Robust Gaussian process classification,
in: Joint IAPR International Workshops on Statistical Techniques in Pattern
Laplace. Then we provide results of comparison in terms of ROC Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR),
curve as demonstrated by Fig. 12. We can see that performance Springer, Berlin, Heidelberg, 2008.
of all methods decrease compared with that for EAF data set. The [16] C.M. Bishop, Pattern Recognition and Machine Learning (Information Science
main reason is that TW data set has a higher requirement for and Statistics), Springer-Verlag New York, Inc., 2006, 049901.
accuracy so that outliers are not easy to identified from normal [17] D.M.J. Tax, R.P.W. Duin, Support vector data description, Mach. Learn. 54 (1)
(2004) 45–66.
data. From the ROC curve we can see that the TPR can hardly reach
[18] H. He, E.A. Garcia, Learning from imbalanced data, IEEE Trans. Knowl. Data
one whatever the FTR is set. This indicates that the distribution of Eng. 21 (9) (2008) 1263–1284.
outliers may be partially overlapping with that of normal data. [19] J.A. Sáez, et al., Tackling the problem of classification with noisy data using
multiple classifier systems: Analysis of the performance and robustness, Inf.
6. Conclusions Sci. Int. J. 247 (15) (2013) 1–20.
[20] V. Chandola, A. Banerjee, V. Kumar, Anomaly detection: A survey, ACM Com-
put. Surv. 41 (3) (2009) 1–58.
With more and more data-based techniques applied in modern [21] J. Takeuchi, K. Yamanishi, A unifying framework for detecting outliers and
industrial processes, detecting outliers for industrial process data change points from time series, IEEE Trans. Knowl. Data Eng. 18 (4) (2011)
become increasingly significant. This paper proposes an outlier 482–492.
detection scheme based on Gaussian process models, which are [22] S. Kumar, V. Sotiris, M. Pecht, Health assessment of electronic products using
routinely used to solve hard machine learning problems. Due to mahalanobis distance , projection pursuit analysis, Int. J. Comput. Inf. Sci. Syst.
Sci. Eng. (4) (2008) 242.
their flexible non-parametric nature and computational simplic-
[23] D.M.J. Tax, One-class classification, Delft University of Technology., 2001.
ity, they are mainly used as effective tools for regression tasks [24] D. Barber, C.M. Bishop, Ensemble learning for multi-layer networks, in: Con-
or classification tasks. Via specific selections of mean function, ference on Advances in Neural Information Processing Systems, 1998.
covariance function, likelihood function and inference method, [25] T.P. Minka, Expectation propagation for approximate Bayesian inference,
we develop three outlier detection algorithms based on Gaussian 2013, vol. 17, p. 362–369.
[26] M. Opper, C. Archambeau, The variational gaussian approximation revisited,
process regression and one based on Gaussian process classifica-
Neural Comput. 21 (3) (1989) 786–792.
tion. Compared with traditional detection methods, our proposed [27] M.N. Gibbs, D.J.C. Mackay, Variational Gaussian process classifiers, IEEE Trans.
scheme has less assumptions and is more suitable for modern Neural Netw. 11 (6) (2000) 1458–1464.
industrial processes. Finally, we carry out several experiments on [28] E. Snelson, Z. Ghahramani, Sparse Gaussian process using pseudo-inputs, Adv.
both synthetic data set and real-life industrial processes’ data sets. Neural Inf. Process. Syst. 18 (1) (2006) 1257–1264.
[29] M. Seeger, Gaussian processes for machine learning, J. Amer. Statist. Assoc. 14
Through comparison with several competitors, the effectiveness of
(481) (2008) 69–106.
our proposed scheme has been verified. [30] C.K.I. Williams, Prediction with gaussian processes: from linear regression to
linear prediction and beyond, in: Nato Advanced Study Institute on Learning
Acknowledgments in Graphical MODELS, 1998.
[31] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, Cam-
This work is supported by National Natural Science Foundation bridge university press, 2014.
[32] R.M. Neal, Monte Carlo implementation of Gaussian process models for
of China (Grant Nos. 61473072 and 61333006).
Bayesian regression and classification, Physics (1997).
[33] P. Jylänki, J. Vanhatalo, A. Vehtari, Robust Gaussian process regression with a
References student- t likelihood, J. Mach. Learn. Res. 12 (7) (2011) 1910–1918.
[34] S. Wright, J. Nocedal, Numerical optimization. Springer Science 35.67-68
[1] B.S.J. Costa, P.P. Angelov, L.A. Guedes, Fully unsupervised fault detection and (1999): 7.
identification based on recursive density estimation and self-evolving cloud- [35] J. Huang, C.X. Ling, Using AUC and accuracy in evaluating learning algorithms,
based classifier, Neurocomputing 150 (2015) 289–303. IEEE Trans. Knowl. Data Eng. 17 (3) (2005) 299–310.
[2] L.H. Chiang, R.J. Pell, M.B. Seasholtz, Exploring process data with the use of [36] Kooperberg, Charles, Multivariate adaptive regression splines, Ann. Statist. 19
robust outlier detection algorithms, J. Process Control 13 (5) (2003) 437–449. (1) (1991) 1–67.
516 B. Wang and Z. Mao / Applied Soft Computing Journal 76 (2019) 505–516
[37] R. Ranjan, B. Huang, A. Fatehi, Robust Gaussian process modeling using EM [40] L. Li, Z. Mao, A direct adaptive controller for EAF electrode regulator system
algorithm, J. Process Control 42 (2016) 125–136. using neural networks, Neurocomputing 82 (4) (2012) 91–98.
[38] J.J. Downs, E.F. Vogel, A plant-wide industrial process control problem, Com- [41] X. Wang, P. Yuan, Z. Mao, Ensemble fixed-size LS-SVMs applied for the Mach
put. Chem. Eng. 17 (3) (1993) 245–255. number prediction in transonic wind tunnel, IEEE Trans. Aerosp. Electron.
[39] S. Bird, et al., Modeling, optimization and estimation in electric arc furnace Syst. 51 (4) (2015) 3167–3181.
(EAF) operation, Chem. Eng. (2013).