0% found this document useful (0 votes)
6 views

Outlier Detection Based On Gaussian Process With Application To

Uploaded by

664682899
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Outlier Detection Based On Gaussian Process With Application To

Uploaded by

664682899
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Applied Soft Computing Journal 76 (2019) 505–516

Contents lists available at ScienceDirect

Applied Soft Computing Journal


journal homepage: www.elsevier.com/locate/asoc

Outlier detection based on Gaussian process with application to


industrial processes

Biao Wang, Zhizhong Mao
Department of Control Theory and Control Engineering, Northeastern University, 110819, Shenyang, China

highlights

• An outlier detection method for industrial processes is proposed.


• Gaussian process is extended to calculate outlier scores.
• Extensive experiment results indicate the effectiveness and necessity of our method.

article info a b s t r a c t

Article history: Due to the extensive usage of data-based techniques in industrial processes, detecting outliers for indus-
Received 15 July 2017 trial process data become increasingly indispensable. This paper proposes an outlier detection scheme
Received in revised form 4 September 2018 that can be directly used for either process monitoring or process control. Based on traditional Gaussian
Accepted 22 December 2018
process regression, we develop several detection algorithms, of which the mean function, covariance
Available online 27 December 2018
function, likelihood function and inference method are specially devised. Compared with traditional
Keywords: detection methods, the proposed scheme has less postulation and is more suitable for modern industrial
Outlier detection processes. The effectiveness of the proposed scheme is verified by experiments on both synthetic and
Gaussian process real-life data sets.
Industrial process © 2019 Elsevier B.V. All rights reserved.

1. Introduction with available physical models obtained from first principles could
be competent. For modern large-scale complicated industrial pro-
For an industrial process, it is of great importance to guarantee cesses, however, constructing a model based monitoring system
product quality as well as control performance simultaneously. seems to be impossible. Hence, utilizing the available process
Due to the increasing demands on system performance, production measurements to establish an efficient and reliable monitoring
quality and economic operation, technical processes of modern system becomes increasingly popular. Approaches on the basis of
industry are becoming much more complicated. This has led to multivariate statistical are the most popular ones and have been
great challenges for process control and process monitoring due applied in many industrial applications. Under this framework,
to the lack of sufficient knowledge regarding system mechanism. principal component analysis (PCA) and partial least squares (PLS)
Thus the usage of process data becomes an attractive solution are two most common techniques. Furthermore, many pruning
in contrast to the relatively traditional model-based techniques methods on PCA and PLS have also been proposed to cope with the
relying on physical models from first principles. significant process dynamics under industrial operating condition.
For industrial process control, the category of data-based tech- Unfortunately in practical applications, process data points or
niques has drawn much attention during recent years. Via us- systems states are usually prone to be contaminated by some
ing the input and output measurements of plants, parameters of abnormal data points, which are usually referred to as outliers.
controller could be directly determined or adapted by data-based As stated in [1], there are many ways of inducing outliers in the
control methods. From the practical point of view, the model- industrial context, such as parameter changes, structural changes,
free adaptive/predictive control, iterative learning, and automatic faulty sensors and faulty actuators. Data patterns under these
tuning control have been deemed as the most typical strategies for conditions would usually deviate much from those under the nor-
process control in modern industry. On the other hand, for process mal condition. It has been observed in [2] that databases contain
monitoring or fault detection, traditional model-based framework data from normal operating conditions, faulty conditions, various
operating modes, startup periods, and shutdown periods. The ma-
∗ Corresponding author. jority are sampled from normal operating conditions and such
E-mail address: [email protected] (Z. Mao). data are referred to as ‘‘normal data’’. While the remaining data,

https://doi.org/10.1016/j.asoc.2018.12.029
1568-4946/© 2019 Elsevier B.V. All rights reserved.
506 B. Wang and Z. Mao / Applied Soft Computing Journal 76 (2019) 505–516

including faulty data, data during shutdown or startup periods, is a nontrivial task. Moreover, the computational simplicity of GP
outliers from various sources, are all categorized into the outlier can facilitate the understanding of process data, which is important
class. These outliers occur frequently in practice and can have for controlling the system states. To our knowledge, nevertheless,
serious consequences [3]. For process control systems, outliers may most researches regarding GP usually use it for regression [11,12]
trigger deviations of the controller parameters, resulting in inferior or classification tasks [13–15]. For example in [15], Gaussian pro-
control performance even industrial accidents. On the other hand, cess is used for the classification problem. Robustness to noise or
for industrial process monitoring systems, outliers may induce outliers on data labels is considered from the practical point of
many false alarms, which may result in unnecessary economic view. But this is different from the problem of outlier detection
losses. As a consequence, the problem of guaranteeing data quality essentially. Outlier detection belongs to the unsupervised learn-
should be considered prior to the implementations of those data- ing paradigm where data labels are unavailable at the training
based techniques. In a nutshell, outliers should be detected as early phase and the robustness of an outlier detection model may be
as possible for the sake of reducing productivity losses, as well as enhanced only through a heuristic manner. In this paper, via spe-
health risks as much as possible. cific selections of mean function, covariance function, likelihood
To that end, several corresponding approaches were proposed function and inference method, we propose three outlier detection
during last years. For improving the performance of fault detec- algorithms either on the basis of Gaussian process regression or
tion, [4] proposed an outlier identification and removal scheme, Gaussian process classification. Compared with traditional outlier
in which a neural network was used to estimate the actual sys- detection methods, our proposed scheme may be more suitable
tem states and outlier detection was performed by comparing the for industrial processes. Here we conclude our contributions as
difference between the estimated and the measured states with follows.
the standard deviation over a moving time window. To enhance (1) We extend Gaussian process models to detect outliers in
the efficiency of the PCA-based chiller sensor fault detection, di- industrial processes;
agnosis and data reconstruction method, [5] presented a training (2) The mean function, covariance function, likelihood function
data cleaning strategy based on Euclidean distance and Z score of and inference method of the used GP models are designed specifi-
training samples. For the purpose of the safety and reliability of the cally;
energy system in steel industry, [6] proposed an anomaly detection (3) The effectiveness of proposed outlier detection scheme is
scheme with the consideration of the data feature of the energy validated with synthetic and real-world data sets.
system. After classifying the anomalies as the trend anomaly for The rest of this paper is structured as follows. Section 2 sum-
pseudo-periodic data and deviants for generic data, a dynamic time marizes methodology regarding Gaussian process models. Then
warping based method combining with adaptive fuzzy C means in Section 3, we extend GP models to outlier detection from the
was proposed for trend anomaly and K-nearest neighbor AFCM was perspective of both GP regression and classification. Then, several
proposed for generic data. Considering the characteristics of time discussions referring to Gaussian process for outlier detection are
series in PCSs, [7] proposed an improved RBF network to construct presented in Section 4. A series of experiments are carried out in
the model of controlled object and an Auto-Regression Hidden Section 5. Finally, some conclusions are drawn in Section 6.
Markov Model (HMM) was used to detect outliers according to
corresponding fitting residuals. Similarly, HMM was applied to 2. Gaussian process model
detect outliers in order to construct a robust dynamic PLS model,
with which an improved generalized predictive control was de- It is necessary to provide a simple explanation of difference
veloped finally [8]. For maintaining optimal product disposition between Gaussian process and Gaussian distribution, prior to in-
and control in advanced semiconductor processing, a robust outlier troducing the GP theory. A Gaussian process is a generalization of
removal method was proposed in [9], in which robust regression the Gaussian probability distribution. Whereas a probability distri-
methods were employed to accurately estimate outliers. Indeed, bution describes random variables which are scalars or vectors, a
almost all these proposed outlier detection schemes were verified stochastic process governs the properties of functions. Therefore,
to facilitate implementations of industrial process control and we can confirm that a Gaussian process model never assume a
process monitoring, several limitations such as Gaussian distribu- Gaussian distribution of any variable. In addition, it turns out many
tion assumption in [4] and [5], inadaptable for high-dimensional models that commonly employed in both machine learning and
data in [6,7] and [8], would hamper their prospects in practical statistics are in fact special cases of, or restricted kinds of Gaussian
applications. processes.
In this paper, we extend Gaussian process (GP) to outlier de- Suppose we are facing a supervised learning task, from which
tection and apply it to industrial processes. Gaussian processes are we have obtained N training points containing inputs {x1 , . . . , xN }
natural generalizations of multivariate Gaussian random variables ∈ Rd and their corresponding outputs {y1 , . . . , yN } ∈ R. We
to infinite index sets. Generally, GP models are routinely used to aim to generalize from these observed data, in the sense that
solve hard machine learning problems [10]. They are attractive our ability to predict uncertain aspects of a problem improves
due to their flexible non-parametric nature and computational after making the observations. There is no doubt that the optimal
simplicity. These two features are crucial for intricate industrial solution is to infer a distribution over functions given the training
process data. Provided we are facing a supervised learning task data, p (f |X , y) , (X = [x1 , . . . , xN ] , y = [y1 , . . . , yN ]) under the as-
and we aim to generalize from these observed data, in the sense
sumption that yi = f (xi ) for some unknown function f (possibly
that our ability to predict uncertain aspects of a problem improves
corrupted by noise). Then we can use this distribution, p (f |X , y) to
after making the observations. This is possible only if we postulate
make predictions for new inputs using Eq. (1).
a priori a relationship between the variables we will observe and ∫
the ones we need to predict. However, if this a priori postulate is
p (y∗ |x∗ , X , y) = p (y∗ |f , x∗ ) p (y∗ |X , y) df (1)
a very informed one a parametric approach governed by a finite
number of parameters is the method of choice, but if many aspects Suppose we have defined a prior over function f, then we can
of the phenomenon are unknown or hard to describe explicitly, obtain a posterior over this function once we have obtained some
a nonparametric approach can be more versatile and powerful. data/evidence through Bayesian theory:
Note that, additionally, for some parametric learning machines like
Neural Network, the determination of the number of hidden nodes p (f |X , y) ∝ p (f ) p (X , y|f ) (2)
B. Wang and Z. Mao / Applied Soft Computing Journal 76 (2019) 505–516 507

In Eq. (2), p (f ) is the defined prior over f, p (X , y|f ) is the likelihood µ∗ = K∗T Ky−1 y (12)
function. The necessary condition for this equation is that p (f )
Σ∗ = K∗∗ − K∗ Ky K∗
T −1
(13)
conjugates to p (X , y|f ). As first sight, it might seem difficult to
work with a distribution over the unaccountably infinite space of A specific case of above distribution is that the test set only contains
functions. However, it turns out that for a finite training set we a single input point x∗ . Then we have:
only need to consider the values of the function at the discrete set
P (y∗ |X , y, X ∗ ) ∼ N y∗ |kT∗ Ky−1 y , k∗∗ − kT∗ Ky−1 k∗
( )
of input values corresponding to the training set and test set data (14)
points, and so in practice we can work in a finite space [16]. To this
where k∗ = [κ (x∗ , x1 ) , . . . , κ (x∗ , xN )] and k∗∗ = κ (x∗ , x∗ ). We
end, a GP assumes that p (f (x1 ) , . . . , f (xN )) is jointly Gaussian,
can also write µ∗ in the following form:
with some mean µ (x) and covariance Σ (X ). Actually there are
many reasonable choices for determining this covariance as long N

as we specify functions which can generate a non-negative definite µ∗ = kT∗ Ky−1 y = αi κ (xi , x∗ ) (15)
covariance matrix for X. Furthermore, we wish to specify covari- i=1
ance so that points with nearby inputs will give rise to similar
where α = Ky−1 y.
predictions from the perspective of modeling. In this situation,
kernel functions are competent for such a task. One widely used
kernel function for GP is given by the exponential of a quadratic 3. Gaussian process for outlier detection
form, with the addition of constant and linear terms:
In this section, we firstly discuss the problem of outlier detec-
θ1
( )
κ (x1 , x2 ) = θ0 exp − ∥x1 − x2 ∥2 + θ2 + θ3 xT1 x2 (3) tion for industrial processes. Then we present how GP models can
2 be applied to calculate outlier scores from the perspective of GP
Usually, the following kernel (Gaussian kernel or RBF kernel) is a regression and GP classification, respectively.
very representative choice
( ) 3.1. Outlier detection for industrial processes
1
κ (x1 , x2 ) = σ 2 exp − 2 (x1 − x2 )2 (4)
2l As discussed in Section 1 that data-based techniques that utilize
Now we define a Gaussian process as a Gaussian distribution over process data have become attractive and effective solutions for
functions: modern industrial process control, process monitoring, and other
industrial processes. Therefore, it is reasonable to summarize sev-
f (x) ∼ GP m (x) , κ x, x′
( ( ))
(5) eral general features of industrial process data for the sake of the
where m (x) = E (f (x)) , κ x, x = E (f (x) − m (x)) f x′ proposal of an appropriate detection approach. Here we conclude
( ′
) ( ( ( )
( ))T ) the main characteristics of industrial process data as
−m x′ .
(1) Industrial process measurements in most scenarios are un-
labeled. It is very expensive, if not impossible, to label all measure-
2.1. Gaussian process for regression
ments for online applications. This feature should be the greatest
challenge for several traditional outlier detection approaches like
Here we discuss GP models for the problem of regression. Tak-
ing account of the noise on the observed target values, we have: traditional classification-based ones, in which training set must in-
clude sufficient instances sampled from both normal and abnormal
class.
yi = f (xi ) + εi (6) (2) Industrial process measurements used for process control
and process monitoring or FDI should be processed in real-time.
εi is a random noise variable whose value is chosen independently
for each observation. Thus we can consider the output variable has This feature is prominent for industrial processes and has also
the following distribution: triggered great challenge for several outlier detection methods like
nearest neighbor-based ones, in which detections for test instances
p (yi |fi ) ∼ N yi |fi , βi−1
( )
(7) are time-consuming since there is a need for the calculation of the
distances from the test instances to all the training samples. This
where β is a hyperparameter representing the precision of the
computation complexity is unaffordable for large-scale data.
noise, fi = f (xi ). Since the noise is independent for each obser-
(3) Models of outliers (abnormal data) encountered in indus-
vation, we can obtain the following distribution:
trial processes are hardly, even impossible, to construct. Since
p (y|f ) ∼ N y|f , β −1 IN
( )
(8) measurements on the normal working conditions of a machine
are very cheap and easy to obtain. Measurements of outliers, on
where f = [f1 , . . . , fN ], IN denotes the N × N unit matrix.
the other hand, would require the destruction of the machine
Assuming we are given a test set X∗ of size N ∗ × d, then our aim
in all possible ways [17]. Since resources generating outliers in
is to predict the corresponding outputs f∗ of size N ∗ × 1:
∫ industrial processes are very various and forms of outliers can be
p (y ∗ |X∗ , X , y) = p (f |X , y) p (y ∗ |X∗ , f ) df (9) also respectable, constructing a generative model (or finite number
of models) to describe measurements of outliers is an impossible
By the definition of GP, we have the following distribution: mission.
( ) ( ( )) (4) In the case that a training set containing both normal data
y Ky K∗ and outliers has been obtained, the problem of data imbalance
∼ N 0, T (10)
y∗ K∗ K∗∗ should be another challenge for several traditional outlier detec-
tion approaches, in which balanced class distributions or equal
Here we assume the mean is zero for simplicity. Ky = K + σy2 IN ,
misclassification costs are expected or assumed usually. Therefore,
K = κ (X , X ) is N ×N, K∗ = κ (X , X∗ ) is N ×N∗ , and K∗∗ = κ (X∗ , X∗ )
when presented with complex imbalanced data sets, these ap-
is N∗ × N∗ . Then by the standard rules of conditioning Gaussian, we
proaches fail to properly represent the distributive characteristics
can obtain the posterior of y∗ that has the following form:
of the data and resultantly provide unfavorable accuracies across
p (y ∗ |X , y, X ∗ ) ∼ N (y ∗ |µ∗ , Σ∗ ) (11) the classes of the data [18].
508 B. Wang and Z. Mao / Applied Soft Computing Journal 76 (2019) 505–516

(5) Due to the fact that modern industrial systems are becoming
increasingly complicated, many data-based techniques used for
industrial processes inevitably suffer from noises even outliers
when training these data-driven models. Noisy data (including
outliers) may deteriorate the performance of these techniques
depending on their degrees of sensitiveness to data corruptions.
Furthermore, in noisy environments, the noise robustness of the
approaches can be more important than the performance results
themselves [19]. This feature has also triggered great troubles for
many outlier detection approaches, especially for those based on
statistical models.
According to above discussion, an appropriate outlier detection
approach for industrial processes should possess the following
characteristics: (1) being implemented under the framework of
Fig. 1. GP regression with mean of zero and covariance of squared exponential
unsupervised learning; (2) being applied in both on-line and off-
kernel. The solid line is the predictive mean and the shadow area is built with the
line scenarios; (3) being robust to the absence of outlier measure- values of predictive mean plus/minus two predictive standard deviations. The circle
ments at training phase; (4) being robust to the problem of data points at the line are 50 training points.
imbalance at training phase; (5) being robust to noise and outliers
in the training set.
(for visualization) training samples whose outputs are all set to
3.2. Definition of outlier one. From this figure, it is not difficult to find that for samples
that are far away from training samples, their predictive means
Outlier detection finds extensive use in a wide variety of ap- decrease and predictive variances increase.
plications apart from industrial processes, such as fraud detection Now that both predictive mean and predictive variance can be
for credit cards, insurance, or health care, intrusion detection for
used for calculating outlier scores for test points, a more informa-
cyber-security, and milit surveillance for enemy activities [20].
tive feature can also be extracted with the following form:
The definition of outlier may be different among different applica-
tions, but most definitions share one common feature, i.e. outliers µ∗
Z∗ = √ (16)
are patterns in data that do not conform to a well defined notion Σ∗
of normal behavior. The difference happens when defining the Considering the analysis in Section 3.1, it is reasonable for us to
corresponding normal behavior.
make the following assumption: examples at hand for training are
In industrial applications, operating conditions are usually so
all sampled from positive class, i.e. normal instances. As thus, our
harsh that measurements are prone to be contaminated by out-
outlier detection model is akin to a data description model, which
liers from various sources, such as parameter changes, structural
can also be applied to outlier detection under the conditions of data
changes, faulty sensors and faulty actuators. In addition to these
unlabeled or extreme data imbalance [23]. Under this assumption,
outliers, measurements from the startup or shutdown periods are
the training process of our detection approach can be deemed as
usually unstable and cannot represent the normal system state.
an unsupervised leaning process.
Moreover, for several multi-mode systems, measurements from
different operating modes are usually distinct much from each
other. In a nutshell, all data not from the normal operating con- 3.4. Calculating outlier score from Gaussian process classification
dition are referred to as ‘‘outliers’’ in this paper.
For the task of binary classification being solved by a probabilis-
3.3. Calculating outlier score from Gaussian process regression tic approach, the aim is to model the posterior probabilities of the
target variable for the test observations, given a set of training data.
Under the framework of GP regression, we could obtain the As we introduced in regression task, however, predictions made by
predictive distribution of the output for a test instance through GP models lie on the entire real axis. As a result, a transfer function
Bayesian inference. Since this predictive distribution is described that can transform outputs of GP model into the interval (0, 1)
completely by its first and second order moments, it is natural to
has to be adopted. This transfer function is usually referred to as
investigate the predictive mean and variance to calculate outlier
activation function. And logistic sigmoid function is a competent
scores for new and unseen instances. Such a strategy has already
activation function:
been applied to detect change points in time-series data sets in [21],
where both ‘‘jumping mean’’ and ‘‘jumping variance’’ can indicate 1
σ (f ) = (17)
the onset of outliers. 1 + exp (−f )
Now recall Eq. (12), in which K∗ is a function of the test point
So in a binary classification task we can assume a latent function
input value x∗ , Ky and y can be calculated by training set (X , y).
f and define a Gaussian process over f (x), and then transform
Thus we can regard the predictive mean as a function of test point
f (x) to y by a logistic sigmoid function σ . In this situation, we
input value and it is natural to use the predictive mean to calculate
will obtain a non-Gaussian stochastic process over function y (x) ∈
outlier score for a new and unseen test instance.
(0, 1). In practice applications, values of target variable in a binary
Next recall Eq. (13), in which both K∗ and K∗∗ are functions of x∗
classification usually belong to the set {0, 1}. Here we define the
and Ky can be calculated by training set. Thus the predictive vari-
target variable ast, and it conforms to the Bernoulli distribution
ance is also the function of test point input value and can also be
used to calculate outlier score for a new and unseen test instance. given the values of f :
This property keeps consistent with the statistical control theory, p (t |f ) = σ (f )t (1 − σ (f ))1−t (18)
i.e. the variance and its changes are strong features indicating the
onset of outliers in multivariate systems [22]. We illustrate afore- Again, we denote the training set inputs by X = [x1 , . . . , xN ]
mentioned two properties in Fig. 1: there are 50 one-dimensional with corresponding observed target variables t = [t1 , . . . , tN ]. And
B. Wang and Z. Mao / Applied Soft Computing Journal 76 (2019) 505–516 509

we consider a single test point x∗ with the target value t∗ . Our goal
is to determine the following predictive distribution:

p (t∗ = 1|x∗ , X , t ) = p (f∗ |x∗ , X , t ) p (t∗ = 1|f∗ ) df∗ (19)

where p (t∗ = 1|f∗ ) = σ (f∗ ). Unlike Eq. (9) whose two terms
under the integral are two Gaussian distributions, this integral is
analytically intractable due to the fact that p (t∗ = 1|f∗ ) obeys a
Bernolli distribution. However, we can analytically approximate
this integral provided we have a Gaussian approximation to the
posterior distribution p (f∗ |x∗ , X , t ) in Eq. (19). As thus, the integral
becomes the convolution of a logistic sigmoid with a Gaussian Fig. 2. GP regression with mean of constant (equal to two) and covariance of
distribution. And we can evaluate it by the following approxima- squared exponential kernel. The solid line is the predictive mean and the shadow
tion [24]: area is built with the values of predictive mean plus/minus two predictive standard
∫ deviations. The circle points at the line are 50 training points (same as those in
Fig. 1).
= σ κ σ2 µ
σ (x) N x|µ, σ 2 dx ∼
( ) ( ( ) )
(20)

where
)−1/2 replacing parameterized latent functions in these models with
κ σ 2 = 1 + π σ 2 /8
( ) (
(21) Gaussian prior, GP models can be constructed. Thus, they can be
Referring to the problem of seeking a Gaussian approximation interpreted through a weight-space view.
to the posterior distribution p (f∗ |x∗ , X , t ), three well-known ap- These two views lead to different implementations, but are con-
proaches, i.e. Laplace Approximation (LA) [13], Expectation Prop- ceptually equivalent. The former is usually much simpler to work
agation (EP) [25] and Kullback–Leibler divergence (KL) minimiza- with, while the latter allows us to relate GP models to parametric
tion [26] comprising Variational Bound (VB) [27] as a special case, linear models rather directly.
can be competent. In the case of large training data, approximation
approach like FITC [28] is available. 4.2. Selection of mean function
Under the framework of GP classification, therefore, we could
obtain the class posterior conditioned on training set (X , t ) and Here we discuss the selection of the a priori mean function m (x)
test point input value x∗ . Thus, it is intuitive to apply GP binary over the latent function f (x) in GP model applied in our detection
classification directly for outlier detection provided we have a method. In Fig. 1, the setting for the a priori mean function is a zero-
labeled training set. Akin to the case of GP regression, we also mean function and the output values for all training samples are
assume that all examples at hand for training stem from positive assumed to be 1 (y = 1N ×1 ). Under this configuration, test points
class, i.e. t = 1, a N × 1 matrix. near training samples would have predictive means closed to one
(corresponding to the set output values of training samples), other-
4. Discussions wise their predictive means would approach zero (corresponding
to the assumed a priori mean of f (x)). As thus, the predictive means
Although the mechanisms for outlier detection based on GP of test inputs can intuitively reflect their similarities to training
models appear to be intuitive and simple, constructing a well- set. Actually, we can also set the a priori mean function to other
performed GP model is inherently non-trivial. In this section, we arbitrary constant except one like in Fig. 2, where the a priori mean
would like to discuss several issues concerning constructing Gaus- function is set to two and the predictive means for test points that
sian process models, detection approaches based on which would are far away from training set increase (approach two), which is
be influenced significantly. opposite to trend described in Fig. 1, but still can reflect the similar-
ities to training set. However, if we postulate a linear a priori mean
4.1. Two views of Gaussian process function, result can be demonstrated by Fig. 3, through which how
test samples resemble training set cannot be intuitively reflected
For the sake of further understanding the implementation of by predictive mean. From Figs. 1 to 3, however, we can find that
Gaussian process models and their relationship with other statisti- whatever the a priori mean function is, the predictive variance is
cal models which are extensively applied in the domain of machine always a reliable indicator (shadow area) for test instances.
learning, we firstly discuss two views of Gaussian process. As a result, in this paper, the priori mean functions for all
The first is process view [29]. The process view on a zero-mean GP models used for outlier detection are zero-mean functions
GP for latent function f (x) with covariance function Σ (x) is in (e.g. used in Fig. 1).
the spirit of the GP definition introduced in Section 3. This latent
function f (x) is defined implicitly for any finite input subset x = 4.3. Calculation of covariance function
{x1 , . . . , xN }, which induces a finite-dimensional distribution: f =
{f (x1 ) , . . . , f (xN )} ∼ N (0, Σ (x)). This definition is equivalent ( In) Gaussian process, the covariance of two functions f (x) and
to imposing a probability constraint on f (x). Then covariance f x′ is calculated indirectly through a function k with inputs x and
function Σ (x) is equivalent to imposing an smoothness constraint x′ :
which indicates that similar data points should have the same class
V f (x) , f x′ = E (f (x) − m (x)) f x′ − m x′ = κ x, x′
[ ( )] [ ( ( ) ( ))] ( )
assignments (for classification) or similar outputs (for regression).
As thus, the process view boils down to dealing with the projection (22)
of the GP onto a multivariate Gaussian distribution, thus to simple
linear algebra of quadratic forms. Functions that could achieve the above property are just kernel
The second is weight-space view [30]. Actually, Gaussian process functions. Thus, any kernel function that defines the covariance
models originally stem from classical statistical models, such as as a positive semi-definite matrix (Mercer theorem) would be
linear functions, Wavelet expansions and Neural networks. Via competent. There are several available kernels can be used directly
510 B. Wang and Z. Mao / Applied Soft Computing Journal 76 (2019) 505–516

Table 1
Compatibility between likelihoods and inference methods.
Likelihood\Inference Exact LA EP VB
√ √ √ √
Gaussian
√ √
Student’s t
√ √
Laplacian

Gaussian. σ > 0 is the scaling parameter and in the limit σ → 0


the distribution acquires thicker tails allowing for the possibility of
outlying observations.
Laplace likelihood function:
( )
1 |yi − fi |
f (yi |fi ) = exp − (25)
Fig. 3. GP regression with mean of linear and covariance of squared exponential 2b b
kernel. The solid line is the predictive mean and the shadow area is built with the
values of predictive mean plus/minus two predictive standard deviations. The circle where b > 0 is the scaling parameter. The probability density
points at the line are 50 training points (same as those in Figs. 1 and 2). function (PDF) of the Laplace distribution is also reminiscent of
the Gaussian distribution, whereas the Gaussian is expressed in
terms of the squared difference from the mean and the Laplace is
for a GP, such as polynomial kernel, squared exponential kernel expressed in terms of the absolute difference from the mean, which
(Gaussian kernel), and neural network. In addition, we can also leads to a fatter tail than the Gaussian.
generate more complex kernel functions based on these simple It is noteworthy that the integrals in Eqs. (9) and (18) are
ones. For a more extensive discussion of ‘‘kernel engineering’’, refer never analytically intractable if we use non-Gaussian likelihood
to [31]. functions. Thus, approximation inference is indispensable.
In this paper, extending the content of generating kernel func-
tion is not focused on. And we only employ two simple kernel 4.5. Selection of inference method
functions when devising GP models, one is squared exponential
kernel, and the other is a composite kernel with the following form:
Given the mean function, covariance function, likelihood func-
tion and training set (X , y) (or (X , t )), inference methods are re-
l
∑ sponsible for computing the approximate posterior. For Gaussian
Σ= αi Ki (23) likelihoods, GP inference boils down to computing mean and co-
i=1
variance of a multivariate Gaussian which can be achieved ex-
where α = {α1 , . . . , αl } are the weight parameters that need to actly by simple matrix algebra. While for non-Gaussian likeli-
be optimized. Since the difference of results generated by different hoods, we investigate three approximation inference techniques,
kernel functions is slight for simple tasks, we do not provide namely Laplace Approximation (LA), Expectation Propagation (EP)
illustrative comparative results like those in Section 4.2. and Variational Bayes (VB), to approximate posterior in this paper.
For more detailed contents concerning approximation inference
4.4. Selection of likelihood function approaches for GP, refer to [10].
In contrast to mean functions and covariance functions which
Recall Eq. (9) of the regression case and Eq. (18) of classification can be used in any context, however, there are some restrictions
case, terms p (f |X , y) and p (t = 1|f ) are usually referred to as
on which likelihood functions may be used with which inference
likelihood functions. In regression case, it is set to be Gaussian
method. Based on the rationales of different inference methods, in
distribution due to the Gaussian assumption for noise term, while
Table 1, a compatibility matrix between likelihoods and inference
it is a Bernoulli distribution for classification case. However, for
methods that are employed in this paper is provided. For Student’s
most industrial processes, as discussed in Section 3.1, robustness
t distribution, EP approximation inference could not be used since
to noise or outliers is a primary requirement for any data-based
technique. Therefore, the selection of likelihood function for a it fails to converge for non log concave distributions [33]. For
GP model would have significant influence on the ultimate per- Laplace distribution, Laplace approximation inference cannot be
formance. As we all know, compared with Gaussian distribution, used since it has discontinuous derivatives.
Student’s t distribution and Laplace distribution have longer ‘tails’,
which means that they are much less sensitive than the Gaussian 4.6. Adaptation of hyperparameters
to presence of a few data points which are outliers. For illustration,
we introduce a contrastive example which was provided by [32] in In order to apply GP models as effective tools in practical ap-
Fig. 4. plications, we need to carefully design its specification, which in-
As a result, in this paper, we determine to employ Student’s t cludes both choices for three functions discussed from Section 4.2
and Laplace distribution as the likelihood function of the GP mod- to Section 4.4 and the setting of associated parameters. Since
els. Assume that fi and yi are latent variable and output variable, these parameters do govern the model indirectly, they are usually
respectively. The Student’s t likelihood function has the following referred to as hyperparameters. In this paper, we only discuss the
form: setting of hyperparameters for our applications. In principle, both
( υ+1 ) )− υ+2 1 GP regression and classification enable an automatic way of tuning
Γ (yi − fi )2
(
1
f (yi |fi ) = (2) √ 1+ (24) the hyperparameters. For example, in the case of regression we
Γ υ2 υπ σ υσ 2
can easily derive the logarithmic likelihood function through the
where parameter υ is called the degrees of freedom. For the par- standard form for a multivariate Gaussian distribution:
ticular case of υ = 1, t-distribution reduces to the Cauchy distri- 1 1 ⏐ ⏐ n
bution, while in the limit υ → ∞ the t-distribution becomes a log p (y|X , θ) = − yT Ky−1 y − log ⏐Ky ⏐ − log 2π (26)
2 2 2
B. Wang and Z. Mao / Applied Soft Computing Journal 76 (2019) 505–516 511

Fig. 4. An illustrative example of GP regression with outliers. The left-hand part is Gaussian likelihood and the right-hand part is the Student’s t likelihood. The black solid
line is the real function, then the blue solid line and the red dashed line are the predictive mean and corresponding standard covariance.

Table 2
Confusion matrix of two-class classification problem.
Actual Label
Target Class Negative Class
Predicted Target Class True Positive (TP) False Positive (FP)
Label Negative Class False Negative (FN) True Negative (TN)

Then the predictive mean, variance and marginal likelihood can be


calculated as followings:
E (f∗ ) = kT∗ α, α = LT \(L\y) (27)
v ar (f∗ ) = κ (x∗ , x∗ ) − υ υ, υ = L\k∗
T
(28)
1 ∑ N
Fig. 5. GP regression with zero-mean function and covariance of squared expo- log p (y|X ) = − yT α − log Lii − log 2π (29)
nential kernel whose parameters are derived by the maximum marginal likelihood 2 2
i
principle. The solid line is the predictive mean and the shadow area is built with
the values of predictive mean plus/minus two predictive standard deviations. The
5. Experiments and analysis
circle points at the line are 50 training points (same as those in Figs. 1–3).

In this section, we carry out extensive experiments to investi-


gate the performance of our proposed outlier detection strategies
where Ky is the covariance matrix of the noisy targets y, and θ rep- with respect to properties discussed in Section 3.1.
resents the hyperparameters. Then maximization of the logarith-
mic likelihood function can be viewed as analogous to the type II 5.1. Baselines and metrics
maximum likelihood procedure and can be solved by any gradient-
based optimization algorithms such as conjugate gradients [34]. We compare our approach with five well-known competitors,
In our outlier detection model, nevertheless, we assume that i.e. Gaussian mixture model (GMM), k-means (KM), k-nearest
the training samples at hand all stem from positive class (i.e. y = neighbor (KNN), support vector data description (SVDD), and prin-
1N ×1 ). If the selected covariance function is not appropriate, the cipal component analysis (PCA). All these competitors are data
predictive mean for any unseen instance will always be one, which description methods whose procedures have been summarized in
indicates that the function f (x) = 1 would give a perfect data dd_tools.
fit while being extremely smooth. Using the same training set as Referring to the evaluation criteria for outlier detection meth-
those in previous examples (Figs. 1–3), we show a result whose hy- ods, many metrics for classification tasks have been developed,
perparameters are optimized by the maximum marginal likelihood such as overall accuracy, precision, recall, geometric mean of ac-
principle in Fig. 5. This result also indicates that hyperparameters curacies (G-mean), F-measure, etc. In addition to these metrics for
which lead to smoother functions are preferred by the maximum only one specific threshold, there are also some metrics that can
evaluate the average performance of a classifier for various thresh-
marginal likelihood principle. Whereas we can still find that the
olds, such as ROC-curve and PR-curve. As the research community
predictive variance is a sensitive variable.
continues to develop a greater number of intricate and promising
In a nutshell, adaptation of hyperparameters for our outlier
learning algorithms, it becomes paramount to have standardized
detection model is a hard work which may be distinct from the
evaluation metrics to properly assess the effectiveness of such
general setting of a GP model. But this problem can be alleviated if
algorithms. In this paper, we choose three metrics that are more
we incorporate further model assumptions and do it in application- suitable for data in industrial processes (data imbalance), namely
specific manner. G-mean, F-measure, and ROC-curve.
Prior to the introductions of these metrics, a representation of
4.7. A computational issue classification performance is formulated by a confusion matrix as
illustrated in Table 2. It is worthy to state that outlier samples
For GP regression model, the calculations of predictive mean, belong to positive class and normal samples belong to negative
predictive variance and marginal likelihood function all need to class.
compute Ky−1 . But for reasons of numerical stability, it is unwise Then we can formulate G-mean as:

to directly invert Ky . A more robust alternative is to (compute) a TP TN
Cholesky decomposition, Ky = LLT (L = cholesky K + σy2 I ). G-mean = × (30)
TP + FN TN + FP
512 B. Wang and Z. Mao / Applied Soft Computing Journal 76 (2019) 505–516

This metric evaluates the degree of inductive bias in terms of a ratio


of positive accuracy and negative accuracy.
F-measure can be formulated as:
1 + β 2 · Recall · Precision
( )
F -Measure = (31)
β 2 · Recall + Precision
where β is a coefficient to adjust the relative importance of preci-
sion versus recall (usually, β = 1).
TP TP
Recall = , Precision = (32)
TP + FN TP + FP
F-measure combining recall and precision as a measure could
provide more insight into the functionality of a classifier than the
accuracy metric.
The ROC (receiver operating characteristic) curve represents Fig. 6a. Results of comparison for Friedman data set w.r.t. G-mean.
the trade-off between the true positive rate and the false positive
rate.1 Note that true positive rate and false positive rate are two
mutually exclusive quantities, and ROC curve could evaluate the
general performance rather than performance at only one work-
ing point. In general, the area under the curve (AUC) is used to
measure the performance of outlier detection algorithms. The AUC
of specific algorithm is defined as the surface area under its ROC
curve. It can be easily found that for an outlier detection task, the
AUC of a perfect algorithm is one, implying that all outliers have
been identified along with none misclassified normal data. And for
algorithms whose AUCs are smaller than 0.5, we usually call them
invalid machines since 0.5 is usually regarded as the performance
of ‘‘random guessing’’. Here we employ method in [35] to calculate
the AUC.

5.2. Friedman data set


Fig. 6b. Results of comparison for Friedman data set w.r.t. F-measure.
Friedman constructed the following regression problem which
accepts a 10-dimensional input vector and yet the function value
depends only on the first five input dimensions [36].

f (x) = 10 sin (π x1 x2 ) + 20 (x3 − 0.5)2 + 10x4 + 5x5 (33)


The purpose of the remaining input dimensions (x6 , . . . , x10 ) is to
complicate the task. We generate ten training sets and each has 100
data points. The input values X are sampled from a 10-dimensional
uniform distribution over the range [0, 1]. The corresponding out-
put values Y are generated by evaluating the above function with
the input locations and then corrupt them using normal noise
with zero mean and unit variance. As thus, the input and output
variables construct a 11-dimensional data set (X , Y ). At the test
phase, we generate 1000 instances, of which the input values are
also sampled from a 10-dimensional uniform distribution over the
range [0, 1] and the corresponding output values are noise free Fig. 6c. Results of comparison for Friedman data set w.r.t. AUC.
function values evaluated at input locations. To created outliers at
the test set, we can either select several dimensions and replace
with samples drawn from a normal distribution over the range
in training set. Our methods (especially GP-RV and GP-RMV) still
[0, 10] (with output values unchanged), or change the outputs
outperform the competitors. When we inject outliers into training
values with samples drawn from a normal distribution with mean set, however, our methods are influenced heavily. This is mainly
15 and variance 9, which is identical to that adopted in [37]. So if we because that more outliers in the test set are classified into the
regard samples generated from the Eq. (31) as normal data, these normal class due to the outliers in the training set. Similarly,
simulated outliers would deviate from the training set. Note that other competitors are also influenced by these abnormal training
maybe particular outliers cannot be identified when the normal instance. It is noteworthy that SVDD performs best in this scenario.
class and outlier class overlaps. Then we choose to use Student’s t and Laplace as the likelihood
Firstly, we compare our methods whose likelihood functions function, and use LA and EP as the inference approach, respectively.
are all Gaussian (exact inference method) with other five competi- Results of comparison with Gaussian likelihood are shown in Fig. 7.
tors. Three histograms in terms of three metrics in Fig. 6 show the For all the proposed methods, we can easily find that Student’s t
results, in which the comparison is obvious. From these histograms and Laplace have strong robustness to outliers in the training set
we can see that all methods perform well when none outlier exists (especially Student’s t). Here we only show the results w.r.t AUC
values for the consideration of space.
1 Note that normal data is regarded as positive in this paper. So true positive rate Finally, we also show the comparison in terms of three metrics
indicates the rate of correctly detected normal data. with other competitors in Table 3. Here we only provide the results
B. Wang and Z. Mao / Applied Soft Computing Journal 76 (2019) 505–516 513

Table 4
Summary of results for TE data set (with outliers in training set) w.r.t G-mean,
F-measure and AUC. For GP methods, the likelihood function is Student’s t and the
inference method is Laplace.
GMM KM KNN SVDD PCA GP-RM GP-RV GP-RMV GP-C
G-mean 0.715 0.671 0.684 0.743 0.697 0.750 0.766 0.762 0.747
F-measure 0.688 0.639 0.642 0.707 0.670 0.732 0.739 0.735 0.732
AUC 0.810 0.759 0.764 0.853 0.799 0.861 0.883 0.879 0.859

Fig. 7. Results of comparison for different likelihood functions w.r.t. AUC.

Table 3
Summary of results for Friedman data set w.r.t. G-mean, F-measure and AUC
values. The results are average values over all ten training sets. GP-RM, GP-RV, GP-
RMV and GP-C represent methods based on predictive mean, predictive variance,
predictive mean plus variance of Gaussian process regression and Gaussian process
classification. Fig. 8. Results of comparison for different covariance functions w.r.t. AUC.
GMM KM KNN SVDD PCA GP-RM GP-RV GP-RMV GP-C
G-mean 0.605 0.632 0.539 0.701 0.513 0.744 0.752 0.748 0.713
F-measure 0.515 0.562 0.541 0.679 0.523 0.711 0.716 0.720 0.718
AUC 0.614 0.602 0.569 0.723 0.608 0.763 0.778 0.771 0.742 methods can be improved significantly. Results shown in Fig. 8
indicates that for TE data set, performance w.r.t. AUC value can be
improved by the composite kernel function, but this improvement
is not significant.
of Student’s t likelihood in the comparison (results of Laplace
is similar). We can see that our methods based on Student’s t 5.4. Electric arc furnace process control
likelihood outperform all the competitors w.r.t all three metrics.
Electric arc furnace (EAF) is widely used in many countries
for refining quality steel for industry. Nowadays in steel making
5.3. Tennessee Eastman benchmark process companies, the number of EAFs is rapidly increasing since they
are suitable devices to melt scrap and direct reduced iron for steel
The Tennessee Eastman (TE) benchmark process is a widely production. A schematic diagram of the EAF operation is shown
used simulation process for evaluating different approaches re- in Fig. 9. The scrap is loaded into the furnace and the roof is then
ferring to process monitoring and FDD [38]. The process has 41 closed, before the electrodes bore down the scrap to transfer elec-
process variables and 12 manipulated variables. Of the 41 process tric energy. Natural gas and oxygen are injected into the furnace
variables, 22 of them are easily measured while the remaining from the burners which get combusted releasing chemical energy
are difficult to measure. Therefore, these 22 variables are usually that is also absorbed by the scrap. The scrap keeps melting through
utilized to either predict other difficult-to measure variables or absorbing electrical chemical and radiation energy. When suffi-
process monitoring. cient amount of space is available within the furnace, another scrap
charge is added and melting continues until a flat batch of molten
Due to changes in the G/H ratio and stripper underflow, TE
steel is formed at the end of the batch. Through the evolution of
process has six basic operation modes. In our experiment, we
carbon monoxide from molten metal a slag layer is formed, which
sample 500 examples from each mode and construct six data sets.
contains most of the oxides resulting from the reactions of the
In each data set, examples from five modes are regarded as normal
metals with oxygen. Slag chemistry is adjusted through oxygen
data and examples from the rest mode are regarded as outliers.
and carbon lancing, beside some direct addition of carbon, lime and
Then 100 examples from each normal mode and 50 examples from
dolomite through the roof of the furnace.
the abnormal mode are selected to construct the training set, all
Generally, an EAF is among the highest electrical energy con-
the remaining examples construct the test set. As thus, the size of sumers in the power grid. The rising cost of energy has put pressure
training set and test set is 550 (including 50 outliers) and 2450 on the steel industry to improve their process control systems
(including 450 outliers), respectively. We repeat this process five to conserve energy without sacrificing quality and equipment.
times and present the average results since the training and test This pressure is more accentuated when we consider the adverse
set are selected randomly. Note that outliers are generated in faulty effects of EAFs on the power quality of its feeding power system.
conditions, so they must deviate from the data generated in normal Since an EAF is a non-stationary electric load, it can cause voltage
condition. However, there are still several outliers hard to detect fluctuation or flicker. It also produces current harmonics due to its
since they greatly resemble normal data. highly nonlinear behavior. The unbalance in the meltdown phase
Firstly, we compare results of our methods equipping with is another adverse effect of such loads in a power system. In the
Student’s t as likelihood and LA as inference with five competitors. literature regarding control strategies for EAF systems, adaptive
The covariance function used here is squared exponential kernel. control and predictive control are the most prominent ones. In
As can be seen in Table 4, our proposed methods outperform all the addition, different sets of state variables have been considered by
competitors in terms of three metrics, especially method based on these control strategies in order to reach higher control perfor-
the predictive variance. mance.
Then, we choose to use a composite kernel function as intro- A direct adaptive controller for EAF electrode regulator system
duced in Section 4.3 to investigate if the performance of GP-based was proposed in [40], from which we could find real-time values
514 B. Wang and Z. Mao / Applied Soft Computing Journal 76 (2019) 505–516

Fig. 9. Electric arc furnace operations.


Source: Cited from [39].

Fig. 10. Results of comparison for EAF data set w.r.t. ROC curve.
Fig. 11. The schematic diagram of transonic wind tunnel systems.
Source: Cited from [41].

of several variables would be used to design the controller. Then, a


17-dimension data set sampled from 8 process variables (primary
potential energy is translated to kinetic energy before it reaches the
voltage, secondary voltage, primary current, secondary current,
test section. The test section is closed in a large plenum chamber
short net resistance, short net reactance, and arc impedance) is
and scale model is mounted at this section. Air exchange can be
used in this experiment. This data set contains 5000 examples,
implemented through slots at the top and bottom walls. Then part
of which 500 are outliers. 1000 normal examples are used as
of the air is injected back into the wind tunnel through the plenum
the training set, and all the rest examples constitute the test set.
exhaust valve and the plenum injector, and the remaining part
Note that those outliers are measurements generated from faulty
is distributed uniformly through mesh screens. After the mesh
conditions, startup and shutdown periods. They deviate much from
the data sampled at the stable stage. screen, the air goes through a diffuser and deflects when passing
Then we provide results of comparison in terms of ROC curve, the first corner. Finally, part of the air is ejected out through the
which can provide insight into the relationship between true pos- main exhaust hydraulic servo valve, and the rest returns to the
itive rate (TPR) and false positive rate (FPR). For our proposed compressor intake.
method, we select the one based on the predictive variance due It is necessary to keep the Mach number constant at a prede-
to its performance for previous data sets. The covariance function fined set point because most measured variables are a function of
is composite kernel function, the likelihood function is Student’s the Mach number, for given test conditions of stagnation pressure
t and the inference method is Laplace. Results of comparison are and temperature. In addition, the predictive controller for the
shown in Fig. 10, from which we can see that when the FPR is Mach number has significantly better performance than the PID
greater than 0.4, the TPR could reach nearly one, which indicates controller and the key issue in predictive control is the prediction
that nearly none normal example is misclassified. While for other of Mach number. Correspondingly, detecting the contaminative
competitors, only accepting more outliers can reduce the number data in Mach number time series is a significant work, which may
of misclassified normal data, for example, 0.6 for SVDD. be ignored by most researchers.
As stated in [41], there are totally 5 main impacting variables
5.5. Wind tunnel process control (displacement of the main control hydraulic servo valve, the dis-
placement of the main exhaust hydraulic servo valve, the dis-
A wind tunnel (WT) system is used for testing scale mod- placement of mesh screen hydraulic servo valve, the stagnation
els, mostly of airplanes, in the speed region of 0 to Mach 1.3. A pressure, the angle of attack) that have strong connection with the
schematic diagram of transonic wind tunnel is shown in Fig. 11. control of WT. Therefore a 5-dimension WT data set is used in this
The air is injected into the wind tunnel through the main control experiment. This data set contains 5000 examples, of which 500
hydraulic servo valve and the main injector. Then it passes the are outliers. Normal samples stem from the stable stage and are
third and the fourth corner in order to reach the stilling chamber, used as training data. Outliers are sampled from the startup and
where air speed is low relatively. After the stilling chamber, the shutdown periods and used in the test data. Different from the
B. Wang and Z. Mao / Applied Soft Computing Journal 76 (2019) 505–516 515

[3] R.K. Pearson, Outliers in process modeling and identification, IEEE Trans.
Control Syst. Technol. 10 (1) (2002) 55–63.
[4] H. Ferdowsi, S. Jagannathan, M. Zawodniok, An online outlier identification
and removal scheme for improving fault detection performance, IEEE Trans.
Neural Netw. Learn. Syst. 25 (5) (2014) 908–919.
[5] Y. Hu, et al., A statistical training data cleaning strategy for the PCA-based
chiller sensor fault detection, diagnosis and data reconstruction method,
Energy Build. 112 (2016) 270–278.
[6] J. Zhao, et al., Adaptive fuzzy clustering based anomaly data detection in
energy system of steel industry, Inform. Sci. 259 (3) (2014) 335–345.
[7] F. Liu, Z. Mao, W. Su, Outlier detection for process control data based on a
non-linear auto-regression hidden Markov Model method, Trans. Inst. Meas.
Control 34 (5) (2012) 527–538.
[8] X. Jin, et al., An improved generalized predictive control in a robust dynamic
partial least square framework, Math. Probl. Eng. 2015 (12) (2015) 1–14.
[9] J.C. Robinson, et al., Improved overlay control using robust outlier removal
methods, Proc. SPIE 7971 (11) (2011) 79711G–79711G-10.
[10] C.E. Rasmussen, C.K.I. Williams, Gaussian processes for machine learning
(adaptive computation and machine learning), Int. J. Neural Syst. 14 (481)
(2005) 69–106.
[11] P.D. Kirk, M.P. Stumpf, Gaussian process regression bootstrapping: exploring
Fig. 12. Results of comparison for WT data set w.r.t. ROC curve.
the effects of uncertainty in time course data, Bioinformatics 25 (10) (2009)
1300–1306.
[12] A. Banerjee, D.B. Dunson, S.T. Tokdar, Efficient Gaussian process regression for
large datasets, Biometrika 100 (1) (2013) 75.
situation of EAF data, the required accuracy for Mach number is
[13] C.K.I. Williams, D. Barber, Bayesian classification with Gaussian processes,
higher, which may be a challenge for the detection model. IEEE Trans. Pattern Anal. Mach. Intell. 20 (12) (1998) 1342–1351.
Identically, for our proposed method, we select the one based [14] J. He, H. Gu, Z. Wang, Multi-instance multi-label learning based on Gaussian
on the predictive variance due to its performance for previous process with application to visual mobile robot navigation, Inform. Sci. 190
data sets. The covariance function is composite kernel function, (3) (2012) 162–177.
the likelihood function is Student’s t and the inference method is [15] H.C. Kim, Z. Ghahramani, Outlier Robust Gaussian process classification,
in: Joint IAPR International Workshops on Statistical Techniques in Pattern
Laplace. Then we provide results of comparison in terms of ROC Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR),
curve as demonstrated by Fig. 12. We can see that performance Springer, Berlin, Heidelberg, 2008.
of all methods decrease compared with that for EAF data set. The [16] C.M. Bishop, Pattern Recognition and Machine Learning (Information Science
main reason is that TW data set has a higher requirement for and Statistics), Springer-Verlag New York, Inc., 2006, 049901.
accuracy so that outliers are not easy to identified from normal [17] D.M.J. Tax, R.P.W. Duin, Support vector data description, Mach. Learn. 54 (1)
(2004) 45–66.
data. From the ROC curve we can see that the TPR can hardly reach
[18] H. He, E.A. Garcia, Learning from imbalanced data, IEEE Trans. Knowl. Data
one whatever the FTR is set. This indicates that the distribution of Eng. 21 (9) (2008) 1263–1284.
outliers may be partially overlapping with that of normal data. [19] J.A. Sáez, et al., Tackling the problem of classification with noisy data using
multiple classifier systems: Analysis of the performance and robustness, Inf.
6. Conclusions Sci. Int. J. 247 (15) (2013) 1–20.
[20] V. Chandola, A. Banerjee, V. Kumar, Anomaly detection: A survey, ACM Com-
put. Surv. 41 (3) (2009) 1–58.
With more and more data-based techniques applied in modern [21] J. Takeuchi, K. Yamanishi, A unifying framework for detecting outliers and
industrial processes, detecting outliers for industrial process data change points from time series, IEEE Trans. Knowl. Data Eng. 18 (4) (2011)
become increasingly significant. This paper proposes an outlier 482–492.
detection scheme based on Gaussian process models, which are [22] S. Kumar, V. Sotiris, M. Pecht, Health assessment of electronic products using
routinely used to solve hard machine learning problems. Due to mahalanobis distance , projection pursuit analysis, Int. J. Comput. Inf. Sci. Syst.
Sci. Eng. (4) (2008) 242.
their flexible non-parametric nature and computational simplic-
[23] D.M.J. Tax, One-class classification, Delft University of Technology., 2001.
ity, they are mainly used as effective tools for regression tasks [24] D. Barber, C.M. Bishop, Ensemble learning for multi-layer networks, in: Con-
or classification tasks. Via specific selections of mean function, ference on Advances in Neural Information Processing Systems, 1998.
covariance function, likelihood function and inference method, [25] T.P. Minka, Expectation propagation for approximate Bayesian inference,
we develop three outlier detection algorithms based on Gaussian 2013, vol. 17, p. 362–369.
[26] M. Opper, C. Archambeau, The variational gaussian approximation revisited,
process regression and one based on Gaussian process classifica-
Neural Comput. 21 (3) (1989) 786–792.
tion. Compared with traditional detection methods, our proposed [27] M.N. Gibbs, D.J.C. Mackay, Variational Gaussian process classifiers, IEEE Trans.
scheme has less assumptions and is more suitable for modern Neural Netw. 11 (6) (2000) 1458–1464.
industrial processes. Finally, we carry out several experiments on [28] E. Snelson, Z. Ghahramani, Sparse Gaussian process using pseudo-inputs, Adv.
both synthetic data set and real-life industrial processes’ data sets. Neural Inf. Process. Syst. 18 (1) (2006) 1257–1264.
[29] M. Seeger, Gaussian processes for machine learning, J. Amer. Statist. Assoc. 14
Through comparison with several competitors, the effectiveness of
(481) (2008) 69–106.
our proposed scheme has been verified. [30] C.K.I. Williams, Prediction with gaussian processes: from linear regression to
linear prediction and beyond, in: Nato Advanced Study Institute on Learning
Acknowledgments in Graphical MODELS, 1998.
[31] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, Cam-
This work is supported by National Natural Science Foundation bridge university press, 2014.
[32] R.M. Neal, Monte Carlo implementation of Gaussian process models for
of China (Grant Nos. 61473072 and 61333006).
Bayesian regression and classification, Physics (1997).
[33] P. Jylänki, J. Vanhatalo, A. Vehtari, Robust Gaussian process regression with a
References student- t likelihood, J. Mach. Learn. Res. 12 (7) (2011) 1910–1918.
[34] S. Wright, J. Nocedal, Numerical optimization. Springer Science 35.67-68
[1] B.S.J. Costa, P.P. Angelov, L.A. Guedes, Fully unsupervised fault detection and (1999): 7.
identification based on recursive density estimation and self-evolving cloud- [35] J. Huang, C.X. Ling, Using AUC and accuracy in evaluating learning algorithms,
based classifier, Neurocomputing 150 (2015) 289–303. IEEE Trans. Knowl. Data Eng. 17 (3) (2005) 299–310.
[2] L.H. Chiang, R.J. Pell, M.B. Seasholtz, Exploring process data with the use of [36] Kooperberg, Charles, Multivariate adaptive regression splines, Ann. Statist. 19
robust outlier detection algorithms, J. Process Control 13 (5) (2003) 437–449. (1) (1991) 1–67.
516 B. Wang and Z. Mao / Applied Soft Computing Journal 76 (2019) 505–516

[37] R. Ranjan, B. Huang, A. Fatehi, Robust Gaussian process modeling using EM [40] L. Li, Z. Mao, A direct adaptive controller for EAF electrode regulator system
algorithm, J. Process Control 42 (2016) 125–136. using neural networks, Neurocomputing 82 (4) (2012) 91–98.
[38] J.J. Downs, E.F. Vogel, A plant-wide industrial process control problem, Com- [41] X. Wang, P. Yuan, Z. Mao, Ensemble fixed-size LS-SVMs applied for the Mach
put. Chem. Eng. 17 (3) (1993) 245–255. number prediction in transonic wind tunnel, IEEE Trans. Aerosp. Electron.
[39] S. Bird, et al., Modeling, optimization and estimation in electric arc furnace Syst. 51 (4) (2015) 3167–3181.
(EAF) operation, Chem. Eng. (2013).

You might also like