0% found this document useful (0 votes)
18 views19 pages

Early Failure Detection of Paper Manufacturing Machinery Using Nearest Neighbor-Based Feature Extraction

Uploaded by

이원재
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views19 pages

Early Failure Detection of Paper Manufacturing Machinery Using Nearest Neighbor-Based Feature Extraction

Uploaded by

이원재
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Received: 4 May 2020 Revised: 31 August 2020 Accepted: 1 September 2020

DOI: 10.1002/eng2.12291

RESEARCH ARTICLE

Early failure detection of paper manufacturing machinery


using nearest neighbor-based feature extraction

Wonjae Lee1 Kangwon Seo1,2

1
Department of Industrial and
Manufacturing Systems Engineering, Abstract
University of Missouri, Columbia, In a paper manufacturing system, it is substantially important to detect machine
Missouri,
failure before it occurs and take necessary maintenance actions to prevent an
2
Department of Statistics, University of
unexpected breakdown of the system. Multiple sensor data collected from a
Missouri, Columbia, Missouri,
machine provides useful information on the system’s health condition. How-
Correspondence ever, it is hard to predict the system condition ahead of time due to the lack
Kangwon Seo, Department of Industrial
and Manufacturing Systems Engineering,
of clear ominous signs for future failures, a rare occurrence of failure events,
University of Missouri, E3437M Thomas and a wide range of sensor signals which might be correlated with each other.
& Nell Lafferre Hall, Columbia, MO We present two versions of feature extraction techniques based on the near-
65211, USA.
Email: [email protected] est neighbor combined with machine learning algorithms to detect a failure of
the paper manufacturing machinery earlier than its occurrence from the mul-
tistream system monitoring data. First, for each sensor stream, the time series
data is transformed into the binary form by extracting the class label of the near-
est neighbor. We feed these transformed features into the decision tree classifier
for the failure classification. Second, expanding the idea, the relative distance to
the local nearest neighbor has been measured, results in the real-valued feature,
and the support vector machine is used as a classifier. Our proposed algorithms
are applied to the dataset provided by Institute of Industrial and Systems Engi-
neers 2019 data competition, and the results show better performance than other
state-of-the-art machine learning techniques.

KEYWORDS
1-nearest neighbor, feature extraction, multistream time series classification, rare event prediction,
relative distance

1 I N T RO DU CT ION

The pulp and paper production requires highly complex and integrated processes by chemical or mechanical means,
which include wood preparation, pulping, chemical recovery, bleaching, and papermaking.1 In the advanced papermak-
ing facilities, the systems are continuously monitored so that the operators can manage and control the processes, and
detect any possible incidents that might cause an abrupt production break. To do this, a wide range of sensors are deployed
in many different parts of manufacturing equipment to measure important process variables and monitor the system sta-

This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the
original work is properly cited.
© 2020 The Authors. Engineering Reports published by John Wiley & Sons Ltd.

Engineering Reports. 2020;e12291. wileyonlinelibrary.com/journal/eng2 1 of 19


https://doi.org/10.1002/eng2.12291
2 of 19 LEE AND SEO

tus. These sensors generate large amounts of multistream measurements. For instance, the motivating dataset for this
research contains system monitoring measurements captured by 61 different sensors located in a paper manufacturing
machinery.2 These raw measurements ought to be processed and analyzed appropriately to obtain useful information
regarding the system’s health condition. The general purpose of this article is to develop a practical pipeline to process,
analyze, and interpret the system monitoring data given as a form of the multistream time series (MSTS) to detect a system
failure that may be occurred in the near future.
One challenge problem that we aim to resolve through this project is that the machine failure has to be prognosed
ahead of a physical occurrence. The traditional system monitoring tools such as the control chart-based quality control
techniques focus on the detection of the assignable causes of the system abnormal status as soon after it occurs as possible.
As such, the average run length has been used as the main performance metric for comparing various types of control
charts.3 In the paper manufacturing process, however, once the machine failure occurs, the system instantly stops and
there is no benefit to detect the failure afterward. Therefore, it is important to perceive any symptomatic signal followed
by a machine breakdown even a few seconds earlier. To achieve this goal, we define our problem as a binary classification
task where we aim to distinguish the precursory signs from the normal signals. This problem definition motivated us to
use the terminology “MSTS” rather than “multivariate time series” as the multivariate data implies multiple responses
in the statistical literature.
Other difficulties for this task may be attributed to the multistream nature of the given data and a lack of failure-labeled
observations. Although there exist several feature-based classification algorithms for the time series data, it is problematic
to generate and select a proper set of features from high dimensional multistream data. As an alternative, the deep learning
techniques are emerging as competent tools handling such data.4,5 However, these techniques require a substantially
large amount of data, which is not the case for our problem where the dataset only includes 124 machine breakdown
points among more than 18 000 time points where the labeled data points only consist of 0.67% among the whole dataset.
Such an extremely imbalanced dataset makes it even harder to build a model with high performance since we don’t have
enough labeled data to train the model.
To solve the aforementioned problems, we rely on machine learning algorithms, which have been recognized as
more powerful techniques for predictive tasks than traditional approaches that do not incorporate these techniques,5
with properly processed variables and informative features. Specifically, for each sensor or variable, we transform the
time series instance into a scalar extracted from its nearest neighbor and feed the transformed variables into a proper
off-the-shelf machine learning algorithms to make a classification. The nearest neighbor-based algorithm has been
recognized as one of the most effective classification methods for time series data.6 In this article, we exploit the
advantages of 1-nearest neighbor (1-NN) but extend the method for MSTS data. The objectives of these algorithms
are to extract suitable features for MSTS classification. First, we extract the class label of the nearest neighbor only
considering a single variable, which results in the binary features for each variable. Second, the relative distance to
the nearest neighbor is measured, which is anticipated to provide more useful information on an instance’s nearest
neighbor. In this research, we demonstrate how to predict the paper machine failure before it occurs (ie, early detec-
tion); and find the variables which have a significant effect on causing failures using these nearest neighbor-based
features.
The rest of this article is organized as follows. Section 2 reviews related work in time series classification. Section 3
shows the overall procedure to implement, and describes dataset, preprocessing, and two versions of algorithms we pro-
pose in this article. In Section 4, we evaluate the performance of the proposed algorithms with the real-world dataset of
the paper manufacturing sensor signals. Finally, we conclude our research in Section 5.

2 RELATED WORK

A wide range of algorithms have been used and proposed to solve classification problems with univariate time
series data. Sykacek and Roberts7 propose an approach with a latent feature representation by applying Bayesian
theory to hierarchical time series processing. Esmael et al8 suggest a hybrid approach to improve the accuracy of
the time series classifier with hidden Markov models. Jović et al9 examine the capability of four common deci-
sion tree ensembles in the biomedical time-series dataset. Eads et al10 employee a support vector machine (SVM)
for time series classification with features extracted from the time series data. Cui et al11 demonstrate convolutional
neural networks for time series classification problem to incorporate feature extraction and classification in a sin-
LEE AND SEO 3 of 19

gle framework. These algorithms have been employed as a single classifier or a combination of multiple methods,
sometimes called an ensemble, to improve the performance of classification.12 Although the ensemble-based clas-
sifier is known as a prominent algorithm for time series classification tasks,13 it requires much computation for
training, which may not be suitable for a large dataset. Meanwhile, Tan et al14 describe that the nearest neigh-
bor classifier based on the Euclidean distance is a fast and promising classification algorithm when it comes to the
big dataset.
Recently, MSTS data has gained great attention, and many researchers have proposed new methods to solve the
multistream-based problem. Orsenigo and Vercellis15 describe a classification method based on a temporal extension
of discrete SVMs with the notions of warping distance and softened variable margin in the set of multivariate input
sequences. Weng and Shen16 implement a new approach for MSTS classification. The eigenvectors of row-row and
column-column covariance matrices of MSTS samples are calculated to extract features and a 1-NN classifier is used for
the classification. The authors show that distance-based methods with 1-NNs are an effective way to classify MSTS. Other
algorithms have also been used to deal with MSTS. Zhang et al17 address the challenges of MSTS data by presenting a
real-time multiple profiles sensor-based process monitoring system.
Feature extraction is considered as one of the popular techniques for MSTS classification. Rodríguez and Alonso18 use
the boosting algorithm to generate new features and a SVM is applied with these metafeatures. Kadous and Sammut19
seek to generate classifiers that are comprehensible and accurate with metafeatures. The authors describe applications
of the sign language recognition and the electrocardiogram signal classification. Li et al20 suggest feature vector selection
approaches for MSTS classification using singular value decomposition.
Profile monitoring techniques with the use of the principal component analysis (PCA) method is another way
to manage MSTS. Kim et al21 develop the method to detect profile changes of multistream tonnage signals for
forging process monitoring and to classify fault patterns while Chang and Yadama22 propose a statistical process
control framework to monitor nonlinear profiles to identify mean shifts in a profile with discrete wavelet trans-
formation and B-splines. Paynabar et al23 suggest a multiway extension of the PCA technique to classify multi-
stream profile data. Grasso et al24 suggest multiway PCA to deal with the reduction of data dimensionality and the
fusion to all the sensor outputs. This article carries out two main multiway extensions of the traditional PCAs to
handle MSTS.
Deep learning has provided prominent results for this application with the popularity of the neural networks. Zheng
et al25 propose a deep learning framework for MSTS classification using features extracted by a 1-NN with dynamic time
warping (DTW). Karim et al4 utilize the long short-term memory fully convolutional network (LSTM-FCN) and attention
LSTM-FCN for MSTS classification. Wang et al5 utilize a recurrent neural network and adaptive differential evolution
algorithm for the same task. Despite the popularity of deep learning, this technique requires a high volume of dataset,
and it is not suitable for our problem due to a lack of labeled data.
An imbalanced classification problem where the distribution of class labels are severely skewed needs to be well
managed due to the poor performance of learning algorithms in the presence of underrepresented data and severely
skewed class distribution. This is because most algorithms assume that distributions of the dataset are balanced.26 The
sampling methods which consist of oversampling and undersampling techniques are commonly used to improve clas-
sifier accuracy by providing a balanced distribution.27 The cost-sensitive method is an alternative for the imbalanced
learning problem by using different cost matrices that outline the cost for misclassifying data instances.28 However,
the failures in the paper machine occur so rarely that traditional techniques had difficulty in training models effec-
tively. Active learning can be one of the most prominent methods which are applied to handle extremely imbalanced
data. To deal with highly imbalanced classes, Attenberg et al29 propose guided learning which is an alternative tech-
nique where the agent inquires humans to find training examples representing the different classes. Kazerouni et al30
suggest an active learning algorithm to learn a binary classifier on a highly imbalanced dataset where most data
has negative labels with a very small number of positive ones. Hybrid active learning is presented to leverage an
explore-exploit trade-off to improve on margin sampling. Moreover, this active learning technique is combined with
state-of-the-art deep learning techniques to improve performance. Fang et al31 reformulate active learning as a rein-
forcement learning problem where the policy plays a role in the active learning heuristic. An agent in the environment
tries to find the data to be labeled in a validation set based on the deep Q-network. Haussmann et al,32 however,
choose a deep Bayesian Neural Net for both a base predictor and the policy network to effectively incorporate the
input distribution.
4 of 19 LEE AND SEO

3 MATERIALS AND MET HODS

3.1 Dataset description

The dataset was provided by the Institute of Industrial and Systems Engineers (IISE) 2019 data competition, which
recorded real sensor observations from a paper manufacturing process.2 Many different types of data are collected over
a period of time using a variety of sensors located on the machines. Some sensors measure raw materials (eg, amount of
pulp fiber, chemicals, and so on) and the others represent process variables (eg, blade type, couch vacuum, rotor speed,
and so on). Overall, 61 different sensor signals are collected, and 1 month of monitoring data are recorded at every 2
minute for a paper manufacturing machine, which results in the dataset of 61 streaming signals at 18 398 time points. In
addition, for each time point, the system condition (ie, normal or break) has been recorded in a binary response variable.
Despite such a large number of measurements, the failures only occur at 124 time points (0.67% of total observations)
during operation and this characteristic of the rare event makes it hard to predict the failure before it occurs. Table 1 sum-
marizes the dataset. A data-driven approach is used for this problem instead of incorporating physical models since no
information was given regarding sensor information and domain knowledge.
Predicting failures for a pulp-and-paper mill is critical because a break has a significant impact on the entire process.
Even though paper breaks rarely take place during operation, only one failure causes a significant loss of time and labor
for identifying a cause of the failure and replacing any broken parts. Once the machine fails, the entire process should be
stopped since the operation needs to be halted until the problem is found and fixed. This maintenance procedure would
take more than an hour which would incur a substantial amount of cost. It indicates that only a small amount of failure
reduction through early detection could give a significant amount of cost savings for industries.

3.2 Procedure

The overall procedure of the proposed algorithms in this article is presented in Figure 1 consisting of preprocessing,
class label of the local nearest neighbor (CL-LNN) and relative distance of the local nearest neighbor (RD-LNN) with
corresponding machine learning techniques. The original MSTS dataset is preprocessed before carrying out two types of
feature extraction methods and these features are fed into a decision tree or SVM based on the extracted data types for
early failure detection. More detailed information is described in the following sections.

3.3 Data preprocessing

The MSTS data obtained from the paper manufacturing machinery is given as

⎛ s1,1 s1,2 … s1,p c1 ⎞


⎜ ⎟
⎜ s2,1 s2,2 … s2,p c2 ⎟
MSTS = (s1 s2 … sp c) = ⎜ ⎟ (1)
⎜ ⋮ ⋮ ⋮ ⋮ ⋮⎟
⎜s cT ,⎟⎠
⎝ T,1 sT,2 … sT,p

where st, j ’s, t = 1, … , T, j = 1, … , p, are sensor signals measured at the time point t from the jth sensor, T = 18 274 is the
number of measurement time points, p = 61 is the number of variables by different sensors, and ct ’s are records of the

T A B L E 1 Dataset
Element Value Remark description
Number of variables Continuous variables 59 s1 ∼ s27 , s29 ∼ s60
Categorical variables 2 s28 (8 categories), s61 (2 categories)
Number of measurements Normal 18 274 Recorded by every 2 minute
Abnormal (failure) 124
LEE AND SEO 5 of 19

Dataset

Preprocessing
Split dataset
Training dataset Test dataset

Standardization Standardization

Second derivative Second derivative

Moving class label Moving class label

Time Window Processing Time Window Processing


1NN

LOO-CV
Binary feature matrix Binary feature matrix Calculate median, Calculate median, Calculate median, Calculate median,
) ) standard deviation standard deviation standard deviation standard deviation

binary binary
Relative Distance Relative Distance Relative Distance Relative Distance
( ) ( ) ( ) ( )

Numerical feature matrix Numerical feature matrix


) )
CL-LNN RD-LNN

Numerical value Numerical value


Decision tree Support Vector Machine

Trained Model Trained Model

Input Test data Input Test data

Early failure detection Early failure detection


Training Model

FIGURE 1 Flow chart of the proposed method for early failure detection

system condition for each time point of measurement (ie, ct = 0 for normal, and ct = 1 for break). This sensor information
is preprocessed to implement the classification algorithms. First, the entire data needs to be split into training and test
dataset before data standardization is conducted for each variable since the test dataset should be unknown during the
modeling. We divide the whole dataset into 90% for training and 10% for test dataset to do the experiments in Section 4
to apply the proposed algorithms in this article. Therefore, the training dataset is standardized first, and then the mean
and SD from the training dataset is applied to the standardization of the test dataset. For implementing standardization,
each measurement is scaled by subtracting the corresponding mean and then being divided by the SD so that the mean
becomes 0 and the SD 1, as follows.

st,j − mean(sj )
st,j ← , j = 1, … , p, (2)
std(sj )

where the notation ← indicates that the variable in the left-hand side is replaced with the new variable of the right-hand
side, mean(sj ) and std(sj ) are the mean and SD, respectively, of the original measurement data from the jth sensor. Stan-
dardization is implemented to scale the data with mean 0 and SD 1 which usually gives better performance on the
algorithm.
The derivative then is applied to sense sudden changes in the sensor signals. The derivative in the time series is the
difference between all neighboring points in one dimension. That is,

s′t,j = |st,j − st−1,j |, s′′t,j = |s′t,j − s′t−1,j |, j = 1, … , p, (3)


6 of 19 LEE AND SEO

0 0 FIGURE 2 Early detection by moving column c by 1 row


0 0
0 0

Time window
0 0

Time window
0 0
0 0
0 0
0 0
0 0
0 1
1 0
0 0
0 0

where st, j′ and st, j′′ represent the first and second derivatives of st, j , respectively. The first derivative is related to a gradual
change in time series which may not be sensitive to a sudden machine breakdown, while the second derivative is more
useful to detect sharp changes that appeared in the streaming signals. For the rest of this article, we use the second
derivative to seize precursors of immediate failure.
In this project, we aim to detect the failure earlier before it occurs. One simple way to achieve this goal is to use the
class label of k time points ahead for the current instance’s class label so that classifiers are able to learn to predict ct + k
the system condition at k time units ahead.2

ct ← ct+k , where k = 1, 2, … . (4)

We set k = 1, which implies that we build a model to detect a failure 2 minutes earlier than its occurrence. Figure 2
depicts this process.
In classification problems with streaming data, temporal sequence data can normally secure more information com-
pared with the data point sampled at a single time step.33 Accordingly, we extract small fragments of sequences by
conducting, namely, time window processing. For a given window size m, a window instance consists of the last m sen-
sor measurements up to time t which corresponds to the rows of MSTS data given in Equation (1) with time indices
t − m + 1, … , t. The class label of the window instance is given as ct so that it represents the system condition at the last
time point of the window. These window instances provide features to be used in a machine learning algorithm. In addi-
tion, we address the problem of severely imbalanced class labels of the original MSTS data while constructing the window
instances by making a balance between two labels to some extent. That is, for time window processing we select all the
time points t where ct = 1 and only randomly select t where ct = 0 such that it makes difference between the number of
class labels not too large. The constructed window instances and those class labels are given as the following form.

⎛w′ w′1,2 … w′1,p y1 ⎞


⎜ 1,1 ⎟
⎜w′ w′2,2 … w′2,p y2 ⎟
(W y) = ⎜ 2,1 ⎟, (5)
⎜ ⋮ ⋮ ⋮ ⋮ ⋮⎟
⎜ ′ ⎟
⎝wn,1 w′n,2 … w′n,p yn ⎠

where each row represents each window instance. That is, wi, j is the sequence of length m which is the second derivatives
of the jth sensor signals, and yi is the class label of the ith window instance. Note that the row index i = 1, … , n merely
distinguishes each window instance, not necessarily implies the time point. The time window processed training dataset
(Wtrain ytrain ) and test dataset (Wtest ) are used as input of Algorithms 1 and 2, respectively.

3.4 1-NN for time series classification

In the field of data mining and machine learning, one of the most frequently studied problems is classification.34 The clas-
sification process is to evaluate the similarities in a dataset to classify them into designated classes. One of the differences
LEE AND SEO 7 of 19

between time series classification problems and traditional classification problems is that the attributes are arranged in
order and input features may be correlated. The 1-NN is a popular classifier for the time series classification as its perfor-
mance can compete with the most complex classifiers.6 When a new observed time series instance comes out, the 1-NN
classifier looks for the instance in the training dataset which has the shortest distance with the new instance and predicts
the class of the new instance as the class label of the closest instance. A distance measure such as the Euclidean distance
is used to compare two-time series instances. For a one-dimensional time series data, the Euclidean distance between
two-time series instances wi and wk is measured by

√m
√∑
DED (wi , wk ) = √ (wit − wkt )2 , (6)
t=1

where wi and wk are window instances with t = 1, … , m measurements to be compared each other.
The other renowned distance measure for time series data is DTW, which is the method to find the optimal alignment
between two time-dependent sequences. It has been widely used in the field of pattern recognition and broadly tested
on the benchmark time series data. DTW is originally designed to compare different speech patterns for the purpose
of automatic speech recognition to solve the problem of distortions in the time axis.35 It makes a time series stretched
and realigned to better match the other time series.36 To find the DTW distance, the matrix M is built where the (t, t′ )th
element of M is d(wit , wkt′ ) = (wit − wkt′ )2 . Then a warping path is defined as the monotonically increasing sequences of
indices p = {(0, 0), … , (t, t′ ), … , (m, m)}. The DTW distance can be found by the warping path which has the minimum
cumulative distance between two sequences.

√H
√∑
DDTW (wi , wk ) = min √ Mh , (7)
p
h=1

where H is the length of the warping path, Mh is the matrix element corresponding to the hth element of a warping
path p.37 Figure 3 depicts how Euclidean matching and DTW matching compare similarities between two-time series
instances. In brief, Euclidean distance measures the distance between the two waves regardless of the shapes, while the
DTW measures the distance by taking into account the shapes of two sequences. However, due to its computational
complexity of DTW, the distance measurement with the DTW may not be suitable to be applied for the real-time sensor
streaming data in which it is required to find the nearest neighbor instance quickly.

FIGURE 3 1-NN comparison between Euclidean and DTW matching. 1-NN,


1-nearest neighbor; DTW, dynamic time warping
8 of 19 LEE AND SEO

To select the appropriate distance measure between Euclidean distance and DTW distance, a separate experiment
is conducted to compare the performance, of which the result is shown in Section 4. For our proposed algorithms,
Euclidean distance is used to measure the distance between two-time series instances as the experiment shows that
Euclidean distance requires much less time than DTW without a significant difference in performances between the
two methods.

3.5 Nearest neighbor-based feature extraction

One possible way to extend the 1-NN for a single-stream time-series data to the case of multistream signals could be to
use the sum of the Euclidean distance measured by Equation (6) for each variable to measure the similarity between two
multistream window instances. In this case, however, information of all variables is aggregated, which results in the loss
of each variable’s information and relationships between variables. Instead, we look for the nearest neighbor considering
each variable only, which we call the localnearest neighbor (ie, the nearest neighbor in an embedded space of a single
stream), and extract scalar features from it. These features are fed into different classification algorithms depending on
the types of extracted features.

Algorithm 1. CL-LNN feature extraction


1: Input: Multistream window instances Wtrain for training, and Wtest for testing, class labels of training instances ytrain ,
index set of training data Train, index set of test data Test
2: Output: Binary feature matrices Xtrain with elements xij , i ∈ Train, and Xtest with elements xij , i ∈ Test
3:
4: for i ∈ Train do
5: for j ∈ {1, … , p} dod∗ = L
6: L is a large number used for a initialization
7: for k ∈ Train ⧵ i dod = DED (xij , xkj )
8: for all instances in training data except itself (LOO-CV)
9: if d ≤ d∗ thenk∗ ← k,d∗ ← d
10: end if
11: end forxij ← yk∗
12: store class label of the local nearest neighbor as a feature
13: end for
14: end for
15:
16: for i ∈ Test do
17: for j ∈ {1, … , p} dod∗ = L
18: L is a large number used for a initialization
19: for k ∈ Train dod = DED (xij , xkj )
20: for all instances in training data
21: if d ≤ d∗ thenk∗ ← k,d∗ ← d
22: end if
23: end forxij ← yk∗
24: store class label of the local nearest neighbor as a feature
25: end for
26: end for

The first feature we propose is the CL-LNN, which is given as 0 or 1 for each variable. Algorithm 1 outlines the
procedure of the CL-LNN feature extraction in which the MSTS data is converted into binary feature matrices Xtrain and
Xtest . The local nearest neighbor is found by leave-one-out cross-validation (LOO-CV) for each variable on the training
dataset Wtrain . That is, for an instance of the training dataset, LOO-CV searches all the other instances in the training
dataset except itself and chooses the one that gives the highest matching with it, which is simple but effective for 1-NN.38
LEE AND SEO 9 of 19

FIGURE 4 Three different cases based on nearest neighbor-based feature extraction

On the other hand, for an instance of the test dataset, the algorithm simply searches the nearest neighbor from the training
dataset. The nearest neighbor is found by computing Euclidean distance based on the time window for each variable as
in Equation (6). Note that, for a given instance, the CL-LNN features for different variables may be varied because the
nearest neighbor for each variable could be different. These binary features are fed into the decision tree classifier for the
model training and prediction, which will be described in more detail in Section 3.6. The features can keep the original
information of each sensor signal by considering each variable separately, and correlations between different variables
are expected to be handled by the decision tree algorithm.
Another feature we propose in this article is called the RD-LNN. While the CL-LNN can be thought of as features of
hard classification where the outcome is certainly given as 0 or 1, RD-LNN provides features of soft classification, which
can be seen as probability-like features. Although the binary feature extracted from the CL-LNN provides information
on which one is the closest to the instance under consideration, it is not able to measure the degrees of significance or
strength of the extracted feature. Let us consider three cases to classify the label of instances with nearest neighbor-based
feature extraction in Figure 4. In the first case, there is a clear decision boundary which makes it easy to separate two
distinct groups where CL-LNN might show superior performance. However, outliers in the second example make it more
challenging to classify the target instance. Suppose we know the CL-LNN for a given instance is, say, 1 a break signal. To
build a robust prediction model, we may also want to know how reliable and accurate this signal is. For the second case,
even if the nearest neighbor is the break signal, this nearest neighbor is an outlier with respect to the majority of the other
break signals. In this case, relying solely on the class label of the nearest neighbor may be risky. To complement this pitfall
of binary features, we may consider measuring distances from the nearest neighbor to the other instances, respectively. If
the distance value is large, the nearest neighbor is thought to be located far from the majority of its same class and does
not provide reliable information. Whereas if the distance is small, the nearest neighbor is thought to represent the group
of the same class and the information provided by this instance is more accurate. In lieu of direct distance measure, we
use probability measure which is similar to the computation of P-value for a statistical hypothesis testing. Rare events
(ie, breaks) in our dataset, however, appear to be indistinguishable from the other which is ambiguous to differentiate
those two groups as in the third case. In this situation, we found that it is more effective to measure the relative distance
for each group, respectively, instead of applying the same nearest neighbor to different groups. Specifically, given an
instance of which the class label has to be predicted, Euclidean distances to all the other instances in the training dataset
are computed. For each class label (y = 0, y = 1), the nearest neighbors are found. Let d∗0 and d∗1 be distances to nearest
neighbors with class label 0 and 1, respectively. We can also find the approximated normal distribution for each class. Let
X 0 and X 1 be random variables with these approximated normal distributions. The RD-LNN features are computed by
P(X0 ≤ d∗0 ) and P(X1 ≤ d∗1 ) for each class, which can be interpreted as the probability that an observation is located farther
than the nearest neighbor from the center of each class. That is,
( )
d∗i − 𝜇i
Pi = P(Xi ≤ d∗i ) =Φ , i = 0, 1, (8)
si
10 of 19 LEE AND SEO

where Φ is the cumulative distribution function of the standard normal random variable, 𝜇i and si are the median and
SD of the distance between the target instance and all the training instance with the label i. The smaller RD-LNN is, the
less likely the label to be found is reliable.

Algorithm 2. RD-LNN feature extraction


1: Input: Multistream window instances Wtrain for training, and Wtest for testing, class labels of training instances ytrain , index set of training
data Train, index set of test data Test
2: Output: Numeric feature matrices Xtrain with elements xij0 and xij1 , i ∈ Train, and Xtest with elements xij0 and xij1 , i ∈ Test
3: for i ∈ Train do
4: for j ∈ {1, … , p} dod0 = d1 = NULL
5: initialize arrays to store distance values
6: for k ∈ Train ⧵ i dod = DED (xij , xkj )
7: for all instances in training data except itself (LOO-CV)
8: if yk = 0 then append d to d0
9: end if
10: if yk ≠ 0 then append d to d1
11: end if
12: end for
13: extract features from distances with label 0
14: d∗0 ← min(d0 )
15: distance to the nearest neighbor with label 0
16: q0 ← interquartile(d0 )
17: 𝜇0 ← median(q0 ), s0 ← stdev(q0 )
18: xij0 ← P(X ≤ d∗0 ), where X ∼ N(𝜇0 , s20 )
19: extract features from distances with label 1
20: d∗1 ← min(d1 )
21: distance to the nearest neighbor with label 1
22: q1 ← interquartile(d1 )
23: 𝜇1 ← median(q1 ), s1 ← stdev(q1 )xij1 ← P(X ≤ d∗1 ), where X ∼ N(𝜇1 , s21 )
24: end for
25: end for
26: for i ∈ Test do
27: for j ∈ {1, … , p} dod0 = d1 = NULL
28: initialize arrays to store distance values
29: for k ∈ Train dod = DED (xij , xkj )
30: for all instances in training data except itself (LOO-CV)
31: if yk = 0 then append d to d0
32: end if
33: if yk ≠ 0 then append d to d1
34: end if
35: end for
36: extract features from distances with label 0
37: d∗0 ← min(d0 )
38: distance to the nearest neighbor with label 0
39: q0 ← interquartile(d0 )
40: 𝜇0 ← median(q0 ), s0 ← stdev(q0 )
41: xij0 ← P(X ≤ d∗0 ), where X ∼ N(𝜇0 , s20 )
42: extract features from distances with label 1
43: d∗1 ← min(d1 )
44: distance to the nearest neighbor with label 1
45: q1 ← interquartile(d1 )
46: 𝜇1 ← median(q1 ), s1 ← stdev(q1 )
47: xij1 ← P(X ≤ d∗1 ), where X ∼ N(𝜇1 , s21 )
48: end for
49: end for

Algorithm 2 describes the procedure in which the algorithm generates the numerical values by measuring the prob-
ability representing the relative position of the nearest neighbor for each class compared with the other instances with
the same class label. We found that this unique feature extraction technique improves the performance of classification
LEE AND SEO 11 of 19

when these extracted features from RD-LNN are fed into SVM. Note that two features are extracted from each variable
corresponds to each class label (y = 0, y = 1). The normal distribution is approximated to distance data where the mean
and SD are set as the sample median and the SD of distances included in the interquartile range (ie, data ranging from
the first quartile Q1 to the third quartile Q3) to minimize the effect of outliers.

3.6 Training model

Different machine learning techniques are used for CL-LNN and RD-LNN, respectively, to train the model and predict
failures 2 minutes earlier based on the data type that algorithms produce. First, the C5.0 decision tree algorithm which is
an improved version of its predecessor C4.5 is applied to the CL-LNN algorithm for the classification between normal and
abnormal conditions. In order to improve model performance, we implemented adaptive boosting which is the process
in which many trees are built and trees vote for the best class. We set boosting iterations to 10. A cost matrix is also
employed by assigning a penalty to different types of errors to improve the accuracy where 1 is assigned for the false
positive, and 5 is chosen for the false-negative since failing to detect breaks can be a more expensive mistake. Second, the
numerical feature matrix (Xtrain ) from the training dataset (Wtrain ) is fed into SVM to generate the model, and the other
matrix (Xtest ) is used to evaluate the performance of the model generated with training dataset (Wtest ) from RD-LNN. To
train SVM model, the function of kernel which takes data as input and transforms it into the required form for training
and predicting is chosen to be radial. The cost is assigned to 1 to trade off the correct classification of training examples
against maximization of the decision function’s margin. 0.5 is also used for gamma parameter which defines how far
the influence of a single training example reaches. These parameters are selected heuristically by experiments based on
our dataset.

4 RESULTS OF EXPERIMENT

4.1 Performance analysis

We compare our methods with four other different approaches which include a type of artificial neural network and
general machine learning models without the feature extraction technique we proposed in this article. The first method
is an Autoencoder which is comprised of encoder and decoder for extremely rare event classification1 . The encoder is to
learn the features of input data which are normally in a reduced dimension, while decoder regenerates the original data
from the encoder output. This method uses a dense layer Autoencoder which selects the instances in random without
considering the correlation among instances. The second approach is the improved version of the first one by constructing
LSTM (long short-term memory) Autoencoder which contemplates the temporal features2 . Both methods also attempt to
detect failures 2 minutes earlier with the same dataset we use in this article. We, in addition, compare the method without
feature extraction technique (ie, decision tree without CL-LNN, SVM without RD-LNN) in order to show the benefit of
the proposed algorithm.
Table 2 shows the prediction result in the form of a confusion matrix to compare the performance of six methods.
As we can see from these results, it looks like all six methods are comparable, and hard to find which method provides
better performance. It also shows the trade-off between the true positive/negative and false positive/negative. RD-LNN,
however, shows the lower number of false-positive among the four methods.
Table 3 provides other metrics to compare the performance among six different methods. In the table, four metrics are
used to evaluate the performance of the proposed classification algorithms. Precision (also known as the positive predictive
value) is defined as the proportion of positive instances over the total number of positive. Recall (also known as sensitivity,
true positive rate) is the number of true positives divided by the number of true positives plus the number of false negatives.
In addition, False positive rate (1 - specificity) refers to the probability of falsely rejecting the null hypothesis for a particular
test. Since, however, the distribution of class labels is highly skewed, another performance metric F-measure has been

1
The implementation of Autoencoder refers to this site (https://github.com/cran2367/autoencoder_classifier/blob/master/autoencoder_classifier.
ipynb)
2
The implementation of LSTM Autoencoder refers to this site (https://github.com/cran2367/lstm_autoencoder_classifier/blob/master/lstm_
autoencoder_classifier.ipynb)
12 of 19 LEE AND SEO

LSTM T A B L E 2 Confusion
Autoencoder autoencoder Decision tree SVM CL-LNN RD-LNN Remark matrix

Prediction 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0 2726 22 3355 19 1636 19 1769 24 1514 19 1762 22 TN FN
1 173 3 272 8 179 6 46 1 282 6 34 3 FP TP

Abbreviations: CL-LNN, class label of the local nearest neighbor; LSTM, long short-term memory; RD-LNN, relative
distance of the local nearest neighbor; SVM, support vector machine.

LSTM Decision T A B L E 3 Performance


Item Autoencoder autoencoder tree SVM CL-LNN RD-LNN with four metrics

Precision 0.017 0.029 0.032 0.021 0.021 0.081


True positive rate 0.120 0.296 0.240 0.040 0.240 0.120
False positive rate 0.060 0.075 0.099 0.024 0.157 0.019
F-measure 0.030 0.052 0.057 0.028 0.038 0.097

Note: The boldfaced value implies the highest performance for each measure.
Abbreviations: CL-LNN, class label of the local nearest neighbor; LSTM, long short-term memory; RD-LNN, relative
distance of the local nearest neighbor; SVM, support vector machine.

used to measure the performance of a rare classification problem. F-measure (also sometimes called the F1 score or
F-score) is the combination of precision and recall using the harmonic mean, a type of average being used for rates of
change. Based on the table, RD-LNN shows the best performance in precision, false-positive rate, and F-measure among
six methods while LSTM autoencoder only performs better in a true positive rate. Note that RD-LNN shows outstanding
performance compared with the others in F-measure which are well suited to represent the performance of the highly
imbalanced dataset.
Another metric used to measure the performance is a receiver operating characteristic curve, or ROC curve,
which represents the diagnostic ability of a binary classifier. This tool is suitable to visualize and compare the per-
formance of our proposed algorithms. The true positive rate (TPR or sensitivity) is plotted in the ROC curve against
the false positive rate (FPR or 1 - specificity) at different threshold settings to exhibit how much a model is able to
distinguish classes.
In Figure 5, ROC curves of six different methods are plotted to compare the performance using area under the
ROC curve (AUC) which represents the degree of separability. LSTM-Autoencoder which considers temporal features
show better performance than Autoencoder and the AUC of RD-LNN is higher than the one of CL-LNN by consider-
ing the relative distance to detect the failures. Decision tree and SVM which are not adopting the feature extraction
we proposed also provide a lower performance than RD-LNN. Overall, the AUC of RD-LNN shows the largest value
0.724, and we reach to the same conclusion that the performance of RD-LNN is better than any other five meth-
ods. Figure 6 summarizes performance comparison based on F-measure and AUC. LSTM-autoencoder and RD-LNN
appear to be better than the others in AUC, while RD-LNN is the only one to show the outstanding performance
in F-measure.
Additional experiment is conducted separately to choose distance measure algorithm between Euclidean distance
and DTW distance considering that 1-NN method requires demanding calculation of distance between the target data
point and all the points in the training set. In the experiment comparing two methods to measure the distance in Table 4,
we found that DTW distance spends much more time to complete the same task than Euclidean distance which takes
only about 4.6 minutes while it shows the almost same performance. The reason why Euclidean distance performs well
compared with DTW in this experiment is that DTW distance is particularly well suited for the application of automatic
speech recognition in which speaking speeds vary based on time. However, time-series data that has been used here has
the same time difference.
LEE AND SEO 13 of 19

Autoencoder LSTM−Autoencoder

True Positive Rate

True Positive Rate


0.8

0.8
0.4

0.4
AUC = AUC =
0.694 0.715
0.0

0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

False Positive Rate False Positive Rate

Decision Tree SVM


1.0

1.0
0.8

0.8
True Positive Rate

True Positive Rate


0.6

0.6
0.4

0.4
AUC = AUC =
0.2

0.2
0.594 0.378
0.0

0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

False Positive Rate False Positive Rate

CL−LNN RD−LNN
True Positive Rate

0.8
True Positive Rate

0.8

0.4
0.4

AUC = AUC =
0.601 0.724
0.0
0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

False Positive Rate False Positive Rate

FIGURE 5 ROC curves of six different methods. ROC, receiver operating characteristic

4.2 Effects of window size and the number of normal instances

In this subsection, the key parameters which highly influence the performance of RD-LNN are examined. First, the win-
dow size m = 20 was determined based on the experiment considering F-measure as well as running time which is also
an important factor when it is deployed in the real-life application.
Figure 7 represents how F-measure and running time3 are varied over window size m. F-measure shows the downward
trend as the window size is increased while running time is increasing almost linearly due to the fact that lager window
size demands more computation to estimate the distance. The relationship between F-measure and window size indicates
that we need to find the optimal window size to capture the appropriate patterns that the failures might have. We substitute
zero for F-measure when the algorithm is not able to detect the true failure 2 minutes earlier. It is noted that 20 window
size shows good performance with decent computing burden, and it is used as the number of window sizes of the proposed
algorithm in this article.

3
Running time can be varied based on computer performance. The computer specification used in this experiment: Windows 10 Pro, Intel Core i7,
16 GB RAM, 64-bit
14 of 19 LEE AND SEO

0.8 0.12 FIGURE 6 Performance comparison with AUC


0.715 0.724 and F-measure. AUC, area under the ROC curve
0.694
0.7
0.097 0.1
0.594 0.601
0.6

0.08
0.5

0.4 0.057 0.378 0.06


0.052

0.3
0.038 0.04
0.03 0.028
0.2

0.02
0.1

0 0
Autoencoder LSTM-Autoencoder Decision Tree SVM CL-LNN RD-LNN

AUC F measure

T A B L E 4 Comparison
Item Euclidean distance DTW distance Remark
between Euclidean and DTW
TN (True negative) 1514 1528 distance
FN (False negative) 19 19
TP (True positive) 6 6 Detect failures
FP (False positive) 282 268
F-measure 0.038 0.040 (2 × TP)/(2 × TP + FP + FN)
AUC 0.601 0.618
Running time 4.6 minutes 5.13 hours

Note: The boldfaced font was used to emphasize the difference of running time between two methods. (the other
metrics are similar).
Abbreviations: AUC, area under the ROC curve; DTW, dynamic time warping.

0.10
F I G U R E 7 The effect of the window size in RD-LNN for
8 the training process. RD-LNN, relative distance of the local
0.08
nearest neighbor
Running time (min)

7
F measure

0.06

0.04 6

0.02
5
0.00

20 40 60 80 100

Window Size

Another parameter we need to carefully determine is the number of normal instances randomly selected in the train-
ing dataset. We examined the effectiveness of the number of normal instances with the performance depicted in Figure 8.
Note that 99 failures are included in the training dataset and the class distribution between failures and normal instances
needs to be balanced to handle imbalanced dataset. It shows that F-measure increases when the number of normal
instances for training increases from 100 to 200, and then significantly decreases after 200 while running time keeps ris-
ing over the number of normal instances. This indicates that 200 normal instances we randomly selected in the training
dataset provide better performance than others.
LEE AND SEO 15 of 19

FIGURE 8 The effect of the number of normal instances 0.10


randomly selected in training data set
5.4
0.08

Running time (min)


F measure
0.06
5.2

0.04

5.0
0.02

0.00

100 150 200 250 300

No. of normal instances in training dataset

T A B L E 5 Root cause analysis


Rank Variable Importance Rank Variable Importance Rank Variable Importance
through decision tree algorithm
1 s3 100% 21 s41 62% 41 s1 0%
2 s20 100% 22 s39 58% 42 s2 0%
3 s21 100% 23 s37 58% 43 s7 0%
4 s32 100% 24 s35 49% 44 s10 0%
5 s33 100% 25 s35 49% 45 s12 0%
6 s40 100% 26 s34 46% 46 s13 0%
7 s6 81% 27 s30 44% 47 s24 0%
8 s60 70% 28 s28 44% 48 s28 0%
9 s15 65% 29 s27 40% 49 s29 0%
10 s26 65% 30 s26 39% 50 s36 0%
11 s57 64% 31 s21 38% 51 s42 0%
12 s18 60% 32 s21 36% 52 s45 0%
13 s31 59% 33 s18 34% 53 s49 0%
14 s44 56% 34 s18 34% 54 s50 0%
15 s14 49% 35 s18 32% 55 s52 0%
16 s51 49% 36 s14 31% 56 s54 0%
17 s37 46% 37 s14 31% 57 s55 0%
18 s34 46% 38 s13 29% 58 s56 0%
19 s11 45% 39 s12 29% 59 s58 0%
20 s46 44% 40 s11 28% 60 s59 0%
61 s61 0%

4.3 Root cause analysis

Root cause analysis is implemented by measuring the importance of each variable to find the critical ones which cause
the failure of paper manufacturing machinery based on the decision tree algorithm. The variable importance is estimated
based on the percentage of training dataset samples that fall into all the terminal nodes after the split to find the root
cause. In Table 5, 61 variables are listed in the order of importance from 1 to 61. Six variables (s3 , s20 , s21 , s32 , s33 , and
s40 ) which have the importance 100% are the most important variables to detect failures earlier than its occurrence. In
other words, these six variables have the most impact on the classification model. One interesting fact is that a categorical
variable (s28 ) and a binary variable (s61 ) do not make any contributions to this model.
16 of 19 LEE AND SEO

T A B L E 6 Cost benefit analysis based on RD-LNN


Gain Loss
Item (by TPR) (by FPR) Remark

Cost/occurrence $10 000/True positive $100/False positive Assumed based on the


relevant articles
Number of occurrence 124 × 12 month = 1488 (2 minutes × 30 × 24 hour × 1 year
365 day) − 1488 = 524 112
Occurrence Rate Recall = 12.0% FPR = 1.9% Test result
Cost/year $1 785 600 −$995 813
Total cost $789 787

Abbreviations: BCs, boundary conditions; FPR, false positive rate; RD-LNN, relative distance of the local nearest neighbor; TPR, true positive rate.

The decision tree which consists of three types of nodes (ie, root nodes, decision nodes, and terminal (or leaf) nodes)
and branches also shows similar results. As we can expect from variable importance, the most important variable is on
the root node which is located on the top of the decision tree.

4.4 Cost benefit analysis

Based on the experiment, it suggests RD-LNN is able to detect three failures 2 minutes earlier among 25 paper breaks. In
this section, we will analyze how much this proposed algorithm could make a contribution to the industries even though
the performance does not look high enough to detect every failure before it occurs. Table 6 shows that even a small number
of failure reduction improved by this algorithm can save a significant amount of cost for the industries every year. The gain
is calculated based on the recall 12%, and the loss caused by the false alarm is estimated to find the total cost that we can
save throughout a year. Ranjan et al2 imply that it will cost more than 10 000 dollars for a break. We assumed that failure
would occur 124 times for 1 month based on our dataset. Since the classification algorithm can detect 12% of failure,
almost 1.7 million dollars can be saved per year by preventing 179 possible failures. However, we also need to consider the
other side, a negative effect caused by a false alarm which gives the warning even though the machine is in the normal
state. We assumed that this false alarm would cost 100 dollars because people might stop working and need to check the
machine status to find out the problem. Based on the fact that data is captured by every 2 minutes, 1488, the number of
failures that occurred every year, is subtracted from the total number of failures. The total loss caused by false alarm would
be less than 1 million dollars due to the FPR which is 1.9%. If both positive and negative factors are considered together to
find the total cost, we can conclude that the algorithm we propose here can save more than 700 thousand dollars in total
for a year.

5 D I S C U S S I O N A N D CO N C LU S I O N

It is crucial to detect the failure earlier to save cost and labor in a paper manufacturing facility. However, it is challeng-
ing to detect machine failure in advance due to the fact that data is comprised of MSTS and failures which rarely occur
during operation without any clear symptom where we call extremely rare event problems. In this research, two types
of methods called CL-LNN, RD-LNN are proposed based on the nearest neighbor to extract proper features for early
detection of paper manufacturing machinery. The data is preprocessed with several different steps: splitting data, stan-
dardization, moving class label, second derivative, and time window processing. CL-LNN measures Euclidean distance
to extract the class label of the nearest neighbor which will be fed into the decision tree classifier for the failure clas-
sification. Another algorithm called RD-LNN extracts relative distance-generating numerical values which are suitable
to be trained with SVM. Experiments are implemented on the dataset provided by the IISE 2019 data competition to
show the competitiveness of our proposed methods. Dataset is preprocessed and proposed algorithms are implemented
with other machine learning techniques. Through the experiment, it finds that RD-LNN is able to extract features effec-
tively to detect the abnormal condition in the MSTS dataset which would make a considerable contribution to industries
by saving cost.
LEE AND SEO 17 of 19

Considering the fact that sensor measurements are collected every 2 minutes and it takes less than 20 seconds
to analyze one measurement with our algorithm to detect a failure, this algorithm would be a feasible solution in a
real-world environment where a prior warning is given so that technicians can take appropriate actions to prevent a
breakdown. However, it would be possible to find a more efficient way to deal with computation complexity when
deploying to a real-world environment. One possible solution for the real-time application is that, based on the fact that
Euclidean distance is calculated based on squared differences between two instances at m time points (see Equation (6)),
if we store these squared differences from, say, t = 1 to t = m, it can be easily updated, when a new signal is mea-
sured at t = m + 1, by dropping one at t = 1 and adding one at t = m + 1. In this case, by reutilizing previously computed
results at t = 2, … , m, it is only required to compute one for t = m + 1 which will let us save much time to calculate
the distance.
It should also be noticed that the test dataset is standardized with mean and SD obtained from the training
dataset since these parameters of the test dataset are not available during the model training. This fact could pos-
sibly lead to a negative impact on the performance if new measurements show a significant difference from the
previous ones (training dataset). Although we assume that the future examples will have similar mean and SD
as the training dataset in this article, this can be alleviated by updating those parameters as we gained the new
measurements.
Even though cost-benefit analysis shows promising results, further research to overcome the rare event sit-
uation is still necessary, since improving performance is limited by insufficient labeled data from which most
of the machine learning algorithms normally suffer. More efforts should be made to overcome the lack of fail-
ure data which is normally encountered when collecting data in industries such as failures, spam email, fraud
credit card transactions, and so on. The concept of active learning could provide a possible solution to handle
the extremely rare event problem where the dataset is severely imbalanced (skewed) with a small number of ini-
tial training data available. The basic idea of active learning is that better performance in a machine learning
algorithm can be achieved with fewer training labeled data if we are allowed to choose the data from which it
learns. Therefore, we might be able to get better performance by adopting active learning algorithms in our future
research.

ACKNOWLEDGEMENT
We are very grateful to the two anonymous reviewers and the Editor-in-Chief for their comments on the article.

PEER REVIEW INFORMATION


Engineering Reports thanks Giovanna Martinez Arellano and other anonymous reviewer(s) for their contribution to the
peer review of this work.

CONFLICT OF INTEREST
The authors have no potential conflict of interest to declare.

PEER REVIEW
The peer review history for this article is available at https://publons.com/publon/10.1002/eng2.12291.

DATA AVAILABILITY STATEMENT


The data that support the findings of this study are openly available in arXiv.org at https://arxiv.org, reference number
arXiv:1809.10717.

ORCID
Kangwon Seo https://orcid.org/0000-0002-2128-4079

REFERENCES
1. Bajpai P. Basic Overview of Pulp and Paper Manufacturing Process. New York, NY: Springer; 2015:11-39.
2. Ranjan C, Mustonen M, Paynabar K, Pourak K. Dataset: rare event classification in multivariate time series; 2018. arXiv preprint
arXiv:1809.10717.
3. Montgomery DC. Introduction to Statistical Quality Control. Hoboken, NJ: John Wiley & Sons; 2012.
4. Karim F, Majumdar S, Darabi H, Harford S. Multivariate lstm-fcns for time series classification. Neural Netw. 2019;116:237-245.
18 of 19 LEE AND SEO

5. Wang L, Wang Z, Liu S. An effective multivariate time series classification approach using echo state network and adaptive differential
evolution algorithm. Exp Syst Appl. 2016;43:237-249.
6. Christ M, Kempa-Liehr AW, Feindt M. Distributed and parallel time series feature extraction for industrial big data applications; 2016.
arXiv preprint arXiv:1610.07717.
7. Sykacek P, Roberts SJ. Bayesian time series classification. Advances in Neural Information Processing Systems, Vancouver, Canada in 2001.
Cambridge, MA: MIT Press; 2002:937-944.
8. Esmael B, Arnaout A, Fruhwirth RK, Thonhauser G. Improving time series classification using Hidden Markov models. Paper presented
at: Proceedings of the 2012 12th International Conference on Hybrid Intelligent Systems (HIS), Pune, India; 2012:502-507; IEEE.
9. Jović A, Brkić K, Bogunović N. Decision tree ensembles in biomedical time-series classification. Paper presented at: Proceedings of the
Joint DAGM (German Association for Pattern Recognition) and OAGM Symposium; 2012:408-417; Springer, Berlin, Heidelberg.
10. Eads DR, Hill D, Davis S, et al. Genetic algorithms and support vector machines for time series classification. Applications and Sci-
ence of Neural Networks, Fuzzy Systems, and Evolutionary Computation, Seattle, Washington. Vol 4787. Bellingham, Washington: SPIE;
2002:74-85.
11. Cui Z, Chen W, Chen Y. Multi-scale convolutional neural networks for time series classification; 2016. arXiv preprint arXiv:1603.06995.
12. Baesens B, Van Gestel T, Viaene S, Stepanova M, Suykens J, Vanthienen J. Benchmarking state-of-the-art classification algorithms for
credit scoring. J Operat Res Soc. 2003;54(6):627-635. https://doi.org/10.1057/palgrave.jors.2601545.
13. Lines J, Taylor S, Bagnall A. Hive-cote: the hierarchical vote collective of transformation-based ensembles for time series classifica-
tion. Paper presented at: Proceedings of the 2016 IEEE 16th International Conference on Data Mining (ICDM), Barcelona, Spain:
2016:1041-1046.
14. Tan CW, Petitjean F, WGI. FastEE: fast ensembles of elastic distances for time series classification. Data Mining Knowl Discov.
2020;34:231-272. https://doi.org/10.1007/s10618-019-00663-x.
15. Orsenigo C, Vercellis C. Combining discrete SVM and fixed cardinality warping distances for multivariate time series classification. Pattern
Recognit. 2010;43(11):3787-3794.
16. Weng X, Shen J. Classification of multivariate time series using two-dimensional singular value decomposition. Knowl Based Syst.
2008;21(7):535-539.
17. Zhang C, Yan H, Lee S, Shi J. Multiple profiles sensor-based monitoring and anomaly detection. J Qual Technol. 2018;50(4):344-362.
18. Rodríguez JJ, Alonso CJ. Support Vector Machines of Interval-Based Features for Time Series Classification. New York, NY: Springer;
2004:244-257.
19. Kadous MW, Sammut C. Classification of multivariate time series and structured data using constructive induction. Mach Learn.
2005;58(2-3):179-216.
20. Li C, Khan L, Prabhakaran B. Feature Selection for Classification of Variable Length Multiattribute Motions. New York, NY: Springer;
2007:116-137.
21. Kim J, Huang Q, Shi J, and Chang T. Online Multichannel Forging Tonnage Monitoring and Fault Pattern Discrimination Using Principal
Curve. ASME. J. Manuf. Sci. Eng. 2006;128(4):944-950. https://doi.org/10.1115/1.2193552.
22. Chang SI, Yadama S. Statistical process control for monitoring non-linear profiles using wavelet filtering and B-spline approximation. Int
J Product Res. 2010;48(4):1049-1068. https://doi.org/10.1080/00207540802454799.
23. Paynabar K, Jin J, Pacella M. Analysis of multichannel nonlinear profiles using uncorrelated multilinear principal component analysis
with applications in fault detection and diagnosis. IIE Trans. 2013;45(11):1235-1247.
24. Grasso M, Colosimo BM, Pacella M. Profile monitoring via sensor fusion: the use of PCA methods for multi-channel data. Int J Product
Res. 2014;52(20):6110-6135. https://doi.org/10.1080/00207543.2014.916431.
25. Zheng Y, Liu Q, Chen E, Ge Y, Zhao JL. Time series classification using multi-channels deep convolutional neural networks. Paper
presented at: Proceedings of the International Conference on Web-Age Information Management; 2014:298-310; Springer, Cham.
26. He H, Garcia EA. Learning from imbalanced data. IEEE Trans Knowl Data Eng. 2009;21(9):1263-1284.
27. Estabrooks A, Jo T, Japkowicz N. A multiple resampling method for learning from imbalanced data sets. Comput Intell. 2004;20(1):
18-36.
28. Ting KM. An instance-weighting method to induce cost-sensitive trees. IEEE Trans Knowl Data Eng. 2002;14(3):659-665.
29. Attenberg J, Provost F. Why label when you can search? Alternatives to active learning for applying human resources to build classification
models under extreme class imbalance. Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining; 2010:423-432.
30. Kazerouni A, Zhao Q, Xie J, Tata S, Najork M. Active learning for skewed data sets; 2020. arXiv preprint arXiv:2005.11442.
31. Fang M, Li Y, Cohn T. Learning how to active learn: a deep reinforcement learning approach; 2017; arXiv preprint arXiv:1708.02383.
32. Haussmann M, Hamprecht FA, Kandemir M. Deep active learning with adaptive acquisition; 2019. arXiv preprint arXiv:1906.11471.
33. Li X, Ding Q, Sun JQ. Remaining useful life estimation in prognostics using deep convolution neural networks. Reliab Eng Syst Safety.
2018;172:1-11.
34. Baradwaj BK, Pal S. Mining educational data to analyze students; 2012. arXiv preprint arXiv:1201.3417.
35. Srikanthan S, Kumar A, Gupta R. Implementing the dynamice warping algorithm in multithreaded environments for real time and unsu-
pervised pattern discovery. Paper presented at: Proceedings of the 2011 2nd International Conference on Computer and Communication
Technology (iccct-2011), Allahabad, India; 2011:394-398.
36. Sakoe H. Dynamic-programming approach to continuous speech recognition. Paper presented at: Proceedings International Congress of
Acoustics; 1971; Budapest.
LEE AND SEO 19 of 19

37. Górecki T, Łuczak M. Non-isometric transforms in time series classification using DTW. Knowl Based Syst. 2014;61:98-108.
38. Dau HA, Silva DF, Petitjean F, Forestier G, Bagnall A, Keogh E. Judicious setting of dynamic time warping’s window width allows more
accurate classification of time series; 2017:917-922.

How to cite this article: Lee W, Seo K. Early failure detection of paper manufacturing machinery using nearest
neighbor-based feature extraction. Engineering Reports. 2020;e12291. https://doi.org/10.1002/eng2.12291

You might also like