Gru Ae
Gru Ae
Article
An Autoencoder Gated Recurrent Unit for Remaining
Useful Life Prediction
Yi-Wei Lu 1 , Chia-Yu Hsu 1, * and Kuang-Chieh Huang 2
1 Department of Industrial Engineering and Management, National Taipei University of Technology,
Taipei 10608, Taiwan; [email protected]
2 Department of Information Management, Yuan Ze University, Taoyuan 32003, Taiwan;
[email protected]
* Correspondence: [email protected]; Tel.: +886-2-27712171
Received: 6 August 2020; Accepted: 11 September 2020; Published: 15 September 2020
Abstract: With the development of smart manufacturing, in order to detect abnormal conditions of
the equipment, a large number of sensors have been used to record the variables associated with
production equipment. This study focuses on the prediction of Remaining Useful Life (RUL). RUL
prediction is part of predictive maintenance, which uses the development trend of the machine to
predict when the machine will malfunction. High accuracy of RUL prediction not only reduces
the consumption of manpower and materials, but also reduces the need for future maintenance.
This study focuses on detecting faults as early as possible, before the machine needs to be replaced
or repaired, to ensure the reliability of the system. It is difficult to extract meaningful features from
sensor data directly. This study proposes a model based on an Autoencoder Gated Recurrent Unit
(AE-GRU), in which the Autoencoder (AE) extracts the important features from the raw data and the
Gated Recurrent Unit (GRU) selects the information from the sequences to forecast RUL. To evaluate
the performance of the proposed AE-GRU model, an aircraft turbofan engine degradation simulation
dataset provided by NASA was used and a comparison made of different recurrent neural networks.
The results demonstrate that the AE-GRU is better than other recurrent neural networks, such as
Long Short-Term Memory (LSTM) and GRU.
Keywords: remaining useful life; predictive maintenance; deep learning; autoencoder; gated
recurrent unit
1. Introduction
Industry 4.0 depends on automated, smart factories, and employs sensor data and the methods of
big data analysis, in the expectation that the equipment in the factory will operate automatically and
self-correct to improve the product yield [1]. Sensor-related data are collected during the manufacturing
processes, such as temperature, pressure, power, humidity, and chemical analysis for equipment
monitoring [2,3]. These temporal patterns represent the equipment condition, and poor product quality
is often associated with abnormal changes of environment or inappropriate operation settings [4].
Equipment condition can be recorded by sensor data from the past and kept as time series data. When
analyzing equipment sensor data, not only the large amount of data recorded should be considered,
but also the time series characteristics. The idea of a smart factory is to link and intellectualize the
manufacturing process. In the past, automation merely used machines to improve production efficiency,
yield, and reduce costs, but intelligence further applies technologies such as the IoT sensor, to monitor
and control production lines. The combination of cloud computing, data analysis, and software and
hardware integration is also an important part of intelligence. Machines communicate with automation
equipment [5]. However, there are numerous parameters in the smart factory, and the data are often
highly correlated between equipment and processes. Therefore, the analysis must go beyond a single
stage in the equipment or process, and be comprehensive.
Equipment maintenance methods are divided into the following three types: (1) repair
maintenance [6]; (2) preventive maintenance [7]; (3) predictive maintenance [8]. Corrective maintenance
is a method of maintenance that is performed after some or all of the equipment fails. The most
common equipment maintenance application is preventive maintenance, also known as scheduled
maintenance, which carries out machine inspections, component replacement and other maintenance at
prescribed times. Predictive maintenance is an equipment maintenance method based on the condition
of the machine. It predicts the time when the machine may be damaged according to the development
trend of the past.
Predictive Maintenance (PdM) has been used to monitor the historical health status of equipment
and make timely adjustments to the equipment, which is quite different from the methods of routine
maintenance employed in the past. PdM not only saves unnecessary costs, but also allow early repairs
when the equipment is about to break down. It can avoid unpredictable machine stoppages caused
by unexpected breakdown and improper operation. Remaining Useful Life (RUL) is an important
indicator in PdM. The definition of device or system RUL is the period from the current state to the time
when the device begins to operate abnormally [9]. There are two types of RUL prediction: model-based
and data-driven [10]. The model-based method establishes the model based on the historical trend
of equipment. Generally, model-based methods perform better than data driven methods when
there is little historical data. Data-driven methods use the health status data and data collected by
sensors. For modeling, the collected data must be closely related to the device status. The advantage
of data-driven methods is that the algorithm is based on learning the correlation between input and
output data. If there are sufficient data, the stability and accuracy of data-driven methods are better
than model-based methods
RUL prediction can be described as a time series forecasting problem. The main purpose of time
series forecasting is to build models from historical records and predict future situations. Time series
forecasting can be divided into statistical models, machine learning models and deep learning models.
Common statistical models include Autoregression (AR), Moving Average (MA), Autoregressive
Moving Average (ARMA), and Autoregressive Integrated Moving Average (ARIMA). Machine learning
builds a function or model to detect patterns in training samples and then uses these patterns for
prediction [11]. Support Machine Vector (SVM) [12], Random Forest (RF) [13], Extreme Gradient
Boosting (XGB) [14] are representative machine learning algorithms for time series forecasting, but it
is difficult to design an effective machine learning algorithm without substantial knowledge of the
data [15]. In addition, the existing methods for analyzing time series data need pre-processing based
on experience, which causes some losses of information. With improved process technology and more
sensors, few existing methods can handle multi-sensor data, which is likely to cause misjudgment or
missed detection.
Deep learning methods are preferred to predict and analyze time series data without predefined
features. Deep learning is a branch of machine learning based on artificial neural networks with
representation learning [16–18], which can learn the key information to make a response by using many
hidden layers integrated in the network. The main difference between deep learning and machine
learning is feature extraction. Usually, the features of machine learning are manually selected in
advance; deep learning not only identifies features through model training, but is also much better
than other algorithms at calculating results. Makridakis et al. [19] compare several time series analysis
methods on the M3-Competition dataset. Most of the machine learning and deep learning algorithms
perform better than traditional statistic methods. Alfred et al. [20] demonstrated that algorithms based
on neural networks are more efficient in learning time series data than other algorithms.
Feature extraction is an important data pre-processing method in machine learning for high
prediction accuracy. In the past, the dimension reduction of most data features was accomplished
through Principal Component Analysis (PCA). PCA retains the maximum variation of features by
Processes 2020, 8, 1155 3 of 18
projecting variables into another space to represent the original complete data with fewer features, to
achieve the effect of data dimension reduction. The difference between PCA and Autoencoder (AE)
is that AE is a non-linear dimension reduction method [21], and an unsupervised training method.
First, the model compresses the original data through the encoder structure, and then restore the
compressed data through the decoder. The AE rebuilds the original data with fewer features for data
representation [22]. It can compress the key information of high dimensional input data into low
dimensional features. AE is often used for feature extraction and dimensional reduction in time series
forecasting problems [23]. Moreover, machine learning methods required manual selection of features.
The quality of the features affected the models’ results. Deep learning has the ability to automatically
extract features. It can reduce the time spent on feature engineering and help experts to make decisions.
Most of the relevant studies on RUL prediction use deep learning as a prediction model, which not
only can save time in selecting features, but also makes prediction results much better than traditional
machine learning models.
Recurrent Neural Networks (RNN) are a deep learning model that deals specifically with time
series data and can select important features from equipment sensors. Long Short-Term Memory
networks (LSTM) is a variant of RNN [24]. For example, Zhang et al. [25] constructed an LSTM model
to predict the RUL of lithium batteries, predicting after how many cycles of fully charge and discharge
the capacity will be lower than the normal threshold, based on the historical decline rate of capacity.
Heimes [9] set the maximum remaining engine life at 130 on the PHM08 data set, and predicted the
RUL using an RNN model. Zhang et al. [25] also constructed an LSTM model to predict the RUL
of the jet engine. One hundred jet engines were used as training data and 100 were used as testing
data. Each jet engine has 24 features that record the sensor value from normal to fault, and the RUL
are predicted by the change in these values. Mathew et al. [26] predict the RUL of jet engines, using
24 parameters of the original data and comparing ten machine learning methods (Linear Regression,
Decision Tree, SVM, Random Forest, KNN, K-means, Gradient Boost, Adaboost, Deep Learning,
ANOVA), and verify the validity of the model through RMSE. Zheng et al. [27] used an LSTM model
to predict the RUL on C-MAPSS data set, PHM08 data set, and Milling data set. The LSTM model
is better than the other models. Chen et al. [28] proposed a two-phase model for RUL prediction by
Kernel Principal Component Analysis (KPCA) and Gated Recurrent Unit (GRU), respectively. To the
best of our knowledge, little research has integrated AE and GRU for RUL prediction.
This study develops an Autoencoder Gated Recurrent Unit (AE-GRU) model to predict the RUL
of equipment. In particular, AE is used to select features, and then the correlation between sensors is
found through the GRU model, and the RUL is predicted by Multi-Layer Perception (MLP) model.
The first part is the Autoencoder (AE) model which is composed of an encoder and decoder. The
second part is the GRU model which is a type of RNN that can deal with time series data. The GRU
model finds key information in historical sensor data in combination with the MLP model. The MLP
model calculates the extracted information and combines with the backpropagation algorithm to
predict the RUL effectively.
2. Literature Review
Time series are statistical data, arranging events or data in order of their occurrence. The main
purpose of time series forecasting is to build models from historical records and predict future situations.
will let the gradient descent method stay at the local minimum. The traditional neural network has
different
weights parameters
will cause theto be learned
problem in each vanish
of gradient hidden because
layer, but of the
the parameters of the RNN
heavy calculation, whichonly willhave
let
different
the gradientinputs.
descent This featurestay
method greatly
at the reduces the training
local minimum. Theburden of theneural
traditional model.network has different
LSTMto
parameters solves the problem
be learned in eachofhidden the gradient vanish
layer, but in the random
the parameters gradient
of the RNN onlydescent have in different
recurrent
neural networks. The biggest difference between
inputs. This feature greatly reduces the training burden of the model. the LSTM and RNN is that each neuron of LSTM
has LSTM
a control gatethe
solves function,
problem which
of theis input,
gradient forget,
vanishandinoutput.
the randomThesegradient
gates have their own
descent weights.
in recurrent
The calculation of the weights after data input determines whether
neural networks. The biggest difference between the LSTM and RNN is that each neuron of LSTM the switches are turned on or off.
Thea control
has input gate
gatecontrols
function,whetherwhich isdata input,canforget,
be written to the These
and output. memory gates space,
have and
theirtheown forget gate
weights.
determines whether the contents of the previous memory space are
The calculation of the weights after data input determines whether the switches are turned on orretained. The output gate controls
whether
off. the memory
The input gate controlsspacewhether
operation dataresult canwritten
can be be output.
to theAlthough
memory adding
space, and thesethegates
forget togate
each
neuron generates
determines whethermore weight to
the contents of be
thecalculated, it can solve
previous memory spacetheareproblem
retained. of The
gradient
output vanish found in
gate controls
the recurrent neural network.
whether the memory space operation result can be output. Although adding these gates to each
neuron Figure 1 is a more
generates diagram weightof thetoLSTM networkitunit
be calculated, canarchitecture,
solve the problemand theofrelated
gradientformulas
vanishare foundshownin
as recurrent
the Equationsneural The LSTM unit will receive the input vector 𝑥 and the output ℎ . Each unit
(1)–(6).network.
contains
Figurefour
1 isgates (i, g, f,ofo)the
a diagram and a cell.
LSTM x is theunit
network input vector of this
architecture, andlayer, h is the
the related output vector,
formulas are shown ⨀ is
asthe element-wise
Equations (1)–(6).multiplication.
The LSTM unit Wwillis the weight
receive the matrix, b is an
input vector errorthevector,
xt and outputand ht−1(i, f, o, unit
. Each g, c)
represents input gate, forget gate, output gate, cell input, and cell activation respectively. The forget
J
contains four gates (i, g, f, o) and a cell. x is the input vector of this layer, h is the output vector, is
gate
the controls how
element-wise much information
multiplication. W is the each memory
weight unit
matrix, needs
b is to forget,
an error vector,theandinput gate
(i, f, o, controls
g, c) representshow
muchgate,
input newforget
information each memory
gate, output gate, cell unit
input,adds,
and and
cell the outputrespectively.
activation gate controlsThe how much
forget information
gate controls
each memory unit outputs. The cell input is the information
how much information each memory unit needs to forget, the input gate controls how much input of the current time and newthe
information of the previous time. Cell activation controls the whether
information each memory unit adds, and the output gate controls how much information each memory the input gate and forget gate
information
unit outputs. canThebe obtained.
cell input is the information input of the current time and the information of the
previous time. Cell activation controls the whether
𝑖 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑾 𝒙 the +𝑾input𝒉 gate
+ 𝒃and forget gate information can
(1)
be obtained.
𝑓it = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑
sigmoid(W𝑾xi xt𝒙+ +
W𝑾 𝒉 + b+i )𝒃
hi ht−1 (2)
(1)
ft = sigmoid Wx f xt + Wh f ht−1 + b f (2)
(3)
𝑜 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑾 𝒙 + 𝑾 𝒉 + 𝒃
ot = sigmoid(Wxo xt + Who ht−1 + bo ) (3)
𝑔 = 𝑡𝑎𝑛ℎ 𝑾 𝒙 + 𝑾 𝒉 + 𝒃 (4)
gt = tanh Wxg xt + Whg ht−1 + b g (4)
(5)
=𝑓
c𝑐t = ft ⊙ct−1
𝑐 ++it 𝑖 ⊙
gt 𝑔 (5)
tanh(c𝑐t ) ⊙
hℎt ==tanh ot 𝑜 (6)
Figure1.1.Structure
Figure StructureofofLong
LongShort-Term
Short-TermMemory
Memory(LSTM)
(LSTM)unit.
unit.
The
Thegated
gatedrecurrent
recurrentunit
unit(GRU)
(GRU)isisaavariant
variantofofLSTM
LSTM[29].
[29].AsAsshown
shownin inFigure
Figure2,2,(z, H,H)
(z,r,r,H,
e are
𝐻) are
update
updategate,
gate,reset
resetgate,
gate,activation,
activation,and
andcandidate
candidateactivation
activationrespectively.
respectively.The Thedetailed
detailedformulas
formulasare are
shown
shownan anEquations
Equations(7)–(10).
(7)–(10).TheTheupdate
updategategateisisused
usedtotocontrol
controlhow
howmuch
muchhistorical
historicalinformation
informationand and
new
newinformation
informationneeds
needstotobebeforgotten
forgottenininthe
thecurrent
currentstate.
state.The
Thereset
resetgate
gateisisused
usedtotocontrol
controlhow howmuch
much
information is available from the candidate state. The candidate activation can
information is available from the candidate state. The candidate activation can be regarded be regarded as theas
newthe
information at the current time. The reset gate is used to control how much historical information
new information at the current time. The reset gate is used to control how much historical information needs
Processes 2020, 8, 1155 5 of 18
Processes 2020, 8, x FOR PEER REVIEW 5 of 18
to be retained. The activation is generated by the update gate and candidate activation. The update
needs to be retained. The activation is generated by the update gate and candidate activation. The
gate in activation controls how much new and old information is retained. New information and
update gate in activation controls how much new and old information is retained. New information
old information have a complementary relationship. If a lot of new information is retained, the old
and old information have a complementary relationship. If a lot of new information is retained, the
information considered will be less, and vice versa.
old information considered will be less, and vice versa.
GRU
GRU simplifies
simplifiesthetheinput
inputgategateand
andforget
forgetgate
gateofof
LSTM
LSTM into an an
into update
updategate, andand
gate, combines the cell
combines the
states and and
cell states hidden statesstates
hidden together. Therefore,
together. the GRU
Therefore, the unit
GRUretains the advantages
unit retains of LSTM,
the advantages and further
of LSTM, and
reduces the model
further reduces thetraining time by time
model training reducing the parameters
by reducing in the model.
the parameters in theAccording to the analysis
model. According to the
results
analysis results in [30], the results of the RNN model are poor and the results of GRU andare
in [30], the results of the RNN model are poor and the results of GRU and LSTM similar,
LSTM are
and better than RNN. However, the GRU has fewer parameters than LSTM,
similar, and better than RNN. However, the GRU has fewer parameters than LSTM, so the trend in so the trend in deep
learning models
deep learning applied
models to time
applied toseries analysis
time series is toward
analysis GRU. GRU.
is toward
In view of
In view of the
the excellent
excellent feature
feature extraction
extraction performance
performance of of AE, and the
AE, and the fast
fast calculation
calculation advantage
advantage
of GRU, in this study the RUL prediction model works by extracting the important
of GRU, in this study the RUL prediction model works by extracting the important features from features from
the
the original data using AE. After pre-processing of the extracted features, the
original data using AE. After pre-processing of the extracted features, the GRU model and DNN GRU model and DNN
(deep
(deep fully
fully connected
connected neural
neural networks)
networks) are
are applied
applied toto predict
predict the
the RUL.
RUL.
applied a wavelet transform algorithm to remove noise from the original data. Stage two stacks AEs
Processes 2020, 8, x FOR PEER REVIEW 6 of 18
to reduces the data dimensions and picks out features automatically. Stage three applies an LSTM
model
model to to make
make predictions. Zhang et
predictions. Zhang et al.
al. [36]
[36] divided
divided Sea
Sea Surface
Surface Temperature
Temperature (SST) (SST) into
into two
two part
part to
to
make
make the prediction. The first part is short-term forecasting, which predicts the SST after 1 and 33day.
the prediction. The first part is short-term forecasting, which predicts the SST after 1 and day.
The
The second
second partpart isis long-term
long-term forecasting,
forecasting, which
which predicts
predicts the
the weekly
weekly average
average and
and monthly
monthly average
average of of
SST. It finds the relationship between sequences through LSTM, and then
SST. It finds the relationship between sequences through LSTM, and then make the prediction make the prediction through
the fully the
through connected layer. How
fully connected et How
layer. al. [37]et uses theuses
al. [37] angle
thechanges on different
angle changes motionmotion
on different sensorssensors
of the
NAO
of the robot
NAO to classify
robot the current
to classify actions
the current of theof
actions robot through
the robot an LSTM
through model.
an LSTM Truong
model. et al.et[38]
Truong al.
construct
[38] construct an LSTM model to identify changes in human type and object type, and even predict
an LSTM model to identify changes in human type and object type, and even predict
possible
possible human
human actions
actions through
through the the action
action combinations.
combinations. KuanKuan et et al.
al. [39]
[39] constructed
constructed an an MS-GRU
MS-GRU
(multilayered
(multilayered self-normalizing gated recurrent units) model. It has good results in predicting power
self-normalizing gated recurrent units) model. It has good results in predicting power
loading. Zhang and
loading. Zhang Kabuka [40]
and Kabuka [40] used
used the
the GRU
GRU model
model toto predict
predict the
the traffic
traffic volume.
volume. TheThe training
training data
data
included
included the the historical
historical weather
weatherdatadata100100hhpreviously
previouslytotopredict
predictthethetraffic
trafficvolume
volume inin
thethe next
next 1212h. h.
It
It was found that weather conditions can improve the
was found that weather conditions can improve the prediction accuracy. prediction accuracy.
3. Research Framework
3. Research Framework
This study constructs an AE-GRU model for RUL prediction to increase equipment life, reduce
This study constructs an AE-GRU model for RUL prediction to increase equipment life, reduce
unexpected harm caused by sudden shutdown of machinery, and improve the reliability of system
unexpected harm caused by sudden shutdown of machinery, and improve the reliability of system
operation. The proposed AE-GRU includes the steps of data pre-processing, feature extraction, and RUL
operation. The proposed AE-GRU includes the steps of data pre-processing, feature extraction, and
prediction as shown in Figure 3. The data pre-processing step first defines the engine life to predict
RUL prediction as shown in Figure 3. The data pre-processing step first defines the engine life to
the exact RUL of the engine, and then standardizes the data to avoid errors caused by different scales
predict the exact RUL of the engine, and then standardizes the data to avoid errors caused by different
of the data. It then uses the deep learning autoencoder model to extract the features of the original
scales of the data. It then uses the deep learning autoencoder model to extract the features of the
data. GRU is specialized in processing time series related data, and considers the changes of historical
original data. GRU is specialized in processing time series related data, and considers the changes of
characteristics. The data is converted from 2D to 3D; the new dimension is the time data. The last
historical characteristics. The data is converted from 2D to 3D; the new dimension is the time data.
step of data processing is to change the value range of the RUL. Because the original data is linearly
The last step of data processing is to change the value range of the RUL. Because the original data is
decreasing, the results in model training and prediction are not optimal. Finally, the RUL expectancy is
linearly decreasing, the results in model training and prediction are not optimal. Finally, the RUL
predicted through the GRU model and a fully-connected layer neural network.
expectancy is predicted through the GRU model and a fully-connected layer neural network.
Figure 3.
Figure 3. Research
Research framework of Autoencoder
framework of Autoencoder Gated
Gated Recurrent
Recurrent Unit
Unit (AE-GRU).
(AE-GRU).
this research
The symbols used in this research are
are defined
defined as
as follows:
follows:
𝑁: 𝑛𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑒𝑛𝑔𝑖𝑛𝑒𝑠
𝑆: 𝑛𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑠𝑒𝑛𝑠𝑜𝑟𝑠
𝑛: 𝑙𝑒𝑛𝑔𝑡ℎ𝑜𝑓𝑠𝑒𝑛𝑠𝑖𝑛𝑔𝑑𝑎𝑡𝑎
𝑌: 𝑛𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑛𝑒𝑢𝑟𝑜𝑛𝑠
Processes 2020, 8, 1155 7 of 18
N : numbero f engines
S : numbero f sensors
n : lengtho f sensingdata
Y : numbero f neurons
L : numbero f hiddenlayers
W : weight
b : errorvector
Features : New f eaturesa f terextraction
Xij : thesensingdatao f the jthsensor ∈ theithengine; i = 1 . . . N, j = 1 . . . S
Xit : thecurrenttimet ∈ theithengine, ; i = 1 . . . N, j = 1 . . . T
hly : thenumbero f theythneuron ∈ thelthhiddenlayer; l = 1 . . . L, y = 1 . . . Y
T : engineusagetime
RUL : remainingli f e
RUL0 : Remainingli f ea f tertrans f orm
µ : theaveragevalueo f sensingdataxi
σ : Standarddeviationo f sensingdataxi
x0 i = dataa f terstandardize, iisthedatalength; i = 1 . . . n
yi : trueremainingli f e; i = 1 . . . n
y0 i : predictedremainingli f e; i = 1 . . . n
T : Max(Xit ) (11)
In the original engine data, the scales of the values measured by different sensors are different,
such as pressure, temperature, humidity, etc., In order to avoid the difference in scales and make all
the sensor values standard, the original data are all converted to Z-scores. The average value µ is the
sum of the sensor data xi divided by n, the number of observations (Equation (13)). The standard
deviation σ is also calculated (Equation (14)). The transformed data x0i is calculated by calculating
Processes 2020, 8, 1155 8 of 18
the difference
Processes 2020, 8, xbetween
FOR PEERthe sensor value of the data and the average value, divided by the standard
REVIEW 8 of 18
deviation (Equation (15)). The average value of x0i is equal to 0 and standard deviation equal to 1.
∑ Pn𝑥
𝜇= xi (13)
µ = 𝑛 i=1 (13)
n
v
t n
1 1X (14)
𝜎 σ== 𝑥 − (xi𝜇− µ)2 (14)
𝑛 n
i=1
xi − µ
x0i =
𝑥 −ρ 𝜇 (15)
𝑥′ = (15)
𝜌
3.1.1. Feature Extraction
3.1.1.AE
Feature Extraction
is used to reduce the data dimension and achieve the effect of feature extraction. As shown
in Figure 4, the encoder
AE is used to reduce will
thereduce the dimension
data dimension of the data
and achieve the and
effecttransform
of featurethe original data
extraction. into a
As shown
new space. The features in this space can describe the data more concisely than the
in Figure 4, the encoder will reduce the dimension of the data and transform the original data into a original features.
The
neworange neurons
space. The in the
features inmiddle of the
this space canFigure
describearethe
thedata
features
more inconcisely
the new space.
than theThe featurefeatures.
original value is
obtained by multiplying the weight matrix of the neurons in the hidden layer of the
The orange neurons in the middle of the Figure are the features in the new space. The feature value previous layer and
adding the error vector (Equation (16)), where f () is the activation function which is
is obtained by multiplying the weight matrix of the neurons in the hidden layer of the previous layer Relu in this model,
W
andisadding
the weight matrix,
the error hl is the
vector error vector,
(Equation b is thef()
(16)), where bias.
is the activation function which is Relu in this
model, W is the weight matrix, ℎ is the error vector, b is the bias.
Features : f (W ∗ hl ) + b (16)
𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠: 𝑓 𝑊 ∗ ℎ + 𝑏 (16)
Figure 4.
Figure 4. AE
AE Structure
Structure
The data
The data collected
collectedby bysensors
sensorsisisa series
a series
of of continuous
continuous data.
data. However,
However, in general
in the the general
model,model, it
it only
only uses the current sensor value to predict the RUL. The change in historical sensor
uses the current sensor value to predict the RUL. The change in historical sensor value is not considered. value is not
considered.
In this study,Ina this
newstudy, a new datamethod
data processing processing methodto
is proposed is take
proposed to take
historical historicalinto
information information
account.
into account. The format of the data will be converted from the original two-dimensional
The format of the data will be converted from the original two-dimensional (samples, features) to (samples,
features) to three-dimensional
three-dimensional (samples,
(samples, time time steps,
steps, features). features). Zero-padding
Zero-padding is often used in is often
signalused in signal
processing or
processing or Neural
Convolutional Convolutional
NetworksNeural
(CNN). Networks
The purpose (CNN). The purpose
is to keep or increase is the
to dimension
keep or increase the
of the data
dimension
without of thethe
affecting data without affecting
information of the data theitself.
information of the
In this study, datathe
when itself. In this
current study,
sensor when
record hasthe
no
current sensor record has no historical data, the zero-padding method
historical data, the zero-padding method will be used to keep the data dimension. will be used to keep the data
dimension.
(a) (b)
Figure5.5.Maximum
Figure Maximumlifelife
value transform
value (a) original
transform RUL; (b)
(a) original RUL(b)
RUL; after Maximum
RUL useful life transform.
after Maximum useful life
transform.
3.2. RUL Prediction Model
The RUL
3.2. RUL prediction
Prediction Modelmodel of this study combines the advantages of effective feature extraction
and fast calculation. The RUL prediction model will be divided into four steps as shown in Figure 6.
The RUL prediction model of this study combines the advantages of effective feature extraction
Step 1 is the input layer. Xt represents the sensor data collected at the current time, Xt−1 represents the
and fast calculation. The RUL prediction model will be divided into four steps as shown in Figure 6.
sensor data collected at the previous moment, Xt−2 represents the sensor data collected two moments
Step 1 is the input layer. 𝑋 represents the sensor data collected at the current time, 𝑋 represents
earlier. The number of historical time points of sensor data to input to the model are adjustable. Step 2
the sensor data collected at the previous moment, 𝑋 represents the sensor data collected two
is the GRU layer. The number of GRU layers is set to two in this paper. The number of GRU neurons
moments earlier. The number of historical time points of sensor data to input to the model are
will be tuned in the experimental part, and the best parameter combination selected for the research.
adjustable. Step 2 is the GRU layer. The number of GRU layers is set to two in this paper. The number
The purpose of this step is to identify time series correlation in the input data through GRU. However,
of GRU neurons will be tuned in the experimental part, and the best parameter combination selected
GRU cannot directly predict the RUL. Therefore, DNN must be added to predict the RUL. Step 3
for the research. The purpose of this step is to identify time series correlation in the input data through
is the DNN layer. DNN will convert the features extracted by GRU to the prediction dimension to
GRU. However, GRU cannot directly predict the RUL. Therefore, DNN must be added to predict the
perform the prediction. The best parameter combination from the experiment result will be used to
RUL. Step 3 is the DNN layer. DNN will convert the features extracted by GRU to the prediction
decide the number of neurons as in Step 2. The objective function of DNN is applied to calculate the
dimension to perform the prediction. The best parameter combination from the experiment result
difference between the predicted value and the true value. In this study, Mean Square Error (MSE) is
will be used to decide the number of neurons as in Step 2. The objective function of DNN is applied
used as the objective function (Equation (18)), and the gradient descent method is used to minimize
to calculate the difference between the predicted value and the true value. In this study, Mean Square
the objective function and train the model. Step 4 is the output layer of the model. The output is the
Error (MSE) is used as the objective function (Equation (18)), and the gradient descent method is used
value of predicted RUL at the current time. The model evaluation is conducted by calculating Root
to minimize the objective function and train the model. Step 4 is the output layer of the model. The
Mean Square Error (RMSE).
output is the value of predicted RUL at the current n time. The model evaluation is conducted by
1 X 2
calculating Root Mean Square Error (RMSE).
MSE = yi − y0 i (18)
n
i=1
1
𝑀𝑆𝐸 = 𝑦 −𝑦 (18)
𝑛
Processes 2020, 8, 1155 10 of 18
Processes 2020, 8, x FOR PEER REVIEW 10 of 18
Figure6.6.Model
Figure Modelstructure
structure of
of AE-GRU
AE-GRU for
for RUL
RULprediction.
prediction.
Figure 6. Model structure of AE-GRU for RUL prediction.
4. Result
4. Result Evaluation
Evaluation
4. Result Evaluation
4.1.4.1.
Data Collection
Data Collection
4.1.
ToData
To Collection
validate
validate theproposed
the proposedAE-GRU,AE-GRU, aa Turbofan
Turbofan Engine
Engine Degradation
DegradationSimulationSimulationData DataSetSet
(Prognostics
(Prognostics and Management
and Management
To validate the proposed2008, 2008, PHM08)
PHM08)
AE-GRU, (https://www.nasa.gov/)
(https://www.nasa.gov/)
a Turbofan Engine Degradation is used
is used for performance
for performance
Simulation Data Set
evaluation
evaluation inin this
this section.
section. ItItwas
was carried
carried out by
out a
by NASA
a NASA tool
(Prognostics and Management 2008, PHM08) (https://www.nasa.gov/) is used for performance C-MAPSS,
tool C-MAPSS,to simulate
to the
simulate real large
the real
commercial
large commercial
evaluation aircraft
in this turbofan
aircraft turbofan
section. engines.
It wasengines.The
carried outdata
Theby consists
data of a
consists
a NASA lot of time
of C-MAPSS,
tool series cycles,
a lot of timetoseries which
cycles,
simulate come
the which from
come
real large
different
fromcommercial
different engines
aircraft
engines of turbofan
thethe
of same
same type.
engines. Each
type. The
Eachengine
data
engine starts
consists with
of
starts a of
awith
lot different
a time wear
series
different level.
cycles,
wear The
which
level. The first
comesection
first from
section
introduces
different the
introduces the
engines engine
engine of thedata
data same collected
collected bybythe
type. Each the sensor,
engine
sensor, andwith
starts
and the second
the a different
second section
wear
section describes
level. The
describes the parameter
first
the section
parameter
settings
introduces of the
the model.
engine The
data performance
collected of
by different
the sensor, activation
and the functions
second is compared
section describes and performance
the parameter
settings of the model. The performance of different activation functions is compared and performance
is optimized.
settings of the Finally,
model. the
The number
performanceof hidden layers and
of different neurons
activation is compared.
functions In the and
is compared thirdperformance
section, we
is optimized. Finally, the number of hidden layers and neurons is compared. In the third section, we
will
is extract
optimized. the best
Finally, parameters
the number andof apply
hidden them to
layers different
and neuronsmodels,
is and
compared.test the
In model
the thirdby analyzing
section, we
will extract the best parameters and apply them to different models, and test the model by analyzing
the
will root mean
extract the square
best error (RMSE).
parameters and apply them to different models, and test the model by analyzing
the root mean square error (RMSE).
There
theThere
root meanare square
data forerror218 (RMSE).
turbo engines, and a total of 45,918 records in the dataset. Every record
has 26 are data
original for 218and
features turbo theengines,
customizedand aRUL.totalTheof 45,918 records designin the dataset. Every record thishas
There are data for 218 turbo engines, and a total ofexperimental
45,918 records in the and verification
dataset. Every of record
26 study
original features and thevalidation
customized RUL. The the experimental of the design and verification of this study
has 26adopts
original k-fold cross
features to ensure
and the customized RUL. stability
The experimental model. design and verification of this
adopts Figure
k-fold cross
7 is validation
a diagram to ensure
of the sensor the stability
parameters of the model.
of the turbine engines in this study. The sensor
study adopts k-fold cross validation to ensure the stability of the model.
Figure 7 is
parameters of aEngine
diagram 1 ofpresented
are the sensor parameters
visually. The of the turbine
parameter values engines in thisvary
of the sensors study.inTheThe sensor
range and
Figure 7 is a diagram of the sensor parameters of the turbine engines in this study. sensor
parameters
the of Engine
parameter trends 1 are
are presented
not visually.
consistent. The The parameter
RUL of the engine values
will of the
be sensorsbyvary
predicted the in range of
changes and
parameters of Engine 1 are presented visually. The parameter values of the sensors vary in range and
thethese
parameter trends
sensor parameters
the parameter
are not
trends areinnot
consistent.
research. The RUL of the engine will be predicted by the changes of of
thisconsistent. The RUL of the engine will be predicted by the changes
these sensor
these sensorparameters
parameters ininthis
thisresearch.
research.
𝑓 𝑥f (x=
)=max
max0,(𝑥0, x) (19)
(19)
In addition,
In addition, since
since the
the relu
relu activation function does
activation function does not
not need
need to
to perform
perform exponential
exponential calculations,
calculations,
the convergence speed is fast and the calculation complexity levels are low. So, this study adopted
the convergence speed is fast and the calculation complexity levels are low. So, this study adopted relu
relu as the activation function.
as the activation function.
4.2.2. Optimizer
4.2.2. Optimizer
The
The purpose
purpose ofof optimizers
optimizers isis to
to minimize
minimize thethe loss
loss function
function and
and the
the gap
gap between
between thethe predicted
predicted
value and real value. This experiment sets the following parameters and observes the training
value and real value. This experiment sets the following parameters and observes the training result of
result
the GRU
of the model
GRU modelunder different
under optimizers:
different optimizers:
1.
1. Epochs:
Epochs: 100;
100;
2.
2. Activation:
Activation: relu;
relu;
3.
3. Batch_size:
Batch_size: 128;
128;
4. Hidden
Hidden layer,number
layer, numberof
ofneuron:
neuron: 1,
1,50;
50;
5. Inputs:
Inputs: 24.
24.
This experiment
This experimentapplies
applies 5-fold
5-fold cross-validation
cross-validation to compare
to compare a total ofaseven
total optimization
of seven optimization
algorithms:
algorithms:
SGD (569.8), SGD (569.8),
RMSprop RMSprop
(438.7), Adagrad (438.7), Adagrad
(4586.9), Adadelta(4586.9), Adadelta
(426.6), (426.6),Adamax
Adam (436.6), Adam (436.6),
(436.8),
Adamax(420.7).
Nadam (436.8),The
Nadam (420.7).
parameter Theofparameter
setting setting ofalgorithm
each optimization each optimization algorithm
is the best result is theinbest
proposed the
result proposed
literature. Kingmain the
andliterature. Kingma Adam’s
Ba [42] proposed and Ba [42] proposed Adam’s
optimization algorithm. optimization
Most of thealgorithm. Most
current neural
of the current
networks applyneural
Adam networks apply
to optimize Adam
the to optimize
loss function. Thethe loss function.
advantage The advantage
of Adam of Adam is
is quick convergence
quick
and convergence
dealing with highand dealing
noise with
and the high noise
problem and the
of sparse problem
gradients. TheofNadam
sparse optimization
gradients. The Nadam
algorithm
optimization
proposed algorithm
by Dozat proposed
[43] changes by Dozat
the part [43] momentum
of Adam’s changes theand part of Adam’s
accelerates momentum rate
the convergence and
accelerates
of the model. theThe
convergence rate ofthat
study confirms thein
model. The study
this PHM08 confirms
dataset, Nadam thatwill
in this PHM08
converge dataset,
faster thanNadam
Adam
will converge
(Figure faster than
9). Therefore, Adam (Figure
the optimizer used9).in Therefore, theNadam.
this study is optimizer used in this study is Nadam.
Figure 9.
Figure Comparison of
9. Comparison of training
training result
result in
in different
different optimizer.
optimizer.
4.2.3. Number of Hidden Layers and Neurons
4.2.3. Number of Hidden Layers and Neurons
This experiment compares three different batch sizes: 32 (Table 2), 64 (Table 3), 128 (Table 4).
This experiment compares three different batch sizes: 32 (Table 2), 64 (Table 3), 128 (Table 4).
The early stopping function has been adopted in the experiment; if the MSE of verification does not
The early stopping function has been adopted in the experiment; if the MSE of verification does not
rise for 10 consecutive times, the training will stop and the previous best model will be used in the test
rise for 10 consecutive times, the training will stop and the previous best model will be used in the
dataset. Early stopping is commonly used in the algorithm of gradient descent to avoid overfitting.
test dataset. Early stopping is commonly used in the algorithm of gradient descent to avoid
The GRU and DNN layers are set to two layers. The RMSE result of different neurons in the training
overfitting. The GRU and DNN layers are set to two layers. The RMSE result of different neurons in
are compared. The input of this experiment is 24 parameters of the current raw data of the sensor,
the training are compared. The input of this experiment is 24 parameters of the current raw data of
and there is no historical information (time steps = 1).
the sensor, and there is no historical information (time steps = 1).
Table 2. Comparison of Root Mean Square Error (RMSE) result (batch size = 32).
Table 2. Comparison of Root Mean Square Error (RMSE) result (batch size = 32).
Networks Epochs Training Time RMSE
Networks Epochs Training Time RMSE
G(32,32)N(8,8) 33 98.4 s 20.18
G(32,32)N(8,8) 33 98.4 s 20.18
G(64,64)N(8,8) 20 89.3 s 20.24
G(64,64)N(8,8)
G(32,64)N(8,8) 2019 76.8 s 89.3 s 20.37 20.24
G(32,64)N(8,8)
G(32,64)N(16,16) 1928 103.7 s 76.8 s 20.24 20.37
G(32,64)N(16,16)
G(96,96)N(8,8) 2825 150.3 s 103.7 s 20.20 20.24
G(96,96)N(16,16)
G(96,96)N(8,8) 2524 151.4 s 150.3 s 20.18 20.20
G(96,96)N(16,16) 24 151.4 s 20.18
Processes 2020, 8, 1155 13 of 18
The experimental results show that the larger the batch size, the faster the model training speed.
Therefore, batch size in this study will be set to 128. When the batch size is greater, not only the training
time is fast, but the RMSE results are also better. So, the network architecture of this study will be set
to G(64,64)N(8,8).
Input 15 10 5
Loss 7.7 × 10−6 2.6 × 10−5 3.0 × 10−5
The experimental results show that the dimension reduction effect is the best when the input is
reduced to 15 dimensions. In 10 or 5 dimensions, the effect of restoration is poor because of the loss
of original data Information. In this study, the AE will reduce the data dimension from 24 to 15 and
compare the performance with other models.
After 5-fold cross-validation, (Tables 6 and 7), it is found that when the input of AE-GRU is
reduced from the original 24 features to 15, the RMSE results are better than those of other models.
This confirms that AE-GRU can effectively extract features and accurately predict the RUL. Time steps
is the number of historical data points. time steps = 5 is the historical information of the current time
point and the previous four-time records. time steps = 10 is the current time information and the
previous nine-time points. The results show that when time steps = 5, the results of the model are
Processes 2020, 8, 1155 14 of 18
Processes 2020, 8, x FOR PEER REVIEW 14 of 18
problem
quite of disappearing
close. gradients,
When time steps while theofgeneral
= 10, because DNN
the large has nooftime
amount series characteristics,
historical data, the RNN and the
has the
prediction results are the worst.
problem of disappearing gradients, while the general DNN has no time series characteristics, and the
prediction results are the worst.
Table 6. RMSE results of different models under 5-fold cross-validation (Time steps = 5).
Table 6. RMSE results of different
DNN models under 5-fold
RNN LSTM cross-validation (Time steps = 5).
GRU AE-GRE
Inputs:24 Inputs:24 Inputs:24 Inputs:24 Inputs:15
DNN RNN LSTM GRU AE-GRE
1 18.57
Inputs:24 18.06
Inputs:24 17.80
Inputs:24 17.68 Inputs:15 17.64
Inputs:24
2 18.76 18.22 18.00 17.88 17.70
1 18.57 18.06 17.80 17.68 17.64
3 2 18.73
18.76 18.47
18.22 18.19
18.00 17.8818.03 17.70 17.98
4 3 19.11
18.73 18.44
18.47 18.45
18.19 18.0318.16 17.98 18.20
5 4 19.12
19.11 18.43
18.44 18.34
18.45 18.1618.12 18.20 18.20
Average 5 19.12
18.86 18.43
18.32 18.34
18.16 18.1217.96 18.20 17.94
Average 18.86 18.32 18.16 17.96 17.94
Table 7. RMSE results of different models under 5-fold cross-validation (Time steps = 10).
Table 7. RMSE results of different models under 5-fold cross-validation (Time steps = 10).
DNN RNN LSTM GRU AE-GRE
DNN
Inputs:24 RNN
Inputs:24 LSTM
Inputs:24 GRUInputs:24 AE-GREInputs:15
1 Inputs:24
19.21 Inputs:24
15.08 Inputs:24
11.20 Inputs:24
10.31 Inputs:15 10.39
2 1 19.21
19.21 15.08
14.26 11.20
11.52 10.3111.05 10.39 10.02
3 2 19.21
19.07 14.26
15.86 11.52
10.78 11.0510.75 10.02 10.70
4 3 19.07
19.77 15.86
15.07 10.78
11.49 10.7511.33 10.70 10.69
4 19.77 15.07 11.49 11.33 10.69
5 19.69 17.27 10.81 10.54 11.31
5 19.69 17.27 10.81 10.54 11.31
AverageAverage 19.39
19.39 15.51
15.51 11.16
11.16 10.7910.79 10.62 10.62
retains the advantages of LSTM. The AE-GRU model proposed in this study has a shorter training
time, and also takes advantage of the features extracted by AE. The convergence speed of the model is
much faster. The validity of the model is evaluated and verified through the 5-fold cross-validation.
The results of the root mean square error are better than other deep learning methods, and the research
model can find the engine that is about to fail early and in time to maintain the equipment, reducing
unnecessary costs.
The AE-GRU model proposed in this study has good accuracy in the prediction of RUL. With the
improvement of process technology, the cost of equipment will become higher and higher. Therefore,
more complex models may be needed in future research, such as a bi-directional recurrent network.
In this study, only a directional recurrent network is considered. In some cases, the output at the
current moment is not only related to the previous state, but also closely related to the state after it.
The pre-processing method in this study extracts features from the original data. Although the
features can effectively represent the data set, they may contain a little noise. In the future, the pre-
processing method might also employ an advanced version of AE, such as denoising AE (DAE).
The combination of these two methods can further improve the prediction of RUL.
If the algorithm can be effectively implemented, predictive maintenance can be widely used in many
applications, providing effective information for early maintenance of machinery, and self-correcting
parameters can improve the yield to achieve the target of smart factories.
Author Contributions: Conceptualization, Y.-W.L. and C.-Y.H.; methodology, C.-Y.H.; software, K.-C.H.;
validation, Y.-W.L. and K.-C.H.; formal analysis, Y.-W.L. and K.-C.H.; investigation, Y.-W.L. and K.-C.H.; resources,
C.-Y.H.; data curation, K.-C.H.; writing—original draft preparation, Y.-W.L.; writing—review and editing, C.-Y.H.;
visualization, Y.-W.L.; supervision, C.-Y.H.; project administration, C.-Y.H.; funding acquisition, C.-Y.H. All authors
have read and agreed to the published version of the manuscript.
Funding: This research is supported by Ministry of Science and Technology, Taiwan (MOST 107-2221-E-027-127-
MY2; MOST108-2745-8-027-003).
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Wang, S.; Wan, J.; Li, D.; Zhang, C. Implementing Smart Factory of Industrie 4.0: An Outlook. Int. J. Distrib.
Sens. Netw. 2016, 12, 3159805. [CrossRef]
2. Chien, C.-F.; Hsu, C.-Y.; Chen, P.-N. Semiconductor Fault Detection and Classification for Yield Enhancement
and Manufacturing Intelligence. Flex. Serv. Manuf. J. 2013, 25, 367–388. [CrossRef]
3. Hsu, C.-Y.; Liu, W.-C. Multiple Time-Series Convolutional Neural Network for Fault Detection and Diagnosis
and Empirical Study in Semiconductor Manufacturing. J. Intell. Manuf. 2020, 1–14. [CrossRef]
4. Fan, S.-K.S.; Hsu, C.-Y.; Tsai, D.-M.; He, F.; Cheng, C.-C. Data-Driven Approach for Fault Detection and
Diagnostic in Semiconductor Manufacturing. IEEE Trans. Autom. Sci. Eng. 2020, 1–12. [CrossRef]
5. Shrouf, F.; Ordieres, J.; Miragliotta, G. Smart factories in Industry 4.0: A review of the concept and of energy
management approached in production based on the Internet of Things paradigm. In Proceedings of the
2014 IEEE International Conference on Industrial Engineering and Engineering Management, Selangor,
Malaysia, 9–12 December 2014; pp. 697–701.
6. Kothamasu, R.; Huang, S.H.; VerDuin, W.H. System health monitoring and prognostics—A review of current
paradigms and practices. Int. J. Adv. Manuf. Technol. 2006, 28, 1012–1024. [CrossRef]
7. Peng, Y.; Dong, M.; Zuo, M.J. Current status of machine prognostics in condition-based maintenance:
A review. Int. J. Adv. Manuf. Technol. 2010, 50, 297–313. [CrossRef]
8. Soh, S.S.; Radzi, N.H.; Haron, H. Review on scheduling techniques of preventive maintenance activities
of railway. In Proceedings of the 2012 Fourth International Conference on Computational Intelligence,
Modelling and Simulation, Kuantan, Malaysia, 25–27 September 2012; pp. 310–315.
9. Heimes, F.O. Recurrent neural networks for remaining useful life estimation. In Proceedings of the 2008
International Conference on Prognostics and Health Management, Denver, CO, USA, 6–9 October 2008;
pp. 1–6.
Processes 2020, 8, 1155 17 of 18
10. Li, Y.; Shi, J.; Gong, W.; Liu, X. A data-driven prognostics approach for RUL based on principle component
and instance learning. In Proceedings of the 2016 IEEE International Conference on Prognostics and Health
Management (ICPHM), Ottawa, ON, Canada, 20–22 June 2016; pp. 1–7.
11. Murphy, K.P. Machine Learning: A Probabilistic Perspective; MIT Press: Cambridge, MA, USA, 2012.
12. Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [CrossRef]
13. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [CrossRef]
14. Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA,
13–17 August 2016; ACM: New York, NY, USA, 2016.
15. Han, Z.; Zhao, J.; Leung, H.; Ma, K.F.; Wang, W. A review of deep learning models for time series prediction.
IEEE Sens. J. 2019. [CrossRef]
16. Bengio, Y.; Courville, A.; Vincent, P. Representation Learning: A Review and New Perspectives. IEEE Trans.
Pattern Anal. Mach. Intell. 2013, 35, 1798–1828. [CrossRef]
17. Schmidhuber, J. Deep Learning in Neural Networks: An Overview. Neural Netw. 2015, 61, 85–117. [CrossRef]
18. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [CrossRef] [PubMed]
19. Makridakis, S.; Spiliotis, E.; Assimakopoulos, V. Statistical and Machine Learning Forecasting Methods:
Concerns and Ways Forward. PLoS ONE 2018, 13, e0194889. [CrossRef] [PubMed]
20. Alfred, R.; Obit, J.H.; Ahmad Hijazi, M.H.; Ag Ibrahim, A.A. A performance comparison of statistical and
machine learning techniques in learning time series data. Adv. Sci. Lett. 2015, 21, 3037–3041.
21. Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science 2006,
313, 504–507. [CrossRef]
22. Lin, P.; Tao, J. A Novel Bearing Health Indicator Construction Method Based on Ensemble Stacked
Autoencoder. In Proceedings of the 2019 IEEE International Conference on Prognostics and Health
Management (ICPHM), San Francisco, CA, USA, 17–20 June 2019; pp. 1–9.
23. Yin, C.; Zhang, S.; Wang, J.; Xiong, N.N. Anomaly Detection Based on Convolutional Recurrent Autoencoder
for IoT Time Series. IEEE Trans. Syst. Man Cybern. Syst 2020, 1–11. [CrossRef]
24. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [CrossRef]
25. Zhang, Y.; Xiong, R.; He, H.; Liu, Z. A LSTM-RNN method for the lithuim-ion battery remaining useful
life prediction. In Proceedings of the 2017 Prognostics and System Health Management Conference
(PHM-Harbin), Harbin, China, 9–12 July 2017; pp. 1–4.
26. Mathew, V.; Toby, T.; Singh, V.; Rao, B.M.; Kumar, M.G. Prediction of Remaining Useful Lifetime (RUL)
of turbofan engine using machine learning. In Proceedings of the 2017 IEEE International Conference on
Circuits and Systems (ICCS), Batumi, Georgia, 5–8 December 2017; pp. 306–311.
27. Zheng, S.; Ristovski, K.; Farahat, A.; Gupta, C. Long short-term memory network for remaining useful life
estimation. In Proceedings of the 2017 IEEE international conference on prognostics and health management
(ICPHM), Dallas, TX, USA, 19–21 June 2017; pp. 88–95.
28. Chen, J.; Jing, H.; Chang, Y.; Liu, Q. Gated recurrent unit based recurrent neural network for remaining useful
life prediction of nonlinear deterioration process. Reliab. Eng. Syst. Safe. 2019, 185, 372–382. [CrossRef]
29. Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning
phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014,
arXiv:1406.1078.
30. Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on
sequence modeling. arXiv 2014, arXiv:1412.3555.
31. Chen, Z.; Liu, Y.; Liu, S. Mechanical state prediction based on LSTM neural netwok. In Proceedings of the
2017 36th Chinese Control Conference (CCC), Dalian, China, 26–28 July 2017; pp. 3876–3881.
32. Zheng, J.; Xu, C.; Zhang, Z.; Li, X. Electric load forecasting in smart grids using long-short-term-memory
based recurrent neural network. In Proceedings of the 2017 51st Annual Conference on Information Sciences
and Systems (CISS), Baltimore, MD, USA, 22–24 March 2017; pp. 1–6.
33. ElSaid, A.; Wild, B.; Higgins, J.; Desell, T. Using LSTM recurrent neural networks to predict excess vibration
events in aircraft engines. In Proceedings of the 2016 IEEE 12th International Conference on e-Science,
Baltimore, MD, USA, 23–27 October 2016; pp. 260–269.
Processes 2020, 8, 1155 18 of 18
34. Cenggoro, T.W.; Siahaan, I. Dynamic bandwidth management based on traffic prediction using Deep Long
Short Term Memory. In Proceedings of the 2016 2nd International Conference on Science in Information
Technology (ICSITech), Balikpapan, Indonesia, 26–27 October 2016; pp. 318–323.
35. Bao, W.; Yue, J.; Rao, Y. A deep learning framework for financial time series using stacked autoencoders and
long-short term memory. PLoS ONE 2017, 12, e0180944. [CrossRef]
36. Zhang, Q.; Wang, H.; Dong, J.; Zhong, G.; Sun, X.J.I.G.; Letters, R.S. Prediction of sea surface temperature
using long short-term memory. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1745–1749. [CrossRef]
37. How, D.N.T.; Sahari, K.S.M.; Yuhuang, H.; Kiong, L.C. Multiple sequence behavior recognition on humanoid
robot using long short-term memory (LSTM). In Proceedings of the 2014 IEEE international symposium
on robotics and manufacturing automation (ROMA), Kuala Lumpur, Malaysia, 15–16 December 2014;
pp. 109–114.
38. Truong, A.M.; Yoshitaka, A. Structured LSTM for human-object interaction detection and anticipation.
In Proceedings of the 2017 14th IEEE International Conference on Advanced Video and Signal Based
Surveillance (AVSS), Lecce, Italy, 29 August–1 September 2017; pp. 1–6.
39. Kuan, L.; Yan, Z.; Xin, W.; Yan, C.; Xiangkun, P.; Wenxue, S.; Zhe, J.; Yong, Z.; Nan, X.; Xin, Z. Short-term
electricity load forecasting method based on multilayered self-normalizing GRU network. In Proceedings
of the 2017 IEEE Conference on Energy Internet and Energy System Integration (EI2), Beijing, China,
26–28 November 2017; pp. 1–5.
40. Zhang, D.; Kabuka, M.R. Combining weather condition data to predict traffic flow: A GRU-based deep
learning approach. IET Intell. Transp. Syst. 2018, 12, 578–585. [CrossRef]
41. Sharma, S. Activation functions in neural networks. Int. J. Eng. Appl. Sci. Technol. 2020, 4, 310–316.
[CrossRef]
42. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980.
43. Dozat, T. Incorporating nesterov momentum into adam. In Proceedings of the 2016 International Conference
on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016; pp. 1–4.
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).