0% found this document useful (0 votes)
41 views18 pages

Gru Ae

Uploaded by

Moïse Djemmo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views18 pages

Gru Ae

Uploaded by

Moïse Djemmo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

processes

Article
An Autoencoder Gated Recurrent Unit for Remaining
Useful Life Prediction
Yi-Wei Lu 1 , Chia-Yu Hsu 1, * and Kuang-Chieh Huang 2
1 Department of Industrial Engineering and Management, National Taipei University of Technology,
Taipei 10608, Taiwan; [email protected]
2 Department of Information Management, Yuan Ze University, Taoyuan 32003, Taiwan;
[email protected]
* Correspondence: [email protected]; Tel.: +886-2-27712171

Received: 6 August 2020; Accepted: 11 September 2020; Published: 15 September 2020 

Abstract: With the development of smart manufacturing, in order to detect abnormal conditions of
the equipment, a large number of sensors have been used to record the variables associated with
production equipment. This study focuses on the prediction of Remaining Useful Life (RUL). RUL
prediction is part of predictive maintenance, which uses the development trend of the machine to
predict when the machine will malfunction. High accuracy of RUL prediction not only reduces
the consumption of manpower and materials, but also reduces the need for future maintenance.
This study focuses on detecting faults as early as possible, before the machine needs to be replaced
or repaired, to ensure the reliability of the system. It is difficult to extract meaningful features from
sensor data directly. This study proposes a model based on an Autoencoder Gated Recurrent Unit
(AE-GRU), in which the Autoencoder (AE) extracts the important features from the raw data and the
Gated Recurrent Unit (GRU) selects the information from the sequences to forecast RUL. To evaluate
the performance of the proposed AE-GRU model, an aircraft turbofan engine degradation simulation
dataset provided by NASA was used and a comparison made of different recurrent neural networks.
The results demonstrate that the AE-GRU is better than other recurrent neural networks, such as
Long Short-Term Memory (LSTM) and GRU.

Keywords: remaining useful life; predictive maintenance; deep learning; autoencoder; gated
recurrent unit

1. Introduction
Industry 4.0 depends on automated, smart factories, and employs sensor data and the methods of
big data analysis, in the expectation that the equipment in the factory will operate automatically and
self-correct to improve the product yield [1]. Sensor-related data are collected during the manufacturing
processes, such as temperature, pressure, power, humidity, and chemical analysis for equipment
monitoring [2,3]. These temporal patterns represent the equipment condition, and poor product quality
is often associated with abnormal changes of environment or inappropriate operation settings [4].
Equipment condition can be recorded by sensor data from the past and kept as time series data. When
analyzing equipment sensor data, not only the large amount of data recorded should be considered,
but also the time series characteristics. The idea of a smart factory is to link and intellectualize the
manufacturing process. In the past, automation merely used machines to improve production efficiency,
yield, and reduce costs, but intelligence further applies technologies such as the IoT sensor, to monitor
and control production lines. The combination of cloud computing, data analysis, and software and
hardware integration is also an important part of intelligence. Machines communicate with automation
equipment [5]. However, there are numerous parameters in the smart factory, and the data are often

Processes 2020, 8, 1155; doi:10.3390/pr8091155 www.mdpi.com/journal/processes


Processes 2020, 8, 1155 2 of 18

highly correlated between equipment and processes. Therefore, the analysis must go beyond a single
stage in the equipment or process, and be comprehensive.
Equipment maintenance methods are divided into the following three types: (1) repair
maintenance [6]; (2) preventive maintenance [7]; (3) predictive maintenance [8]. Corrective maintenance
is a method of maintenance that is performed after some or all of the equipment fails. The most
common equipment maintenance application is preventive maintenance, also known as scheduled
maintenance, which carries out machine inspections, component replacement and other maintenance at
prescribed times. Predictive maintenance is an equipment maintenance method based on the condition
of the machine. It predicts the time when the machine may be damaged according to the development
trend of the past.
Predictive Maintenance (PdM) has been used to monitor the historical health status of equipment
and make timely adjustments to the equipment, which is quite different from the methods of routine
maintenance employed in the past. PdM not only saves unnecessary costs, but also allow early repairs
when the equipment is about to break down. It can avoid unpredictable machine stoppages caused
by unexpected breakdown and improper operation. Remaining Useful Life (RUL) is an important
indicator in PdM. The definition of device or system RUL is the period from the current state to the time
when the device begins to operate abnormally [9]. There are two types of RUL prediction: model-based
and data-driven [10]. The model-based method establishes the model based on the historical trend
of equipment. Generally, model-based methods perform better than data driven methods when
there is little historical data. Data-driven methods use the health status data and data collected by
sensors. For modeling, the collected data must be closely related to the device status. The advantage
of data-driven methods is that the algorithm is based on learning the correlation between input and
output data. If there are sufficient data, the stability and accuracy of data-driven methods are better
than model-based methods
RUL prediction can be described as a time series forecasting problem. The main purpose of time
series forecasting is to build models from historical records and predict future situations. Time series
forecasting can be divided into statistical models, machine learning models and deep learning models.
Common statistical models include Autoregression (AR), Moving Average (MA), Autoregressive
Moving Average (ARMA), and Autoregressive Integrated Moving Average (ARIMA). Machine learning
builds a function or model to detect patterns in training samples and then uses these patterns for
prediction [11]. Support Machine Vector (SVM) [12], Random Forest (RF) [13], Extreme Gradient
Boosting (XGB) [14] are representative machine learning algorithms for time series forecasting, but it
is difficult to design an effective machine learning algorithm without substantial knowledge of the
data [15]. In addition, the existing methods for analyzing time series data need pre-processing based
on experience, which causes some losses of information. With improved process technology and more
sensors, few existing methods can handle multi-sensor data, which is likely to cause misjudgment or
missed detection.
Deep learning methods are preferred to predict and analyze time series data without predefined
features. Deep learning is a branch of machine learning based on artificial neural networks with
representation learning [16–18], which can learn the key information to make a response by using many
hidden layers integrated in the network. The main difference between deep learning and machine
learning is feature extraction. Usually, the features of machine learning are manually selected in
advance; deep learning not only identifies features through model training, but is also much better
than other algorithms at calculating results. Makridakis et al. [19] compare several time series analysis
methods on the M3-Competition dataset. Most of the machine learning and deep learning algorithms
perform better than traditional statistic methods. Alfred et al. [20] demonstrated that algorithms based
on neural networks are more efficient in learning time series data than other algorithms.
Feature extraction is an important data pre-processing method in machine learning for high
prediction accuracy. In the past, the dimension reduction of most data features was accomplished
through Principal Component Analysis (PCA). PCA retains the maximum variation of features by
Processes 2020, 8, 1155 3 of 18

projecting variables into another space to represent the original complete data with fewer features, to
achieve the effect of data dimension reduction. The difference between PCA and Autoencoder (AE)
is that AE is a non-linear dimension reduction method [21], and an unsupervised training method.
First, the model compresses the original data through the encoder structure, and then restore the
compressed data through the decoder. The AE rebuilds the original data with fewer features for data
representation [22]. It can compress the key information of high dimensional input data into low
dimensional features. AE is often used for feature extraction and dimensional reduction in time series
forecasting problems [23]. Moreover, machine learning methods required manual selection of features.
The quality of the features affected the models’ results. Deep learning has the ability to automatically
extract features. It can reduce the time spent on feature engineering and help experts to make decisions.
Most of the relevant studies on RUL prediction use deep learning as a prediction model, which not
only can save time in selecting features, but also makes prediction results much better than traditional
machine learning models.
Recurrent Neural Networks (RNN) are a deep learning model that deals specifically with time
series data and can select important features from equipment sensors. Long Short-Term Memory
networks (LSTM) is a variant of RNN [24]. For example, Zhang et al. [25] constructed an LSTM model
to predict the RUL of lithium batteries, predicting after how many cycles of fully charge and discharge
the capacity will be lower than the normal threshold, based on the historical decline rate of capacity.
Heimes [9] set the maximum remaining engine life at 130 on the PHM08 data set, and predicted the
RUL using an RNN model. Zhang et al. [25] also constructed an LSTM model to predict the RUL
of the jet engine. One hundred jet engines were used as training data and 100 were used as testing
data. Each jet engine has 24 features that record the sensor value from normal to fault, and the RUL
are predicted by the change in these values. Mathew et al. [26] predict the RUL of jet engines, using
24 parameters of the original data and comparing ten machine learning methods (Linear Regression,
Decision Tree, SVM, Random Forest, KNN, K-means, Gradient Boost, Adaboost, Deep Learning,
ANOVA), and verify the validity of the model through RMSE. Zheng et al. [27] used an LSTM model
to predict the RUL on C-MAPSS data set, PHM08 data set, and Milling data set. The LSTM model
is better than the other models. Chen et al. [28] proposed a two-phase model for RUL prediction by
Kernel Principal Component Analysis (KPCA) and Gated Recurrent Unit (GRU), respectively. To the
best of our knowledge, little research has integrated AE and GRU for RUL prediction.
This study develops an Autoencoder Gated Recurrent Unit (AE-GRU) model to predict the RUL
of equipment. In particular, AE is used to select features, and then the correlation between sensors is
found through the GRU model, and the RUL is predicted by Multi-Layer Perception (MLP) model.
The first part is the Autoencoder (AE) model which is composed of an encoder and decoder. The
second part is the GRU model which is a type of RNN that can deal with time series data. The GRU
model finds key information in historical sensor data in combination with the MLP model. The MLP
model calculates the extracted information and combines with the backpropagation algorithm to
predict the RUL effectively.

2. Literature Review
Time series are statistical data, arranging events or data in order of their occurrence. The main
purpose of time series forecasting is to build models from historical records and predict future situations.

2.1. Recurrent Neural Networks


RNN is the most common time series analysis method in deep learning. The main difference
between RNN and general neural network is that the neurons of RNN between hidden layers are not
independent, but influence each other. The neurons of RNN also operate in order. The neurons of
the recurrent neural network have the function of temporarily storing memory. The previously input
data will be temporarily stored in the internal memory, so the neuron can have different output values
according to the previous state. However, in the training of RNN, too many hidden layer neuron
Processes 2020, 8, x FOR PEER REVIEW 4 of 18
Processes 2020, 8, 1155 4 of 18

will let the gradient descent method stay at the local minimum. The traditional neural network has
different
weights parameters
will cause theto be learned
problem in each vanish
of gradient hidden because
layer, but of the
the parameters of the RNN
heavy calculation, whichonly willhave
let
different
the gradientinputs.
descent This featurestay
method greatly
at the reduces the training
local minimum. Theburden of theneural
traditional model.network has different
LSTMto
parameters solves the problem
be learned in eachofhidden the gradient vanish
layer, but in the random
the parameters gradient
of the RNN onlydescent have in different
recurrent
neural networks. The biggest difference between
inputs. This feature greatly reduces the training burden of the model. the LSTM and RNN is that each neuron of LSTM
has LSTM
a control gatethe
solves function,
problem which
of theis input,
gradient forget,
vanishandinoutput.
the randomThesegradient
gates have their own
descent weights.
in recurrent
The calculation of the weights after data input determines whether
neural networks. The biggest difference between the LSTM and RNN is that each neuron of LSTM the switches are turned on or off.
Thea control
has input gate
gatecontrols
function,whetherwhich isdata input,canforget,
be written to the These
and output. memory gates space,
have and
theirtheown forget gate
weights.
determines whether the contents of the previous memory space are
The calculation of the weights after data input determines whether the switches are turned on orretained. The output gate controls
whether
off. the memory
The input gate controlsspacewhether
operation dataresult canwritten
can be be output.
to theAlthough
memory adding
space, and thesethegates
forget togate
each
neuron generates
determines whethermore weight to
the contents of be
thecalculated, it can solve
previous memory spacetheareproblem
retained. of The
gradient
output vanish found in
gate controls
the recurrent neural network.
whether the memory space operation result can be output. Although adding these gates to each
neuron Figure 1 is a more
generates diagram weightof thetoLSTM networkitunit
be calculated, canarchitecture,
solve the problemand theofrelated
gradientformulas
vanishare foundshownin
as recurrent
the Equationsneural The LSTM unit will receive the input vector 𝑥 and the output ℎ . Each unit
(1)–(6).network.
contains
Figurefour
1 isgates (i, g, f,ofo)the
a diagram and a cell.
LSTM x is theunit
network input vector of this
architecture, andlayer, h is the
the related output vector,
formulas are shown ⨀ is
asthe element-wise
Equations (1)–(6).multiplication.
The LSTM unit Wwillis the weight
receive the matrix, b is an
input vector errorthevector,
xt and outputand ht−1(i, f, o, unit
. Each g, c)
represents input gate, forget gate, output gate, cell input, and cell activation respectively. The forget
J
contains four gates (i, g, f, o) and a cell. x is the input vector of this layer, h is the output vector, is
gate
the controls how
element-wise much information
multiplication. W is the each memory
weight unit
matrix, needs
b is to forget,
an error vector,theandinput gate
(i, f, o, controls
g, c) representshow
muchgate,
input newforget
information each memory
gate, output gate, cell unit
input,adds,
and and
cell the outputrespectively.
activation gate controlsThe how much
forget information
gate controls
each memory unit outputs. The cell input is the information
how much information each memory unit needs to forget, the input gate controls how much input of the current time and newthe
information of the previous time. Cell activation controls the whether
information each memory unit adds, and the output gate controls how much information each memory the input gate and forget gate
information
unit outputs. canThebe obtained.
cell input is the information input of the current time and the information of the
previous time. Cell activation controls the whether
𝑖 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑾 𝒙 the +𝑾input𝒉 gate
+ 𝒃and forget gate information can
(1)
be obtained.
𝑓it = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑
sigmoid(W𝑾xi xt𝒙+ +
W𝑾 𝒉 + b+i )𝒃
hi ht−1 (2)
(1)
 
ft = sigmoid Wx f xt + Wh f ht−1 + b f (2)
(3)
𝑜 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑾 𝒙 + 𝑾 𝒉 + 𝒃
ot = sigmoid(Wxo xt + Who ht−1 + bo ) (3)
𝑔 = 𝑡𝑎𝑛ℎ 𝑾 𝒙 + 𝑾 𝒉 + 𝒃  (4)
gt = tanh Wxg xt + Whg ht−1 + b g (4)
(5)
=𝑓
c𝑐t = ft ⊙ct−1
𝑐 ++it 𝑖 ⊙
gt 𝑔 (5)

tanh(c𝑐t ) ⊙
hℎt ==tanh ot 𝑜 (6)

Figure1.1.Structure
Figure StructureofofLong
LongShort-Term
Short-TermMemory
Memory(LSTM)
(LSTM)unit.
unit.

The
Thegated
gatedrecurrent
recurrentunit
unit(GRU)
(GRU)isisaavariant
variantofofLSTM
LSTM[29].
[29].AsAsshown
shownin inFigure
Figure2,2,(z, H,H)
(z,r,r,H,
e are
𝐻) are
update
updategate,
gate,reset
resetgate,
gate,activation,
activation,and
andcandidate
candidateactivation
activationrespectively.
respectively.The Thedetailed
detailedformulas
formulasare are
shown
shownan anEquations
Equations(7)–(10).
(7)–(10).TheTheupdate
updategategateisisused
usedtotocontrol
controlhow
howmuch
muchhistorical
historicalinformation
informationand and
new
newinformation
informationneeds
needstotobebeforgotten
forgottenininthe
thecurrent
currentstate.
state.The
Thereset
resetgate
gateisisused
usedtotocontrol
controlhow howmuch
much
information is available from the candidate state. The candidate activation can
information is available from the candidate state. The candidate activation can be regarded be regarded as theas
newthe
information at the current time. The reset gate is used to control how much historical information
new information at the current time. The reset gate is used to control how much historical information needs
Processes 2020, 8, 1155 5 of 18
Processes 2020, 8, x FOR PEER REVIEW 5 of 18

to be retained. The activation is generated by the update gate and candidate activation. The update
needs to be retained. The activation is generated by the update gate and candidate activation. The
gate in activation controls how much new and old information is retained. New information and
update gate in activation controls how much new and old information is retained. New information
old information have a complementary relationship. If a lot of new information is retained, the old
and old information have a complementary relationship. If a lot of new information is retained, the
information considered will be less, and vice versa.
old information considered will be less, and vice versa.

𝑧zt ==𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑾xz x𝒙t ++W


sigmoid(W 𝑾hz h𝒉t−1 ++bz𝒃) (7)
(7)

𝑟rt==𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑾xr x𝒙t ++W


sigmoid(W 𝑾hr h𝒉t−1 ++br𝒃) (8)
(8)
 K 
et = tanh W e xt + W e rt
H H + b (9)
(9)
𝐻 = tanh 𝑾 xH 𝒙 + 𝑾 HH 𝒓 ⨀𝐻 t−1
+𝒃 H e
K K
Ht = zt H + (1 − zt ) H
et (10)
(10)
𝐻 = 𝑧 ⨀𝐻t−1 + 1 − 𝑧 ⨀𝐻

Structure of gated recurrent neural network.


Figure 2. Structure

GRU
GRU simplifies
simplifiesthetheinput
inputgategateand
andforget
forgetgate
gateofof
LSTM
LSTM into an an
into update
updategate, andand
gate, combines the cell
combines the
states and and
cell states hidden statesstates
hidden together. Therefore,
together. the GRU
Therefore, the unit
GRUretains the advantages
unit retains of LSTM,
the advantages and further
of LSTM, and
reduces the model
further reduces thetraining time by time
model training reducing the parameters
by reducing in the model.
the parameters in theAccording to the analysis
model. According to the
results
analysis results in [30], the results of the RNN model are poor and the results of GRU andare
in [30], the results of the RNN model are poor and the results of GRU and LSTM similar,
LSTM are
and better than RNN. However, the GRU has fewer parameters than LSTM,
similar, and better than RNN. However, the GRU has fewer parameters than LSTM, so the trend in so the trend in deep
learning models
deep learning applied
models to time
applied toseries analysis
time series is toward
analysis GRU. GRU.
is toward
In view of
In view of the
the excellent
excellent feature
feature extraction
extraction performance
performance of of AE, and the
AE, and the fast
fast calculation
calculation advantage
advantage
of GRU, in this study the RUL prediction model works by extracting the important
of GRU, in this study the RUL prediction model works by extracting the important features from features from
the
the original data using AE. After pre-processing of the extracted features, the
original data using AE. After pre-processing of the extracted features, the GRU model and DNN GRU model and DNN
(deep
(deep fully
fully connected
connected neural
neural networks)
networks) are
are applied
applied toto predict
predict the
the RUL.
RUL.

2.2. RNN Applications in Time Series Forecasting


2.2. RNN Applications in Time Series Forecasting
Chen et al. [31] performed a prediction of mechanical state (PMS), and extracted the collected sensor
Chen et al. [31] performed a prediction of mechanical state (PMS), and extracted the collected
data of the machine through the feature extraction of empirical mode decomposition. This method
sensor data of the machine through the feature extraction of empirical mode decomposition. This
makes the unstable signal slightly stable for building the LSTM model. Zheng et al. [32] predicts the
method makes the unstable signal slightly stable for building the LSTM model. Zheng et al. [32]
short-term power system load capacity, the time range collected from a few hours to several weeks.
predicts the short-term power system load capacity, the time range collected from a few hours to
The characteristics of power system load data are non-linear, non-stationary, and non-seasonal, so the
several weeks. The characteristics of power system load data are non-linear, non-stationary, and non-
research applies an LSTM model to solve this problem. ElSaid et al. [33] applied an LSTM model
seasonal, so the research applies an LSTM model to solve this problem. ElSaid et al. [33] applied an
to predict aircraft engine vibration and predicted aircraft engine vibration values in the next 5, 10,
LSTM model to predict aircraft engine vibration and predicted aircraft engine vibration values in the
and 20 s. 76 parameters are recorded in the aircraft flight data recorder. Fifteen key features were
next 5, 10, and 20 s. 76 parameters are recorded in the aircraft flight data recorder. Fifteen key features
selected by experts with a professional background. The features were normalized to make the values
were selected by experts with a professional background. The features were normalized to make the
range between 0 and 1, and used to make predictions. Cenggoro and Siahaan [34] constructed a
values range between 0 and 1, and used to make predictions. Cenggoro and Siahaan [34] constructed
DLSTM (Deep Long Short-Term Memory) neural network to predict traffic flow. The input layer had
a DLSTM (Deep Long Short-Term Memory) neural network to predict traffic flow. The input layer
5 input values, there were 5 hidden layers, and each hidden layer had 100 neurons. Bao et al. [35]
had 5 input values, there were 5 hidden layers, and each hidden layer had 100 neurons. Bao et al. [35]
conducted a stock price prediction. The stock price prediction is divided into three stages. Stage one
conducted a stock price prediction. The stock price prediction is divided into three stages. Stage one
applied a wavelet transform algorithm to remove noise from the original data. Stage two stacks AEs
to reduces the data dimensions and picks out features automatically. Stage three applies an LSTM
Processes 2020, 8, 1155 6 of 18

applied a wavelet transform algorithm to remove noise from the original data. Stage two stacks AEs
Processes 2020, 8, x FOR PEER REVIEW 6 of 18
to reduces the data dimensions and picks out features automatically. Stage three applies an LSTM
model
model to to make
make predictions. Zhang et
predictions. Zhang et al.
al. [36]
[36] divided
divided Sea
Sea Surface
Surface Temperature
Temperature (SST) (SST) into
into two
two part
part to
to
make
make the prediction. The first part is short-term forecasting, which predicts the SST after 1 and 33day.
the prediction. The first part is short-term forecasting, which predicts the SST after 1 and day.
The
The second
second partpart isis long-term
long-term forecasting,
forecasting, which
which predicts
predicts the
the weekly
weekly average
average and
and monthly
monthly average
average of of
SST. It finds the relationship between sequences through LSTM, and then
SST. It finds the relationship between sequences through LSTM, and then make the prediction make the prediction through
the fully the
through connected layer. How
fully connected et How
layer. al. [37]et uses theuses
al. [37] angle
thechanges on different
angle changes motionmotion
on different sensorssensors
of the
NAO
of the robot
NAO to classify
robot the current
to classify actions
the current of theof
actions robot through
the robot an LSTM
through model.
an LSTM Truong
model. et al.et[38]
Truong al.
construct
[38] construct an LSTM model to identify changes in human type and object type, and even predict
an LSTM model to identify changes in human type and object type, and even predict
possible
possible human
human actions
actions through
through the the action
action combinations.
combinations. KuanKuan et et al.
al. [39]
[39] constructed
constructed an an MS-GRU
MS-GRU
(multilayered
(multilayered self-normalizing gated recurrent units) model. It has good results in predicting power
self-normalizing gated recurrent units) model. It has good results in predicting power
loading. Zhang and
loading. Zhang Kabuka [40]
and Kabuka [40] used
used the
the GRU
GRU model
model toto predict
predict the
the traffic
traffic volume.
volume. TheThe training
training data
data
included
included the the historical
historical weather
weatherdatadata100100hhpreviously
previouslytotopredict
predictthethetraffic
trafficvolume
volume inin
thethe next
next 1212h. h.
It
It was found that weather conditions can improve the
was found that weather conditions can improve the prediction accuracy. prediction accuracy.

3. Research Framework
3. Research Framework
This study constructs an AE-GRU model for RUL prediction to increase equipment life, reduce
This study constructs an AE-GRU model for RUL prediction to increase equipment life, reduce
unexpected harm caused by sudden shutdown of machinery, and improve the reliability of system
unexpected harm caused by sudden shutdown of machinery, and improve the reliability of system
operation. The proposed AE-GRU includes the steps of data pre-processing, feature extraction, and RUL
operation. The proposed AE-GRU includes the steps of data pre-processing, feature extraction, and
prediction as shown in Figure 3. The data pre-processing step first defines the engine life to predict
RUL prediction as shown in Figure 3. The data pre-processing step first defines the engine life to
the exact RUL of the engine, and then standardizes the data to avoid errors caused by different scales
predict the exact RUL of the engine, and then standardizes the data to avoid errors caused by different
of the data. It then uses the deep learning autoencoder model to extract the features of the original
scales of the data. It then uses the deep learning autoencoder model to extract the features of the
data. GRU is specialized in processing time series related data, and considers the changes of historical
original data. GRU is specialized in processing time series related data, and considers the changes of
characteristics. The data is converted from 2D to 3D; the new dimension is the time data. The last
historical characteristics. The data is converted from 2D to 3D; the new dimension is the time data.
step of data processing is to change the value range of the RUL. Because the original data is linearly
The last step of data processing is to change the value range of the RUL. Because the original data is
decreasing, the results in model training and prediction are not optimal. Finally, the RUL expectancy is
linearly decreasing, the results in model training and prediction are not optimal. Finally, the RUL
predicted through the GRU model and a fully-connected layer neural network.
expectancy is predicted through the GRU model and a fully-connected layer neural network.

Figure 3.
Figure 3. Research
Research framework of Autoencoder
framework of Autoencoder Gated
Gated Recurrent
Recurrent Unit
Unit (AE-GRU).
(AE-GRU).

this research
The symbols used in this research are
are defined
defined as
as follows:
follows:
𝑁: 𝑛𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑒𝑛𝑔𝑖𝑛𝑒𝑠
𝑆: 𝑛𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑠𝑒𝑛𝑠𝑜𝑟𝑠
𝑛: 𝑙𝑒𝑛𝑔𝑡ℎ𝑜𝑓𝑠𝑒𝑛𝑠𝑖𝑛𝑔𝑑𝑎𝑡𝑎
𝑌: 𝑛𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑛𝑒𝑢𝑟𝑜𝑛𝑠
Processes 2020, 8, 1155 7 of 18

N : numbero f engines
S : numbero f sensors
n : lengtho f sensingdata
Y : numbero f neurons
L : numbero f hiddenlayers
W : weight
b : errorvector
Features : New f eaturesa f terextraction
Xij : thesensingdatao f the jthsensor ∈ theithengine; i = 1 . . . N, j = 1 . . . S
Xit : thecurrenttimet ∈ theithengine, ; i = 1 . . . N, j = 1 . . . T
hly : thenumbero f theythneuron ∈ thelthhiddenlayer; l = 1 . . . L, y = 1 . . . Y
T : engineusagetime
RUL : remainingli f e
RUL0 : Remainingli f ea f tertrans f orm
µ : theaveragevalueo f sensingdataxi
σ : Standarddeviationo f sensingdataxi
x0 i = dataa f terstandardize, iisthedatalength; i = 1 . . . n
yi : trueremainingli f e; i = 1 . . . n
y0 i : predictedremainingli f e; i = 1 . . . n

3.1. Data Pre-Processing


In the original engine data, there are only usage records for each engine, and there is no specific
RUL of the engine recorded. Therefore, this study defines the time from good to bad for each engine,
so that we can apply supervised learning methods to the experiment. It is possible to predict the RUL
accurately using the parameter changes recorded by the sensors. As shown in Table 1, after finding the
available time for each engine record (Equation (11)), the difference between the maximum time and
the current time (Equation (12)) is the RUL of the engine.

T : Max(Xit ) (11)

RUL : T − Xit (12)

Table 1. Definition of Remaining Useful Life (RUL).

Engine Usage Time Parameter 1 Parameter 2 Parameter 24 RUL


1 1 10.0047 0.2501 17.1735 222
1 2 0.0015 0.0003 23.3619 221
... ... ... ... ... ...
1 223 34.9992 0.84 8.6695 0
... ... ... ... ... ...
218 132 35.007 0.8419 8.6761 1
218 133 25.007 0.6216 8.512 0

In the original engine data, the scales of the values measured by different sensors are different,
such as pressure, temperature, humidity, etc., In order to avoid the difference in scales and make all
the sensor values standard, the original data are all converted to Z-scores. The average value µ is the
sum of the sensor data xi divided by n, the number of observations (Equation (13)). The standard
deviation σ is also calculated (Equation (14)). The transformed data x0i is calculated by calculating
Processes 2020, 8, 1155 8 of 18

the difference
Processes 2020, 8, xbetween
FOR PEERthe sensor value of the data and the average value, divided by the standard
REVIEW 8 of 18
deviation (Equation (15)). The average value of x0i is equal to 0 and standard deviation equal to 1.
∑ Pn𝑥
𝜇= xi (13)
µ = 𝑛 i=1 (13)
n
v
t n
1 1X (14)
𝜎 σ== 𝑥 − (xi𝜇− µ)2 (14)
𝑛 n
i=1
xi − µ
x0i =
𝑥 −ρ 𝜇 (15)
𝑥′ = (15)
𝜌
3.1.1. Feature Extraction
3.1.1.AE
Feature Extraction
is used to reduce the data dimension and achieve the effect of feature extraction. As shown
in Figure 4, the encoder
AE is used to reduce will
thereduce the dimension
data dimension of the data
and achieve the and
effecttransform
of featurethe original data
extraction. into a
As shown
new space. The features in this space can describe the data more concisely than the
in Figure 4, the encoder will reduce the dimension of the data and transform the original data into a original features.
The
neworange neurons
space. The in the
features inmiddle of the
this space canFigure
describearethe
thedata
features
more inconcisely
the new space.
than theThe featurefeatures.
original value is
obtained by multiplying the weight matrix of the neurons in the hidden layer of the
The orange neurons in the middle of the Figure are the features in the new space. The feature value previous layer and
adding the error vector (Equation (16)), where f () is the activation function which is
is obtained by multiplying the weight matrix of the neurons in the hidden layer of the previous layer Relu in this model,
W
andisadding
the weight matrix,
the error hl is the
vector error vector,
(Equation b is thef()
(16)), where bias.
is the activation function which is Relu in this
model, W is the weight matrix, ℎ is the error vector, b is the bias.
Features : f (W ∗ hl ) + b (16)
𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑠: 𝑓 𝑊 ∗ ℎ + 𝑏 (16)

Figure 4.
Figure 4. AE
AE Structure
Structure

The data
The data collected
collectedby bysensors
sensorsisisa series
a series
of of continuous
continuous data.
data. However,
However, in general
in the the general
model,model, it
it only
only uses the current sensor value to predict the RUL. The change in historical sensor
uses the current sensor value to predict the RUL. The change in historical sensor value is not considered. value is not
considered.
In this study,Ina this
newstudy, a new datamethod
data processing processing methodto
is proposed is take
proposed to take
historical historicalinto
information information
account.
into account. The format of the data will be converted from the original two-dimensional
The format of the data will be converted from the original two-dimensional (samples, features) to (samples,
features) to three-dimensional
three-dimensional (samples,
(samples, time time steps,
steps, features). features). Zero-padding
Zero-padding is often used in is often
signalused in signal
processing or
processing or Neural
Convolutional Convolutional
NetworksNeural
(CNN). Networks
The purpose (CNN). The purpose
is to keep or increase is the
to dimension
keep or increase the
of the data
dimension
without of thethe
affecting data without affecting
information of the data theitself.
information of the
In this study, datathe
when itself. In this
current study,
sensor when
record hasthe
no
current sensor record has no historical data, the zero-padding method
historical data, the zero-padding method will be used to keep the data dimension. will be used to keep the data
dimension.

3.1.2. Maximum RUL Transformation


Processes 2020, 8, 1155 9 of 18

Processes 2020, 8, x FOR PEER REVIEW 9 of 18

3.1.2. Maximum RUL Transformation


The maximum and minimum life of the engine is 356 and 127 cycles, and the average engine life
TheThe
is 209. maximum
literature and minimum
related to thelifeprediction
of the engine
of RUL is 356 and 127the
converts cycles, and the
maximum average
RUL to a engine
specific
life is 209. The literature related to the prediction of RUL converts the maximum
reasonable value in the experiment. In this study, the transform process refers to the setting of T = RUL to a specific
reasonable
130 in Heimesvalue[3].
in The
the experiment.
RUL greaterInthan thisTstudy, thedefined
will be transform process
as T, refers
the RUL lesstothan
the setting = 130
of Tchange
T will not
in Heimes (17)).
(Equation [3]. The RUL5 shows
Figure greaterthe
thanlifeTtransform
will be defined
of one as
of T,
thethe RUL less
engines than
in the dataT set.
willThrough
not changethis
(Equation (17)). Figure 5 shows the life transform of one of the engines in the data
conversion, the RUL changes from the original linear decline to a nonlinear decline. Both the model set. Through this
conversion,
training andthe theRUL changesoffrom
estimation the original
the RUL linear
have better decline
results to asetting.
in the nonlinear decline. Both the model
training and the estimation of the RUL have better results in the setting.
𝑇, 𝑖𝑓𝑅𝑈𝐿 𝑇
𝑅𝑈𝐿 = ( 𝑅𝑈𝐿, 𝑖𝑓𝑅𝑈𝐿 𝑇 (1)
T, i f RUL ≥ T
RUL0 = (17)
RUL, i f RUL < T

(a) (b)

Figure5.5.Maximum
Figure Maximumlifelife
value transform
value (a) original
transform RUL; (b)
(a) original RUL(b)
RUL; after Maximum
RUL useful life transform.
after Maximum useful life
transform.
3.2. RUL Prediction Model
The RUL
3.2. RUL prediction
Prediction Modelmodel of this study combines the advantages of effective feature extraction
and fast calculation. The RUL prediction model will be divided into four steps as shown in Figure 6.
The RUL prediction model of this study combines the advantages of effective feature extraction
Step 1 is the input layer. Xt represents the sensor data collected at the current time, Xt−1 represents the
and fast calculation. The RUL prediction model will be divided into four steps as shown in Figure 6.
sensor data collected at the previous moment, Xt−2 represents the sensor data collected two moments
Step 1 is the input layer. 𝑋 represents the sensor data collected at the current time, 𝑋 represents
earlier. The number of historical time points of sensor data to input to the model are adjustable. Step 2
the sensor data collected at the previous moment, 𝑋 represents the sensor data collected two
is the GRU layer. The number of GRU layers is set to two in this paper. The number of GRU neurons
moments earlier. The number of historical time points of sensor data to input to the model are
will be tuned in the experimental part, and the best parameter combination selected for the research.
adjustable. Step 2 is the GRU layer. The number of GRU layers is set to two in this paper. The number
The purpose of this step is to identify time series correlation in the input data through GRU. However,
of GRU neurons will be tuned in the experimental part, and the best parameter combination selected
GRU cannot directly predict the RUL. Therefore, DNN must be added to predict the RUL. Step 3
for the research. The purpose of this step is to identify time series correlation in the input data through
is the DNN layer. DNN will convert the features extracted by GRU to the prediction dimension to
GRU. However, GRU cannot directly predict the RUL. Therefore, DNN must be added to predict the
perform the prediction. The best parameter combination from the experiment result will be used to
RUL. Step 3 is the DNN layer. DNN will convert the features extracted by GRU to the prediction
decide the number of neurons as in Step 2. The objective function of DNN is applied to calculate the
dimension to perform the prediction. The best parameter combination from the experiment result
difference between the predicted value and the true value. In this study, Mean Square Error (MSE) is
will be used to decide the number of neurons as in Step 2. The objective function of DNN is applied
used as the objective function (Equation (18)), and the gradient descent method is used to minimize
to calculate the difference between the predicted value and the true value. In this study, Mean Square
the objective function and train the model. Step 4 is the output layer of the model. The output is the
Error (MSE) is used as the objective function (Equation (18)), and the gradient descent method is used
value of predicted RUL at the current time. The model evaluation is conducted by calculating Root
to minimize the objective function and train the model. Step 4 is the output layer of the model. The
Mean Square Error (RMSE).
output is the value of predicted RUL at the current n time. The model evaluation is conducted by
1 X 2
calculating Root Mean Square Error (RMSE).
MSE = yi − y0 i (18)
n
i=1
1
𝑀𝑆𝐸 = 𝑦 −𝑦 (18)
𝑛
Processes 2020, 8, 1155 10 of 18
Processes 2020, 8, x FOR PEER REVIEW 10 of 18

Processes 2020, 8, x FOR PEER REVIEW 10 of 18

Figure6.6.Model
Figure Modelstructure
structure of
of AE-GRU
AE-GRU for
for RUL
RULprediction.
prediction.
Figure 6. Model structure of AE-GRU for RUL prediction.
4. Result
4. Result Evaluation
Evaluation
4. Result Evaluation
4.1.4.1.
Data Collection
Data Collection
4.1.
ToData
To Collection
validate
validate theproposed
the proposedAE-GRU,AE-GRU, aa Turbofan
Turbofan Engine
Engine Degradation
DegradationSimulationSimulationData DataSetSet
(Prognostics
(Prognostics and Management
and Management
To validate the proposed2008, 2008, PHM08)
PHM08)
AE-GRU, (https://www.nasa.gov/)
(https://www.nasa.gov/)
a Turbofan Engine Degradation is used
is used for performance
for performance
Simulation Data Set
evaluation
evaluation inin this
this section.
section. ItItwas
was carried
carried out by
out a
by NASA
a NASA tool
(Prognostics and Management 2008, PHM08) (https://www.nasa.gov/) is used for performance C-MAPSS,
tool C-MAPSS,to simulate
to the
simulate real large
the real
commercial
large commercial
evaluation aircraft
in this turbofan
aircraft turbofan
section. engines.
It wasengines.The
carried outdata
Theby consists
data of a
consists
a NASA lot of time
of C-MAPSS,
tool series cycles,
a lot of timetoseries which
cycles,
simulate come
the which from
come
real large
different
fromcommercial
different engines
aircraft
engines of turbofan
thethe
of same
same type.
engines. Each
type. The
Eachengine
data
engine starts
consists with
of
starts a of
awith
lot different
a time wear
series
different level.
cycles,
wear The
which
level. The first
comesection
first from
section
introduces
different the
introduces the
engines engine
engine of thedata
data same collected
collected bybythe
type. Each the sensor,
engine
sensor, andwith
starts
and the second
the a different
second section
wear
section describes
level. The
describes the parameter
first
the section
parameter
settings
introduces of the
the model.
engine The
data performance
collected of
by different
the sensor, activation
and the functions
second is compared
section describes and performance
the parameter
settings of the model. The performance of different activation functions is compared and performance
is optimized.
settings of the Finally,
model. the
The number
performanceof hidden layers and
of different neurons
activation is compared.
functions In the and
is compared thirdperformance
section, we
is optimized. Finally, the number of hidden layers and neurons is compared. In the third section, we
will
is extract
optimized. the best
Finally, parameters
the number andof apply
hidden them to
layers different
and neuronsmodels,
is and
compared.test the
In model
the thirdby analyzing
section, we
will extract the best parameters and apply them to different models, and test the model by analyzing
the
will root mean
extract the square
best error (RMSE).
parameters and apply them to different models, and test the model by analyzing
the root mean square error (RMSE).
There
theThere
root meanare square
data forerror218 (RMSE).
turbo engines, and a total of 45,918 records in the dataset. Every record
has 26 are data
original for 218and
features turbo theengines,
customizedand aRUL.totalTheof 45,918 records designin the dataset. Every record thishas
There are data for 218 turbo engines, and a total ofexperimental
45,918 records in the and verification
dataset. Every of record
26 study
original features and thevalidation
customized RUL. The the experimental of the design and verification of this study
has 26adopts
original k-fold cross
features to ensure
and the customized RUL. stability
The experimental model. design and verification of this
adopts Figure
k-fold cross
7 is validation
a diagram to ensure
of the sensor the stability
parameters of the model.
of the turbine engines in this study. The sensor
study adopts k-fold cross validation to ensure the stability of the model.
Figure 7 is
parameters of aEngine
diagram 1 ofpresented
are the sensor parameters
visually. The of the turbine
parameter values engines in thisvary
of the sensors study.inTheThe sensor
range and
Figure 7 is a diagram of the sensor parameters of the turbine engines in this study. sensor
parameters
the of Engine
parameter trends 1 are
are presented
not visually.
consistent. The The parameter
RUL of the engine values
will of the
be sensorsbyvary
predicted the in range of
changes and
parameters of Engine 1 are presented visually. The parameter values of the sensors vary in range and
thethese
parameter trends
sensor parameters
the parameter
are not
trends areinnot
consistent.
research. The RUL of the engine will be predicted by the changes of of
thisconsistent. The RUL of the engine will be predicted by the changes
these sensor
these sensorparameters
parameters ininthis
thisresearch.
research.

Figure 7. Diagram of turbine engine sensor parameter.


Figure7.7.Diagram
Figure Diagramof
ofturbine
turbine engine
engine sensor
sensorparameter.
parameter.
Processes 2020, 8,
Processes 2020, 8, 1155
x FOR PEER REVIEW 11 of
11 of 18
18

4.2. Hyperparameters Setting


4.2. Hyperparameters Setting
Based on the experience of adjusting the model, the hyperparameters in the neural network will
Based on the experience of adjusting the model, the hyperparameters in the neural network will
significantly affect the results of the model. This section finds the best parameter combination to
significantly affect the results of the model. This section finds the best parameter combination to apply
apply in the model to predict the RUL value.
in the model to predict the RUL value.
In experiments on neural networks, the number of hidden layers and the number of neurons are
In experiments on neural networks, the number of hidden layers and the number of neurons are
often the most difficult to choose. The training results will also depend on the quality of batch size
often the most difficult to choose. The training results will also depend on the quality of batch size
and epoch. Batch size means the number of samples to work through in one iteration. Epoch is the
and epoch. Batch size means the number of samples to work through in one iteration. Epoch is the
number of times that the model goes through batches to go through the entire training dataset once.
number of times that the model goes through batches to go through the entire training dataset once.
For example, if a training dataset has 3000 records and batch size is set to 500, it takes 6 iterations to
For example, if a training dataset has 3000 records and batch size is set to 500, it takes 6 iterations to
complete an epoch.
complete an epoch.
4.2.1. Activation Function
4.2.1. Activation Function
The main function of the activation function in the neural network is to introduce nonlinear
The main function of the activation function in the neural network is to introduce nonlinear
characteristics. If there is no activation function in the neural network, the input and output will only
characteristics. If there is no activation function in the neural network, the input and output will only
correlate with a simple linear relationship and not able to handle complex issues, so the activation
correlate with a simple linear relationship and not able to handle complex issues, so the activation
function is very important in deep learning.
function is very important in deep learning.
This experiment sets the following parameters to observe the training result of the GRU model
This experiment sets the following parameters to observe the training result of the GRU model
with different activation functions:
with different activation functions:
1. Epochs:
1. Epochs: 100;
100;
2. Optimizers: Adam;
2. Optimizers: Adam;
3. Batch_size: 128;
3. Batch_size: 128;
4. Hidden layer, number of neurons: 1, 50;
4. Hidden layer, number of neurons: 1, 50;
5. Inputs: 24.
5. Inputs: 24.
This experiment uses a 5-fold cross-validation to compare total seven activation functions and
This experiment uses a 5-fold cross-validation to compare total seven activation functions and
calculate the average loss of each activation function: softmax (3055.5), softplus (517.0), softsign
calculate the average loss of each activation function: softmax (3055.5), softplus (517.0), softsign (438.9),
(438.9), relu (434.8), tanh (445.3), sigmoid (990.7), hard sigmoid (1051.9). According to the training
relu (434.8), tanh (445.3), sigmoid (990.7), hard sigmoid (1051.9). According to the training results,
results, we find that the softmax performance is relatively poor, and the relu activation function
we find that the softmax performance is relatively poor, and the relu activation function performs
performs best (Figure 8). The softmax activation function is a blend of multiple sigmoid, which is
best (Figure 8). The softmax activation function is a blend of multiple sigmoid, which is relatively
relatively suited to multiclass classification problems rather than a regression problem. The
suited to multiclass classification problems rather than a regression problem. The advantage of the
advantage of the Relu activation function is that neuron deactivation will only happen when the
Relu activation function is that neuron deactivation will only happen when the output of linear
output of linear transformation is zero. The neurons will not be activated at the same time, but a
transformation is zero. The neurons will not be activated at the same time, but a certain number of
certain number of neurons will be activated at a time. The objective function of Relu can be express
neurons will be activated at a time. The objective function of Relu can be express as [41]:
as [41]:

𝑓 𝑥f (x=
)=max
max0,(𝑥0, x) (19)
(19)
In addition,
In addition, since
since the
the relu
relu activation function does
activation function does not
not need
need to
to perform
perform exponential
exponential calculations,
calculations,
the convergence speed is fast and the calculation complexity levels are low. So, this study adopted
the convergence speed is fast and the calculation complexity levels are low. So, this study adopted relu
relu as the activation function.
as the activation function.

Figure 8. Comparison of training result in different


different activation
activation function
function
Processes 2020, 8, 1155 12 of 18
Processes 2020, 8, x FOR PEER REVIEW 12 of 18

4.2.2. Optimizer
4.2.2. Optimizer
The
The purpose
purpose ofof optimizers
optimizers isis to
to minimize
minimize thethe loss
loss function
function and
and the
the gap
gap between
between thethe predicted
predicted
value and real value. This experiment sets the following parameters and observes the training
value and real value. This experiment sets the following parameters and observes the training result of
result
the GRU
of the model
GRU modelunder different
under optimizers:
different optimizers:
1.
1. Epochs:
Epochs: 100;
100;
2.
2. Activation:
Activation: relu;
relu;
3.
3. Batch_size:
Batch_size: 128;
128;
4. Hidden
Hidden layer,number
layer, numberof
ofneuron:
neuron: 1,
1,50;
50;
5. Inputs:
Inputs: 24.
24.
This experiment
This experimentapplies
applies 5-fold
5-fold cross-validation
cross-validation to compare
to compare a total ofaseven
total optimization
of seven optimization
algorithms:
algorithms:
SGD (569.8), SGD (569.8),
RMSprop RMSprop
(438.7), Adagrad (438.7), Adagrad
(4586.9), Adadelta(4586.9), Adadelta
(426.6), (426.6),Adamax
Adam (436.6), Adam (436.6),
(436.8),
Adamax(420.7).
Nadam (436.8),The
Nadam (420.7).
parameter Theofparameter
setting setting ofalgorithm
each optimization each optimization algorithm
is the best result is theinbest
proposed the
result proposed
literature. Kingmain the
andliterature. Kingma Adam’s
Ba [42] proposed and Ba [42] proposed Adam’s
optimization algorithm. optimization
Most of thealgorithm. Most
current neural
of the current
networks applyneural
Adam networks apply
to optimize Adam
the to optimize
loss function. Thethe loss function.
advantage The advantage
of Adam of Adam is
is quick convergence
quick
and convergence
dealing with highand dealing
noise with
and the high noise
problem and the
of sparse problem
gradients. TheofNadam
sparse optimization
gradients. The Nadam
algorithm
optimization
proposed algorithm
by Dozat proposed
[43] changes by Dozat
the part [43] momentum
of Adam’s changes theand part of Adam’s
accelerates momentum rate
the convergence and
accelerates
of the model. theThe
convergence rate ofthat
study confirms thein
model. The study
this PHM08 confirms
dataset, Nadam thatwill
in this PHM08
converge dataset,
faster thanNadam
Adam
will converge
(Figure faster than
9). Therefore, Adam (Figure
the optimizer used9).in Therefore, theNadam.
this study is optimizer used in this study is Nadam.

Figure 9.
Figure Comparison of
9. Comparison of training
training result
result in
in different
different optimizer.
optimizer.
4.2.3. Number of Hidden Layers and Neurons
4.2.3. Number of Hidden Layers and Neurons
This experiment compares three different batch sizes: 32 (Table 2), 64 (Table 3), 128 (Table 4).
This experiment compares three different batch sizes: 32 (Table 2), 64 (Table 3), 128 (Table 4).
The early stopping function has been adopted in the experiment; if the MSE of verification does not
The early stopping function has been adopted in the experiment; if the MSE of verification does not
rise for 10 consecutive times, the training will stop and the previous best model will be used in the test
rise for 10 consecutive times, the training will stop and the previous best model will be used in the
dataset. Early stopping is commonly used in the algorithm of gradient descent to avoid overfitting.
test dataset. Early stopping is commonly used in the algorithm of gradient descent to avoid
The GRU and DNN layers are set to two layers. The RMSE result of different neurons in the training
overfitting. The GRU and DNN layers are set to two layers. The RMSE result of different neurons in
are compared. The input of this experiment is 24 parameters of the current raw data of the sensor,
the training are compared. The input of this experiment is 24 parameters of the current raw data of
and there is no historical information (time steps = 1).
the sensor, and there is no historical information (time steps = 1).
Table 2. Comparison of Root Mean Square Error (RMSE) result (batch size = 32).
Table 2. Comparison of Root Mean Square Error (RMSE) result (batch size = 32).
Networks Epochs Training Time RMSE
Networks Epochs Training Time RMSE
G(32,32)N(8,8) 33 98.4 s 20.18
G(32,32)N(8,8) 33 98.4 s 20.18
G(64,64)N(8,8) 20 89.3 s 20.24
G(64,64)N(8,8)
G(32,64)N(8,8) 2019 76.8 s 89.3 s 20.37 20.24
G(32,64)N(8,8)
G(32,64)N(16,16) 1928 103.7 s 76.8 s 20.24 20.37
G(32,64)N(16,16)
G(96,96)N(8,8) 2825 150.3 s 103.7 s 20.20 20.24
G(96,96)N(16,16)
G(96,96)N(8,8) 2524 151.4 s 150.3 s 20.18 20.20
G(96,96)N(16,16) 24 151.4 s 20.18
Processes 2020, 8, 1155 13 of 18

Table 3. Comparison of RMSE result (batch size = 64).

Networks Epochs Training Time RMSE


G(32,32)N(8,8) 21 52.0 s 20.32
G(64,64)N(8,8) 24 75.8 s 20.25
G(32,64)N(8,8) 23 63.8 s 20.33
G(32,64)N(16,16) 24 66.7 s 20.36
G(96,96)N(8,8) 15 80.1 s 20.44
G(96,96)N(16,16) 14 78.3 s 20.44

Table 4. Comparison of RMSE result (batch size = 128).

Networks Epochs Training Time RMSE


G(32,32)N(8,8) 42 60.4 s 20.16
G(64,64)N(8,8) 34 70.1 s 20.07*
G(32,64)N(8,8) 33 60.3 s 20.20
G(32,64)N(16,16) 52 85.6 s 20.19
G(96,96)N(8,8) 42 119.4 s 20.15
G(96,96)N(16,16) 36 115.7 s 20.12

The experimental results show that the larger the batch size, the faster the model training speed.
Therefore, batch size in this study will be set to 128. When the batch size is greater, not only the training
time is fast, but the RMSE results are also better. So, the network architecture of this study will be set
to G(64,64)N(8,8).

4.3. Model Evaluation

4.3.1. Model Comparison


This section uses the 5-fold cross-validation to evaluate the validity of the RUL model of the
AE-GRU in this study, and compares the results of existing models. In total 218 engines are divided
into five parts as 44, 44, 44, 44 and 42 engines. Four parts are used for training each time, and the
remainder are test data. After distinguishing the data, the AE model reduces the dimension of the data,
and then predicts the RUL of the engine through GRU. The performance of the model is also compared
using the RMSE result. The dimension of the data is reduced from 24 inputs of the original data to 15,
10, 5. AE extracts the characteristics of the original data by non-linear transformation, and through
multiple hidden layers, observing the loss of each feature in the compression and restoration process
(Table 5).

Table 5. Loss of dimension reduction by AE.

Input 15 10 5
Loss 7.7 × 10−6 2.6 × 10−5 3.0 × 10−5

The experimental results show that the dimension reduction effect is the best when the input is
reduced to 15 dimensions. In 10 or 5 dimensions, the effect of restoration is poor because of the loss
of original data Information. In this study, the AE will reduce the data dimension from 24 to 15 and
compare the performance with other models.
After 5-fold cross-validation, (Tables 6 and 7), it is found that when the input of AE-GRU is
reduced from the original 24 features to 15, the RMSE results are better than those of other models.
This confirms that AE-GRU can effectively extract features and accurately predict the RUL. Time steps
is the number of historical data points. time steps = 5 is the historical information of the current time
point and the previous four-time records. time steps = 10 is the current time information and the
previous nine-time points. The results show that when time steps = 5, the results of the model are
Processes 2020, 8, 1155 14 of 18
Processes 2020, 8, x FOR PEER REVIEW 14 of 18

problem
quite of disappearing
close. gradients,
When time steps while theofgeneral
= 10, because DNN
the large has nooftime
amount series characteristics,
historical data, the RNN and the
has the
prediction results are the worst.
problem of disappearing gradients, while the general DNN has no time series characteristics, and the
prediction results are the worst.
Table 6. RMSE results of different models under 5-fold cross-validation (Time steps = 5).
Table 6. RMSE results of different
DNN models under 5-fold
RNN LSTM cross-validation (Time steps = 5).
GRU AE-GRE
Inputs:24 Inputs:24 Inputs:24 Inputs:24 Inputs:15
DNN RNN LSTM GRU AE-GRE
1 18.57
Inputs:24 18.06
Inputs:24 17.80
Inputs:24 17.68 Inputs:15 17.64
Inputs:24
2 18.76 18.22 18.00 17.88 17.70
1 18.57 18.06 17.80 17.68 17.64
3 2 18.73
18.76 18.47
18.22 18.19
18.00 17.8818.03 17.70 17.98
4 3 19.11
18.73 18.44
18.47 18.45
18.19 18.0318.16 17.98 18.20
5 4 19.12
19.11 18.43
18.44 18.34
18.45 18.1618.12 18.20 18.20
Average 5 19.12
18.86 18.43
18.32 18.34
18.16 18.1217.96 18.20 17.94
Average 18.86 18.32 18.16 17.96 17.94
Table 7. RMSE results of different models under 5-fold cross-validation (Time steps = 10).
Table 7. RMSE results of different models under 5-fold cross-validation (Time steps = 10).
DNN RNN LSTM GRU AE-GRE
DNN
Inputs:24 RNN
Inputs:24 LSTM
Inputs:24 GRUInputs:24 AE-GREInputs:15
1 Inputs:24
19.21 Inputs:24
15.08 Inputs:24
11.20 Inputs:24
10.31 Inputs:15 10.39
2 1 19.21
19.21 15.08
14.26 11.20
11.52 10.3111.05 10.39 10.02
3 2 19.21
19.07 14.26
15.86 11.52
10.78 11.0510.75 10.02 10.70
4 3 19.07
19.77 15.86
15.07 10.78
11.49 10.7511.33 10.70 10.69
4 19.77 15.07 11.49 11.33 10.69
5 19.69 17.27 10.81 10.54 11.31
5 19.69 17.27 10.81 10.54 11.31
AverageAverage 19.39
19.39 15.51
15.51 11.16
11.16 10.7910.79 10.62 10.62

4.3.2. RUL Prediction


4.3.2. RUL Prediction
Figures 10–12 show the test results for the total of five model methods (DNN, RNN, LSTM, GRU,
Figures 10–12 show the test results for the total of five model methods (DNN, RNN, LSTM, GRU,
and AE-GRU). According to the Figure, it can be seen that the five models have the problem of
and AE-GRU). According to the Figure, it can be seen that the five models have the problem of unstable
unstable fluctuation. Neither DNN nor RNN are ideal for training or testing. LSTM, GRU and AE-
fluctuation. Neither DNN nor RNN are ideal for training or testing. LSTM, GRU and AE-GRU are
GRU are good in the training set, but precarious in the testing set, because there is not enough training
good in the training set, but precarious in the testing set, because there is not enough training data.
data. Deep learning requires a large amount of data identify features. When the RUL is stable for 130
Deep learning requires a large amount of data identify features. When the RUL is stable for 130 cycles,
cycles, the AE-GRU proposed in this study can have more accurate results in predicting the RUL of
the AE-GRU proposed in this study can have more accurate results in predicting the RUL of the engine.
the engine. It will not fluctuate like DNN and RNN. When the RUL begins to decline, AE-GRU can
It will not fluctuate like DNN and RNN. When the RUL begins to decline, AE-GRU can predict it
predict it in advance.
in advance.

Model testing results comparison (Engine 1).


Figure 10. Model
Processes 2020, 8, x FOR PEER REVIEW 15 of 18
Processes 2020, 8, 1155 15 of 18
Processes 2020, 8, x FOR PEER REVIEW 15 of 18

Figure 11. Model testing results comparison (Engine 2).


Figure 11.
Figure Model testing
11. Model testing results
results comparison
comparison (Engine
(Engine 2).
2).

Figure 12. Model testing results comparison (Engine 3).


Figure 12. Model testing results comparison (Engine 3).
5. Conclusions
5. Conclusions
5. Conclusions
This study proposes the AE-GRU model for RUL prediction. First, pre-processing was applied to
This study proposes the AE-GRU model for RUL prediction. First, pre-processing was applied
the sensor data collected by different sensors. Because the scales of different sensors are not the same,
to theThis study
sensor dataproposes
collected thebyAE-GRU
different model for RUL
sensors. Because prediction.
the scales First, pre-processing
of different sensorswas areapplied
not the
we standardize the sensor data so that the sensor data can be rearranged to the same scale standard
to the sensor data collected by different sensors. Because the scales
same, we standardize the sensor data so that the sensor data can be rearranged to the same scale of different sensors are not the
while retaining the characteristics of the data time series. Next, the RUL is defined. The purpose
same,
standardwe while
standardize
retaining the the
sensor data so thatofthe
characteristics thesensor
data timedataseries.
can beNext,rearranged
the RUL to is
thedefined.
same scaleThe
of defining the RUL is to allow the user to accurately predict the life value of equipment that the
standard
purpose of while retaining
defining the RUL theischaracteristics
to allow the user of the data time predict
to accurately series. the
Next,lifethe RUL
value of is defined. The
equipment that
engine can work. Feature extraction finds the characteristics of the sensor data by deep learning.
purpose
the engine of defining
can work.the RUL isextraction
Feature to allow the user
finds thetocharacteristics
accurately predict of the the life value
sensor dataofbyequipment that
deep learning.
The characteristics will directly affect the predictive ability of the model. A good model is indispensable
the
Theengine can work.will
characteristics Feature extraction
directly affectfinds
the the characteristics
predictive ability ofofthethesensor
model.dataAbygooddeep model
learning. is
for characteristics that can fully represent the data. Data dimension conversion takes the historical
The characteristics will directly affect the predictive ability of
indispensable for characteristics that can fully represent the data. Data dimension conversion takesthe model. A good model is
sensor data into account, because the data has the characteristics of a time series. The last step of data
indispensable for characteristics
the historical sensor data into account, that can fully represent
because the data the has data. Data dimension
the characteristics of aconversion
time series.takesThe
pre-processing is the transformation of the maximum life value. The engine with an RUL greater than
the
last historical
step of data sensor data into account,
pre-processing because the data
is the transformation of thehasmaximum
the characteristics
life value.ofThea time series.
engine withThe
an
130 step
last cycles of isdata
set pre-processing
to 130 cycles. When is the using the RUL prediction,
transformation of the maximumattentionlife focusesThe
value. on engine
the RUL value
with an
RUL greater than 130 cycles is set to 130 cycles. When using the RUL prediction, attention focuses on
whengreater
RUL it is about
than to130
break,
cycles and theto
is set situation
130 cycles.where
When theusing
RUL the valueRULis still very large is less focuses
important.
the RUL value when it is about to break, and the situation where theprediction,
RUL valueattention
is still very largeon is
Then
the thevalue
GRU when is combined with the back-propagation algorithm to learn the parameters. Because
lessRUL
important. Thenit theis about
GRUtoisbreak, and the
combined situation
with where the
the back-propagation RUL value is still very
algorithm largethe
to learn is
thereimportant.
less are a large Then number theof GRU
parameters to learn, the experiment uses a GPU graphics card in parallel
parameters. Because there are a is combined
large number with the back-propagation
of parameters algorithm
to learn, the experiment touses
learn the
a GPU
operation
parameters. and increase
Because the process
there operation speed.
are a largeand At
number the same time,
of parameters the relu activation function and Nadam
graphics card in parallel increase the processtospeed.learn, the experiment
At the same time, usesthea GPU
relu
optimizercard
graphics are used
in to makeoperation
parallel the processand of learning
increase parameters
the process converge
speed. more
At quickly.
the same Combined
time, the with
relu
activation function and Nadam optimizer are used to make the process of learning parameters
the best combination
activation function of parameters
and Nadam run under
optimizer the experimentalmake design, the prediction of the RUL has
converge more quickly. Combined with are the used
best to combination the process of learning
of parameters run parameters
under the
better results.
converge more quickly. Combinedof with the has
best combination
experimental design, the prediction the RUL better results. of parameters run under the
The contribution
experimental design, the of this study isof
prediction that
theitRUL
applies
hasthe optimized
better results.version of LSTM, the GRU. GRU can
The contribution of this study is that it applies the optimized version of LSTM, the GRU. GRU
effectively capture
The contribution the characteristics of the sensor data, and extract the characteristics of GRU.
the data by
can effectively captureof thethis study is thatofitthe
characteristics applies
sensor thedata,
optimized version
and extract the of LSTM, the
characteristics of theGRU
data
AE before
can the RUL
effectively capturemodel theprediction.
characteristics The of
GRU the model
sensor has
data,many fewer parameters than LSTM butdata
still
by AE before the RUL model prediction. The GRU model hasand extract
many fewer theparameters
characteristics
thanof the
LSTM but
by AE before the RUL model prediction. The GRU model has many fewer parameters than LSTM but
Processes 2020, 8, 1155 16 of 18

retains the advantages of LSTM. The AE-GRU model proposed in this study has a shorter training
time, and also takes advantage of the features extracted by AE. The convergence speed of the model is
much faster. The validity of the model is evaluated and verified through the 5-fold cross-validation.
The results of the root mean square error are better than other deep learning methods, and the research
model can find the engine that is about to fail early and in time to maintain the equipment, reducing
unnecessary costs.
The AE-GRU model proposed in this study has good accuracy in the prediction of RUL. With the
improvement of process technology, the cost of equipment will become higher and higher. Therefore,
more complex models may be needed in future research, such as a bi-directional recurrent network.
In this study, only a directional recurrent network is considered. In some cases, the output at the
current moment is not only related to the previous state, but also closely related to the state after it.
The pre-processing method in this study extracts features from the original data. Although the
features can effectively represent the data set, they may contain a little noise. In the future, the pre-
processing method might also employ an advanced version of AE, such as denoising AE (DAE).
The combination of these two methods can further improve the prediction of RUL.
If the algorithm can be effectively implemented, predictive maintenance can be widely used in many
applications, providing effective information for early maintenance of machinery, and self-correcting
parameters can improve the yield to achieve the target of smart factories.

Author Contributions: Conceptualization, Y.-W.L. and C.-Y.H.; methodology, C.-Y.H.; software, K.-C.H.;
validation, Y.-W.L. and K.-C.H.; formal analysis, Y.-W.L. and K.-C.H.; investigation, Y.-W.L. and K.-C.H.; resources,
C.-Y.H.; data curation, K.-C.H.; writing—original draft preparation, Y.-W.L.; writing—review and editing, C.-Y.H.;
visualization, Y.-W.L.; supervision, C.-Y.H.; project administration, C.-Y.H.; funding acquisition, C.-Y.H. All authors
have read and agreed to the published version of the manuscript.
Funding: This research is supported by Ministry of Science and Technology, Taiwan (MOST 107-2221-E-027-127-
MY2; MOST108-2745-8-027-003).
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Wang, S.; Wan, J.; Li, D.; Zhang, C. Implementing Smart Factory of Industrie 4.0: An Outlook. Int. J. Distrib.
Sens. Netw. 2016, 12, 3159805. [CrossRef]
2. Chien, C.-F.; Hsu, C.-Y.; Chen, P.-N. Semiconductor Fault Detection and Classification for Yield Enhancement
and Manufacturing Intelligence. Flex. Serv. Manuf. J. 2013, 25, 367–388. [CrossRef]
3. Hsu, C.-Y.; Liu, W.-C. Multiple Time-Series Convolutional Neural Network for Fault Detection and Diagnosis
and Empirical Study in Semiconductor Manufacturing. J. Intell. Manuf. 2020, 1–14. [CrossRef]
4. Fan, S.-K.S.; Hsu, C.-Y.; Tsai, D.-M.; He, F.; Cheng, C.-C. Data-Driven Approach for Fault Detection and
Diagnostic in Semiconductor Manufacturing. IEEE Trans. Autom. Sci. Eng. 2020, 1–12. [CrossRef]
5. Shrouf, F.; Ordieres, J.; Miragliotta, G. Smart factories in Industry 4.0: A review of the concept and of energy
management approached in production based on the Internet of Things paradigm. In Proceedings of the
2014 IEEE International Conference on Industrial Engineering and Engineering Management, Selangor,
Malaysia, 9–12 December 2014; pp. 697–701.
6. Kothamasu, R.; Huang, S.H.; VerDuin, W.H. System health monitoring and prognostics—A review of current
paradigms and practices. Int. J. Adv. Manuf. Technol. 2006, 28, 1012–1024. [CrossRef]
7. Peng, Y.; Dong, M.; Zuo, M.J. Current status of machine prognostics in condition-based maintenance:
A review. Int. J. Adv. Manuf. Technol. 2010, 50, 297–313. [CrossRef]
8. Soh, S.S.; Radzi, N.H.; Haron, H. Review on scheduling techniques of preventive maintenance activities
of railway. In Proceedings of the 2012 Fourth International Conference on Computational Intelligence,
Modelling and Simulation, Kuantan, Malaysia, 25–27 September 2012; pp. 310–315.
9. Heimes, F.O. Recurrent neural networks for remaining useful life estimation. In Proceedings of the 2008
International Conference on Prognostics and Health Management, Denver, CO, USA, 6–9 October 2008;
pp. 1–6.
Processes 2020, 8, 1155 17 of 18

10. Li, Y.; Shi, J.; Gong, W.; Liu, X. A data-driven prognostics approach for RUL based on principle component
and instance learning. In Proceedings of the 2016 IEEE International Conference on Prognostics and Health
Management (ICPHM), Ottawa, ON, Canada, 20–22 June 2016; pp. 1–7.
11. Murphy, K.P. Machine Learning: A Probabilistic Perspective; MIT Press: Cambridge, MA, USA, 2012.
12. Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [CrossRef]
13. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [CrossRef]
14. Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA,
13–17 August 2016; ACM: New York, NY, USA, 2016.
15. Han, Z.; Zhao, J.; Leung, H.; Ma, K.F.; Wang, W. A review of deep learning models for time series prediction.
IEEE Sens. J. 2019. [CrossRef]
16. Bengio, Y.; Courville, A.; Vincent, P. Representation Learning: A Review and New Perspectives. IEEE Trans.
Pattern Anal. Mach. Intell. 2013, 35, 1798–1828. [CrossRef]
17. Schmidhuber, J. Deep Learning in Neural Networks: An Overview. Neural Netw. 2015, 61, 85–117. [CrossRef]
18. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [CrossRef] [PubMed]
19. Makridakis, S.; Spiliotis, E.; Assimakopoulos, V. Statistical and Machine Learning Forecasting Methods:
Concerns and Ways Forward. PLoS ONE 2018, 13, e0194889. [CrossRef] [PubMed]
20. Alfred, R.; Obit, J.H.; Ahmad Hijazi, M.H.; Ag Ibrahim, A.A. A performance comparison of statistical and
machine learning techniques in learning time series data. Adv. Sci. Lett. 2015, 21, 3037–3041.
21. Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science 2006,
313, 504–507. [CrossRef]
22. Lin, P.; Tao, J. A Novel Bearing Health Indicator Construction Method Based on Ensemble Stacked
Autoencoder. In Proceedings of the 2019 IEEE International Conference on Prognostics and Health
Management (ICPHM), San Francisco, CA, USA, 17–20 June 2019; pp. 1–9.
23. Yin, C.; Zhang, S.; Wang, J.; Xiong, N.N. Anomaly Detection Based on Convolutional Recurrent Autoencoder
for IoT Time Series. IEEE Trans. Syst. Man Cybern. Syst 2020, 1–11. [CrossRef]
24. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [CrossRef]
25. Zhang, Y.; Xiong, R.; He, H.; Liu, Z. A LSTM-RNN method for the lithuim-ion battery remaining useful
life prediction. In Proceedings of the 2017 Prognostics and System Health Management Conference
(PHM-Harbin), Harbin, China, 9–12 July 2017; pp. 1–4.
26. Mathew, V.; Toby, T.; Singh, V.; Rao, B.M.; Kumar, M.G. Prediction of Remaining Useful Lifetime (RUL)
of turbofan engine using machine learning. In Proceedings of the 2017 IEEE International Conference on
Circuits and Systems (ICCS), Batumi, Georgia, 5–8 December 2017; pp. 306–311.
27. Zheng, S.; Ristovski, K.; Farahat, A.; Gupta, C. Long short-term memory network for remaining useful life
estimation. In Proceedings of the 2017 IEEE international conference on prognostics and health management
(ICPHM), Dallas, TX, USA, 19–21 June 2017; pp. 88–95.
28. Chen, J.; Jing, H.; Chang, Y.; Liu, Q. Gated recurrent unit based recurrent neural network for remaining useful
life prediction of nonlinear deterioration process. Reliab. Eng. Syst. Safe. 2019, 185, 372–382. [CrossRef]
29. Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning
phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014,
arXiv:1406.1078.
30. Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on
sequence modeling. arXiv 2014, arXiv:1412.3555.
31. Chen, Z.; Liu, Y.; Liu, S. Mechanical state prediction based on LSTM neural netwok. In Proceedings of the
2017 36th Chinese Control Conference (CCC), Dalian, China, 26–28 July 2017; pp. 3876–3881.
32. Zheng, J.; Xu, C.; Zhang, Z.; Li, X. Electric load forecasting in smart grids using long-short-term-memory
based recurrent neural network. In Proceedings of the 2017 51st Annual Conference on Information Sciences
and Systems (CISS), Baltimore, MD, USA, 22–24 March 2017; pp. 1–6.
33. ElSaid, A.; Wild, B.; Higgins, J.; Desell, T. Using LSTM recurrent neural networks to predict excess vibration
events in aircraft engines. In Proceedings of the 2016 IEEE 12th International Conference on e-Science,
Baltimore, MD, USA, 23–27 October 2016; pp. 260–269.
Processes 2020, 8, 1155 18 of 18

34. Cenggoro, T.W.; Siahaan, I. Dynamic bandwidth management based on traffic prediction using Deep Long
Short Term Memory. In Proceedings of the 2016 2nd International Conference on Science in Information
Technology (ICSITech), Balikpapan, Indonesia, 26–27 October 2016; pp. 318–323.
35. Bao, W.; Yue, J.; Rao, Y. A deep learning framework for financial time series using stacked autoencoders and
long-short term memory. PLoS ONE 2017, 12, e0180944. [CrossRef]
36. Zhang, Q.; Wang, H.; Dong, J.; Zhong, G.; Sun, X.J.I.G.; Letters, R.S. Prediction of sea surface temperature
using long short-term memory. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1745–1749. [CrossRef]
37. How, D.N.T.; Sahari, K.S.M.; Yuhuang, H.; Kiong, L.C. Multiple sequence behavior recognition on humanoid
robot using long short-term memory (LSTM). In Proceedings of the 2014 IEEE international symposium
on robotics and manufacturing automation (ROMA), Kuala Lumpur, Malaysia, 15–16 December 2014;
pp. 109–114.
38. Truong, A.M.; Yoshitaka, A. Structured LSTM for human-object interaction detection and anticipation.
In Proceedings of the 2017 14th IEEE International Conference on Advanced Video and Signal Based
Surveillance (AVSS), Lecce, Italy, 29 August–1 September 2017; pp. 1–6.
39. Kuan, L.; Yan, Z.; Xin, W.; Yan, C.; Xiangkun, P.; Wenxue, S.; Zhe, J.; Yong, Z.; Nan, X.; Xin, Z. Short-term
electricity load forecasting method based on multilayered self-normalizing GRU network. In Proceedings
of the 2017 IEEE Conference on Energy Internet and Energy System Integration (EI2), Beijing, China,
26–28 November 2017; pp. 1–5.
40. Zhang, D.; Kabuka, M.R. Combining weather condition data to predict traffic flow: A GRU-based deep
learning approach. IET Intell. Transp. Syst. 2018, 12, 578–585. [CrossRef]
41. Sharma, S. Activation functions in neural networks. Int. J. Eng. Appl. Sci. Technol. 2020, 4, 310–316.
[CrossRef]
42. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980.
43. Dozat, T. Incorporating nesterov momentum into adam. In Proceedings of the 2016 International Conference
on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016; pp. 1–4.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

You might also like