10.3390@aerospace7090132

aerospace
Article
Deep Neural Network Feature Selection Approaches
for Data-Driven Prognostic Model of Aircraft Engines
Phattara Khumprom * , David Grewell and Nita Yodo
Industrial and Manufacturing Engineering, North Dakota State University, Fargo, ND 58102, USA;
[email protected] (D.G.); [email protected] (N.Y.)
* Correspondence: [email protected]; Tel.: +1-701-231-9818

Received: 29 July 2020; Accepted: 3 September 2020; Published: 4 September 2020
Abstract: Predicting Remaining Useful Life (RUL) of systems has played an important role in various
fields of reliability engineering analysis, including in aircraft engines. RUL prediction is critically
an important part of Prognostics and Health Management (PHM), which is the reliability science
that is aimed at increasing the reliability of the system and, in turn, reducing the maintenance cost.
The majority of the PHM models proposed during the past few years have shown a significant
increase in the amount of data-driven deployments. While more complex data-driven models are
often associated with higher accuracy, there is a corresponding need to reduce model complexity.
One possible way to reduce the complexity of the model is to use the features (attributes or variables)
selection and dimensionality reduction methods prior to the model training process. In this work,
the effectiveness of multiple filter and wrapper feature selection methods (correlation analysis,
relief forward/backward selection, and others), along with Principal Component Analysis (PCA)
as a dimensionality reduction method, was investigated. A basis algorithm of deep learning,
Feedforward Artificial Neural Network (FFNN), was used as a benchmark modeling algorithm.
All those approaches can also be applied to the prognostics of an aircraft gas turbine engines. In this
paper, the aircraft gas turbine engines data from NASA Ames prognostics data repository was used
to test the effectiveness of the filter and wrapper feature selection methods not only for the vanilla
FFNN model but also for Deep Neural Network (DNN) model. The findings show that applying
feature selection methods helps to improve overall model accuracy and significantly reduced the
complexity of the models.
Keywords: data-driven; machine learning; deep learning; DNN; feature selection; Prognostic and
Health Management; aircraft gas turbine engines; C-MAPSS
1. Introduction
Modern computational capability has become more powerful over the past decades. This has
induced a new trend of employing various data-driven models in many fields. Despite the fact that
modern computers can complete complex tasks, researchers are still searching for solutions to reduce
the computational time and complexity of the data-driven models to increase the likelihood that the
models can be employed in real-time operation.
The same challenge has also applied to a certain type of aerospace data, which in this case, is
the estimation of Remaining Useful Life (RUL) of the aircraft gas turbine engines. The main purpose
of this work is to prove the theory that a particular group or a set of prognostics features (attributes
or variables) from the aircraft gas turbine engines data can be selected prior to the training phase of
Artificial Neural Network (ANN) modeling in order to reduce the complexity of the model. The same
assumption also is believed to be applicable to the Deep Neural Network (DNN) model. It might
Aerospace 2020, 7, 132; doi:10.3390/aerospace7090132 www.mdpi.com/journal/aerospace

Aerospace 2020, 7, 132 2 of 32
also be applied to other complex deep learning models, i.e., Convolutional Neural Network (CNN),
Recurrent Neural Network (RNN), and their variations as well.
In order to validate the aforementioned theory, the prognostics of aircraft gas turbine engines
dataset or Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) dataset derived from
NASA Ames Prognostics Center of Excellence (PCoE) [1] was used to develop preliminary vanilla
ANN models with selected features from different feature selection methods. Furthermore, to prove
that similar assumptions can also be deployed to other deep learning algorithms, the Deep Neural
Network or DNN models have also been developed based on some selected features derived from
the ANN validation models. The final goal was to determine which feature selection method was the
most suitable for the deep learning model in general to predict prognostics state or Remaining Useful
Life for aircraft gas turbine engines data. End results from various future selection methods were
compared against the one that is using original features. The ANN and DNN models with selected
features were studied and compared based on their performance.
Based on the aforementioned goal, the summary of the main contributions of this work are:
1. Extract meaningful features for neural network-based and deep learning data-driven models
from the C-MAPSS dataset.
2. Suggest the novel neural network-based feature selection method for aircraft gas turbine engines
RUL prediction.
3. Develop deep neural network models from selected features.
4. Show how the developed methodology can improve the RUL prediction model by comparing its
performance/error and complexity to the model derived from original features.
1.1. Neural Network for RUL Prediction

Prognostic and Health Management (PHM) is aimed at improving reliability and reducing the cost
of maintenance of the system’s elements [2]. Remaining Useful Life (RUL) in PHM is defined as the
amount of time left before systems or elements cannot perform as their intended function. Therefore,
RUL prognostics are used to evaluate the equipment’s life status in order to plan future maintenance [3].
With enough condition monitoring data, data-driven machine learning methods can be used to learn the
degradation patterns directly from data in order to generate predictive prognostics models. Data-driven
model using machine learning has an advantage over physics-based [4] and traditional data-driven
statistical-based models [5]. For example, machine learning models can be implemented without
prior degradation knowledge [6]. Neural network algorithms have particularly been receiving more
attention compared to other machine learning algorithms as they have outperformed other algorithms
as well as their ability to approximate high dimensional non-linear regression function directly from
raw data [7].
The Artificial Neural Networks (ANN) model is fundamentally based on biological neural
networks. Sigmoid functions are applied to the nodes of ANN to connect and sum the total weights of
the neural network. A sigmoid function is a Gaussian spheroid function, which can be expressed as:
2
−( ||X−c||
2 )
Y (x) = e 2σ (1)
The hidden neurons in ANN measure the distance between the input vector x and the centroid c
from the data cluster. The measured values are the output of the ANN. In Equation (1), the σ parameter
represents the radius of the hypersphere determined by iteratively selecting the optimum width.
The weights of the neural network are updated at the neural nodes using error back-propagation,
which is a stochastic gradient descent technique. Then the weights of each individual neural node
are fed forward to the next layer. This technique is often referred to as Feedforward Neural Network
(FFNN). This is how ANN “learns” the data pattern through its weights [8].
Aerospace 2020, 7, 132 3 of 32
In 2006, Geoffrey Hinton suggested the early design of deep learning algorithms based on the
aforementioned FFNN [9]. The vanilla FFNN generally consists of only the hidden layer with a sigmoid
activation function described in Equation (1). Multiple configurations of deep learning algorithms,
Aerospace 2020, 7, x FOR PEER REVIEW 3 of 33
such as Deep Neural Network (DNN), Convolutional Neural Network (CNN), Recurrent Neural
Network (RNN), etc., have been widely used as data-driven modeling algorithms. Most of them have
algorithms, such as Deep Neural Network (DNN), Convolutional Neural Network (CNN), Recurrent
outperformed every other well-known data-driven algorithms in the past.
Neural Network (RNN), etc., have been widely used as data‐driven modeling algorithms. Most of
One aspect to keep in mind before employing any deep learning algorithm is that each deep
them have outperformed every other well‐known data‐driven algorithms in the past.
learning algorithm might be suitable for different tasks. This heavily depends upon the different data
One aspect to keep in mind before employing any deep learning algorithm is that each deep
characteristics and type of target models. The deep learning algorithms also include different types of
learning algorithm might be suitable for different tasks. This heavily depends upon the different data
activation functions and optimizers. These are the key differences between deep learning algorithms
characteristics and type of target models. The deep learning algorithms also include different types
and vanilla ANN or FFNN that have been proposed in the early years [10].
of activation functions and optimizers. These are the key differences between deep learning
In this work, we only employed DNN with auto-encoder as a modeling algorithm. All encoded
algorithms and vanilla ANN or FFNN that have been proposed in the early years [10].
and decoded processes happen inside the hidden layers of the network through parameterized
In this work, we only employed DNN with auto‐encoder as a modeling algorithm. All encoded
function [9,10]. The construction of DNN with auto-encoder is briefly illustrated in Figure 1. Unlike
and decoded processes happen inside the hidden layers of the network through parameterized
the ANN that uses sigmoid function as an activation function, our DNN layers used Rectified Linear
function [9,10]. The construction of DNN with auto‐encoder is briefly illustrated in Figure 1. Unlike
Units (ReLU) as activation function. The ReLu function can be simply expressed as:
the ANN that uses sigmoid function as an activation function, our DNN layers used Rectified Linear
Units (ReLU) as activation function. The
f (xReLu
) = xfunction
+
= maxcan(0, xbe
) simply expressed as: (2)
𝑓 𝑥 𝑥 max 0, 𝑥 (2)
where x is the input to a neuron and + represents the positive part of its arguments. The ReLU
where 𝑥has
function is been
the input to a neuron
demonstrated and + better
to achieve represents theregression
general positive tasks
part of its arguments.
training for deeperThe ReLU
networks
function has
compared been activation
to other demonstrated to achieve
functions such asbetter general sigmoid
the logistic regressionandtasks training fortangent
the hyperbolic deeper
networks compared to other activation functions such as the logistic sigmoid and
(tanh) [10]. Therefore, the ReLU function has been chosen to use for modeling Remaining Useful the hyperbolic
tangent
Life (RUL)(tanh) [10]. Therefore,
prediction for our PHMthe ReLU function
data while the has
ANN been chosen
with to use
sigmoid for modeling
function has beenRemaining
used as a
Useful Lifealgorithm
validation (RUL) prediction forselection
for feature our PHM data while the ANN with sigmoid function has been used
methods.
as a validation algorithm for feature selection methods.
Figure 1. Auto-encoder Deep Neural Networks Construction.

Figure 1. Auto‐encoder Deep Neural Networks Construction.
The estimation of the RUL or “health state” of a system or its components is one of the main tasks
The estimation of the RUL or “health state” of a system or its components is one of the main
for prognostics analysis. The RUL estimations often involve the prediction of the life span based on
tasks for prognostics analysis. The RUL estimations often involve the prediction of the life span based
time or cycles, which is also known as the regression task. In PHM, the RUL is determined using
on time or cycles, which is also known as the regression task. In PHM, the RUL is determined using
the historical data collected from the system’s sensors or signals. The ANN-based or deep learning
the historical data collected from the system’s sensors or signals. The ANN‐based or deep learning
data‐driven models have been proven to work relatively well with these types of PHM tasks [11].
However, one of the challenges is to reduce the complexity of the neural network prior to the training
states. This might possibly be done by reducing the input training data. One possible way that can
help in reducing the complexity of the model is to select only meaningful features or attributes from
the raw dataset before model training.
Aerospace 2020, 7, 132 4 of 32
data-driven models have been proven to work relatively well with these types of PHM tasks [11].
However, one of the challenges is to reduce the complexity of the neural network prior to the training
states. This might possibly be done by reducing the input training data. One possible way that can
help in reducing the complexity of the model is to select only meaningful features or attributes from
the raw dataset before model training.
1.2. Related Works

Multiple deep learning algorithms have been used to generate data-driven models to predict RUL
for C-MAPSS aircraft gas turbine engines data. It can be observed from the literatures [12–20] that
the most suitable deep learning algorithms for training the high accuracy C-MAPSS models is the
Long-Short Term Memory Recurrent Neural Network (LSTM). The hybrid deep neural network layers
with LSTM is also an ongoing investigation and experiment on the C-MAPSS dataset. This approach
believes to achieve the higher accuracy among other algorithms that have been employed. The most
important drawback of the hybrid models is the high complexity of the model architectures. These
models can also have limitless variations and architecture structures. It is best to reduce the complexity
of the model as much as possible and one way to achieve that is to limit the number of input nodes.
This is the area that feature selection methods can be brought in.
There are many publications on applying ANN-based or deep learning algorithms to C-MAPSS
aircraft gas turbine engines data. However, all previous works have never introduced the feature
selection approaches into their model architectures. Also, the usefulness of any particular feature
selection methods have not been addressed in any prior works.
The next paragraph concludes the contribution of past publications for such an approach.
We specifically only included the works that employed deep neural network algorithms for prognostics
of C-MAPSS aircraft gas turbine engines data modeling here. It might be worth to note that there are
other research works that used other data-driven algorithms or machine learning algorithms, which
are not mentioned here.
Chen Xiongzi, et al., (2011) conducted a comprehensive survey of the three main data-driven
methods for aircraft gas turbine engines, namely particle filtering methods, neural network, and
relevant vector machine methods [12]. Mei Yuan, et al., (2016) applied RNN network methods for
fault diagnosis and estimation of remaining useful life of engines [13]. Faisal Khan, et al., (2018)
used particle filter algorithms to generate the arbitrary input data points before training their models
with neural networks. Unlike, vanilla neural network algorithm, their models employed radial
basic function (RBF) as activation function instead of original sigmoid function [14]. Xiang Li, et al.,
(2018) applied the Convolutional Neural Network (CNN) as a time window approach to generate
a feature extraction model of engine data [15]. Ansi Zhang et al., (2018) proposed a supervised
domain adaptation approach by exploiting labeled data from the target domain aims to fine-tune a
bi-directional Long-Short Term Memory Recurrent Neural Network (LSTM) previously trained on the
source domain [16]. Zhengmin Kong et al., (2019) also very recently developed the models based on
CNN. They employed CNN as part of the network layers in their experiment and proposed the hybrid
models by combining the CNN layers with LSTM layers. Their approaches have proven to achieve
highest accuracy over the other standard methods [17]. Other works previously published [18–20]
mostly focused on adopting the LSTM network and proposing new models without addressing
the complexity reduction in their approaches. While each work proposed the different network
architectures and the performances of the models have been improved over time, what they failed
to address is whether the complexity reduction in ANN-based models can play a role in improving
the complexity of the model. This work aims to address the issue of a selection approach to reduce
learning times.
Aerospace 2020, 7, 132 5 of 32
The rest of the paper is organized as follows: Section 2 covers the methodology outlining all
methods and approaches used for the defined problem. Section 3 describes the experimental setup
with detail of data description and comparing final results from all models. Section 4 discusses and
compares results from all modela. Lastly, a final conclusion and possible future works
Aerospace highlight
2020, 7, x FOR PEERareREVIEW
discussed in Section 5.
x FOR PEER REVIEW 5 of 33
2. Methodology
2. Methodology
gy In this section, all essential details of au
In this section, all essential details of auto-encoder deep neural network used
will in our experiment
be discussed. The problem definition, a
ction, all essentialwill
details of auto-encoder deep neural network used in our experiment
be discussed. The problem definition, and all notations will also be clearly defined,of ashow
well our
as the
illustration proposed deep neu
sed. The problem illustration
definition, and all notations
of how our proposed will also
deepbe clearly
neural defined,
network as well as the
architecture can be applied for RUL aircraft gas
gas turbine engines prediction with feature
how our proposed deep engines
turbine neural network
prediction architecture
with featurecan be applied
selection and for RULnetwork
neural aircraft modeling framework.
ngines prediction with feature selection and neural network modeling framework.
2.1. Problem Definition
2.1. Problem Definition
Definition n oNs Starting with the raw data, which i
Starting with the raw data, which is denoted as, D = xi , yi , the data contains N training
with the raw data, which is denoted
𝑁𝑠 S S
as, 𝐷𝑆 = {(𝑥𝑆𝑖 , 𝑦𝑆𝑖 )}𝑖=1 , the data contains 𝑁i𝑠=1
S training sample swhere 𝑥𝑆𝑖 ∈ 𝒳𝑆 is a feature
i
sample where xS ∈ XS is a feature with a length of Ti and qS is the number of features, in which,
Ti
ple where 𝑥𝑆𝑖 ∈ 𝒳𝑆 iis a feature
n oTi with a length of 𝑇𝑖 and 𝑞𝑆 is the number of features, in which, xSi = = {xti }t=1 ∈ RqS×Ti . In addition
i qS ×Ti i
T xS == xt ∈ R . In addition, yS ∈ YS is denoted as Remaining Useful withLife (RUL) also
the length with space and RUL
𝑇𝑖 (feature
= {x i } i ∈ RqS×Ti . In
t t=1 addition,t=1𝑦 𝑖 ∈ 𝒴 is denoted as Remaining
𝑆 𝑆 Useful Life (RUL) also 𝑇 𝑖𝑇
th 𝑇𝑖 (feature space with 𝑦 𝑖 = {𝓎𝑖𝑡 }𝑡=1 ∈ ℝ≥0𝑖 . where 𝑡 ∈ {1, 2
theand RULTspace
length i (feature are within
space and the same space are within the same length) with 𝑆
RUL length) .
𝑖 𝑇𝑖 𝑇𝑖 𝑖 𝑞 𝑖 measurement of all variables and RUL lab
𝑡 }𝑡=1 ∈ ℝ≥0 . wherewhere𝑡 ∈t {1, 2, …
∈ {1, . .𝑖 }. ,, T𝓍i },𝑡 ∈
2,, 𝑇 xit ℝ and 𝓎𝑡 ∈ ℝ≥0,,represent
∈ Rq,S ,and
𝑆 representthe t-th𝑡 -th
the measurement of 𝑖all𝑁𝒯variables and 𝑖
of all variables and RUL label, respectively. Similarly, the estimated target domain, n 𝐷N𝒯T= {𝑥𝒯 }𝑖=1 where 𝑥𝒯 ∈ 𝒳𝒯 and 𝒳𝒯 ∈ ℝ
o
RUL label, respectively. Similarly, the estimated target domain, DT = xiT where xi ∈ XT and
𝑖 𝑞𝒯 ×𝑇𝑖 i=1 𝐷𝒯 , are Tassumed
and to possibly have a
where 𝑥𝒯 ∈ 𝒳𝒯 and XT ∈ R 𝒳 ∈ ℝ with no labels. The source
𝒯 qT ×Ti with no labels. The source and target domain, D and D and target domain, 𝐷 𝑆 , are assumed to possibly have
S T primary goal is to define a function 𝑔 t
assumed to possibly have aprobability
different distribution, P(XS ) , P(X𝑃(𝑋
probability distribution, ) ≠ 𝑃(𝑋𝒯 ) . The
a different T ). 𝑆The primary goal is toapproximate
define a function g that
the corresponding RUL for the
is to define a function
can derive 𝑔 orthat canfrom
learn derive theor learndata
source from thecan
that source data that
approximate the can
corresponding RUL for the target
the preliminary assumption that mapping
𝑦𝒯𝑖preliminary
≈ 𝑔(𝑥𝒯𝑖 ), with

the correspondingdomain
RUL foratthe thetarget
testing domain
time, such,at theytesting
i ≈ g time,x such,the
i , with assumption that mapping between
T T across all domains.
ry assumption that mapping
input (x) and between
output (y) input (𝑥) and similar
is somehow outputacross(𝑦) is all
somehow
domains.similar
mains.
2.2. Deep Neural Network Architecture
2.2. Deep Neural Network Architecture
ral Network Architecture While there are existing deep learning
While there are existing deep learning algorithms that have been proposed to accommodate for
PHM of aircraft gas turbine engines data m
ere are existing deepPHM learning algorithms
of aircraft gas turbine that engines
have been dataproposed
modeling to [12–20],
accommodate
this workfor focuses on using a deep neural
network with auto-encoder with a specific u
aft gas turbine engines data modeling [12–20], this work focuses on using a deep neural
network with auto-encoder with a specific use case and specifications that fit into problem definition
previously identified.
auto-encoder with a specific identified.
previously use case and specifications that fit into problem definition
The DNN used in this work focused
entified. The DNN used in this work focused on the feedforward architecture by the H2O package in Python
Python API [21]. H2O is based on multi-lay
N used in this work APIfocused
[21]. H2O on theis based feedforward architecture
on multi-layer by the H2O
feedforward package
neural networksin for predictive modeling [22].
[22]. The following are some of the H2O DN
21]. H2O is based The on multi-layer
following are feedforward
some of the neural
H2O networks
DNN features for predictive modeling
used for this experiment.
owing are some of the H2O DNN features used for this experiment.  Supervised training protocol for regre
• Supervised training protocol for regression tasks  A multi-threaded and distributed par
ed training protocol for regression tasks
• A multi-threaded and distributed parallel computation that can be run oncluster
node a single or a
threaded and distributed parallel computation that can be run on a single or a multi-
multi-node cluster  Automatic, per-neuron, adaptive learn
ster
• Automatic, per-neuron, adaptive learning rate for fast convergence  Optional specification of the learning r
ic, per-neuron, adaptive learning rate for fast convergence
• Optional specification of the learning rate, annealing, and momentum options Regularization options to prevent mod
specification of the learning rate, annealing, and momentum options
• Regularization options to prevent model overfitting  Elegant and intuitive web interface (Fl
zation options to prevent model overfitting
 Grid search for hyperparameter optim
and intuitive web •interface Elegant
(Flow)and intuitive web interface (Flow)
• Grid search forand hyperparameter optimization and model selection  Automatic early stopping based on the
rch for hyperparameter optimization model selection
tolerance
ic early stopping • based Automatic
on the convergenceearly stopping based on
of user-specified the toconvergence
metric a user-specified of user-specified metric to a
user-specified tolerance  Model check-pointing for reduced run
e
 Automatic pre- and post-processing fo
•
heck-pointing for reduced Modelruncheck-pointing
times and model fortuning
reduced run times and model tuning
 Additional expert parameters for mod
ic pre- and post-processing
• for categorical
Automatic numerical data for categorical numerical data
pre- and post-processing
 Deep auto-encoders for unsupervised
nal expert parameters • for model tuning
Additional expert parameters for model tuning
to-encoders for unsupervised feature learning. In the proposed DNN model, deep n
features from the time length, 𝑇𝑖 . The hidd
roposed DNN model, deep neural network layers are used to extract the temporal
vector ℎ𝑡−1 ∈ ℝℎ , input vector (as define
the time length, 𝑇𝑖 . The hidden state units of the neural consist of, the hidden state
function, 𝑓. All operations in DNN layers c
ℝℎ , input vector (as defined in problem definition), xti ∈ Rq , and the activation
All operations in DNN layers can be written as: 𝑖𝑡 = 𝑓(
Aerospace 2020, 7, 132 6 of 32
• Deep auto-encoders for unsupervised feature learning.
In the proposed DNN model, deep neural network layers are used to extract the temporal features
from the time length, Ti . The hidden state units of the neural consist of, the hidden state vector
ht−1 ∈ Rh , input vector (as defined in problem definition), xit ∈ Rq , and the activation function, f .
All operations in DNN layers can be written as:
Aerospace2020,
Aerospace 2020,7,
7,xxFOR
FORPEER
PEERREVIEW
REVIEW 66of
of33
33

Aerospace 2020, 7, x FOR PEER REVIEW it = f + bi Wt xit + Wt0 ht−1 6 of 33 (3)
𝑖 ′ (4)
𝑜𝑜𝑡𝑡 == 𝑓(𝑊
𝑓(𝑊
𝑜𝑜𝓍𝓍𝑡𝑡 +
𝑖 +𝑊𝑊𝑜𝑜ℎℎ𝑡−1
′ + 𝑏𝑏𝑜𝑜))
𝑡−1 + (4)
ot =𝑖 f Wo x′ it + Wo0 ht−1 + b′′o (4) (4)
where 𝑖𝑖 and
where and 𝑜𝑜 represent
represent input 𝑜𝑡 and
input = 𝑓(𝑊
and 𝑜 𝓍𝑡 + states.
output
output 𝑊𝑜 ℎ𝑡−1𝑊
states. 𝑊+ and𝑏and
𝑜) 𝑊𝑊 are are matrices
matrices of
of updated
updated weights
weights and
and
weightsi from
where fromothe
and the hidden state,
represent state, and
input 𝑏 is
and 𝑏output is the
thestates. W and W 0 are matrices of updated weights and
bias vector.
vector.
where weights
𝑖 and 𝑜 represent hidden
input and output states. bias
𝑊 and 𝑊 ′ are matrices of updated weights and
Unlike
weights
Unlikefrom inthe
in vanilla ANN,
hidden
vanilla ANN, inand
state, the proposed
bproposed
is the bias DNN, the activation
vector. function 𝑓𝑓 is
activation function is the
the Rectifier
Rectifier Linear
Linear
weights from the hidden state, and in𝑏 isthethe bias vector. DNN, the
function
Unlike
function [23]in
[23] instead
vanilla
instead ofANN,
the sigmoid
sigmoid function. The
in the function.
proposed The
DNN, DNN activation
theactivation function
activationfunction can bethe
f isbe
functioncan represented as;
Rectifier Linear
Unlike in vanilla ANN,ofinthe the proposed DNN
DNN, the activation function 𝑓 is the represented
Rectifier Linear as;
function [23] instead of the sigmoid function. The DNN activation function can be represented as;
function [23] instead of the sigmoid function.𝑓(𝛼) The DNN
𝑓(𝛼) = max(0,
= activation
max(0, 𝛼) ∈∈ ℝ
𝛼) ℝfunction
++
can be represented as; (5)
(5)
where, in
where, in this case, 𝛼𝛼 represent
this case, represent𝑓(𝛼) the state
the state α)functions
=f (max(0, = max
functions ℝα+) ∈ R+ (3)
𝛼) (∈0,(Formulas
(Formulas (3) and and (4)) (4)) that
that firing
firing intointo(5) the input
the input (5)
neural.
where,neural.
in this case, 𝛼 represent the state functions (Formulas (3) and (4)) that firing into the input
where, in this important
Another
Another case, α represent
important aspectthe
aspect ofstate
of the DNN
the functions
DNN model
model (Equations
architecture
architecture (3) and is (4))
is thethat
the lossfiring
loss function,
function, into the input by,
denoted
denoted neural.
by, ℒ. ℒ.
neural.
For this
For Another
this work,work, the important
the Huber
Huber loss aspect of
loss function the
function wasDNN model
was selected becausearchitecture
because itit [24]is the loss
[24] has function,
has provenproven to denoted
to work
work bestby, L.
best in For
in termsthis
terms
Another important aspect of the DNN modelselected architecture is the loss function, denoted by, ℒ.
work,
of the
accurately Huber loss
projecting functionthe RUL,was selected
𝑦 𝑖𝑖 ∈ 𝒴𝑆 ,because
𝑦𝑆𝑆 selected
∈ 𝒴𝑆 , of the of the it
source[24] has
domain, proven 𝐷 to
. work
The
𝐷𝑆 . ThetoHuber best
Huber in terms
loss of
function accurately
can be
For thisofwork,
accurately the Huber projecting the RUL,was
lossi function becausesource domain,
it [24] has proven 𝑆
work loss best function
in terms can be
projecting
described
described the
as;
as; RUL, y S
∈ Y S , of the source domain, D S . The Huber loss function can be described as;
of accurately projecting the RUL, 𝑦𝑆𝑖 ∈ 𝒴𝑆 , of the source domain, 𝐷𝑆 . The Huber loss function can be
described as; 11 𝑖𝑖 2
‖𝓎
‖𝓎 ̂̂𝑡𝑡 −−𝓎 𝓎𝑖𝑡𝑖𝑡‖‖22 ,,𝑓𝑜𝑟
𝑓𝑜𝑟 ‖𝓎 ‖𝓎 𝑖
̂̂𝑖𝑡𝑡 − −𝓎 𝓎𝑖𝑡𝑖𝑡‖‖1 ≤ ≤ 11
ℒ 𝑖
𝑖𝑦 (𝜃𝑓 , 𝜃𝑦 ) = { 22 2 1
(6)
1
ℒ𝑦 (𝜃𝑓 , 𝜃𝑦 ) = 𝑖{ 𝑖 2 1 ̂ 𝑡 − 𝓎𝑡 ‖ ≤ 1
𝑖 𝑖 (6)
‖𝓎̂𝑡 −‖𝓎 𝓎̂𝑖𝑡𝑖‖− 𝑖 , 𝑓𝑜𝑟1‖𝓎
𝓎𝑡𝑡‖‖1 −
𝑖 − , 𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
ℒ𝑦𝑖 (𝜃𝑓 , 𝜃𝑦 ) = {2 ‖𝓎̂𝑡𝑡 −2 𝓎 1
2 , 𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (6)
𝑖 𝑖
1 1 2
‖𝓎 ̂ − 𝓎 ‖ − , 𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
where, 𝜃
where,
where, θ𝜃𝑓𝑓f isisthe
thespace
space representation
representation 𝑡 of 𝑡the
of the1 targettarget
2 input input that that mapped
mapped through
mapped through the
through the feature
the feature extraction
feature extraction
extraction
layers
where,layers
layers
𝜃𝑓 is the
into
into
into aaa new
space
new space.
new space. In
space.
representation
In addition,
Inaddition,
addition,
of the target 𝜃𝜃θ is
isis the
𝑦𝑦y input
the
thethat domain
domain
domainmapped
regression
regression
throughspace
regression space
space
the feature
generated
generated
generatedextraction
by
by logistic
by logistic
logistic
repressor [24], and, 𝓎̂𝑖𝑖 is RUL prediction from the source domain.
layers repressor
repressor
into a new [24],
[24], and, 𝓎
and,
space. ̂ 𝑡𝑡 isaddition,
In isRULRULprediction
prediction
𝜃𝑦 is the from
from thesource
the
domain source domain.
domain.
regression space generated by logistic
The objective
objective in training
training DNN DNN is is totominimize
minimize the the prediction
prediction loss, loss, ℒ𝑖𝑖 , whichcan can bedescribed
described by;
repressor [24], The
Theand, objective 𝑖 in
𝓎̂𝑡 is in RULtraining
prediction DNNfrom is to minimize
the sourcethe prediction loss, ℒL𝑦𝑦iy,,which
domain. which canbe be describedby; by;
1 𝑁 𝑖
The objective in training DNN is to minimize min[[the 1 ∑prediction
𝑁 𝑠 ℒ𝑖 (𝜃 , 𝜃loss,
𝑠 𝑖
𝑦(𝜃 𝑓, 𝜃𝑦𝑦))]] 𝑦
ℒ , which can be described by; (7)
min 𝑁𝑠∑ 𝑖=1Nℒ s 𝑦 i𝑓 (7)
𝑦 1𝑠
𝜃𝜃𝑓𝑓,𝜃
,𝜃𝑦 𝑁 X 𝑖=1
1min 𝑁𝑠 𝑖 Ly θ f , θy (7)
min [ θ f∑ ,θ y Nℒs𝑦 (𝜃𝑓i,=𝜃1𝑦 )] (7)
The DNN
The DNN model model used used in in 𝜃this
this
𝑓 ,𝜃𝑦 work𝑁𝑠 𝑖=1
work is depicted
is depicted in in Figure
Figure 2. 2. This
This DNN DNN model model architecture
architecture is
is
trainedto
trained topredict
predictfor foreach
eachinput, input, 𝑥𝑥𝑖𝑖,,real realvaluevalue 𝑦𝑦𝑖𝑖 and andits itsdomain
domainlabel label 𝑑𝑑𝑖𝑖 for forthethesource
sourcedomain
domainand and
The DNN The model DNN model used inused this in thisiswork
work depicted is depicted in Figure in Figure
2. This2.DNN Thismodel DNN model architecture architecture
is is
onlydomain
only domainlabel labelfor forthe thetarget
target
𝑖 domain.
domain. i The
The
𝑖 firstipart
first partof ofthe theDNN DNNarchitecture
𝑖architecture i isthe
is thefeature
featureextractor,
extractor,
trainedtrained
to predict to predict
for eachfor each𝑥input,
input, x , real𝑦value
, real value and its y and domain its domainlabel 𝑑 label for the d for the source
source domaindomain and and
𝑔𝑔 , that decomposes
𝑓𝑓, that
decomposes the the inputs
inputs and and maps maps them them into into the the hidden
hidden state, state, ℎℎ𝑡−1 𝑡−1 ∈ ∈ℝ ℝℎℎ.. The
The modelmodel then then
only
only domain domain
label forlabel for thedomain.
the target target domain. The firstThe part first
of thepartDNN of the DNN architecture
architecture is the is the
feature feature
extractor, extractor,
embeds the output space as a feature space 𝑓
𝑓them of the deeper layers and repeats this process as needed.
𝑔𝑓 , thatembeds
g decomposes the output
f , that decomposes
space
the inputs theas a feature
inputs
and maps andthemspace
mapsinto ofthetheinto deeper
hidden the hiddenlayers
state, ℎ andstate,
𝑡−1
repeats
∈ hℝt−1 ℎ this
∈ Rhprocess
. The . The model
model as needed.
then then
As previously
As previously detailed, detailed, this this vectorvector space space parameter
parameter that that is is the
the result
result of of feature
feature mappingmapping is, is, 𝜃𝜃𝑓𝑓 i.e.,
i.e.,
embedsembedsthe output the output
space asspace a feature as a spacefeature𝑓space of thef deeper of the deeper layers and layers repeats and repeatsthis
𝑖
this process
process as needed. as needed.
𝑓𝑓
As
As previously==previously
𝑔𝑔𝑓𝑓(𝜃(𝜃𝑓𝑓)).. This
detailed,
This feature
feature
detailed,
this vector this space
spacevector
space 𝑓𝑓parameter
is first
is
space firstparameter
mapped to
mapped
that is the
to aa is
that real-value
real-value
result theofresult feature 𝓎𝓎 𝑖 variable by the function,
variablemapping
of𝑡𝑡 feature
mapping by
is, the function,
is, θ f i.e.,
𝜃𝑓 i.e.,
𝑔𝑔 𝑦 (𝑓;
(𝑓; 𝜃
𝜃 𝑦 ),
), which
which is
is composed
composed of
of fully-connected
fully-connected neural
neural network
network 𝑖 layers
layers with
with parameter,
parameter, 𝜃𝜃𝑦.. The
The
𝑓 = 𝑔𝑓 (𝜃 𝑓 ) .g fThis 𝑓 is f first real-value 𝓎𝑡 variable

f𝑦 = 𝑦
θ f . feature
This feature spacespace is first mappedmapped totoa areal-value variableby bythe the function,
function, g y𝑦 f ; θ y ,
dropout layer layer with with aa raterate of of 0.4
0.4 waswas applied
applied to to avoid
avoid the the overfitting
overfitting issue issue [25].
𝑔𝑦 (𝑓; 𝜃dropout
), which
which
𝑦 is composed
is composed of fully-connected
of fully-connected neural neural
network network layerslayers with parameter,with [25]. θ y . The𝜃dropout
parameter, 𝑦 . The layer
Another
Another goal
goal is
is to
to find
find the
the feature
feature space
space that
that is
is domain
domain invariant,
invariant, i.e., finding
i.e., finding aa feature
feature spacespace 𝑓𝑓
dropout withlayer withofa 0.4
a rate ratewasof 0.4 was applied
applied to avoidtothe avoid the overfitting
overfitting issue [25]. issue [25].
in which
in which 𝑃(𝑋 𝑃(𝑋𝑆)) and and 𝑃(𝑋 𝑃(𝑋𝒯𝒯)) are are similar.
similar. This This is is oneone of of the the challenges
challenges in in training, which which can be
Another goal is 𝑆to find the feature space that is domain invariant, i.e., finding atraining, feature space 𝑓can be
improved
improved by applying
by applying the “feature
thesimilar.
“feature selection”
selection” prior
prior to training (detailed in the further section).
in which 𝑃(𝑋𝑆 ) and 𝑃(𝑋𝒯 ) are This is one of the to training (detailed
challenges in training, in which
the further can be section).
Another
Another objective
objectivethe is to minimize
is to“feature
minimize the weights
the weights of feature extractor in the direction of the regression
improved by applying selection” prioroftofeature training extractor
(detailed in the in the direction
further of section).
the regression
loss, ℒℒ𝑦𝑖𝑦𝑖.. In
loss, In more
more detail,
detail, the the modelmodel loss loss function
function can can be be used
used to to derive
derive the the final
final learning
learning function,
function, 𝑔, 𝑔,
Another objective is to minimize the weights of feature extractor in the direction of the regression 𝑖
through parameter 𝜃, which means the RUL prediction
parameter 𝜃, which means the RUL prediction result (described in Equation (6)), 𝓎̂𝑡𝑡 = result (described in Equation (6)), 𝓎̂𝑖 =
loss, ℒthrough
𝑖
𝑦 . In more detail, the model loss function can be used to derive the final learning function, 𝑔,
𝑔 (𝑔 (𝑔𝑓𝑓(𝜃(𝜃𝑓𝑓); );𝜃𝜃𝑦𝑦). ).
through𝑔𝑦𝑦parameter 𝜃, which means the RUL prediction result (described in Equation (6)), 𝓎̂𝑖𝑡 =
The way the DNN algorithm update its learning weights, 𝜃, is through the gradient descent
𝑔𝑦 (𝑔𝑓 (𝜃𝑓 ); The 𝜃𝑦 ). way the DNN algorithm update its learning weights, 𝜃, is through the gradient descent
update
update [26] in the form of; of;
The way [26] the DNN in the formalgorithm update its learning weights, 𝜃, is through the gradient descent
update [26] in the form of; 𝜕ℒ𝑦𝑖𝑦𝑖
𝜕ℒ
𝜃𝑓𝑓 ← 𝜃𝑓𝑓 − 𝜆 ( 𝜕𝜃 ))
𝜃 ← 𝜃 − 𝜆 ( (8)
(8)
𝜕ℒ𝑦𝑖 𝜕𝜃𝑓𝑓
𝜃𝑓 ← 𝜃𝑓 − 𝜆 ( ) (8)
𝜕𝜃𝑓 𝜕ℒ 𝜕ℒ𝑖𝑦𝑖
𝜃𝜃𝑦𝑦 ← − 𝜆𝜆(( 𝑦 ))
← 𝜃𝜃𝑦𝑦 − (9)
(9)
𝜕ℒ𝑦𝑖 𝜕𝜃 𝜕𝜃𝑦𝑦
𝜃 ← 𝜃 − 𝜆( ) (9)
described as;
1 𝑖 2
‖𝓎̂𝑡 − 𝓎𝑖𝑡 ‖2 , 𝑓𝑜𝑟 ‖𝓎̂𝑖𝑡 − 𝓎𝑖𝑡 ‖1 ≤ 1
ℒ𝑦𝑖 (𝜃𝑓 , 𝜃𝑦 ) = { 2
1
‖𝓎̂𝑖𝑡 − 𝓎𝑖𝑡 ‖1 − , 𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Aerospace 2020,7,7,132
Aerospace2020, x FOR PEER REVIEW 2 77ofof32
33
where, 𝜃𝑓 is the space representation of the target input that mapped through th
layers into a new space. In addition, 𝜃𝑦 is the domain regression space ge
repressor [24], and, 𝓎̂𝑖𝑡 is RUL prediction from the source domain.
The objective in training DNN is to minimize the prediction loss, ℒ𝑦𝑖 , which c
1
𝑁𝑠
min [ ∑𝑖=1 ℒ𝑦𝑖 (𝜃𝑓 , 𝜃𝑦 )]
𝜃𝑓 ,𝜃𝑦 𝑁𝑠
The DNN model used in this work is depicted in Figure 2. This DNN m
trained to predict for each input, 𝑥 𝑖 , real value 𝑦 𝑖 and its domain label 𝑑 𝑖 for the
only domain label for the target domain. The first part of the DNN architecture is t
𝑔𝑓 , that decomposes the inputs and maps them into the hidden state, ℎ𝑡−1 ∈ ℝ
embeds the output space as a feature space 𝑓 of the deeper layers and repeats this
As previously detailed, this vector space parameter that is the result of feature
𝑓 = 𝑔𝑓 (𝜃𝑓 ) . This feature space 𝑓 is first mapped to a real-value 𝓎𝑖𝑡 variabl
𝑔𝑦 (𝑓; 𝜃𝑦 ), which is composed of fully-connected neural network layers with p
Figure 2. Proposeddropout layer Networks
Deep Neural with a rate of 0.4Architecture.
Model was applied to avoid the overfitting issue [25].
Another goal is to find
Figure 2. Proposed Deep Neural Networks Model Architecture. the feature space that is domain invariant, i.e., findin
Another goal is to find the featureinspace
which that𝑃(𝑋 ) and 𝑃(𝑋
is 𝑆domain 𝒯 ) arei.e.,
invariant, finding
similar. a feature
This is onespace
of the f in
challenges in train
which P(XS )Selection
2.3. Feature and P(XMethods
T ) are similar.
for This
Neural is
Network
improved one of the challenges
Architectures
by applying the in training,
“feature which
selection” can be
priorimproved
to training (detailed in th
by applying the “feature selection” prior to training (detailed in the further section). Another
Another objective is to minimize the weights of feature extractor in the directio objective
In prognostic applications, feature extraction occurs after receiving raw data from sensors. The
is to minimize the weights of featureloss, ℒ𝑦𝑖 . Inin
extractor the detail,
more direction the of the regression
model loss function loss, Liybe
can . In
usedmore to derive the final le
feature extraction usually involves signal processing and analysis in the time or frequency domain.
detail, the model loss function can be through
used to derive
parameter the final
𝜃, which learningmeansfunction,
the RULg, through
predictionparameter
result
The purpose is to transform raw signals into more informative data that well‐represents the system
θ, (described in Equation
Equation (6)), 𝓎̂𝑖 = 𝑔𝑦 (𝑔𝑓 (𝜃𝑓 ); 𝜃𝑦 ).
which
[27]. means
In other the RUL
words, prediction
feature result
extraction is the(described
process of intranslating sensor𝑡 signals
(6)), into data.) In. contrast,
The way the DNN algorithm update Theits way the DNN
learning weights, algorithm
θ, is updatethe
through its gradient weights, 𝜃, is through th
learning descent
the purpose of feature selection is to select a particular set of features in the dataset that is believed
update [26] in the form update [26] in the form of;
to be more relevant forof;
modeling. These feature selection  i processes always execute after the feature
∂L ytraining

extraction and occur in between pre‐processing and the  or pre‐training phase of 𝜕ℒ the𝑦𝑖 data
θ f ← θ f − λ  𝜃 𝑓 ← 𝜃𝑓 − 𝜆 ( ) (8)
modeling framework. ∂θ f  𝜕𝜃𝑓

Three common feature selection strategies have  been  discussed in the literature: (1)𝑖 filter
 ∂Liy  𝜕ℒ𝑦
approach, (2) wrapper approach, and (3) embedded approach. This paper will𝜃𝑦only ← 𝜃discuss
𝑦 − 𝜆(
the ) filter
θ y ← θ y − λ 
𝜕𝜃 (9)
and wrapper approaches. Figure 3 shows the processes ∂θ y flow and role difference role of feature 𝑦

extraction and feature selection in the data modeling
Usually, theprocess.
Stochastic Continuous Greedy (SCG) estimate is used to updat
Usually, the Stochastic Continuous Greedy (SCG) estimate is used to update the
and (9). The learning rate, 𝜆, represents the learning steps taken by the SCG as tr
Equations (8) and (9). The learning rate, λ, represents the learning steps taken by the SCG as
training processes.
2.3. Feature Selection Methods for Neural Network Architectures

In prognostic applications, feature extraction occurs after receiving raw data from sensors.
The feature extraction usually involves signal processing and analysis in the time or frequency domain.
The purpose is to
Figure 3. transform rawextraction
Role of feature signals into
andmore informative
feature selection indata that well-represents
the prognostics modeling the system [27].
process.
In other words, feature extraction is the process of translating sensor signals into data. In contrast,
the purpose of feature
Filter methods selection
employ is to select
statistical, a particular
correlation, set of features
and information in the
theory todataset
identifythat
the is believed
importance
to
of the features. The performance measurement metrics of filter methods usually use the localfeature
be more relevant for modeling. These feature selection processes always execute after the criteria
extraction
that do notand occurrelate
directly in between
to model pre-processing
performance and [28]. the training or pre-training phase of the data
modeling
Thereframework.
are currently multiple baseline filter methods popularly employed for feature selection
Three common
processes. However, feature selection
the result fromstrategies have been
the experiments discussed
showed in thethe
that only literature: (1) filter approach,
correlation‐based methods
(2) wrapper
were approach,
suitable and (3)
for the case embedded
study approach.
data. This is due toThis
thepaper willcorrelation‐based
fact that only discuss the filter and wrapper
methods evaluate
approaches. Figure
the feature with 3 shows
a direct the processes
correlation to theflow andvariable.
target role difference role
In other of feature
words, extraction and feature
the correlation‐based filter
selection in the data modeling process.
methods make selections based on the modeling objectives, which can imply that these methods are
more suitable to the data with the target variable. The correlation‐based filter methods included in
this work is Pearson correlation [29,30]. Additionally, the result from other statistical‐based methods,
namely Relief algorithm, Deviation selection, SVM selection, and PCA selection [31], was also
included to provide a complete comparison.
extraction and occur in between pre‐processing and the training or pre‐training phase of the data
modeling framework.
Three common feature selection strategies have been discussed in the literature: (1) filter
approach, (2) wrapper approach, and (3) embedded approach. This paper will only discuss the filter
Aerospace 2020, 7, 132
and wrapper 8 of 32
approaches. Figure 3 shows the processes flow and role difference role of feature
extraction and feature selection in the data modeling process.
Figure3.3.Role
Figure Roleofoffeature
featureextraction
extractionand
andfeature
featureselection
selectionininthe
theprognostics
prognosticsmodeling
modelingprocess.
process.
Filter
Filtermethods
methodsemploy
employstatistical,
statistical,correlation,
correlation,andandinformation
informationtheory
theorytotoidentify
identifythetheimportance
importance
ofofthe
the features. The performance measurement metrics of filter methods usually use the localcriteria
features. The performance measurement metrics of filter methods usually use the local criteria
that
thatdo
donot
notdirectly
directlyrelate
relatetotomodel
modelperformance
performance[28]. [28].
There
Therearearecurrently
currentlymultiple
multiplebaseline
baselinefilter
filtermethods
methodspopularly
popularlyemployed
employedfor forfeature
featureselection
selection
processes. However, the result from the experiments showed that only the correlation-based
processes. However, the result from the experiments showed that only the correlation‐based methods methods
were
weresuitable
suitablefor
forthe
thecase
casestudy
studydata.
data.This
Thisisisdue
duetotothe
thefact
factthat
thatcorrelation-based
correlation‐basedmethods
methodsevaluate
evaluate
the
the feature with a direct correlation to the target variable. In other words, the correlation‐basedfilter
feature with a direct correlation to the target variable. In other words, the correlation-based filter
methods
methodsmakemakeselections
selectionsbased
basedon onthe
themodeling
modelingobjectives,
objectives,which
whichcan
canimply
implythatthatthese
thesemethods
methodsare are
more
moresuitable
suitabletotothe
thedata
datawith
withthe
thetarget
targetvariable.
variable.TheThecorrelation-based
correlation‐basedfilter
filtermethods
methodsincluded
includedinin
this
thiswork
workisisPearson
Pearsoncorrelation
correlation[29,30].
[29,30].Additionally,
Additionally,the theresult
resultfrom
fromother
otherstatistical-based
statistical‐basedmethods,
methods,
namely Relief algorithm, Deviation selection, SVM selection, and PCA
namely Relief algorithm, Deviation selection, SVM selection, and PCA selection selection [31], was [31],
also included
was also
toincluded
provide toa complete
provide acomparison.
complete comparison.
Wrapper methods use a data-driven algorithm that performs the modeling for the dataset to
select the set of features that yield the highest modeling performance [32]. Wrapper methods are
typically more computationally intensive compared to filter methods. There are four main baseline
wrapper methods [32]: (1) forward selection, (2) backward elimination, (3) brute force selection, and
(4) evolutionary selection.
Forward selection and backward elimination are search algorithms with different starting and
stopping conditions. The forward selection starts with an empty selection set of features, then adds an
attribute in each searching round. Only the attribute that provides the highest increase in performance
is retained. Afterwards, another new searching cycle is started with the modified set of selected
features. The searching of forward selection stops when the added attribute in the next round does not
further improve the model performance.
In contrast, the backward elimination method performs in the reverse process. Backward selection
starts with a set of all attributes, and then the searching processes continue to eliminate attributes
until the next set of eliminated attributes does not provide any further improvements of modeling
performance. The brute force selection method uses search algorithms that try all combinations of
attributes. Evolutionary selection employs a genetic algorithm to select the best set of features based on
the fittest function measurement [33]. Because of computational and time limitations, the brute force
selection could not be included in this experiment. Only forward selection, backward elimination, and
evolutionary selection were implemented [34].
2.4. Neural Network Data-Driven Modeling Framework

In general, the modeling framework for this experiment is similar to a data-driven modeling
framework that was developed from a cross-industry standard process for data mining (CRISP-DM) [35].
The standard construction consists of five phases: (1) definition states phase, (2) preprocessing phase,
(3) training phase, (4) testing phase, and (5) evaluating phase [36]. In addition to the standard
construction, the feature engineering phase and pre-training phase might be important prior to the
training phase.
The feature engineering phase was introduced in “Features selection procedure for prognostics:
An approach based on predictability” [34] and the pre-training phase was introduced in “The difficulty
of training deep architectures and the effect of unsupervised pre-training” [37] to overcome issues in
Aerospace 2020, 7, 132 9 of 32
training deep learning models, while it also helped to improve some aspects of model performance.
The details of these two additional phases have also been detailed by others [34,37].
As mentioned in Section 2.1, one of the challenges of training the deep learning model is to seek
for a feature space f in which P(XS ) and P(XT ) are similar. Selecting only the meaningful feature is
believed to help reduce the dissimilarity in the feature space that effect the predictability of the model.
This is also the way to reduce the complexity of the model architecture and might also improve the
prediction accuracy of the deep learning models. One possible framework that incorporates the feature
engineering
Aerospace 2020, 7, phase and pre-training
x FOR PEER REVIEW phase into the CRISP-DM standard is illustrated in Figure 4. 9 of 33
Figure 4. The prognostic data-driven framework for neural network algorithms.

Figure 4. The prognostic data‐driven framework for neural network algorithms.
3. Experimental Setup and Results

The first part of the experiment was designed to compare the effectiveness of using different
feature selection methods and filtering for ANN modeling of the prognostics dataset. The aircraft gas
turbine engines dataset with 21 attributes was fed into different filter and wrapper feature selection
methods to identify particular sets of features prior to the model training phase. The selected sets of
features were then used as training features or training attributes for the ANN model. The second
part was to test the feature selected using ANN modeling with the DNN architecture. The results
from different sets of features were compared in order to determine the most suitable set of selected
features. Finally, the final‐best DNN model for predicting RUL of aircraft gas turbine engines was
determined.
Aerospace 2020, 7, 132 10 of 32
3. Experimental Setup and Results

The first part of the experiment was designed to compare the effectiveness of using different
feature selection methods and filtering for ANN modeling of the prognostics dataset. The aircraft gas
turbine engines dataset with 21 attributes was fed into different filter and wrapper feature selection
methods to identify particular sets of features prior to the model training phase. The selected sets of
features were then used as training features or training attributes for the ANN model. The second part
was to test the feature selected using ANN modeling with the DNN architecture. The results from
different sets of features were compared in order to determine the most suitable set of selected features.
Finally, the final-best DNN model for predicting RUL of aircraft gas turbine engines was determined.
3.1. C-MAPSS Aircraft Engines Data

Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) is a simulation tool used to
generate the turbofan engine degradation run-to-failure test dataset. This test dataset was derived
from the NASA Ames prognostics data repository [1]. The C-MAPSS dataset is one of the most popular
benchmark datasets used in the prognostics and diagnostics research community. This dataset provides
a set of editable input parameters to simulate various operational conditions for aircraft gas turbine
engines [38]. The operational conditions include sea-level temperature, Mach number, and altitude.
The C-MAPSS dataset includes four sub-datasets described in Table 1.
Table 1. Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) dataset description [38].
C-MAPSS
Description
FD001 FD002 FD003 FD004
Number of training engines 100 260 100 248
Number of testing engines 100 259 100 248
Operational conditions 1 6 1 6
Fault modes 1 1 2 2
Each sub-dataset FD001, FD002, FD003, and FD004 contains a number of training engines with
run-to-failure information and a number of testing engines with information terminating before failure
is observed. As for operating conditions, each dataset can have one or six operational conditions based
on altitude (0–42,000 feet), throttle resolver angle (20–100◦ ), and Mach (0–0.84). As for fault mode, each
dataset can have one mode or two modes, which are, HPC degradation and Fan degradation.
Sub-dataset FD002 and FD004 are generated with six operational conditions, which are believed
to be a better representation of general aircraft gas turbine engines operation compared to FD001 and
FD003, which could be generated from only one operational condition. Therefore, either data from
FD002 or FD004 can be selected for a complete experiment. In this study, the data from FD002 set
were selected as a training dataset. As our current model validation set-up (which will be described in
Section 3.2), the wrapper methods required roughly 2 to 3 weeks to complete the run. We also keep the
consistency of the amount of data points used in feature selection validations and model trainings–in
both ANN feature selection validation and DNN model training. Our experiments have been designed
this way in order to clearly demonstrate the effectiveness of the feature selection methods used for
neural network-based algorithms.
There are 21 features included in the C-MAPSS dataset for every sub-dataset. These attributes
represent the sensor signals from the different parts of the aircraft gas turbine engines, as illustrated in
Figure 5 [39]. Short descriptions of the features and the plots of all 21 sensor signals of sub-dataset
FD002 are illustrated in Figure 6.
Aerospace 2020, 7, 132 11 of 32
Figure 5. Engine and sensor points (left) and engine parts modules connections (right) [39].
It is also reasonable to estimate RUL as a constant value when the engines operate in normal
conditions [38]. Therefore, a piece‐wise linear degradation model can be used to define the observed
RUL value in the training datasets. That is, after an initial period with constant RUL values, it can be
assumed that the RUL targets decrease linearly.
conditions [38]. Therefore, a piece‐wise linear degradation model can be used to define the observed
Figure 6. Example of Sensor signals (NRc and Ps30) and all feature descriptions.
Figure 6. Example of Sensor signals (NRc and Ps30) and all feature descriptions.
It has been suggested by multiple literature references to normalize the raw signal before
performing modeling and analysis [13–15]. Figure 7 shows the data signals before and after applying
z-normalization:
ij

ij
xt − min xj
xt =
e (10)
max x j − min x j
ij
where, xt denotes the original i-th data point of j-th feature at time t and x j is the vector of all inputs of
the j-th feature. Figure 6. Example
Each attribute of Sensor
value signals (NRc individually
was normalized and Ps30) andand
all feature
scaleddescriptions.
down to the same range
across all data points.
From the dataset, aircraft gas turbine engines start with various initial wear levels, but all are
considered to be at “healthy state” at the start of each record. The engines begin to degrade at a point
in time at higher operation cycles until they can no longer function normally. This is considered as
the time when the engine system is being at the “unhealthy state”. The training datasets have been
collected over the time of run-to-failure information to cover entire life until the engines fail.
conditions [38]. Therefore, a piece-wise linear degradation model can be used to define the observed
Aerospace 2020, 7, 132 12 of 32
Figure 7. Example of before (left) and after (right) z‐normalization.
Figure 8 illustrates the RUL curves of all unseen or test datasets containing testing engines from
FD002 and FD004 dataset. Figure 9 show the example of RUL curves from one degradation engine
from FD002 and FD004 dataset. The same degradation behavior is also applied to the training set.
These RUL curves represent the health state or prognostic of the aircraft gas turbine engines over
cycles until the end‐of‐life, or the point that the aircraft gas turbine engines can no longer operate
normally.
The degradationFigure behavior of the aircraft
7. Example of beforegas
(left)turbine engines
and after can be observed clearer from Figure
(right) z-normalization.
9. We presume that the RUL is a constant cycle until it gets to the critical point when the performance
Figure 7. Example of before (left) and after (right) z‐normalization.
Figure 8starts
of the engine illustrates the RUL
to degrade. In curves of all unseen
the degradation or test
phase, thedatasets
RUL iscontaining
represented testing engines
by a linear from
function.
FD002 and
Hence,Figure FD004
the entire dataset.
RUL curve Figure 9 show
is identified the example
as unseen
a piece‐wiseof RUL curves from
linear degradation one degradation
function. engine
The from
critical
8 illustrates the RUL curves of all or test datasets containing testing engines
fromRth,
point, FD002
is andpoint
the FD004 dataset.
where the The same
aircraft degradation
engines started behavior
to degrade.is also
The applied
critical to the training
points of the set.
aircraft
FD002 and FD004 dataset. Figure 9 show the example of RUL curves from one degradation engine
gasThese RUL engines
turbine curves represent the health state
weredataset.
predefined based or on
prognostic of the aircraft
the condition gas turbine
described by theengines over cycles
from FD002 and FD004 The same degradation behavior is also applied todata
the source–NASA
training set.
until the end-of-life, or the point that the aircraft gas turbine engines can no longer operate normally.
These
Ames RUL curves
prognostics represent
data the health
repository [1]. state or prognostic of the aircraft gas turbine engines over
cycles until the end‐of‐life, or the point that the aircraft gas turbine engines can no longer operate
normally.
The degradation behavior of the aircraft gas turbine engines can be observed clearer from Figure
9. We presume that the RUL is a constant cycle until it gets to the critical point when the performance
of the engine starts to degrade. In the degradation phase, the RUL is represented by a linear function.
Hence, the entire RUL curve is identified as a piece‐wise linear degradation function. The critical
point, Rth, is the point where the aircraft engines started to degrade. The critical points of the aircraft
gas turbine engines were predefined based on the condition described by the data source–NASA
Ames prognostics data repository [1].
RULcurve
Figure8.8.RUL
Figure curveof
of all
all testing engines:
engines: FD002
FD002(top)
(top)and
andFD004
FD004(bottom).
(bottom).
Figure 8. RUL curve of all testing engines: FD002 (top) and FD004 (bottom).
Aerospace 2020, 7, 132 13 of 32
The degradation behavior of the aircraft gas turbine engines can be observed clearer from Figure 9.
We presume that the RUL is a constant cycle until it gets to the critical point when the performance of
the engine starts to degrade. In the degradation phase, the RUL is represented by a linear function.
Hence, the entire RUL curve is identified as a piece-wise linear degradation function. The critical point,
Rth, is the point where the aircraft engines started to degrade. The critical points of the aircraft gas
turbine engines were predefined based on the condition described by the data source–NASA Ames
prognostics data repository [1].
Figure
Figure9.9.Example
Exampleof
ofRUL
RULcurve
curveof
ofone
onetesting
testingengine:
engine:FD002
FD002(top)
(top)and
andFD004
FD004(bottom).
(bottom).
To measure
To measure and andevaluate
evaluate the
theperformance
performance of ofthethemodels
modelswith withselected
selectedfeatures,
features,root rootmean
meansquare
square
error(RMSE)
error (RMSE)and and the
the scoring
scoring algorithm
algorithm as as suggested
suggested in in [39]
[39] were
were used.
used.
RMSE is commonly used as a performance indicator for
RMSE is commonly used as a performance indicator for regression models. regression models. TheThefollowing
followingisisthe
the
formulaof
formula ofRMSE:
RMSE: v
t n
1X
RMSE = [xi − xi ]2 (11)
1n i=1
𝑅𝑀𝑆𝐸 𝑥 𝑥̅ (11)
𝑛
where, n is the number of prediction datasets, xi is the real value, and xi is the prediction value. In this
case, the x parameters refer to the data points in RUL curve while xi is the actual RUL value and xi is
the RUL𝑛 value
where, is the predicted
number ofbyprediction
our models.datasets, 𝑥 is the real value, and 𝑥̅ is the prediction value. In
this case, the 𝑥 parameters refer to
The scoring algorithm is as described the datain points
the formulain RUL curve while 𝑥 is the actual RUL value
below:
and 𝑥̅ is the RUL value predicted by our models.
 n
The scoring algorithm is as described  in the
P −( adformula
) below:
1 − 1 f or d < 0


 e
 i=1

s= n −( d ) (12)
⎧ P
⎪ 𝑒e a2 − 11 𝑓𝑜𝑟 f or d𝑑≥ 00




i=1
𝑠 (12)
⎨
where, s is the computed score, n is number ⎪ 𝑒 of units 1 𝑓𝑜𝑟 𝑑 0
under test (UTT), d = t̂RUL − t̂ RUL or Estimated
RUL—True RUL, while a1 = 10 and a2 ⎩ = 13. It can also be explained that the difference between ai
is the difference
where, between score,
𝑠 is the computed predicted and observed
n is number of unitsRUL under values and s is
test (UTT), 𝑑 summed
𝑡̂ 𝑡over
̂ allEstimated
or examples.
From the formula,
RUL—True the scoring
RUL, while 𝑎 matric
10 andpenalizes
𝑎 13. positive
It can also errors more than that
be explained negative errors as these
the difference have
between
a higher impact on maintenance policies. Also, note that the lower score
𝑎 is the difference between predicted and observed RUL values and 𝑠 is summed over all examples. means better prediction
performance
From the formula,of thethe
model [39]. matric penalizes positive errors more than negative errors as these
scoring
have a higher impact on maintenance policies. Also, note that the lower score means better prediction
performance of the model [39].
3.2. Training Procedure and Hyperparameters Selection

For training, the data from input sensors, operational setting, and labeled RUL value from the
Aerospace 2020, 7, 132 14 of 32
3.2. Training Procedure and Hyperparameters Selection

For training, the data from input sensors, operational setting, and labeled RUL value from the
source data, and only sensors and settings from the target dataset, were used. The raw data were
normalized, and the feature selection was applied before the start of all models training. For the
training process, the training dataset (as a source) from dataset FD002 were used. The FD002 and FD004
test dataset were used to validate the models and calculate prediction errors (RMSE and Score). As for
wrapper methods, we used ANN as a validation algorithm. The cross-validation within the FD002
training data was employed for measuring the performance of the wrapper algorithms. The set-up
parameters for ANN validation were fine-tuned based on the best model that was derived from
complete attributes (21 features) modeling;
• 5 Folds Cross-Validation
• 1000 Training cycles
• 0.001 Learning rate
• 0.9 Momentum
• Linear sampling.
For the DNN hyperparameters selection, the model parameters in H2O DNN algorithm varied as
described in Table 2. The grid search to identify the range of the learning rate, λ, was performed after
fine-tuning the remaining parameters manually. Additionally, the training sample per iteration was set
to auto-tuning, and batch size was set to 1 for all variations.
Table 2. Hyperparameters values evaluated in the proposed Deep Neural Network (DNN) model.
Hyperparameters Range
Epoch {100, 1000, 5000, 7000, 10,000}
Training sample per iteration AUTO
Batch size 1
Leaning rate annealing {10−10 , 10−8 , 10−5 , 10−1 }
Momentum {0.1, 0.2, 0.3, 0.5, 0.6, 0.8, 0.99}
L1: Regularization that constraint the absolute value {10−20 , 10−15 , 10−10 , 10−5 , 10−1 , 0}
L2: Regularization that constraint the sum of square weights {10−20 , 10−15 , 10−10 , 10−5 , 10−1 , 0}
Max w2: Maximum sum of square of incoming weight into the neuron {0, 10, 100, 10,000, ∞}
The best-case scenario is the combination of following hyperparameters; Epoch = 5000, Learning
rate = 10−8 , Momentum = 0.99, L1 = 10−5 , L2 = 0, and Max w2 set to infinity. These are all
hyperparameters employed in the final DNN model proposed.
Aerospace 2020, 7, 132 15 of 32
3.3. Experimental Setup and Results

All experiments were implemented on an Intel® Core i7 10th generation i7–10510U 4 cores
processor with 8 MB Cache, 1.8 GHz clock speed, and up to 4.9 GHz boost speed with 16 GB RAM and
Intel® UHD integrated graphic. The DNN architecture was implemented using Python 3.6 with H2O
library/package [21]. The experimental results presented in this section will be broken down into three
parts: (1) Feature selected using feature selection methods, (2) Results and models from ANN with the
selected feature, and (3) Proposed DNN model. All RMSE and all performance measurements of DNN
models reported in this paper are the average results from 20 trials.
3.3.1. Feature Selection for Aircraft Engine Dataset

All possible feature selection methods were performed with the C-MAPSS dataset. Filter methods
include; Deviation selection, PCA selection, Relief algorithm selection, selection, SVM selection, and
Pearson correlation selection. For wrapper methods, only three methods were implemented, which
include; forward selection, backward elimination, and evolutionary selection.
Table 3 shows the ranking of attributes based on coefficients and weights calculated from each
filter feature selection method. It is important to note that the ranking of the attributes based on
different methods is dependent upon the statistical measures or weights obtained from each method.
Table 3. Attribute values from different filter methods.
Pearson Correlation Relief Algorithm SVM PCA Deviation

Attributes Weight Attribute Weight Attribute Weight Attribute Weight Attribute Weight
farB −0.0648807 P15 2.55555 × 10−5 epr 28.062965 htBleed 0.24226001 PCNfR_dmd 1.00002156
Ps30 −0.0426395 Nf_dmd 4.29878 × 10−13 T2 24.031467 T30 0.24219398 farB 1.00000884
T50 −0.0377657 farB −1.76803 × 10−13 Nf_dmd 15.921074 Ne 0.24213648 P15 1.00000751
BPR −0.0320325 T2 −3.5083 × 10−13 Nf 15.293535 T50 0.24212279 epr 1.00000215
NRc −0.0308729 P2 −1.41209 × 10−12 T24 11.169562 T24 0.23799320 P2 1.00000079
htBleed −0.0254014 PCNfR_dmd −3.58802 × 10−12 W31 9.070028 epr 0.23251894 T2 1.00000049
T30 −0.0253007 phi −8.18383 × 10−7 W32 8.806654 Ps30 0.23247642 Ne 1.00000022
Ne −0.0133643 Nf −1.94057 × 10−6 PCNfR_dmd 6.597514 phi 0.22942893 T50 1.00000016
T24 −0.0063673 NRF −2.22812 × 10−6 NRF 5.849870 P30 0.22931342 Ps30 1.00000013
P2 −0.0031016 P30 −3.43389 × 10−6 P30 5.529144 W31 0.22654883 W32 1.00000013
P15 −0.0028634 T24 −3.2525 × 10−5 phi 5.262733 W32 0.22654313 T24 1.00000011
T2 −0.0023212 W31 −6.1066 × 10−5 Ne 0.026252 P15 0.21870245 T30 1.00000000
phi −0.0004811 W32 −6.76249 × 10−5 P15 −0.151776 Nf 0.21427293 P30 0.99999998
P30 −0.0003329 epr −9.125 × 10−5 P2 −0.726430 Nf_dmd 0.21420247 NRF 0.99999993
epr 0.0013847 Ne −0.00017538 farB −16.274719 T2 0.21253812 BPR 0.99999985
Nf 0.0026742 BPR −0.000324083 T30 −24.291950 P2 0.20884536 NRc 0.99999984
W32 0.0029798 NRc −0.000344686 htBleed −24.530502 farB 0.20473956 phi 0.99999978
Nf_dmd 0.0030117 Ps30 −0.000364589 NRc −32.369914 NRc 0.18353047 Nf 0.99999973
W31 0.0030517 T50 −0.000397835 T50 −40.853420 NRF 0.14637480 W31 0.99999957
NRF 0.0044269 T30 −0.000422547 Ps30 −53.894591 PCNfR_dmd 0.14634719 htBleed 0.99999829
PCNfR_dmd 0.0048232 htBleed −0.000613424 BPR −65.865476 BPR −0.21428742 Nf_dmd 0.99998466
For the Pearson correlation, the attributes were not selected if the coefficient was less than
−0.01 [29,30]. For PCA, the features have been selected based on weight (selected if weight is more than
0.2) and the PCA matrix [31]. For the Relief algorithm, the attributes were not selected if the calculated
weight was below zero [31]. For deviation selection, the feature will be selected if the weights are
higher than 1 [31]. It is important to note that the weights of the attributes calculated using the Relief
algorithm were unacceptably low (less than 10−12 ) and there were very large gaps between calculated
weights. Similar results were observed with other filter selection methods, including the SVM. It was
found that by using the filter methods that provided statistically low weight as for selecting features,
the models trained from those features were unable to provide usable prediction results.
Aerospace 2020, 7, 132 16 of 32
The following are the features selected based on these two filtering methods. In addition to the
feature weights from Pearson correlation selection and PCA selection in Table 3, the Pearson correlation
matrix and PCA matrix are also provided in Appendices A and B.
• Pearson correlation; 8 attributes: T30, T50, Ne, Ps30, NRc, BPR, farB, and htBleed.
• Relief algorithm; 2 attributes: P15 and Nf_dmd.
• SVM selection; 11 attributes: T2, T24, P30, Nf, epr, phi, NRF, Nf_dmd, PCNfR_dmd, W31, and W32.
• PCA selection; 17 attributes: T2, T24, T30, T50, P2, P15, P30, Nf, Ne, epr, Ps30, phi, farB, htBleed,
Nf_dmd, W31, and W32.
• Deviation selection; 11 attributes: T2, T24, T50, P2, P15, Ne, epr, Ps30, farB, PCNfR_dmd, and W32.
In reference to the wrapper methods, below are the sets of features selected from each method.
It is important to note that for the wrapper methods, ANN validation with the modeling set-up,
as mentioned in Section 3.2 was used. Figure 10 shows the validation process using ANN for
evolutionary selection.
Unlike forward selection and backward elimination methods, which are both based on search
algorithms [32], the setting of Evolutionary selection is based on genetic algorithms [40]. However,
instead of using fitness function from genetic theory, the evolutionary selection method used ANN
validation as fitness measurement. The parameters set-up in our evolutionary selection experiment
are; population size = 10, maximum number of generation = 200, using tournament selection with
0.25 size, initial probability for attributes (features) to be switched = 0.5, crossover probability = 0.5
with uniform
Aerospace crossover,
2020, 7, x FOR and mutation probability = number of1 attributes .
PEER REVIEW 17 of 33
Figure 10. Validation result using Evolutionary selection.

Figure 10. Validation result using Evolutionary selection.
3.3.2. DNN Models and Results

Table 4 summarizes RMSE and prediction score results from all DNN models. The complete
RUL best fit prediction curves for testing data of all feature selection methods are illustrated in Figure
11 for FD002 test data, and in Figure 12 for FD004 test data, respectively. The blue curves represent
the actual RUL from the dataset, and the red lines/dots are the prediction points from our feature
selection DNN models. For illustration purposes, Figures 13 and 14 include the prediction curve from
one engine of each testing data FD002 and FD004 in order to demonstrate how DNN predicts RUL of
one degradation cycle. Additionally, Table 5 includes all DNN models and all prediction error values
measured from the DNN models using FD002 test dataset, i.e., absolute error, relative error, relative
error lenient, relative error strict, normalized absolute error, root relative squared error, squared error,
Aerospace 2020, 7, 132 17 of 32
It is also important to note that, in this case, the brute force algorithm was not used. The brute
force algorithm is the selection algorithm that can derive the best features set from the data. However,
with limited computational capability, it cannot be used in real-time. Therefore, we did not include the
Brute force algorithm in this experiment.
• Backward elimination; validate RMSE 46.429 from 19 attributes; T2, T30, P2, P15, P30, Nf, epr,
Ps30, phi, NRF, NRc, BPR, farB, htBleed, Nf_dmd, PNCfR_dmd, W31, and W32.
• Evolutionary selection; validate RMSE 46.451 from 14 attributes; T2, T30, T50, P2, Nf, Ne, epr,
Ps30, NRc, BPR, farB, htBleed, W31, and W32.
• Forward selection methods; validate RMSE 46.480 from 11 attributes; T2, T30, T50, P2, P15, Ps30,
NRc, BPR, farB, htBleed, and Nf_dmd.
3.3.2. DNN Models and Results

Table 4 summarizes RMSE and prediction score results from all DNN models. The complete RUL
best fit prediction curves for testing data of all feature selection methods are illustrated in Figure 11
for FD002 test data, and in Figure 12 for FD004 test data, respectively. The blue curves represent the
actual RUL from the dataset, and the red lines/dots are the prediction points from our feature selection
DNN models. For illustration purposes, Figures 13 and 14 include the prediction curve from one
engine of each testing data FD002 and FD004 in order to demonstrate how DNN predicts RUL of one
degradation cycle. Additionally, Table 5 includes all DNN models and all prediction error values
measured from the DNN models using FD002 test dataset, i.e., absolute error, relative error, relative
error lenient, relative error strict, normalized absolute error, root relative squared error, squared error,
correlation, squared correlation, prediction average, spearman rho, and Kendall tau. The number of
hidden nodes in the DNN layers was identified based on the best models fine-tuned from one-layer
ANN models for each feature selection method. We used the same number of hidden nodes from the
best ANN models to construct the DNN model layers. Note that we only presented the DNN models
from feature selection methods that provided usable prediction results. Therefore, the results from
Relief algorithms and SVM selection are not presented here.
Table 4. Best root mean square error (RMSE) and Prediction Score results of RUL prediction from all
DNN models.
RMSE Score
Methods
FD002 FD004 FD002 FD004
Original data 45.439 45.302 645,121 427,968
SVM
Unusable
Relief algorithm
Backward elimination 45.121 45.436 645,132 211,129
Deviation 45.374 45.630 740,936 256,776
Evolutionary Selection 44.717 44.953 518,025 355,458 Best Overall
Forward selection 45.242 46.505 1,353,749 423,997
PCA 45.368 45.108 1,450,397 406,872
Pearson correlation 45.272 46.216 502,579 338,400
Aerospace 2020, 7, 132 18 of 32
(a) Original Data FD002

200
150
RUL
100
50
0
0 5000 10000 15000 20000 Cycles
200 (b) Backward Elimination FD002
150
RUL
100
50
0
0 5000 10000 15000 20000 Cycles
200 (c) Deviation Selection FD002
150
RUL
100
50
0
0 5000 10000 15000 20000 Cycles
(d) Evolutionary Selection FD002
200
150
RUL
100
50
0
0 5000 10000 15000 20000 Cycles
(e) Forward Selection FD002
200
150
RUL
100
50
0
0 5000 10000 15000 20000 Cycles
(f) PCA Selection FD002
200
150
RUL
100
50
0
0 5000 10000 15000 20000 Cycles
(g) Pearson Correlation Selection FD002
200
150
RUL
100
50
0
0 5000 10000 15000 20000 Cycles
Actual RUL DNN Prediction
Figure 11. (a–g)

Figure All RUL
11. (a–g) prediction
All RUL curves
prediction for FD002.
curves for FD002.
Aerospace 2020, 7, 132 19 of 32
(a) Original Data FD004

200
150
RUL
100
50
0
0 5000 10000 15000 Cycles
(b) Backward Elimination FD004

200
150
RUL
100
50
0
0 5000 10000 15000 Cycles
200 (c) Deviation Selection FD004
150
RUL
100
50
0
0 5000 10000 15000 Cycles
(d) Evolutionary Selection FD004
200
150
RUL
100
50
0
0 5000 10000 15000 Cycles
200 (e) Forward Selection FD004
150
RUL
100
50
0
0 5000 10000 15000 Cycles
200 (f) PCA Selection FD004
RUL
0
0 5000 10000 15000 Cycles
(g) Pearson Correlation Selection FD004
200
RUL
0
0 5000 10000 15000 Cycles
Actual RUL DNN Prediction
Figure 12. (a–g) All RUL prediction curves for FD004.
Figure 12. (a–g) All RUL prediction curves for FD004.
Aerospace 2020, 7, 132 20 of 32
Figure 13.
Figure (a–g)RUL
13. (a–g) RULprediction
prediction points
points for
for one
one engine
engine of
of FD002
FD002 test
test data.
data.
Aerospace 2020, 7, 132 21 of 32
Figure 14. (a–g)

Figure14. (a–g) RUL
RUL prediction
predictionpoints
pointsfor
forone
oneengine
engineof
ofFD004
FD004test
testdata.
data.
Aerospace 2020, 7, 132 22 of 32
Table 5. The best DNN Models for FD002 test data.
Feature Selection
Model Output Weights Errors
Method
root_mean_squared_error: 45.439 +/− 0.000

absolute_error: 37.062 +/− 26.289
relative_error: 285.29% +/− 1071.56%
Layer Unit Type
relative_error_lenient: 40.87% +/− 26.92%
—– —- ———–
relative_error_strict: 290.30% +/− 1070.51%
Layer 2: −0.389707 normalized_absolute_error: 0.933
Original
Layer 3: −0.954436 root_relative_squared_error: 0.963
(All 21 Attributes)
1 21 Input
Layer 4: −0.798112 squared_error: 2064.669 +/− 2549.829
Layer 5: 1.135641 correlation: 0.426
2 12 Rectifier
3 12 Rectifier squared_correlation: 0.182
4 12 Rectifier prediction_average: 68.095 +/− 47.177
0.4 Dropout spearman_rho: 0.406
5 1 Linear kendall_tau: 0.28

absolute_error: 36.707 +/− 26.240
relative_error: 275.51% +/− 1043.67%
Layer Unit Type
—– —- ———–
Backward Elimination Layer 3: −0.791862 root_relative_squared_error: 0.956
1 19 Input
2 11 Rectifier

absolute_error: 37.420 +/− 25.662
relative_error: 283.25% +/− 1026.67%
Layer Unit Type
—– —- ———–
Deviation Selection Layer 3: −0.962801 root_relative_squared_error: 0.962
1 11 Input
2 7 Rectifier

absolute_error: 36.402 +/− 25.971
relative_error: 271.60% +/− 1022.51%
Layer Unit Type
—– —- ———–
Evolutionary Selection * Layer 3: −0.729643 root_relative_squared_error: 0.948
1 14 Input
2 9 Rectifier

absolute_error: 36.817 +/− 26.294
relative_error: 275.71% +/− 1038.01%
Layer Unit Type
—– —- ———–
Forward Selection Layer 3: −1.333539 root_relative_squared_error: 0.959
1 11 Input
2 7 Rectifier
Aerospace 2020, 7, 132 23 of 32
Table 5. Cont.
Feature Selection
Model Output Weights Errors
Method

absolute_error: 36.694 +/− 26.680
relative_error: 264.95% +/− 1016.32%
Layer Unit Type
—– —- ———–
PCA Selection Layer 3: −1.327223 root_relative_squared_error: 0.962
1 17 Input
2 10 Rectifier
Layer 4: −1.583420 relative_error: 275.71% +/− 1038.01%
Layer 5: 0.341112 relative_error_lenient: 41.12% +/− 26.56%68.095 +/− 47.177
prediction_average:
4 10 Rectifier relative_error_strict: 282.81% +/− 1036.64%
1 11 Input 0.4 Dropout spearman_rho: 0.382
normalized_absolute_error: 0.927
kendall_tau: 0.266
25 7 1Rectifier Linear root_relative_squared_error: 0.959
3 7 Rectifier
squared_error: 2046.830 +/− 2564.139
4 7 Rectifier
correlation: 0.403 root_mean_squared_error: 45.272 +/− 0.000
0.4 Dropout
squared_correlation: 0.163
absolute_error: 37.002 +/− 26.084
5 1 Linear
prediction_average:relative_error: 269.63% +/− 1010.61%
68.095 +/− 47.177
Layer Unit Type
spearman_rho: 0.390relative_error_lenient: 41.23% +/− 26.36%
—– —- ———–
kendall_tau: 0.272relative_error_strict: 277.67% +/− 1009.11%
Layer Unit Type −0.853966
Layer 2:root_mean_squared_error:
normalized_absolute_error:
45.368 +/− 0.000 0.932
Backward Elimination ‐‐‐‐‐ ‐‐‐‐ ‐‐‐‐‐‐‐‐‐‐‐ −1.340343 36.694
Layer 3:absolute_error: root_relative_squared_error:
+/− 26.680 0.960
1 8 Input
−0.972141 264.95%
Layer 4:relative_error: squared_error:
+/− 1016.32%2049.533 +/− 2474.691
Layer 5: 0.786599 correlation:
relative_error_lenient: 0.382
41.31% +/− 26.37%
2 6 Rectifier
3 6 Rectifier squared_correlation:
relative_error_strict: 275.07% +/− 1014.64% 0.146
1 17 Input Layer 2: −0.022651 prediction_average: 68.095 +/− 47.177
4 6 Rectifier normalized_absolute_error: 0.924
PCA 2 10 Rectifier Layer 3: −1.327223 spearman_rho:
0.4 Dropout root_relative_squared_error: 0.962 0.364
Selection 3 10 Rectifier Layer 4: −1.541491 kendall_tau: 0.253
5 1 Linear squared_error: 2058.300 +/− 2623.562
4 10 Rectifier Layer 5: 1.298059
correlation: 0.390
0.4 Dropout
* Note: Complete model5layers 1for the Proposed
Linear Evolutionary Selection DNN model
squared_correlation: will be described in detail in
0.152
Appendix C. prediction_average: 68.095 +/− 47.177
spearman_rho: 0.382
kendall_tau: 0.266
Layer Unit Type root_mean_squared_error: 45.272 +/− 0.000
Due to the fluctuations in the prediction results from the DNN algorithm, we ran our experiments
‐‐‐‐‐ ‐‐‐‐ ‐‐‐‐‐‐‐‐‐‐‐ absolute_error: 37.002 +/− 26.084
(training and testing) 100 times for each model. The result from Table 4 are the best prediction results.
relative_error: 269.63% +/− 1010.61%
The fluctuations across 100 iterations for FD002 and FD004 are presented in Figure 15. In addition to
1 8 Input Layer 2: −0.853966
normalized_absolute_error: 0.932
the best prediction, we include the mean RMSE and error distributions from the 100 times testing as
Backward 2 6 Rectifier Layer 3: −1.340343
root_relative_squared_error: 0.960
Elimination 3 6 Rectifier Layer 4: −0.972141
illustrated in Table 6 and Figure 16. These fluctuations in prediction errors are commonly found in most
4 6 Rectifier Layer 5: 0.786599
squared_error: 2049.533 +/− 2474.691
correlation: 0.382
deep learning algorithms due to the random initial training weights assignment and the amplification
0.4 Dropout
squared_correlation: 0.146
5 1 Linear
effect from the optimizer function in deeper networks. The fluctuations in the prediction result can be
prediction_average: 68.095 +/− 47.177
spearman_rho: 0.364
more obvious when models are more complex and take a large number of input attributes. We will
kendall_tau: 0.253
discuss more on * Note:

thisComplete
topic inmodel layers for
Section 4.the Proposed Evolutionary Selection DNN model will be described
in detail in Appendix C.
Figure Figure 15. (a,b)

15. (a,b) RMSE RMSE Fluctuation for
Fluctuation forFD002
FD002and and
FD004 test data.
FD004 test data.
Aerospace 2020, 7, 132 24 of 32
Figure 16. Cont.

Aerospace 2020, 7, 132 25 of 32
Figure 16.
Figure 16. (a–g)
(a–g) Prediction
Prediction Error
Error Distributions.
Distributions.
Table6.6. Mean
Table Mean RMSE
RMSE from
fromall
allDNN
DNNmodels.
models.
Average
AverageRMSE
RMSE
Original BW
OriginalData
Data BW Elimination Deviation
Deviation EvoSelection
Evo Selection FWFWSelection
Selection PCA
PCA Pearson
Pearson
Elimination
FD002 FD004 FD002 FD004 FD002 FD004 FD002 FD004 FD002 FD004 FD002 FD004 FD002 FD004
FD002 FD004 FD002 FD004 FD002 FD004 FD002 FD004 FD002 FD004 FD002 FD004 FD002 FD004
48.398
48.398 50.541
50.541 47.907
47.907 50.331
50.331 48.160 50.081
48.160 50.081 47.452
47.452 49.650
49.650 48.434
48.434 50.708
50.708 48.072
48.072 49.737
49.737 49.203
49.203 52.111
52.111
4. Discussion
4.
As mentioned
As mentionedinin thethe related
related works
works (Section
(Section 1.2), have
1.2), there therebeen
have been aofnumber
a number efforts inofdeveloping
efforts in
developing deep learning models for a C-MAPSS aircraft gas turbine
deep learning models for a C-MAPSS aircraft gas turbine engines dataset [12–20]. Currently, the engines dataset [12–20].
deep
Currently,
learning the deep
model withlearning
the highestmodel with the
accuracy washighest
proposedaccuracy was proposed
by Zhengmin Kong by Zhengmin
et al. [17]. TheirKong
deepet
al. [17]. Their
learning deep learning
architecture consistsarchitecture
of CNN andconsists
LSTM-RNN of CNN and LSTM-RNN
combined layers andcombined
can achieve layers
16.13and can
RMSE,
achieve
while our16.13
best RMSE, while our
Evolutionary DNN best Evolutionary
model can achieve DNN
44.71model
RMSE. can achieve
This 44.71
indicates RMSE.
that This indicates
the performance of
thatDNN
our the performance of our
models is poorer thanDNN models hybrid
the modern is poorer
deepthan the modern
learning modelshybrid deepinlearning
developed the recent models
years.
developed
However,in the recent
to the bestyears.
of our knowledge, no work has addressed the complexity of the models and
However, to the
the computational best of
burden forour
modelknowledge,
training. noAllwork
hybrid hasdeep
addressed
neural the complexity
network layers of
arethe models
generally
and the
overly computational
complex and require burden for model
exponentially moretraining. All hybrid
computational timedeep
and neural
resourcesnetwork
comparedlayers are
to our
generally overly complex and require exponentially more computational
proposed Evolutionary DNN. All proposed models in recent years also took all features from the time and resources
compared dataset
C-MAPSS to our proposed
and disregard Evolutionary
the featuresDNN. All proposed
performance models Different
benchmark. in recent from
yearsthose
also models,
took all
features
our fromapproach
proposed the C-MAPSS appliesdataset and disregard
the feature the features
selection prior performance
to the model trainingbenchmark.
phase to help Different
reduce
from those models, our proposed approach applies the feature selection prior
the number of input attributes, and to improve the model complexity as a result. The reduction in to the model training
phase to help
complexity reduce
when usingtheless
number
input of input attributes,
features and tofor
is more evident improve
the highthe model complexity
complexity hybrid deepas aneural
result.
The reduction
network layers.in complexity when using less input features is more evident for the high complexity
hybrid deep neural network layers.
Additionally, as illustrated in Figures 15 and 16, prediction errors fluctuations can be noticed
when training deep learning models. This effect has occurred not only in DNN but also in other types
of network layers, such as LSTM-RNN, CNN, and other modern hybrid layers. Based on the results
demonstrated in Table 4 and Figures 12–16, the key observations of such an effect are as follows:
(1) Utilizing fewer features to train the model has shown to lower the error distribution range,
compared to using more features. This is due to that the initial random weights assigned to the
hidden nodes are smaller when using less feature in model training. In other words, the models
are more robust and reliable when using less features. Same observation is also applied for the
Aerospace 2020, 7, 132 26 of 32
Additionally, as illustrated in Figures 15 and 16, prediction errors fluctuations can be noticed
when training deep learning models. This effect has occurred not only in DNN but also in other types
of network layers, such as LSTM-RNN, CNN, and other modern hybrid layers. Based on the results
demonstrated in Table 4 and Figures 12–16, the key observations of such an effect are as follows:
(1) Utilizing fewer features to train the model has shown to lower the error distribution range,
compared to using more features. This is due to that the initial random weights assigned to the
hidden nodes are smaller when using less feature in model training. In other words, the models
are more robust and reliable when using less features. Same observation is also applied for the
fluctuation of the prediction errors, in that the prediction results are more stable when using less
features in model training.
(2) In terms of model performance and accuracy, although using selected features does not
always guarantee better results, the feature selection methods still help in terms of reducing
a computational burden while offering better prediction performance. In our experiment, the
Evolutionary selection can achieve both better performance and complexity reduction.
We emphasize that our current goal is not to improve on model performance compared against
other existing works; rather, we aim to provide baseline results and demonstrate the significant effect of
using feature selections on deep learning models, which have never been addressed before. We believe
that the end results can be further improved when applying our feature selection results in the modern
hybrid deep neural network architectures.
For our experimental results in general, as mentioned, the best accuracy based on the RMSE
results in Table 4 were generated from the Evolutionary method. The complexity of the model has also
been significantly improved using a reduced set of features, from 21 attributes to only 14 attributes.
When considering the complexity and computational time, the filter methods were less complex
and faster to run because they do not require to train-and-test multiples of ANN model validation in
the process. In this study, when performing the selection process, most of the filter methods required
only 5–10 min while wrapper methods required 10 h to 10 days to complete.
It is also important to note that the curve fitting and pattern recognition have been vastly improved,
as can be seen when comparing the RUL prediction curves in Figures 11–14. In greater detail, the DNN
model from most of the selected features can reasonably capture the trend of both before and after
aircraft gas turbine engines’ degradation intervals.
In summary, our Evolutionary DNN model architecture performs best as a simplified deep neural
network data-driven model for C-MAPSS aircraft gas turbine engines data. The feature selection
phase (as described in the modeling framework in Figure 4) must be included as a standard in the
modeling framework for such a PHM dataset. This is one way to potentially improve the overall
performance for RUL prediction for the prognostics of aircraft gas turbine engines data as well as other
prognostic datasets.
5. Conclusions and Future Work

Even though we already included the deep neural network algorithms and proposed new DNN
model architecture in this work, the features selected must still be tested with other new deep learning
algorithms and methods. As demonstrated in the related works [12–20], their RNN, LSTM, and
CNN have been proven to draw more accurate RUL prediction when compared to shallow DNN
models. However, further improvements can be achieved by applying new algorithms to the selected
features. One of the aspects that can improve such selected features is the reduction in the complexity
of the model. Reducing input features when employing more complex deep learning algorithms can
significantly reduce the model training time, possibly, from days to hours. This work aims to be a
baseline for using selected features to generate a data-driven neural network model for the prognostic
of aircraft gas turbine engines data. More complex deep learning algorithms; however, still need to
be performed and tested for the effectiveness of such a feature selection technique. Additionally, it
Aerospace 2020, 7, 132 27 of 32
is also possible to use the dimensionality reduction technique such as, PCA, to transform the data
from selected features to reduce dimensionality, which can possibly improve prediction accuracy and
complexity. These are the key aspects that should be tested and experimented with in the future.
Lastly, we also believe that our studies will be a great benefit to aviation communities. We aim to
raise the awareness and discussion on how each aircraft gas turbine engines feature can significantly
help improve the overall life-span of the engines. Although, we only provided the insights based on
data science perspective, we strongly believe that more study in aviation communities will be further
investigated based on the results achieved in this work.
Author Contributions: Conceptualization, P.K.; methodology, P.K.; software, P.K.; validation, P.K.; formal analysis,
P.K.; investigation, P.K., and D.G.; resources, P.K.; data curation, P.K.; writing—original draft preparation, P.K.;
writing—review and editing, P.K., D.G., and N.Y.; visualization, P.K.; supervision, D.G.; project administration, D.G.
and N.Y.; funding acquisition, P.K. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Conflicts of Interest: The authors declare no conflict of interest. The funders had no role in the design of the
study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to
publish the results.
Aerospace 2020, 7, 132 28 of 32
Appendix A
Table A1. Pearson Correlation Matrix.

Attributes T2 T24 T30 T50 P2 P15 P30 Nf Nc epr Ps30 phi NRF NRc BPR farB htBleed Nf_dmd PCNfR_dmd W31 W32 RUL
T2 1.0000 0.9441 0.8709 0.8979 0.9864 0.9864 0.9731 0.5725 0.8618 0.8266 0.7060 0.9729 0.1643 0.3528 −0.5426 0.7936 0.8732 0.5720 0.1642 0.9777 0.9777 −0.0023
T24 0.9441 1.0000 0.9822 0.9810 0.9158 0.9441 0.9686 0.8106 0.9785 0.9051 0.8957 0.9688 0.4801 0.6241 −0.7779 0.8050 0.9830 0.8103 0.4800 0.9624 0.9624 −0.0064
T30 0.8709 0.9822 1.0000 0.9896 0.8429 0.8848 0.9290 0.8957 0.9978 0.9290 0.9607 0.9295 0.6209 0.7520 −0.8759 0.8047 0.9987 0.8954 0.6208 0.9171 0.9171 −0.0253
T50 0.8979 0.9810 0.9896 1.0000 0.8841 0.9196 0.9567 0.8439 0.9873 0.9616 0.9368 0.9571 0.5447 0.7156 −0.8467 0.8591 0.9902 0.8436 0.5446 0.9464 0.9464 −0.0378
P2 0.9864 0.9158 0.8429 0.8841 1.0000 0.9963 0.9798 0.5242 0.8329 0.8438 0.6736 0.9795 0.1136 0.3305 −0.5253 0.8241 0.8455 0.5237 0.1135 0.9857 0.9857 −0.0031
P15 0.9864 0.9441 0.8848 0.9196 0.9963 1.0000 0.9933 0.5944 0.8762 0.8782 0.7339 0.9931 0.1981 0.4075 −0.5955 0.8403 0.8871 0.5940 0.1980 0.9964 0.9964 −0.0029
P30 0.9731 0.9686 0.9290 0.9567 0.9798 0.9933 1.0000 0.6791 0.9226 0.9187 0.8054 1.0000 0.3070 0.5081 −0.6842 0.8577 0.9309 0.6787 0.3069 0.9991 0.9991 −0.0003
Nf 0.5725 0.8106 0.8957 0.8439 0.5242 0.5944 0.6791 1.0000 0.9033 0.7829 0.9726 0.6801 0.9028 0.9245 −0.9712 0.5913 0.8937 1.0000 0.9028 0.6559 0.6558 0.0027
Nc 0.8618 0.9785 0.9978 0.9873 0.8329 0.8762 0.9226 0.9033 1.0000 0.9291 0.9643 0.9231 0.6349 0.7711 −0.8855 0.7996 0.9979 0.9030 0.6347 0.9100 0.9100 −0.0134
epr 0.8266 0.9051 0.9290 0.9616 0.8438 0.8782 0.9187 0.7829 0.9291 1.0000 0.8924 0.9192 0.5087 0.7271 −0.8475 0.9141 0.9297 0.7827 0.5086 0.9092 0.9091 0.0014
Ps30 0.7060 0.8957 0.9607 0.9368 0.6736 0.7339 0.8054 0.9726 0.9643 0.8924 1.0000 0.8062 0.8001 0.8931 −0.9654 0.7326 0.9597 0.9724 0.8000 0.7848 0.7847 −0.0426
phi 0.9729 0.9688 0.9295 0.9571 0.9795 0.9931 1.0000 0.6801 0.9231 0.9192 0.8062 1.0000 0.3084 0.5094 −0.6853 0.8579 0.9314 0.6797 0.3083 0.9991 0.9991 −0.0005
NRF 0.1643 0.4801 0.6209 0.5447 0.1136 0.1981 0.3070 0.9028 0.6349 0.5087 0.8001 0.3084 1.0000 0.9277 −0.8842 0.2952 0.6173 0.9031 1.0000 0.2766 0.2765 0.0044
NRc 0.3528 0.6241 0.7520 0.7156 0.3305 0.4075 0.5081 0.9245 0.7711 0.7271 0.8931 0.5094 0.9277 1.0000 −0.9574 0.5425 0.7496 0.9245 0.9275 0.4792 0.4792 −0.0309
BPR −0.5426 −0.7779 −0.8759 −0.8467 −0.5253 −0.5955 −0.6842 −0.9712 −0.8855 −0.8475 −0.9654 −0.6853 −0.8842 −0.9574 1.0000 −0.6644 −0.8742 −0.9712 −0.8842 −0.6601 −0.6601 −0.0320
farB 0.7936 0.8050 0.8047 0.8591 0.8241 0.8403 0.8577 0.5913 0.7996 0.9141 0.7326 0.8579 0.2952 0.5425 −0.6644 1.0000 0.8060 0.5910 0.2950 0.8554 0.8553 −0.0649
htBleed 0.8732 0.9830 0.9987 0.9902 0.8455 0.8871 0.9309 0.8937 0.9979 0.9297 0.9597 0.9314 0.6173 0.7496 −0.8742 0.8060 1.0000 0.8934 0.6172 0.9191 0.9190 −0.0254
Nf_dmd 0.5720 0.8103 0.8954 0.8436 0.5237 0.5940 0.6787 1.0000 0.9030 0.7827 0.9724 0.6797 0.9031 0.9245 −0.9712 0.5910 0.8934 1.0000 0.9030 0.6554 0.6554 0.0030
PCNfR_dmd 0.1642 0.4800 0.6208 0.5446 0.1135 0.1980 0.3069 0.9028 0.6347 0.5086 0.8000 0.3083 1.0000 0.9275 −0.8842 0.2950 0.6172 0.9030 1.0000 0.2765 0.2764 0.0048
W31 0.9777 0.9624 0.9171 0.9464 0.9857 0.9964 0.9991 0.6559 0.9100 0.9092 0.7848 0.9991 0.2766 0.4792 −0.6601 0.8554 0.9191 0.6554 0.2765 1.0000 0.9999 0.0031
W32 0.9777 0.9624 0.9171 0.9464 0.9857 0.9964 0.9991 0.6558 0.9100 0.9091 0.7847 0.9991 0.2765 0.4792 −0.6601 0.8553 0.9190 0.6554 0.2764 0.9999 1.0000 0.0030
RUL −0.0023 −0.0064 −0.0253 −0.0378 −0.0031 −0.0029 −0.0003 0.0027 −0.0134 0.0014 −0.0426 −0.0005 0.0044 −0.0309 −0.0320 −0.0649 −0.0254 0.0030 0.0048 0.0031 0.0030 1.0000
Aerospace 2020, 7, 132 29 of 32
Appendix B
Table A2. Principle Components (PC) Matrix.

Eignvector
Standard Proportion Cumulative
Component T2 T24 T30 T50 P2 P15 P30 Nf Nc epr Ps30 phi NRF NRc BPR farB htBleed Nf_dmd PCNfR_dmd W31 W32
Deviation Of Variance Variance
PC 1 4.1098 0.8043 0.8043 0.2125 0.2380 0.2422 0.2421 0.2088 0.2187 0.2293 0.2143 0.2421 0.2325 0.2325 0.2294 0.1464 0.1835 −0.2143 0.2047 0.2423 0.2142 0.1463 0.2265 0.2265
PC 2 1.8911 0.1703 0.9746 0.2432 0.0759 −0.0129 0.0338 0.2694 0.2294 0.1739 −0.2437 −0.0234 0.0343 −0.1505 0.1731 −0.4207 −0.3244 0.2381 0.1294 −0.0105 −0.2440 −0.4208 0.1900 0.1900
PC 3 0.6210 0.0184 0.9930 −0.2203 −0.2194 −0.1148 0.0276 −0.0668 −0.0637 −0.0351 −0.1583 −0.1055 0.4065 −0.0113 −0.0345 −0.0745 0.2675 −0.1636 0.7255 −0.1142 −0.1582 −0.0747 −0.0391 −0.0391
PC 4 0.2765 0.0036 0.9966 −0.1973 −0.1956 −0.1354 −0.0125 0.1133 0.1197 0.1552 −0.1492 −0.0729 0.3738 −0.0685 0.1558 −0.0679 0.3512 −0.2851 −0.6037 −0.1303 −0.1498 −0.0681 0.1400 0.1399
PC 5 0.1898 0.0017 0.9983 0.2407 0.1279 0.1338 0.2365 −0.1105 −0.1263 −0.1444 −0.1089 0.2504 0.0171 0.1765 −0.1436 −0.2556 0.5771 0.3024 −0.1262 0.1371 −0.1126 −0.2598 −0.1894 −0.1890
PC 6 0.1429 0.0010 0.9993 −0.1080 −0.0508 −0.1018 −0.2787 0.1666 0.1661 0.1402 −0.0006 0.0850 −0.5969 −0.2325 0.1398 0.0629 0.5242 0.1348 0.1816 −0.0949 −0.0067 0.0564 0.1473 0.1473
PC 7 0.0912 0.0004 0.9997 0.3962 0.2324 −0.0820 −0.3642 −0.0725 −0.1049 −0.1372 0.1574 0.2384 0.2671 −0.5964 −0.1412 −0.0471 0.0686 −0.2038 0.0105 −0.0991 0.1585 −0.0485 0.0028 0.0024
PC 8 0.0467 0.0001 0.9998 0.2674 0.1477 −0.3715 0.0978 −0.0681 −0.0160 0.1226 −0.0583 0.0358 −0.3458 0.2701 0.1267 −0.1464 −0.0277 −0.6182 0.0391 −0.1879 −0.0548 −0.1362 −0.1777 −0.1775
PC 9 0.0363 0.0001 0.9999 0.0132 −0.0133 0.7894 −0.0969 −0.0175 −0.0065 0.0249 −0.0527 −0.0336 −0.0869 0.0295 0.0252 −0.0552 0.0009 −0.1860 0.0047 −0.5529 −0.0523 −0.0533 −0.0472 −0.0457
PC 10 0.0328 0.0001 0.9999 0.0008 −0.0389 0.1788 −0.7296 −0.0113 −0.0056 0.0201 −0.0967 −0.0508 0.0283 0.3284 0.0208 −0.1045 0.0016 −0.1610 0.0012 0.5025 −0.0962 −0.1026 −0.0400 −0.0441
PC 11 0.0311 0.0000 1.0000 0.1051 0.1076 −0.2848 −0.3377 0.0413 0.0245 −0.0107 0.1014 0.1245 0.2238 0.5163 −0.0125 0.0465 0.0011 0.3575 −0.0016 −0.5287 0.1028 0.0463 0.0864 0.0873
PC 12 0.0138 0.0000 1.0000 −0.2578 −0.2810 −0.0020 0.0108 0.0167 −0.0099 −0.0864 −0.1056 0.8618 −0.0568 0.0304 −0.0848 −0.0164 −0.2225 −0.1099 −0.0024 −0.0025 −0.1110 −0.0253 0.0644 0.0496
PC 13 0.0118 0.0000 1.0000 0.0477 −0.0185 0.0025 0.0839 −0.0086 −0.1363 −0.4283 0.0260 −0.1718 −0.1710 0.1638 −0.4312 −0.1047 0.0539 −0.2109 −0.0039 0.0099 0.0176 −0.1249 0.4655 0.4744
PC 14 0.0101 0.0000 1.0000 −0.0021 −0.0067 −0.0005 −0.0023 0.0098 0.0070 −0.0020 0.0012 0.0090 0.0009 0.0001 −0.0028 0.0001 −0.0025 −0.0016 0.0000 0.0020 0.0007 −0.0008 −0.7132 0.7008
PC 15 0.0071 0.0000 1.0000 0.5000 −0.7791 0.0160 0.0235 −0.1718 −0.1149 0.1418 0.1483 −0.0349 −0.0279 0.0250 0.1270 −0.0272 0.0099 0.0680 0.0015 0.0178 0.1666 −0.0003 0.0504 0.0501
PC 16 0.0058 0.0000 1.0000 0.0583 −0.2380 −0.0142 0.0054 0.6356 0.4221 −0.2979 0.1703 −0.0600 0.0165 0.0288 −0.2753 −0.0146 0.0194 −0.0802 −0.0018 −0.0146 0.1399 −0.0673 −0.2421 −0.2634
PC 17 0.0025 0.0000 1.0000 0.0239 −0.0042 0.0006 −0.0014 −0.0321 0.0275 −0.7104 −0.0225 0.0000 0.0017 −0.0024 0.7009 0.0127 −0.0002 0.0038 0.0001 −0.0002 −0.0181 0.0232 0.0068 0.0081
PC 18 0.0011 0.0000 1.0000 −0.0506 0.0059 0.0006 0.0011 0.0861 −0.0696 −0.0029 −0.4001 0.0033 −0.0050 0.0004 −0.0007 −0.5777 0.0057 −0.0026 0.0000 0.0008 0.4897 0.5014 −0.0056 −0.0055
PC 19 0.0008 0.0000 1.0000 −0.2540 0.0089 0.0007 0.0033 0.4119 −0.6623 0.1074 0.3333 −0.0010 −0.0155 0.0051 0.1922 −0.1498 −0.0006 −0.0017 −0.0002 0.0011 0.2199 −0.3095 −0.0178 −0.0180
PC 20 0.0004 0.0000 1.0000 0.3279 −0.0012 0.0002 0.0002 0.4451 −0.4099 −0.0213 −0.4508 0.0004 0.0104 0.0000 −0.0349 0.3864 0.0000 −0.0001 0.0002 0.0002 −0.3040 0.2821 0.0029 0.0029
PC 21 0.0002 0.0000 1.0000 0.0385 −0.0005 0.0000 0.0000 0.0661 −0.0567 −0.0051 0.4972 0.0001 0.0022 −0.0001 −0.0091 −0.4057 0.0000 0.0000 0.0000 0.0000 −0.5855 0.4860 0.0009 0.0010
Aerospace 2020, 7, 132 30 of 32

Appendix C
Appendix C
Figure
Figure A1. Final Proposed
A1. Final Proposed Evolutionary
Evolutionary DNN
DNN Model
Model description.
description.
Aerospace 2020, 7, 132 31 of 32
References
1. Saxena, A.; Goebel, K. Turbofan Engine Degradation Simulation Data Set. NASA Ames Prognostics Data
Repository, NASA Ames Research Center, Moffett Field. 2008. Available online: http://ti.arc.nasa.gov/project/
prognostic-data-repository (accessed on 10 May 2019).
2. Atamuradov, V.; Medjaher, K.; Dersin, P.; Lamoureux, B.; Zerhouni, N. Prognostics and health management
for maintenance practitioners-review, implementation and tools evaluation. Int. J. Progn. Health Manag.
2017, 8, 1–31.
3. Papakostas, N.; Papachatzakis, P.; Xanthakis, V.; Mourtzis, D.; Chryssolouris, G. An approach to operational
aircraft maintenance planning. Decis. Support Syst. 2010, 48, 604–612. [CrossRef]
4. Cubillo, A.; Perinpanayagam, S.; Esperon-Miguez, M. A review of physics-based models in prognostics:
Application to gears and bearings of rotating machinery. Adv. Mech. Eng. 2016, 8, 1687814016664660.
[CrossRef]
5. Si, X.-S.; Wang, W.; Hu, C.-H.; Zhou, D.-H. Remaining useful life estimation—A review on the statistical data
driven approaches. Eur. J. Oper. Res. 2011, 213, 1–14. [CrossRef]
6. Lei, Y.; Li, N.; Guo, L.; Li, N.; Yan, T.; Lin, J. Machinery health prognostics: A systematic review from data
acquisition to RUL prediction. Mech. Syst. Signal Process. 2018, 104, 799–834. [CrossRef]
7. Faghih-Roohi, S.; Hajizadeh, S.; Nunez, A.; Babuška, R.; De Schutter, B. Deep Convolutional Neural Networks
for Detection of Rail Surface Defects. In Proceedings of the 2016 International Joint Conference on Neural
Networks (IJCNN), Vancouver, BC, Canada, 24–29 July 2016; pp. 2584–2589.
8. Mehrotra, K.; Mohan, C.K.; Ranka, S. Elements of Artificial Neural Networks; MIT Press: Cambridge, MA,
USA, 1997.
9. Hinton, G.E.; Salakhutdinov, R.R. Reducing the Dimensionality of Data with Neural Networks. Science 2006,
313, 504–507. [CrossRef]
10. Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. 2015, 61, 85–117. [CrossRef]
11. Zhao, G.; Zhang, G.; Ge, Q.; Liu, X. Research Advances in Fault Diagnosis and Prognostic Based on
Deep Learning. In Proceedings of the 2016 Prognostics and System Health Management Conference
(PHM-Chengdu), Chengdu, China, 19–21 October 2016; pp. 1–6.
12. Xiongzi, C.; Jinsong, Y.; Diyin, T.; Yingxun, W. Remaining Useful Life Prognostic Estimation for Aircraft
Subsystems or Components: A Review. In Proceedings of the 2011 10th International Conference on
Electronic Measurement & Instruments, Chengdu, China, 16–19 August 2011; Volume 2, pp. 94–98.
13. Yuan, M.; Wu, Y.-T.; Lin, L. Fault Diagnosis and Remaining Useful Life Estimation of Aero Engine Using
LSTM Neural Network. In Proceedings of the 2016 IEEE International Conference on Aircraft Utility Systems
(AUS), Beijing, China, 10–12 October 2016; pp. 135–140.
14. Khan, F.; Eker, O.F.; Khan, A.; Orfali, W. Adaptive Degradation Prognostic Reasoning by Particle Filter with
a Neural Network Degradation Model for Turbofan Jet Engine. Data 2018, 3, 49. [CrossRef]
15. Li, X.; Ding, Q.; Sun, J.-Q. Remaining useful life estimation in prognostics using deep convolution neural
networks. Reliab. Eng. Syst. Saf. 2018, 172, 1–11. [CrossRef]
16. Zhang, A.; Wang, H.; Li, S.; Cui, Y.; Liu, Z.; Yang, G.; Hu, J. Transfer Learning with Deep Recurrent Neural
Networks for Remaining Useful Life Estimation. Appl. Sci. 2018, 8, 2416. [CrossRef]
17. Kong, Z.; Cui, Y.; Xia, Z.; Lv, H. Convolution and Long Short-Term Memory Hybrid Deep Neural Networks
for Remaining Useful Life Prognostics. Appl. Sci. 2019, 9, 4156. [CrossRef]
18. Zheng, S.; Ristovski, K.; Farahat, A.; Gupta, C. Long Short-Term Memory Network for Remaining Useful
Life estimation. In Proceedings of the 2017 IEEE International Conference on Prognostics and Health
Management (ICPHM), Dallas, TX, USA, 19–21 June 2017; pp. 88–95.
19. Wu, Y.-T.; Yuan, M.; Dong, S.; Lin, L.; Liu, Y. Remaining useful life estimation of engineered systems using
vanilla LSTM neural networks. Neurocomputing 2018, 275, 167–179. [CrossRef]
20. Ellefsen, A.L.; Bjoerlykhaug, E.; Æsøy, V.; Ushakov, S.; Zhang, H. Remaining useful life predictions for
turbofan engine degradation using semi-supervised deep architecture. Reliab. Eng. Syst. Saf. 2019, 183,
240–251. [CrossRef]
21. Candel, A.; Parmar, V.; LeDell, E.; Arora, A. Deep Learning with H2 O; H2O. ai Inc.: Mountain View, CA,
USA, 2016.
22. Bengio, Y. Learning Deep Architectures for AI. Found. Trends Mach. Learn. 2009, 2, 1–127. [CrossRef]
Aerospace 2020, 7, 132 32 of 32
23. Goodfellow, I.; Warde-Farley, D.; Mirza, M.; Courville, A.; Bengio, Y. Maxout networks. In Proceedings of
the International Conference on Machine Learning, Atlanta, GA, USA, 16−21 June 2013; pp. 1319–1327.
24. Ben-David, S.; Blitzer, J.; Crammer, K.; Kulesza, A.; Pereira, F.C.N.; Vaughan, J.W. A theory of learning from
different domains. Mach. Learn. 2010, 79, 151–175. [CrossRef]
25. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent
neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958.
26. Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; LaRochelle, H.; LaViolette, F.; Marchand, M.; Lempitsky, V.
Domain-Adversarial Training of Neural Networks. J. Mach. Learn. Res. 2016, 17, 189–209. [CrossRef]
27. Guyon, I.; Gunn, S.; Nikravesh, M.; Zadeh, L.A. (Eds.) Feature Extraction: Foundations and Applications.
Springer: New York, NY, USA, 2008.
28. Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3,
1157–1182.
29. Benesty, J.; Chen, J.; Huang, Y.; Cohen, I. Pearson Correlation Coefficient. In Noise Reduction in Speech
Processing; Springer: Berlin/Heidelberg, Germany, 2009; pp. 1–4.
30. Sarwate, D. Mean-square correlation of shift-register sequences. IEEE Proc. F Commun. Radar Signal Process.
1984, 131, 101. [CrossRef]
31. Sun, Y. Iterative RELIEF for Feature Weighting: Algorithms, Theories, and Applications. IEEE Trans. Pattern
Anal. Mach. Intell. 2007, 29, 1035–1051. [CrossRef]
32. Derksen, S.; Keselman, H.J. Backward, forward and stepwise automated subset selection algorithms:
Frequency of obtaining authentic and noise variables. Br. J. Math. Stat. Psychol. 1992, 45, 265–282. [CrossRef]
33. Vafaie, H.; Imam, I.F. Feature Selection Methods: Genetic Algorithms vs. Greedy-Like Search. In Proceedings
of the International Conference on Fuzzy and Intelligent Control Systems, LoUIsville, KY, USA,
26 June−2 July 1994; Volume 51, p. 28.
34. Javed, K.; Gouriveau, R.; Zemouri, R.; Zerhouni, N. Features Selection Procedure for Prognostics:
An Approach Based on Predictability. IFAC Proc. Vol. 2012, 45, 25–30. [CrossRef]
35. Wirth, R.; Hipp, J. CRISP-DM: Towards a Standard Process Model for Data Mining. In Proceedings
of the 4th International Conference on the Practical Applications of Knowledge Discovery and Data Mining;
Springer: London, UK, 2000; pp. 29–39.
36. Khumprom, P.; Yodo, N. A Data-Driven Predictive Prognostic Model for Lithium-Ion Batteries based on a
Deep Learning Algorithm. Energies 2019, 12, 660. [CrossRef]
37. Erhan, D.; Manzagol, P.A.; Bengio, Y.; Bengio, S.; Vincent, P. The difficulty of training deep architectures and
the effect of unsupervised pre-training. AISTATS 2009, 5, 153–160.
38. Saxena, A.; Goebel, K.; Simon, D.; Eklund, N. Damage Propagation Modeling for Aircraft Engine
Run-to-Failure Simulation. In Proceedings of the 2008 International Conference on Prognostics and
Health Management, Denver, CO, USA, 6–9 October 2008; pp. 1–9.
39. Frederick, D.K.; DeCastro, J.A.; Litt, J.S. User’s Guide for the Commercial Modular Aero-Propulsion System
Simulation (C-MAPSS); NASA/TM-2007-215026; NASA: Washington, DC, USA, 1 October 2007.
40. Van der Drift, A. Evolutionary selection, a principle governing growth orientation in vapour-deposited
layers. Philips Res. Rep. 1967, 22, 267–288.
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

10.3390@aerospace7090132

Uploaded by

Copyright:

Available Formats

10.3390@aerospace7090132

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10.3390@aerospace7090132

Uploaded by

Copyright:

Available Formats

aerospace

Aerospace 2020, 7, 132; doi:10.3390/aerospace7090132 www.mdpi.com/journal/aerospace

1.1. Neural Network for RUL Prediction

Figure 1. Auto-encoder Deep Neural Networks Construction.

1.2. Related Works

• Deep auto-encoders for unsupervised feature learning.

2.3. Feature Selection Methods for Neural Network Architectures

2.4. Neural Network Data-Driven Modeling Framework

Figure 4. The prognostic data-driven framework for neural network algorithms.

3. Experimental Setup and Results

3. Experimental Setup and Results

3.1. C-MAPSS Aircraft Engines Data

Figure 7. Example of before (left) and after (right) z‐normalization.

3.2. Training Procedure and Hyperparameters Selection

3.2. Training Procedure and Hyperparameters Selection

3.3. Experimental Setup and Results

3.3.1. Feature Selection for Aircraft Engine Dataset

Table 3. Attribute values from different filter methods.

Pearson Correlation Relief Algorithm SVM PCA Deviation

Figure 10. Validation result using Evolutionary selection.

3.3.2. DNN Models and Results

3.3.2. DNN Models and Results

Aerospace 2020, 7, x FOR PEER REVIEW 18 of 32

(a) Original Data FD002

Figure 11. (a–g)

Aerospace 2020, 7, 132 19 of 32

(a) Original Data FD004

(b) Backward Elimination FD004

Figure 14. (a–g)

Table 5. The best DNN Models for FD002 test data.

root_mean_squared_error: 45.439 +/− 0.000

root_mean_squared_error: 45.121 +/− 0.000

root_mean_squared_error: 45.374 +/− 0.000

root_mean_squared_error: 44.717 +/− 0.000

root_mean_squared_error: 45.242 +/− 0.000

root_mean_squared_error: 45.368 +/− 0.000

discuss more on * Note:

Figure Figure 15. (a,b)

Figure 16. Cont.

5. Conclusions and Future Work

Table A1. Pearson Correlation Matrix.

Table A2. Principle Components (PC) Matrix.

Aerospace 2020, 7, x FOR PEER REVIEW 31 of 33

You might also like