0% found this document useful (0 votes)
38 views21 pages

Structural Contr HLTH - 2021 - Caceres - A Probabilistic Bayesian Recurrent Neural Network For Remaining Useful Life

Uploaded by

ex w
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views21 pages

Structural Contr HLTH - 2021 - Caceres - A Probabilistic Bayesian Recurrent Neural Network For Remaining Useful Life

Uploaded by

ex w
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Received: 17 September 2020 Revised: 9 May 2021 Accepted: 2 June 2021

DOI: 10.1002/stc.2811

RESEARCH ARTICLE

A probabilistic Bayesian recurrent neural network for


remaining useful life prognostics considering epistemic
and aleatory uncertainties

Jose Caceres1 | Danilo Gonzalez1 | Taotao Zhou2 | Enrique Lopez Droguett3

1
Mechanical Engineering Department,
University of Chile, Santiago, Chile
Summary
2
Center for Risk and Reliability, Deep learning-based approach has emerged as a promising solution to handle
University of Maryland, College Park, big machinery data from multi-sensor suites in complex physical assets and
Maryland, USA
predict their remaining useful life (RUL). However, most recent deep learning-
3
Department of Civil and Environmental
Engineering, and B. John Garrick
based approaches deliver a single-point estimate of RUL as these models repre-
Institute for the Risk Sciences, University sent the weights of a neural network as a deterministic value and hence cannot
of California Los Angeles, Los Angeles, convey uncertainty in the RUL prediction. This practice usually provides
California, USA
overly confident predictions that might cause severe consequences in safety-
Correspondence critical industries. To address this issue, this paper proposes a probabilistic
Taotao Zhou, Center for Risk and Bayesian recurrent neural network (RNN) for RUL prognostics considering
Reliability, University of Maryland,
College Park, MD, USA. epistemic and aleatory uncertainties. The epistemic uncertainty is handled by
Email: [email protected] Bayesian RNN layers as extensions from the Frequentist RNN layers using the
Flipout method. The aleatory uncertainty is covered by a probabilistic output
that follows a Gaussian distribution parameterized by the two neurons in the
output layer. The network is trained using Bayes by backprop with the Flipout
method. The proposed model is demonstrated by the open-access Commercial
Modular Aero-Propulsion System Simulation (C-MAPSS) dataset of turbofan
engines and a comparative study of the Frequentist RNN counterparts, the
Monte Carlo Dropout-based RNN, and the state-of-the-art models for
C-MAPSS datasets. The results demonstrate the promising performance and
robustness of the proposed model in RUL prognostics.

KEYWORDS
aleatory uncertainty, Bayesian methods, epistemic uncertainty, prognostics and health
management, recurrent neural networks, remaining useful life

1 | INTRODUCTION

Modern engineered systems are generally configured with hardware, software, human, organizations, and their interac-
tions.1 These features exponentially increase the complexity of systems and, in turn, boost the possibility of system fail-
ure. Proper understanding and modeling of such system behaviors are important and urgent to ensure the reliability
and safety of system design and operations. With the rapid advances in the Internet of Things (IoT), modern engineered
systems tend to be increasingly instrumented with network-connected devices, and massive quantities of multi-
dimensional data have been generated from multi-sensor suites.2 These big machinery data have been recognized as a

Struct Control Health Monit. 2021;28:e2811. wileyonlinelibrary.com/journal/stc © 2021 John Wiley & Sons, Ltd. 1 of 21
https://doi.org/10.1002/stc.2811
15452263, 2021, 10, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/stc.2811 by Xian Jiaotong University, Wiley Online Library on [19/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2 of 21 CACERES ET AL.

valuable resource, based on which one can leverage deep learning techniques to uncover and explore hidden features
to gain insights into a complex system performance.3
In the past decade, prognostics and health management (PHM) has attracted growing attention from both academia
and industry. PHM utilizes sensor technology and data analytics to detect the degradation of engineered systems, diag-
nose the type of faults, predict the remaining useful life (RUL), and proactively manage failures. Typically, the raw sen-
sor data are first preprocessed to enhance the data quality, then identify the fault-relevant features, and utilize
techniques such as machine learning-based approaches to formulate a predictive model for diagnostic and prognostic
purposes. Most PHM studies focus on critical machinery components and infrastructures including bearings,4,5 gears,6,7
batteries,8,9 bridges,10,11 and railway.12 However, these studies are mainly developed based on conventional machine
learning models with shallow configurations. Achieving a satisfactory performance may require large amounts of time
and expertise as domain knowledge to manually design features.13
More recently, deep learning has emerged as a promising approach that enables handling big machinery data and
automatically learning from data representing features without deep knowledge about it. The key idea of deep learning
is to use multiple layers to progressively learn a more abstract and composite representation of the input and ultimately
formulate an end-to-end predictive framework.14 Typically, some well-known deep learning models include
autoencoders, convolutional neural networks (CNNs), recurrent neural networks (RNNs), deep belief networks, and
deep Boltzmann machines. Variants of the above models are also under active development such as variational
autoencoders, generative adversarial networks, and long short-term memory (LSTM). In the context of PHM, research
has been conducted to explore the advantages of deep learning techniques. For instance, Modarres et al.15 developed a
CNNs-based method for automated damage recognition with applications to real concrete bridge crack images; Zhang
et al. presented an LSTM RNN model to predict the RUL of lithium-ion batteries.16 Chen et al. proposed a domain-
adaptation-based transfer learning approach to diagnose gear faults under varying working conditions.17 Verstraete
et al. developed an unsupervised and semi-supervised deep learning-enabled generative adversarial network-based
methodology for fault diagnostics of rolling element bearings.18 Correa-Jullian et al.19 presented an LSTM-based perfor-
mance prediction framework for a solar thermal system and demonstrated its advantages over shallow learning models.
The interested readers can find a comprehensive review of the applications of deep learning to handle big machinery
data in Jia and Zhao et al.20,21
The above research is mainly developed based on Frequentist deep learning models that represent the weights as a
deterministic value and thus cannot properly convey uncertainty in the predictions. This practice usually provides
overly confident predictions, in turn, leading to unreliable decisions that might cause severe deleterious consequences
in safety-critical industries. Therefore, there is an urgent need to properly consider the uncertainties22,23 attributed to
the incomplete knowledge (i.e., epistemic uncertainty) and the noisy observations (i.e., aleatory uncertainty) in the deep
learning-based prognostic method. Bayesian methods have long been used to frame a principled approach for uncer-
tainty quantification in making reasoned decisions for engineering applications.24 For instance, Sankararaman and
Mahadevan25 developed a Bayesian approach to facilitate quantification and updating of the uncertainty in damage
diagnosis. Arangio and Beck26 proposed a two-step Bayesian framework for the damage identification and quantifica-
tion in a long-suspension bridge. These enormous successful works highlight the advantages of Bayesian methods and
suggest that combining deep learning and Bayesian methods could enable a unified treatment of uncertainty
and improve decision-making.27
There have been limited studies on considering the uncertainty of the deep learning methods in the context of
PHM. Peng et al.28 presented a deep learning approach from a Bayesian viewpoint to address uncertainty in health
prognostics and implement the Bayesian approximation using Monte Carlo Dropout (MC Dropout).29 Unlike the regu-
lar dropout, MC Dropout is applied at both training and testing steps. The addition of dropout between every layer can
switch off some portion of neurons in each layer and generate random predictions as samples from a probability distri-
bution that is considered equivalent to performing approximate variational inference. It is worth noting that Peng
et al.28 suffer from two main drawbacks. First, using MC Dropout can only capture the epistemic uncertainty regarding
the weights by activating or deactivating neurons but ignore the aleatory uncertainty due to inherent noise in data.
Second, MC Dropout has been criticized for not being able to converge with increasing amounts of data and performing
poorly even on a simple linear network.30 This implies that MC Dropout provides no Bayesian approximation, and the
corresponding deep learning methods would be problematic in support of reliable uncertainty quantification for health
prognostics.
To properly consider the epistemic and aleatory uncertainties, this paper proposes a probabilistic Bayesian RNN for
RUL prognostic. This was accomplished in a twofold aspect in terms of the network configuration. Firstly, develop
15452263, 2021, 10, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/stc.2811 by Xian Jiaotong University, Wiley Online Library on [19/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
CACERES ET AL. 3 of 21

Bayesian RNN layers by extending the Frequentist RNN1 layers based on the Flipout method. The weights of Bayesian
RNN layers are represented in the form of a probability distribution rather than a deterministic value. Given a prior dis-
tribution specified for the weights of the neural network, a regularization effect is introduced and the epistemic uncer-
tainty in RUL predictions can be captured through the Bayesian posterior distribution of the weights. Secondly,
consider the aleatory uncertainty by a probabilistic output of RUL prediction following a Gaussian distribution. The
corresponding mean and standard deviation are estimated from the two-dimension output produced by a dense-flipout
layer, which is an extension of the Frequentist dense layer using the Flipout method. In doing so, the mean and noise
of observations can also be learned from the input data, in turn, allowing one to handle the heterogeneous noise com-
monly encountered in PHM and achieve a more accurate RUL prediction. The network training is conducted through
the Bayes by backprop with the Flipout method that improves the scalability and training efficiency with a lower vari-
ance of gradient estimate.
The proposed method is demonstrated by an open-access turbofan engine dataset31 known as C-MAPSS using four
architectural types including Vanilla RNN (VRNN), LSTM, gated recurrent units (GRUs), and Just Another NETwork
(JANET). Furthermore, comparative studies are presented to show the advantages of the proposed model over the
Frequentist RNN; the advantages of Bayes by backprop over the MC Dropout through the model training; and
the improved performance of the proposed model in relation to state-of-the-art models developed for the C-MAPSS
dataset.
The remainder of this paper is organized as follows. Section 2 summarizes the background of RNN, Bayesian
methods, and weight perturbations to make the discussion self-contained. Section 3 presents the proposed probabilistic
Bayesian RNN for RUL prognostics. Section 4 demonstrates the proposed model using the turbofan engine dataset and
provides a benchmark study of different RNN architectures, different inference algorithms, and state-of-the-art predic-
tive models. Section 5 provides concluding remarks.

2 | B ACKGROUND

2.1 | Recurrent neural networks

RNNs are a family of neural networks specialized in processing sequential data. The key idea is to configure a single
building unit called the cell that can memorize the information given the current input data and then pass through the
same cell sequentially to produce a single output for every time step, called the hidden state. This hidden state also
serves as extra input for the next time step, alongside the actual input for that time step, so the cell has information
about the last result predicted. With the hidden state of the past time step, and the current input, the cell should under-
stand the temporal behavior of the data, so it can produce an output to the actual time step according to that develop-
ment. As such, RNN architectures are generally composed of a single cell or a stack of cells, and the network of
neuron-like nodes is organized into successive layers. Besides the VRNN,32 some RNN variants were developed with
different cell designs to tackle gradient vanish and/or computational issues: LSTM,33 GRU,34 and JANET.35

2.2 | Bayesian methods

Bayesian methods provide a principled way to represent uncertainty by placing a prior distribution over model parame-
ters or latent variables and then marginalizing these parameters given any new observations to obtain an updated distri-
bution, referred to as a posterior distribution.36 The posterior distribution is often computational intractable, and
hence, approximate inferences are required such as Markov chain Monte Carlo (MCMC) and variational inference.37
Deriving the posterior distribution in the context of the neural network becomes more challenging due to a large num-
ber of datasets and parameters. MCMC methods generally converge slowly and are computationally expensive, while
variational inference methods can scale to large models or datasets. Therefore, variational inference has received much
attention, and two types of methods have been proposed to perform variational inference in deep learning38: (1) implicit
methods that exploit model uncertainties through noisy optimization. MC Dropout29 is the most popular implicit
method in practice due to its simplicity. However, there are much debate as to its validity of being Bayesian as discussed
in Section 1; (2) explicit methods that model parameters as probability distributions. A prominent explicit method is
the Bayes by backprop39 as a generalization for the Gaussian reparameterization trick to derive a Monte Carlo estimator
15452263, 2021, 10, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/stc.2811 by Xian Jiaotong University, Wiley Online Library on [19/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
4 of 21 CACERES ET AL.

of the evidence lower bound (ELBO) gradient, in turn, allowing one to perform mean-field variational inference by the
usual backpropagation algorithm. Therefore, our proposed model for RUL prognostic is trained based on Bayes by
backprop.

2.3 | Weight perturbation

Weight perturbation is a key step of Bayes by backprop to sample the weights stochastically during training. However,
in the vanilla Bayes by backprop,39 the perturbation is sampled just once per mini-batch and all examples in the mini-
batch share a base perturbation for the sake of computational efficiency. This practice leads to high variance of the gra-
dient estimate because the high correlation of weight perturbation makes the variance of the gradient estimate difficult
to be eliminated by averaging. Many research efforts have been proposed to reduce the variance of the gradient estima-
tor.40,41 Most notably, Wen et al. proposed Flipout as an efficient implementation by performing antithetic sampling to
sample quasi-independent weights and, in turn, decorrelate the gradients between different training examples in a
mini-batch and reduce the variance of the gradient estimates.42 In doing so, the variance of gradient estimate can be
much reduced by averaging over a mini-batch. These operations on a minibatch can be vectorized and hence are an
efficient implementation on GPUs.
To make discussions self-contained in Section 3, we present a brief description of the mathematic operation of the
Flipout method as follows. In the Flipout method, the variational distribution is represented as the sum of a mean W 
d
and a perturbation ΔW following a symmetric distribution around zero. ΔW is a base weight perturbation drawn from
the perturbation distribution, and is shared by the training examples in a batch. Then the element-wise multiplication
d by a different rank-one sign matrix for each example in the mini-batch would result in ΔW, which is identically
of ΔW
d and generates an unbiased estimator for the loss gradients.
distributed to ΔW

d ∘ vn sn >
ΔW n ¼ ΔW ð1Þ

where rn and sn are random vectors sampled uniformly from ±1 and ∘ denotes the element-wise multiplication. Flipout
applies to a variety of network architectures.42 Consider a Frequentist dense layer and denote the corresponding opera-
tion as Dense() in Equation 2. Modifying the weights of the Frequentist dense layer based on the Flipout method would
result in a dense-flipout layer in Equation 3 which enables one to quantify uncertainties for prognostics of yn. As vn and
sn are sampled independently, one can obtain derivatives of the weights and the input, thus allowing to do
backpropagation.
 
yn ¼ Dense W > x n ð2Þ

 > 
yn ¼ Dense  þ ΔW
W d ∘ vn sn > xn ð3Þ

3 | P RO P OS ED M ODEL FO R RU L PR OGN O STI CS

This section presents a probabilistic Bayesian RNN that quantifies both epistemic uncertainty and aleatory uncertainty
in the context of RUL prognostics. Particularly, the epistemic uncertainty is captured by the Bayesian posterior over the
weights that are represented by a probabilistic distribution. The aleatory uncertainty is reflected by probabilistically
treating the model output that parameterizes the RUL prediction with a Gaussian distribution. The network is trained
using Bayes-by-backprop algorithm with the Flipout method. Ultimately, multiple samples can be drawn to obtain a
probability distribution characterizing the uncertainty of RUL predictions. Figure 1 displays a flowchart of the proposed
framework consisting of three main parts: network configuration, network training, and RUL prognostics, the details of
which are discussed in the following section, respectively.
15452263, 2021, 10, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/stc.2811 by Xian Jiaotong University, Wiley Online Library on [19/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
CACERES ET AL. 5 of 21

3.1 | Network configuration

Suppose the health condition of a component or system can be described by a temporal series consisting of multi-
dimensional sensor data. For the data preprocessing, it is necessary to apply a sliding window to cut the signal into
batches, and the preprocessed data are reshaped into some windows, window length, and a number of sensors. As illus-
trated in Figure 2, a sequential data Xi acquired in T time steps is fed into a Bayesian RNN layer with K units. This pro-
duces a vector corresponding to each recurrent unit, which is followed by a flatten layer. Then a dense-flipout layer is
used to produce the results with two output dimensions. Note that the parameters in this network are twice their
frequentist version because the proposed network is fully Bayesian and every weight in the probabilistic Bayesian RNN
is parameterized with two parameters instead of one. The reasoning behind the net configuration design in terms of
uncertainty quantification is explained below.
The aleatory uncertainty is addressed by creating a probabilistic output that follows a Gaussian distribution. A
dense-flipout layer is used to produce the mean and standard deviation parameters of the Gaussian distribution in the
output layer. As such, both the observation estimate and noise are treated as variables and can be learned from the data
input. This allows one to handle data with heterogeneous noise, rather than the common assumption of homogeneous
noise, leading to a more accurate prediction of RUL.
The epistemic uncertainty is considered by the Bayesian RNN layer, which is created by modifying the weights of
the Frequentist recurrent cell operations using the Flipout method. To illustrate the extension of these recurrent cells,

FIGURE 1 Flowchart of the probabilistic Bayesian recurrent neural network for RUL prognostics

FIGURE 2 Illustration of the network configuration of the proposed probabilistic Bayesian RNN
15452263, 2021, 10, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/stc.2811 by Xian Jiaotong University, Wiley Online Library on [19/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
6 of 21 CACERES ET AL.

we start with the Frequentist VRNN layer. The cell of Frequentist VRNN is designed with one hidden gate, the activa-
tion function of which usually is a hyperbolic tangent. Equation 4 describes the operation inside the cell of a
Frequentist VRNN:

ht ¼ tanhðU h x t þ W h ht1 þ bh Þ ð4Þ

where ht is the hidden state at time t, t  {0,…,k} is the discrete-time index with k the prediction horizon, Uh denotes the
weight matrix of each of the layer for their connection to the input vector xt at time t, Wh denotes the weight matrix of
each of the layer for their connection to the input vector ht  1, bh denotes the bias terms for each of the layers, and tanh
is the hyperbolic tangent function. Then Equation 4 can be rewritten in the form of Equation 5 based on the operations
of dense layers.
    
ht ¼ tanh Dense U h > x t þ Dense W h > ht1 ð5Þ

As noted in Section 2.3, transforming a Frequentist dense layer into a dense-flipout layer would empower uncertainty
quantification. Therefore, replacing the two Frequentist dense layers in Equation 5 with two dense-flipout layers,
respectively, would result in an ensemble dense-flipout layer as shown in Equation 6. This ultimately provides a Bayes-
ian VRNN layer to quantify uncertainty when dealing with sequential data.
    
ht ¼ tanh Uh x t þ ΔU
b h ðx t ∘ sh,t Þ ∘ r h,t þ W h ht1 þ ΔW
ch ðht1 ∘ sh,t Þ ∘ r h,t ð6Þ

where U  i are the mean of the weight matrix and ΔW


 i, V ^ i , ΔU
^ i are the perturbation of the weight matrix that is applied
to the input and hidden vector, respectively. ri,t and si,t are independent random sign vectors for each time step and each
cell. Therefore, W i , U  i ,ΔW
^ i ,ΔU
^ i are trainable parameters of the recurrent layers that need to be learned from
training data.
Note that it is straightforward to extend other architectural types of Frequentist RNN to Bayesian RNN. For
instance, the LSTM cell has four gates: the input gate, the current state gate, the forget gate, and the output gate, all of
which should be replaced with dense-flipout layers to enable the Bayesian behaviors for uncertainty quantification. The
details are provided in Appendix A on the extension of LSTM, GRU, and JANET.

3.2 | Network training

Denote the training dataset as D = {X,Y}, where X ¼ fx i gTi¼1 represents the multi-sensor monitoring data and Y ¼
fyi gTi¼1 represents the data of RUL. The network training aims to estimate the parameters θ of a variational distribution
q(wj θ) that approximates the true posterior distribution p(wj D). This is conducted by minimizing the cost function
known as the ELBO in Equation 7. In particular, the training process needs a balance between two terms in ELBO:
(a) the complexity cost, which represents the Kullback–Leibler (KL) divergence between the variational distribution q
(wj θ) and the prior distribution of the weights p(w), and (b) the likelihood cost, which is the expected value of the
likelihood.

F ðD, θÞ ¼ KLðqðwjθÞ j jpðwÞÞ  E qðwjθÞðlogpðDjwÞÞLikelihood cost ð7Þ


|fflfflfflffl{zfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
ELBO Complexity cost

By rearranging the term of complexity cost, the cost function can be written as

F ðD, θÞ ¼ E   ð8Þ
q wjθÞ logqðwjθÞÞE
ð
q wjθÞðlogpðwÞÞE qðwjθÞðlogpðDjwÞÞ
15452263, 2021, 10, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/stc.2811 by Xian Jiaotong University, Wiley Online Library on [19/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
CACERES ET AL. 7 of 21

To reduce the computational complexity, the cost function can be approximated by an unbiased Monte Carlo estimate
in Equation 9. Monte Carlo samples are generated through weight perturbation, and the reparameterization trick is
applied to guarantee that backpropagation works. Particularly, the Flipout method is adopted for the weight perturba-
tion to enhance the scalability and training efficiency with a reduced variance of gradient estimate.

1 XN      
ðiÞ ðiÞ ðiÞ
F ðD, θÞ ≈ log q w jθ  log p w  log p Djw ð9Þ
N i¼1

where the super index denotes the ith Monte Carlo sample w(i) drawn from the variational posterior distribution q
(wj θ). In the backward pass, gradients of variational parameters θ are calculated via backpropagation so that their
values can be updated by an optimizer. Each step of the optimization process is described as follows:

1. Sample ϵN(0,I).
2. Let the base weight perturbation Δw ^ ¼ logð1 þ expðρÞÞ ∘ ϵ.
3. Sample two random sign vectors rn and sn uniformly from ±1.
4. Let the weight w ¼ w ^ ∘ vn sn > .
 þ Δw
 
5. Let f(w, θ) = logq(wj θ)  logp(w)  logp(Dj w), where θ ¼ w, ^ .
 Δw
6. 
Calculate the gradient with respect to w:

∂f ðw, θÞ ∂f ðw, θÞ
Δw ¼ þ
∂w 
∂w

^
7. Calculate the gradient with respect to Δw:

∂f ðw, θÞ ∂f ðw, θÞ
ΔΔw þ _ E
∂w ∂Δw

8. Update the variational parameters


w   αΔw
w
_ _
Δw Δw  αΔΔw_

3.3 | RUL prognostics with uncertainty quantification

Suppose the model p(wj D) is successfully trained, in which the epistemic uncertainty is covered by the variational dis-
tribution q(wj θ) and the aleatory uncertainty is covered by the Gaussian probability distribution g(Dj τ) parametrized
by τ = (μ, σ) in the output layer. Then it is desired to make the RUL prognostics y* for a new data input x*. Particularly,
the RUL prediction estimate is accomplished by a Monte Carlo estimate of RUL by drawing M samples through sto-
chastic forward passes of the network. A point estimate of RUL prediction is determined by computing the median of
the predictions regarding each sample. The uncertainty of RUL prediction is characterized by intervals with a certain
confidence level.
15452263, 2021, 10, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/stc.2811 by Xian Jiaotong University, Wiley Online Library on [19/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
8 of 21 CACERES ET AL.

pðy jx  , DÞ ¼ EpðwjDÞ ðgðy j τÞpðτjx  ,w ÞÞ


≈ Eqðwj θÞ ðgðy j τÞpðτjx  , w ÞÞ
ð10Þ
1X M
≈ gðy jτÞpðτjx  , w m Þ
M m¼1

4 | APPL ICATION TO THE RUL PROGNOSTICS OF TURBOFAN ENGINES

This section presents the application of the proposed approach to predict the RUL of a turbofan engine. Section 4.1 pro-
vides a brief description of the turbofan engine dataset. Section 4.2 presents the data preparation and model develop-
ment. Section 4.3 discusses the results of RUL predictions with uncertainty quantification and the performance of the
probabilistic Bayesian RNN regarding four different architectures. The model performance assessment is carried out
utilizing three comparative studies as follows, and the details are provided in Section 4.4. The model was developed
based on Python v3.6, TensorFlow v1.13, and TensorFlow Probability v0.643 using a desktop with Intel Core i7 8700
CPU, 32-GB DDR4 RAM, and 12-GB NVIDIA Titan XP GPU.

• Comparison between the probabilistic Bayesian RNN with Frequentist RNN.


• Comparison between the probabilistic Bayesian RNN with Bayesian approximation based on either Bayes by
backprop or MC Dropout.
• Comparison between the probabilistic Bayesian RNN and state-of-the-art models developed for C-MAPSS datasets.

4.1 | Problem and data description

The turbofan engine datasets were generated by the Commercial Modular Aero-Propulsion System Simulation
(C-MAPSS) tool developed by NASA.31 The operational status of each engine starts in a healthy state with some initial
degradation and degrades over service time in units of operational cycles until a failure occurs. Various degradation
patterns are simulated by the randomly chosen initial degradation, the pre-defined fault injection parameters, and
operational condition. Then multidimensional sensor data are collected and can be utilized to characterize the
degradation behaviors of a turbofan engine.
As shown in Table 1, the C-MAPSS contains four datasets that represent varying degrees of complexity due to the
different operational conditions and simultaneous fault modes involved in the dataset generation. Each dataset is
divided into a training set and a test set. The training set records the sensory data until failures. The number of training
samples varies depending on the number of engines and the number of data points per engine. The test set records the
sensory data up to some random cycles within their useful life, and the corresponding RUL is recorded that is the gro-
und truth RUL target to be predicted. Table 2 displays the 21 sensor measurements as recorded in each dataset.44 It is
noticed that the measurements in sensors #1, #5, #6, #10, #16, #18, and #19 remain almost constant and are not infor-
mative to reflect the evolution of engine degradation. As such, the above seven sensors are excluded in each dataset for
the following analysis.

TABLE 1 Description of the four turbofan engine datasets in C-MAPSS

Operational Fault Number of engines Number of engines


Dataset conditions modes in the training set in the test set
FD001 1 1 100 100
FD002 6 1 258 259
FD003 1 2 100 100
FD004 6 2 248 248
15452263, 2021, 10, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/stc.2811 by Xian Jiaotong University, Wiley Online Library on [19/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
CACERES ET AL. 9 of 21

4.2 | Data preparation and model development

The dataset consisting of 14 sensor measurements or features is preprocessed to be normalized that rescales the sensor
measurements into a range of [0, 1]. The normalized data are then reshaped into the format [windowNo, windowLength,
sensorNo], where windowNo is the number of windows, windowLength is the number of time steps in each window, and
sensorNo is the number of sensor measurements. Particularly, such data are generated by applying a sliding window,
which allows one to increase the number of training data, standardize the length of model input, and accelerate the
model training. Figure 3 illustrates the data generation process using a sliding time window for an engine with 11 opera-
tional cycles until failure. Each row of the raw data represents 14 sensor measurements collected at a time step.
The window length is six operational cycles, and we move the sliding window to one operational cycle at a time.
The ground truth RUL value of the engine in the training dataset is constructed based on a linear RUL function with a
one-step-ahead prediction that is to subtract the sensor measurement time from the total number of cycles within the
useful lifetime. Furthermore, the length of the sliding window is fixed and is selected according to the minimum

TABLE 2 Description of the sensor measurements in each dataset of C-MAPSS

Sensor Sensor
index Description Units index Description Units

1 Total temperature at fan inlet R 12 Ratio fuel flow to Ps30 pps/psi

2 Total temperature at low-pressure compressor R 13 Corrected fan speed rpm
(LPC) outlet

3 Total temperature at high-pressure compressor R 14 Corrected core speed rpm
(HPC) outlet

4 Total temperature at LPC outlet R 15 Bypass ratio -
5 Pressure at fan inlet psia 16 Burner fuel–air ratio -
6 Total pressure in bypass duct psia 17 Bleed enthalpy -
7 Total pressure at HPC outlet psia 18 Demanded fan speed rpm
8 Physical fan speed rpm 19 Demanded corrected fan speed rpm
9 Physical core speed rpm 20 High-pressure turbine (HPT) lbm/s
coolant bleed
10 Engine pressure ratio - 21 Low-pressure turbine (LPT) lbm/s
coolant bleed
11 Static pressure at HPC outlet psia

F I G U R E 3 An illustration of data generation using a sliding time window size of 6 for an engine with 11 operational cycles within the
useful lifetime
15452263, 2021, 10, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/stc.2811 by Xian Jiaotong University, Wiley Online Library on [19/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
10 of 21 CACERES ET AL.

number of operational cycles of the engines in each dataset. Hence, the sliding window length and the shape of each
prepared dataset are shown in Table 3, respectively.
Each of the four datasets is utilized to train and test for four architectural types of RNN (i.e., VRNN, LSTM, GRU,
and JANET). The model hyperparameter tuning is accomplished by grid search over 10 major hyperparameters as dis-
played in Table 4. The best value for each hyperparameter is presented in the last column of Table 4. Therefore, the
model architecture consists of one recurrent layer of 32 units and three dense Flipout layers of 32, 16, and 2 neurons,
respectively. A decaying polynomial learning rate is applied to smooth the learning in the last epochs; the learning rate
decays in 60 epochs to 70% of the learning rate at epoch 1.
Since this is a probabilistic model, the RUL prediction is obtained by stochastic forward passes through the network.
For each run of the probabilistic Bayesian RNN, a hundred samples are performed to characterize the distribution of
predicted RUL value for each test engine. Denote RUL ^ k as the predicted RUL value for the kth engine by the jth sample
i,j
of the ith run. Furthermore, the median value of the predicted results RUL ^ k is utilized as the final predicted value for
i
kth engine by the ith run. Note that the prior in the probabilistic Bayesian RNN is a mean-field Gaussian distribution
with a mean 0 and standard deviation of 1.
The model performance is examined by aggregating the prediction performance from all engines in the test set of
each dataset, respectively. Particularly, the prediction performance for the kth engine is measured by the difference
between its ground truth RUL value RULk and the RUL value RUL ^ k predicted by a model. To consider the random
effects through training and testing, three metrics are evaluated including the mean RMSE  , standard deviation STD,
and coefficient of variations COV of the root mean square error (RMSE), where RMSEn is the RMSE value obtained in
the nth run, N is the total number of test engines, and M is the total number of runs.

TABLE 3 Dataset size for C-MAPSS

Dataset Window length Sensors Train/test Number of samples


FD001 20 14 Train 14,347
Test 100
FD002 21 14 Train 29,526
Test 259
FD003 30 14 Train 17,864
Test 100
FD004 18 14 Train 46,326
Test 248

TABLE 4 Hyperparameters in the model development for the C-MAPSS dataset

Hyperparameter Grid search space Selected value


Bayes RNN neurons [16, 32, 64] 32
Dense Flipout layers [1, 2, 3] 2
Dense Flipout neurons [64, 32, 16] 32, 16
Epochs [50, 100, 300] 100
Learning rate (LR) [1e-2, 1e-3, 1e-4] 0.001
Scale [1e0, 5e0, 1e1] 1e1
Softplus [1e-3, 5e-3, 1e-2] 1e-3
LR decay [40, 60, 100] 60
End LR [50%, 70%] 70%
KL regularizer [1e-5, 5e-6, 1e-6] 1e-6
15452263, 2021, 10, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/stc.2811 by Xian Jiaotong University, Wiley Online Library on [19/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
CACERES ET AL. 11 of 21

vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u M h i
1X N
1X N u X
t1 dk 2
RMSE ¼ RMSE n ¼ RULki  RUL i ð11Þ
N n¼1 N k¼1 M i¼1

vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u
u 1 X N
STD ¼ t
2
RMSEn  RMSE ð12Þ
N  1 n¼1

STD
COV ¼ ð13Þ
RMSE

4.3 | Results and discussions

Table 5 displays the results obtained by the probabilistic Bayesian RNN given different model architectures and
datasets. The best performing model with the lowest RMSE values for each dataset is highlighted in bold and is selected
in the following analysis. A comparison between the predicted RUL generated by the best performing model and the
ground truth RUL is shown in Figure 4. The uncertainty of RUL prediction for each test engine is represented by an
interval with a 90% confidence level. Furthermore, the RUL predictions of three test engines (i.e., X1, X2, and X3) are
presented to show their approximated distribution in the lower left corner, where the star symbol at these points repre-
sents the real RUL value at this point, which is contrasted with the probability distribution obtained by the probabilistic
Bayesian RNN. Some important insights are summarized as below:

• The ground truth RUL values are bounded by 90% probability intervals of the RUL prediction. This demonstrates the
ability of the probabilistic Bayesian RNN to quantify the uncertainty based on multi-sensor time series.
• The effects of operating conditions on the uncertainty bounds of RUL predictions can be examined by a comparison
between two scenairos: FD002 and FD004 under the complex operating conditions; FD001 and FD003 with just one
operating condition. It is noticed that the former scenario results in wider RUL probability distributions (higher level
of uncertainty). It is intuitive that the more complex the operating condition, the higher variability and, therefore,
the uncertainty of the system states.
• As shown in Figure 4, the RUL predictions in the four datasets have the expected trend; that is, as the operating
cycles increase, the RUL decreases. Also note that, the model that presents the best results for the most complex

TABLE 5 Results based on four different architectures of the probabilistic Bayesian RNN

Dataset FD001 FD002

Model/metric RMSE STD COV RMSE STD COV


Probabilistic Bayesian VRNN 14.28 0.97 6.79 16.78 1.00 5.96
Probabilistic Bayesian LSTM 14.00 0.92 6.57 17.43 1.01 5.79
Probabilistic Bayesian GRU 14.08 1.05 7.46 17.58 1.03 5.86
Probabilistic Bayesian JANET 15.60 1.02 6.54 18.44 0.97 5.26
Dataset FD003 FD004

Model/metric RMSE STD COV RMSE STD COV


Probabilistic Bayesian VRNN 13.86 1.00 7.22 20.11 1.24 6.17
Probabilistic Bayesian LSTM 13.70 1.00 7.30 21.00 1.07 5.10
Probabilistic Bayesian GRU 12.31 0.93 7.55 21.42 1.13 5.28
Probabilistic Bayesian JANET 12.70 0.93 7.32 21.95 1.17 5.33
15452263, 2021, 10, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/stc.2811 by Xian Jiaotong University, Wiley Online Library on [19/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
12 of 21 CACERES ET AL.

FIGURE 4 Comparison between the real RUL and RUL predictions obtained from the best performing model for each C-MAPSS
dataset

datasets (FD002 and FD004) is the probabilistic Bayesian RNN, which does not have a long-term memory cell. Thus,
it can be argued that for these more complex datasets, the long-term information is not as relevant as the short-term
information. Overall, all the proposed models have good results in terms of RUL predictions. There is no individual
model that outperforms all the other models.
15452263, 2021, 10, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/stc.2811 by Xian Jiaotong University, Wiley Online Library on [19/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
CACERES ET AL. 13 of 21

4.4 | Model performance assessment

4.4.1 | Comparison with the Frequentist RNN

This section aims to compare the proposed probabilistic Bayesian RNN with the Frequentist RNN, both of which share
the same hyperparameters. The results are summarized in Table 6 with the lowest RMSE value highlighted in bold.

• The probabilistic Bayesian RNN has lower RMSE values than the Frequentist counterparts in almost all scenarios.
The only exception occurs in FD001, where the Frequentist JANET models provide slightly lower RMSE when com-
pared to the probabilistic Bayesian one.
• The probabilistic Bayesian RNN significantly outperforms the Frequentist RNN when dealing with datasets FD002,
FD003, and FD004, which are more complex in terms of multiple operating conditions and the existence of two
mixed fault modes.

TABLE 6 Results of the probabilistic Bayesian RNN versus Frequentist RNN

FD001

Dataset Frequentist RNN Probabilistic Bayesian RNN

Model/metric RMSE STD COV RMSE STD COV


VRNN 14.35 1.15 8.01 14.28 0.97 6.79
LSTM 15.26 0.93 6.09 14.00 0.92 6.57
GRU 14.12 0.38 2.69 14.08 1.05 7.46
JANET 14.11 0.59 4.18 15.60 1.02 6.54
FD002

Dataset Frequentist RNN Probabilistic Bayesian RNN

Model/metric RMSE STD COV RMSE STD COV


VRNN 18.99 0.45 2.37 16.78 1.00 5.96
LSTM 18.08 0.46 2.54 17.43 1.01 5.79
GRU 17.97 0.46 2.56 17.58 1.03 5.86
JANET 18.98 0.20 1.05 18.44 0.97 5.26
FD003

Dataset Frequentist RNN Probabilistic Bayesian RNN

Model/metric RMSE STD COV RMSE STD COV


VRNN 14.76 0.89 6.03 13.86 1.00 7.22
LSTM 14.34 0.55 3.84 13.70 1.00 7.30
GRU 13.60 0.34 2.50 12.31 0.93 7.55
JANET 13.96 0.27 1.93 12.70 0.93 7.32
FD004

Dataset Frequentist RNN Probabilistic Bayesian RNN

Model/metric RMSE STD COV RMSE STD COV


VRNN 23.37 1.22 5.22 20.11 1.24 6.17
LSTM 21.47 0.19 0.88 21.00 1.07 5.10
GRU 21.67 0.13 0.60 21.42 1.13 5.28
JANET 22.64 0.67 2.96 21.95 1.17 5.33
15452263, 2021, 10, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/stc.2811 by Xian Jiaotong University, Wiley Online Library on [19/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
14 of 21 CACERES ET AL.

• One should also note that even if the improvement of RMSE values is not significant, the probabilistic Bayesian RNN
is capable of quantifying the uncertainty and delivering the results in the form of an RUL probability distribution, in
turn, providing more information than the Frequentist RNN.
• Note that the Frequentist RNN present RUL predictions with higher standard deviations than the probabilistic
Bayesian RNN. The results can be partially explained by considering that the STD of the Frequentist RNN is calcu-
lated from three runs whereas the STD of the proposed probabilistic Bayesian RNN considers the deviation of all the
samples from the model.

4.4.2 | Comparison with the MC Dropout-based RNN

This section discusses the performance of the probabilistic Bayesian RNN in relation to the MC Dropout RNN. For the
sake of comparison, all the considered models share the same architectures and hyperparameters. The MC Dropout is
implemented with a dropout rate of 0.25. The results are summarized in Table 7, in which the lowest RMSE value for
each dataset is highlighted in bold. Some important observations are discussed as follows:

• The probabilistic Bayesian RNN outperforms the MC Dropout RNN in almost all scenarios except in FD001, in which
the Frequentist JANET with drop rate of 0.25 generated a slightly lower RMSE value than the probabilistic Bayesian
JANET.
• In the relatively more complex datasets (i.e., FD002, FD003, and FD004) involving multiple operational conditions
and fault modes, the probabilistic Bayesian RNN generates lower RMSE values compared to MC Dropout RNN. This
again demonstrates the advantage of probabilistic Bayesian RNN in handling more uncertain contexts.
• Note also that MC Dropout RNN provides lower values of STD or COV because it measures the variability between
activations in a deep neural network that is attributed to the stochasticity of random dropouts. In other words, MC
Dropout RNN only quantifies the epistemic uncertainty as discussed in Section 1.

4.4.3 | Comparison with state-of-the-art models for C-MAPSS datasets

The C-MAPSS datasets have been extensively utilized for performance benchmarking of RUL prognostic based on shal-
low learning and/or deep learning methods. Ramasso and Saxena45 reviewed over 70 papers involving C-MAPSS
datasets between 2010 and 2014. They proposed to categorize the RUL prognostic approaches into three major catego-
ries that mainly focused on shallow learning methods. After 2014, deep learning approaches attracted much attention
and various deep learning frameworks were proposed for RUL prognostic that shows better performance than the shal-
low learning methods. For instance, a combination of CNN and LSTM generates a sequential model consisting of four
convolutional layers and a fully connected layer with an LSTM46; an Attn-DLSTM model47 emphasizes the time domain
into the prediction developing more interpretable results; and a sequential information bottleneck network (SIBN)
model48 learns sequential information and uses encoder–decoder architecture to improve the learning representation
from the data. The literature used one or more of the C-MAPSS datasets, and various metrics were adopted to measure
the model performance.
For benchmarking purposes, this paper focuses on the literature using all four C-MAPSS datasets with RMSE metric
values reported for model performance evaluation. This enables a straightforward comparison with our proposed proba-
bilistic Bayesian RNN. In particular, four deep learning-based models are selected for performance benchmarking: a
multi-objective deep belief networks ensemble (MODBNE) model uses multiple object evolutionary algorithm (MOEA)
to train simultaneously many deep belief networks46; a deep convolutional neural network (DCNN) developed uses
many layers of CNNs to process the data and extract more complex features49; a deep LSTM (DLSTM) model uses many
stacks of LSTM to train a deep learning model50; and a domain adaptive CNN (AdaBN-CNN) uses batch normalization
to adapt the domain of the prediction allowing the network to use one dataset and test on another by the application of
two different CNNs.51
Table 8 summarizes the results retrieved from the four above-mentioned models as well as the results from the pro-
posed probabilistic Bayesian RNN: probabilistic Bayesian LSTM for FD001, probabilistic Bayesian VRNN for FD002,
15452263, 2021, 10, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/stc.2811 by Xian Jiaotong University, Wiley Online Library on [19/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
CACERES ET AL. 15 of 21

TABLE 7 Comparison of the probabilistic Bayesian RNN with MC Dropout-based RNN

FD001

Dataset Probabilistic Bayesian RNN MC Dropout RNN

Model/metric RMSE STD COV RMSE STD COV


VRNN 14.28 0.97 6.79 14.73 0.83 5.63
LSTM 14.00 0.92 6.57 14.96 0.68 4.54
GRU 14.08 1.05 7.46 14.25 0.68 4.77
JANET 15.60 1.02 6.54 14.91 0.66 4.42
FD002

Dataset Probabilistic Bayesian RNN MC Dropout RNN

Model/metric RMSE STD COV RMSE STD COV


VRNN 16.78 1.00 5.96 19.92 0.35 1.75
LSTM 17.43 1.01 5.79 18.24 0.33 1.80
GRU 17.58 1.03 5.86 18.51 0.31 1.67
JANET 18.44 0.97 5.26 20.67 0.35 1.69
FD003

Dataset Probabilistic Bayesian RNN MC Dropout RNN

Model/metric RMSE STD COV RMSE STD COV


VRNN 13.86 1.00 7.22 15.33 0.70 4.56
LSTM 13.70 1.00 7.30 15.11 0.57 3.77
GRU 12.31 0.93 7.55 14.52 0.65 4.47
JANET 12.70 0.93 7.32 14.59 0.58 3.97
FD004

Dataset Probabilistic Bayesian RNN MC Dropout RNN

Model/metric RMSE STD COV RMSE STD COV


VRNN 20.11 1.24 6.17 22.72 0.33 1.45
LSTM 21.00 1.07 5.10 21.53 0.27 1.25
GRU 21.42 1.13 5.28 21.75 0.28 1.28
JANET 21.95 1.17 5.33 22.44 0.38 1.69

TABLE 8 Results of the probabilistic Bayesian RNN against state-of-the-art deep learning models

Model MODBNE DCNN DLSTM AdaBN-CNN Probabilistic Bayesian RNN

Dataset/metric RMSE STD RMSE STD RMSE STD RMSE STD RMSE STD
FD001 15.40 - 12.61 0.19 16.14 - 13.17 - 14.00 0.92
FD002 25.05 - 22.36 0.32 24.49 - 20.87 - 16.78 1.00
FD003 12.51 - 12.64 0.14 16.18 - 14.97 - 12.31 0.93
FD004 28.66 - 23.31 0.39 28.17 - 24.57 - 20.11 1.24

probabilistic Bayesian GRU for FD003, and probabilistic Bayesian VRNN for FD004. The best result with the lowest
RMSE value for each dataset is highlighted in bold. The results show the promising performance of probabilistic Bayes-
ian RNN. Some important insights are discussed as follows:
15452263, 2021, 10, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/stc.2811 by Xian Jiaotong University, Wiley Online Library on [19/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
16 of 21 CACERES ET AL.

• The probabilistic Bayesian RNN delivers lower values of RMSE in all datasets but FD001, in which the best
performance is obtained by the DCNN. The relative difference of RMSE values between DCNN and the probabilistic
Bayesian RNN is around 9.93% that indicates a close performance in FD001.
• It is demonstrated the advantages of the probabilistic Bayesian RNN in dealing with the case involving complex oper-
ational conditions and simultaneous fault modes. This is reflected by the significant improvement of RMSE values
over the other four models in FD004. This observation is consistent with that in Sections 4.3 and 4.4.
• STD values were reported only for the DCNN model, which has lower values than the STD values in probabilistic
Bayesian RNN. It is worth noting that the STD values for the DCNN model can only cover a part of the uncertainty
source, in turn, leading to an overconfident prediction, since the DCNN model as a Frequentist model is limited in
treating either epistemic or aleatory uncertainties. On the other hand, our proposed probabilistic Bayesian RNN
model can systematically treat both epistemic and aleatory uncertainties. STD values in our proposed model can
address the contribution of more uncertainty sources and hence are larger than the STD value for the DCNN model.

5 | C ON C L U S I ON S

This paper proposed a probabilistic Bayesian RNN for RUL prognostics considering epistemic and aleatory uncer-
tainties, by extending the Frequentist RNN into Bayesian RNN based on the Flipout method and parametrizing the
RUL prediction with a Gaussian distribution produced by a dense-flipout layer. We demonstrated the application of
the proposed model regarding four architectural types of RNN (i.e., VRNN, LSTM, GRU, and JANET) to perform RUL
prognostics with uncertainty quantification using C-MAPSS dataset. The model performance was assessed based on a
benchmark study of the Frequentist RNN counterparts, the MC Dropout-based RNN, and the state-of-the-art models
for C-MAPSS datasets. The results indicated that the proposed model outperformed their Frequentist counterparts as
well as the MC Dropout-based RNN. The advantage of the proposed model was featured when dealing with more com-
plex scenarios involving multiple operating conditions and/or a combination of two fault modes as represented by
FD002, FD003, and FD004 datasets. In sum, the epistemic and aleatory uncertainties have great importance in the RUL
prognostics and need to be properly addressed through deep learning-based methods. Also, the capability of uncertainty
characterization and propagation through the neural network would enhance the predictive accuracy of RUL. The RUL
prediction with uncertainty quantification would provide a valuable tool to assist engineers in dealing with the opera-
tion and maintenance of complex systems in establishing more appropriate operating and maintenance policies to avoid
catastrophic or unexpected failures.

A C K N O WL E D G M E N T S
The authors received no financial support for the research, authorship, and/or publication of this article.

A U T H O R C ON T R I B U T I O NS
Jose Caceres: conceptualization; data curation; software; writing—original draft preparation. Danilo Gonzalez:
conceptualization; data curation; software; writing—original draft preparation. Taotao Zhou: conceptualization; data
curation; methodology; formal analysis; visualization; supervision; writing—original draft preparation; writing—review
and editing. Enrique Lopez Droguett: conceptualization; methodology; supervision; writing—review and editing;
resources.

ORCID
Taotao Zhou https://orcid.org/0000-0002-2881-4781

ENDNOTE
1
Frequentist RNN refers to the recurrent neural network that represents their weights as a deterministic value.

R EF E RE N C E S
1. Mohaghegh Z, Kazemi R, Mosleh A. Incorporating organizational factors into Probabilistic Risk Assessment (PRA) of complex socio-
technical systems: a hybrid technique formalization. Reliab Eng Syst Safe. 2009;94(5):1000-1018.
15452263, 2021, 10, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/stc.2811 by Xian Jiaotong University, Wiley Online Library on [19/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
CACERES ET AL. 17 of 21

2. Perera C, Liu CH, Jayawardena S. The emerging internet of things marketplace from an industrial perspective: a survey. IEEE Trans
Emerg Top Comput. 2015;3(4):585-598.
3. Lei Y, Li N, Guo L, Li N, Yan T, Lin J. Machinery health prognostics: a systematic review from data acquisition to RUL prediction. Mech
Syst Signal Pr. 2018;104:799-834.
4. Randall RB, Antoni J. Rolling element bearing diagnostics—a tutorial. Mech Syst Signal Pr. 2011;25(2):485-520.
5. Wang D, Tsui K. Statistical modeling of bearing degradation signals. IEEE Trans Reliab. 2017;66(4):1331-1344.
6. Lei Y, Lin J, Zuo MJ, He Z. Condition monitoring and fault diagnosis of planetary gearboxes: a review. Measurement. 2014;48:292-305.
7. Kundu P, Darpe AK, Kulkarni MS. Gear pitting severity level identification using binary segmentation methodology. Struct Control
Health Monit. 2020;27(3):e2478. https://doi.org/10.1002/stc.2478
8. Goebel K, Saha B, Saxena A, Celaya J, Christophersen J. Prognostics in battery health management. IEEE Instru Meas Mag. 2008;11(4):
33-40.
9. Zhang J, Lee J. A review on prognostics and health monitoring of Li-ion battery. J Power Sources. 2011;196(15):6007-6014.
10. Neves AC, Leander J, Gonzalez I, Karoumi R. An approach to decision-making analysis for implementation of structural health moni-
toring in bridges. Struct Control Health Monit. 2019;26(6):e2352. https://doi.org/10.1002/stc.2352
11. Zhao H, Ding Y, Li A, Sheng W, Geng F. Digital modeling on the nonlinear mapping between multi-source monitoring data of in-service
bridges. Struct Control Health Monit. 2020;27(11):e2618. https://doi.org/10.1002/stc.2618
12. Carboni M, Crivelli D. An acoustic emission based structural health monitoring approach to damage development in solid railway axles.
Int J Fatigue. 2020;139:105753. https://doi.org/10.1016/j.ijfatigue.2020.105753
13. Hamadache M, Jung JH, Park J, Youn BD. A comprehensive review of artificial intelligence-based approaches for rolling element bear-
ing PHM: shallow and deep learning. JMST Adv. 2019;1(1):125-151.
14. Goodfellow I, Bengio Y, Courville A. Deep Learning. Cumberland: MIT Press; 2016.
15. Modarres C, Astorga N, Droguett EL, Meruane V. Convolutional neural networks for automated damage recognition and damage type
identification. Struct Control Health Monit. 2018;25(10):e2230. https://doi.org/10.1002/stc.2230
16. Zhang Y, Xiong R, He H, Pecht M. Long short-term memory recurrent neural network for remaining useful life prediction of lithium-
ion batteries. IEEE Trans Veh Technol. 2018;67(7):5695-5705.
17. Chen C, Shen F, Xu J, Yan R. Domain adaptation-based transfer learning for gear fault diagnosis under varying working conditions.
IEEE Trans Instrum Meas. 2020;70:1-10.
18. Verstraete D, Droguett EL, Meruane V, Modarres M, Ferrada A. Deep semi-supervised generative adversarial fault diagnostics of rolling
element bearings. Struct Health Monit. 2019;19(2):390-411.
19. Correa-Jullian C, Cardemil J, Droguett EL, Behzad M. Assessment of deep learning techniques for prognosis of solar thermal systems.
Renew Energy. 2020;145:2178-2191.
20. Jia F, Lei Y, Lin J, Zhou X, Lu N. Deep neural networks: a promising tool for fault characteristic mining and intelligent diagnosis of
rotating machinery with massive data. Mech Syst Signal Process. 2016;72-73:303-315.
21. Zhao R, Yan R, Chen Z, Mao K, Wang P, Gao R. Deep learning and its applications to machine health monitoring. Mech Syst Signal Pro-
cess. 2019;115:213-237.
22. Der Kiureghian A, Ditlevsen O. Aleatory or epistemic? Does it matter? Struct Saf. 2009;31(2):105-112.
23. Kendall A, Gal Y. What uncertainties do we need in Bayesian deep learning for computer vision? arXiv preprint arXiv:1703.04977. 2017.
24. Droguett EL, Mosleh A. Bayesian methodology for model uncertainty using model performance data. Risk Anal: Int J. 2008;28(5):
1457-1476.
25. Sankararaman S, Mahadevan S. Bayesian methodology for diagnosis uncertainty quantification and health monitoring. Struct Control
Health Monit. 2013;20(1):88-106.
26. Arangio S, Beck JL. Bayesian neural networks for bridge integrity assessment. Struct Control Health Monit. 2012;19(1):3-21.
27. Neal RM. Bayesian Learning for Neural Networks (Vol. 118). Springer Science & Business Media; 2012.
28. Peng W, Ye Z, Chen N. Bayesian deep-learning-based health prognostics toward prognostics uncertainty. IEEE Trans Ind Electron. 2020;
67(3):2283-2293.
29. Gal Y, Ghahramani Z. Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In: Proceedings of the
33rd International Conference on Machine Learning, PMLR. Vol.48; 2016:1050-1059.
30. Osband I, Aslanides J, Cassirer A. Randomized prior functions for deep reinforcement learning. In: 32nd Conference on Neural Informa-
tion Processing Systems (NeurIPS 2018). Montréal, Canada, December 3–8; 2018.
31. Saxena A, Goebel K. Turbofan Engine Degradation Simulation Data Set. Moffett Field, CA: NASA Ames; 2008.
32. Rumelhart D, Hinton G, Williams R. Learning representations by back-propagating errors. Nature. 1986;323(6088):533-536.
33. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735-1780.
34. Cho K, van Merrienboer B, Gulcehre C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine trans-
lation, arXiv preprint arXiv:1406.1078. 2014.
35. Van der Westhuizen J, Lasenby J. The unreasonable effectiveness of the forget gate, arXiv preprint arXiv:1804.04849. 2018.
36. MacKay DJ. Information Theory, Inference and Learning Algorithms. Cambridge University Press; 2003.
37. Salimans T, Kingma D, Welling M. Markov chain Monte Carlo and variational inference: bridging the gap. In: Proceedings of the 32nd
International Conference on Machine Learning, PMLR. Vol.37; 2015:1218-1226.
15452263, 2021, 10, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/stc.2811 by Xian Jiaotong University, Wiley Online Library on [19/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
18 of 21 CACERES ET AL.

38. Filos A, Farquhar S, Gomez AN, et al. A systematic comparison of Bayesian deep learning robustness in diabetic retinopathy tasks, arXiv
preprint arXiv:1912.10481. 2019.
39. Blundell C, Cornebise J, Kavukcuoglu K, Wierstra D. Weight uncertainty in neural network. In: Proceedings of the 32nd International
Conference on Machine Learning, PMLR. Vol.37; 2015:1613-1622.
40. Miller AC, Foti NJ, D'Amour A, Adams RP. Reducing reparameterization gradient variance. In: 31st Conference on Neural Information
Processing Systems (NIPS 2017). Long Beach, CA, USA, December 4–9; 2017.
41. Li Y. Topics in approximate inference. 2020. Available at: http://yingzhenli.net/home/pdf/topics_approx_infer.pdf
42. Wen Y, Vicol P, Ba J, Tran D, Grosse R. Flipout: efficient pseudo-independent weight perturbations on mini-batches. In: International
Conference on Learning Representations 2018. Vancouver, Canada, April 30–May 3; 2018.
43. Dillon J, Langmore I, Tran D, et al. Tensorflow distributions, arXiv preprint arXiv:1711.10604. 2017.
44. Saxena A, Goebel K, Simon D, Eklund N. Damage propagation modeling for aircraft engine run-to-failure simulation. In: 2008 Interna-
tional Conference on Prognostics and Health Management. Denver, Colorado, USA, October 6–9; 2008.
45. Ramasso E, Saxena A. Performance benchmarking and analysis of prognostic methods for CMAPSS datasets. Int J Progn Health Manag.
2014;5(2):1-15.
46. Zhang X, Dong Y, Wen L, Lu F, Li W. Remaining useful life estimation based on a new convolutional and recurrent neural network. In:
2019 IEEE 15th International Conference on Automation Science and Engineering (CASE). Vancouver, Canada, August 22–26; 2019.
47. Das A, Hussain S, Yang F, Habibullah MS, Kumar A. Deep recurrent architecture with attention for remaining useful life estimation. In:
TENCON 2019–2019 IEEE Region 10 Conference (TENCON). Kerala, India, October 17–20; 2019.
48. Zhang Y, Li Y, Jia L, Wei X, Murphey YL. Sequential information bottleneck network for RUL prediction. In: 2019 IEEE Symposium
Series on Computational Intelligence (SSCI). Xiamen, China, December 6–9; 2019.
49. Li X, Ding Q, Sun JQ. Remaining useful life estimation in prognostics using deep convolution neural networks. Reliab Eng Syst Safe.
2018;172:1-11.
50. Wu J, Hu K, Cheng Y, Zhu H, Shao X, Wang Y. Data-driven remaining useful life prediction via multiple sensor signals and deep long
short-term memory neural network. ISA Trans. 2020;97:241-250.
51. Li J, He D. A Bayesian optimization AdaBN-DCNN method with self-optimized structure and hyperparameters for domain adaptation
remaining useful life prediction. IEEE Access. 2020;8:41482-41501.

How to cite this article: Caceres J, Gonzalez D, Zhou T, Droguett EL. A probabilistic Bayesian recurrent neural
network for remaining useful life prognostics considering epistemic and aleatory uncertainties. Struct Control
Health Monit. 2021;28(10):e2811. https://doi.org/10.1002/stc.2811

A P P EN D I X A

This section presents the extension of Frequentist RNN to Bayesian RNN regarding three architectural types
(i.e., LSTM, GRU, and JANET).

A.1 | Bayesian long short-term memory

The long short-term memory (LSTM) was introduced to address the vanishing gradient problem, which is a common
issue for RNN given long sequences of data.33 The LSTM cell consists of three gates: forget, input, and output gates.
The forget gate processes the previous hidden state (ht  1) and the current input (xi) to forget the information that is
not relevant to the past state of the cell, where the activation function used is a sigmoid; the input gate (it) fulfills the
purpose of updating the cell state according to the past hidden state and the current state; and the output (ot) gate gen-
erates the new hidden state (ht) using the information of the already updated cell (ct). All the operations of the LSTM
cell are shown in Figure A1.
Equations A1–A5 describe the operation inside the LSTM cell:
 
f t ¼ σ U f ht1 þ W f x t þ bf ðA1Þ

it ¼ σ ðU i ht1 þ W i x t þ bi Þ ðA2Þ
15452263, 2021, 10, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/stc.2811 by Xian Jiaotong University, Wiley Online Library on [19/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
CACERES ET AL. 19 of 21

FIGURE A1 Cell design of an LSTM

ot ¼ σ ðU o ht1 þ W o x t þ bo Þ ðA3Þ

ct ¼ f t ∘ ct1 þ it ∘ tanhðU c ht1 þ W c x t þ bc Þ ðA4Þ

ht ¼ ot ∘ tanhðct Þ ðA5Þ

where Wi denotes the weight matrix of each of the layers for their connection to the input vector xt, Ui denotes the
weight matrices of each of the layers for their connections to the previous short-term state ht  1, and bi denotes the bias
terms for each of the four layers.
According to the above definition of an LSTM cell and the Flipout method, the equation of Bayesian LSTM is
implemented as follows:
       
^ f ht1 ∘ sf ,t ∘ r f ,t þ W f x t þ ΔW
f t ¼ σ Uf ht1 þ ΔU ^ f x t ∘ sf ,t ∘ r f ,t ðA6Þ

     
^ i ðht1 ∘ si,t Þ ∘ r i,t þ W
it ¼ σ Ui ht1 þ ΔU ^ i ðx t ∘ si,t Þ ∘ r i,t
 i x t þ ΔW ðA7Þ

     
^ o ðht1 ∘ so,t Þ ∘ r o,t þ W o x t þ ΔW
ot ¼ σ Uo ht1 þ ΔU ^ o ðx t ∘ so,t Þ ∘ r o,t ðA8Þ

     
^ c ðht1 ∘ sc,t Þ ∘ r c,t þ W c x t þ ΔW
ct ¼ f t ∘ ct1 þ it ∘ tanh Uc ht1 þ ΔU ^ c ðx t ∘ sc,t Þ ∘ r c,t ðA9Þ

ht ¼ ot ∘ tanhðct Þ ðA10Þ

A.2 | Bayesian gated recurrent unit

The gated recurrent unit (GRU) was proposed to also overcome the vanishing gradient problem.34 The cells of GRUs
are designed with the update and reset gates, rather than the three gates in LSTMs. This difference is intended to
enhance efficiency since fewer gates mean fewer operations and less computation time. However, LSTMs usually have
better performance with longer time series. The basic GRU cell is shown in Figure A2. Note that GRUs have the cell
and hidden states merged into just one operation, implying that GRUs only need one input against LSTMs that have
two inputs (the last cell and hidden state gates).
The GRU operations are described in Equations A11–A14:
15452263, 2021, 10, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/stc.2811 by Xian Jiaotong University, Wiley Online Library on [19/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
20 of 21 CACERES ET AL.

FIGURE A2 Cell design of a GRU

zt ¼ σ ðU z ht1 þ W z x t þ bz Þ ðA11Þ

r t ¼ σ ðU r ht1 þ W r x t þ br Þ ðA12Þ

 
~t ¼ tanh U ~ ðr t ∘ ht1 Þ þ W ~ x t þ b~
h ðA13Þ
h h h

ht ¼ ð1  zt Þ ∘ ht1 þ zt  h~t ðA14Þ

where Wi is the weight matrix of each of the layers for their connection to the current input vector, Ui represents the
weight matrices of each of the layers for their connection to the previous short-term state ht  1, and bi denotes the bias
terms for each of the four layers.
The implementation of the Bayesian GRU cell is described in the following equations:
     
^ z ðht1 ∘ sz,t Þ ∘ r z,t þ W z x t þ ΔW
zt ¼ σ Uz ht1 þ ΔU ^ z ðx t ∘ sz,t Þ ∘ r z,t ðA15Þ

     
^ r ðht1 ∘ sr,t Þ ∘ r r,t þ W r x t þ ΔW
r t ¼ σ Ur ht1 þ ΔU ^ r ðx t ∘ sr,t Þ ∘ r r,t ðA16Þ

       
h~t ¼ tanh Uh~ ðht1 ∘ r t Þ þ ΔU
^ ~ ðht1 ∘ r t Þ ∘ s~ ∘ r ~ þ W ~ x t þ ΔW
h h h,t h
^ ~ x t ∘ s~
h h,t ∘ r h,t
~ ðA17Þ

ht ¼ ð1  zt Þ ∘ ht1 þ zt h~t ðA18Þ

A.3 | Bayesian Just Another NETwork

Due to the new perspective implemented in GRUs (fewer gates, fast training, and improved results), it is natural to ask
whether gates are completely necessary or not. In this context, the cell design of Just Another NETwork (JANET) only
uses a forget cell (ft). As shown by Van der Westhuizen and Lasenby,35 the JANET architecture not only can outperform
the common LSTM network on the MNIST and pMNIST datasets but also requires fewer computation resources. The
basic JANET cell is shown in Figure A3.
The JANET operations are described in Equations A19–A21:
 
f t ¼ σ U f ht1 þ W f x t þ bf ðA19Þ
15452263, 2021, 10, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/stc.2811 by Xian Jiaotong University, Wiley Online Library on [19/05/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
CACERES ET AL. 21 of 21

FIGURE A3 Cell design of a JANET

ct ¼ f t ∘ ht1 þ ð1  f t Þ ∘ tanhðU c ht1 þ W c x t þ bc Þ ðA20Þ

h t ¼ ct ðA21Þ

where Wi is the weight matrix of each of the layers for their connection to the current input vector, Ui denotes the
weight matrices of each of the layers for their connection to the previous short-term state ht  1, and bi are the bias
terms for both layers.
As all the previously described cells, the operation of Bayesian JANET is described in the following equations:
        
f t ¼ σ Uf ht1 þ ΔU^f ht1 ∘ sf ,t ∘ r f ,t þ W f x t þ ΔW^ f x t ∘ sf ,t ∘ r f ,t ðA22Þ

    
ct ¼ f t ∘ ht1 þ ð1  f t Þ ∘ tanh Uc ht1 þ ΔU^c ðht1 ∘ sc,t Þ ∘ r c,t þ W c x t þ ΔW^ c ðx t ∘ sc,t Þ ∘ r c,t ðA23Þ

h t ¼ ct ðA24Þ

You might also like