Risk-Based Fault Detection Using Bayesian Networks
Risk-Based Fault Detection Using Bayesian Networks
Article
Risk-Based Fault Detection Using Bayesian Networks Based on
Failure Mode and Effect Analysis
Bálint Levente Tarcsay 1, *,† , Ágnes Bárkányi 1,† , Sándor Németh 1 , Tibor Chován 1 , László Lovas 2
and Attila Egedy 1
Abstract: In this article, the authors focus on the introduction of a hybrid method for risk-based
fault detection (FD) using dynamic principal component analysis (DPCA) and failure method and
effect analysis (FMEA) based Bayesian networks (BNs). The FD problem has garnered great interest
in industrial application, yet methods for integrating process risk into the detection procedure are
still scarce. It is, however, critical to assess the risk each possible process fault holds to differentiate
between non-safety-critical and safety-critical abnormalities and thus minimize alarm rates. The
proposed method utilizes a BN established through FMEA analysis of the supervised process and the
results of dynamical principal component analysis to estimate a modified risk priority number (RPN)
of different process states. The RPN is used parallel to the FD procedure, incorporating the results
of both to differentiate between process abnormalities and highlight critical issues. The method is
showcased using an industrial benchmark problem as well as the model of a reactor utilized in the
emerging liquid organic hydrogen carrier (LOHC) technology.
Keywords: fault detection; dynamic risk assessment; Bayesian networks; FMEA; DPCA
for FD and compare these metrics calculated from process data to predefined statistical
thresholds to decide whether a sample can be categorized as normal or abnormal [6].
The performance of the FD method therefore is usually evaluated using only the FAR and
MAR metrics. The issue with this approach is that since no risk is associated with the
out-of-control process states in traditional MSPM FD, alarms will be raised regardless of
whether the detected abnormality is just a simple nuisance that holds no process risk or if
it is a state that could cause severe damage if left unchecked [7]. Therefore, many nuisance
alarms are raised, which could lead to alarm floods in complex systems, especially when,
due to fault propagation, other alarms are raised as well [8].
To circumvent this issue, methods have been developed that take the risk of each
indicated fault into account during the FD process through dynamic risk assessment (DRA);
however, the arsenal of techniques which utilize risk assessment in coordination with
FD is still sparse [9]. Among the first instances of such methods was a technique which
proposed a PCA model for the supervision of chemical processes and incorporated risk
estimation using a quantitative risk assessment model. Using this technique, alarms were
only raised when a fault was detected by the PCA metrics and the predicted risk exceeded
a defined threshold [7]. Later on, self-organizing maps were utilized to address the FD
problem of non-linear systems. Using a probabilistic approach, faults were categorized
into several classes based on severity, and FD was performed while taking the risk into
account as well [10]. Qualitative models have also become researched for the risk-based
FD methodology, such as the use of the R-vine copula and the event tree methods for the
supervision of non-linear and non-Gaussian processes [11].
The techniques for risk-based FD have become increasingly more researched and
popular as the previous examples show but are still relatively scarce. While most methods
propose risk estimation techniques which are in a manner related to traditional techniques
of industrial risk assessment, such as hazard and operability study (HAZOP) [12], event
trees (ETs) [13], fault trees (FTs) [14] or FMEA [15], these methods are not intrinsically
integrated into the framework and performed rigorously [7].
For example, in the previously noted articles [5,7,10], the general definition of risk was
formulated as a product of the probability of a fault occurring, which leads to an unwanted
catastrophic event (P) and the severity score assigned to each fault consequence (S) as per
Equation (1), from which a dynamic risk profile for the process was calculated [16]. This
procedure and its basic logic are fairly similar to the calculation of RPN in FMEA:
lower the probability of catastrophic events occurring even if, due to some fault, a critical
process variable shows abnormal behavior [17]. Therefore, the probability of safety-critical
events occurring may not be properly characterized by simply observing the deviation
of process variables from their normal states without taking the general construction of
the system, presence of fail-safes, inherent safety, and possible fault propagation paths
into account.
To overcome this issue, established risk assessment methods and models from the
literature have been evaluated to propose hybrid techniques for more rigorous risk-based
FD [18]. Based on previous trends, the most popular methods for quantitative risk as-
sessment are probabilistic graphic models, such as dynamic event and fault trees, event
sequence diagrams, Markov models [19], Monte Carlo simulation [20], BNs, Petri nets [21],
etc., to estimate the risk of certain system states under both static and dynamic condi-
tions [22].
In recent years, BNs have gained especially great popularity in the risk assessment
community, with many applications aiming to extend their applicability and combining
them with previously established methods such as FMEA or ETs [23]. The allure of these
techniques is a more rigorous way to estimate the probability of process risks than the
traditional FMEA or HAZOP techniques and addressing the entirety of the system (fail-safes
and components included) and taking possible failure propagation paths into account [23].
In light of this, the key idea of this article is to extend the framework of risk-based FD
using a method based on dynamic principal component analysis (DPCA) and BN-FMEA-
based risk assessment, which can be used to give a more accurate estimate of process risk
by taking fault propagation paths into account as well (fail-safes and inherent safety as
well) when evaluating possible abnormal process states.
The authors utilize DPCA to characterize the observed process and produce indicators
for the presence of process faults and risk events. After establishing the model under
normal operating conditions, the presence of characteristic faults is observed, and statistic
indicators such as the Q statistic are calculated for the different fault scenarios. Parallel to
the FD procedure, a risk profile is observed for the system based on BN-FMEA. Severity
scores are assigned based on the deviating principal components, detectability is evaluated
using MAR metrics of the DPCA technique, and the probability of fault presence is eval-
uated using the BN. This approach results in a modified RPN, which is used to indicate
whether a process state poses significant risk to process operations or not through alarm
and warning signals. In the paper, the following definitions of alarm and warning signals
are used:
• Alarms: Signals with intense visual and vocal prompts used to signal operators
that a shutdown of the supervised system or other immediate and severe actions
are necessary.
• Warnings: Signals with vocal and visual prompts which signal to operators that
process functions are lost/process states changed due to process faults, but immediate
shutdown is not necessary, as the disturbances are not critical from a safety perspective.
As can be seen, safety-critical system states are highlighted using alarm signals, while
non-safety-critical process conditions are still recognized by warning signals. Going for-
ward, the main contribution of our approach can be summarized in the following points.
• Development of a risk-based fault detection method which combines standardized
expert knowledge (failure mode and effect analysis) with data-based techniques
(Bayesian network) for risk assessment.
• Introduction of a modified RPN containing both FD and risk assessment consideration
for the raising of alarms.
In the following, the mathematical formalization and background of the employed
techniques are introduced in Section 2. The flowchart and general logic of the proposed
algorithm are formalized is Section 3. Case studies for method evaluation are given in
Section 4, including both a case study of an FD benchmark problem and a case study utiliz-
Sensors 2024, 24, 3511 4 of 25
ing a dehydrogenation reactor of the liquid organic hydrogen carrier (LOHC) technology.
The discussion and critical evaluation of the results are shown in Section 5.
2.1. Principal Component Analysis (PCA) and Dynamic Principal Component Analysis (DPCA)
Consider a data set describing the behavior of a process, denoted by X ∈ Rn× p with n
observations and p process variables. The columns of X are centered and scaled to have a
mean of zero and unit variance. The centered and scaled X matrix shall be denoted as X̃.
The sample covariance matrix Z ∈ R p× p of X̃ may be calculated according to Equation (2):
1
Z= X̃ T X̃ (2)
n−1
The eigenvalue decomposition of the covariance matrix Z according to Equation (3)
results in P ∈ R p× p , which is a matrix containing the eigenvectors of Z, while Λ ∈ R p× p is
a diagonal matrix containing the eigenvalues of Z.
Z = PΛP T (3)
After arranging the eigenvectors based on the value of their corresponding eigenvalues
in descending order, we obtain the matrix P̃. The optimal number of PCs to be retained
( a) can be calculated according to Equation (4), where θ is the cumulative value of the
p
eigenvalues to be retained, provided that ∑i=1 ṽi ≥ θ holds true:
!2
q
a = arg min
i
∑ λ̃i − θ (4)
i =1
The PCA transformation is then realized in the form of Equation (5) with T ∈ Rn×a :
T = X̃P̃a (5)
The PCA data decomposition model thus takes the form shown in Equation (6), where
the matrix E ∈ Rn× p is the prediction error:
X̃ = TP̃aT + E (6)
Assuming that the PCs are normally distributed, an upper control limit can be estab-
lished, which can be used to filter abnormal outlier points. In the case of the T 2 statistic
Sensors 2024, 24, 3511 5 of 25
for a given confidence level α [26], the control limit Tα2 may be calculated according to
Equation (8), where F is the F-distribution [27]:
a(n + 1)(n − 1)
Tα2 = F (α, a, (n − a)) (8)
n2 − na
The Q statistic for an i-th data point can be calculated according to Equation (9), where
I ∈ R p× p is a unit matrix with appropriate dimensions [26]:
T
Qi = X̃i − TP̃aT X̃i − TP̃aT = X̃iT I − P̃a P̃aT X̃i = EiT Ei (9)
A control metric for the Q statistic has been proposed by Jackson and Mudholkar [26].
For a given confidence level, α the control limit Qα is calculated according to Equation (10):
q h1
dα 2θ2 h20 θ2 h0 ( h0 − 1)
0
Q α = θ1 1 + + (10)
θ1 θ12
most commonly identified based on observed issues of similar systems, based on historical
data or in case of novel processes system decomposition and analysis techniques, such as
product–function analysis, function–component relationship analysis, function–structure
relationship analysis, etc., during brainstorming sessions of the team of professionals. More
rigorous methods such as fault tree (FTA) or event tree analysis (ETA) are also often times
applied to perform the failure mode analysis [32].
After analyzing the failure modes and their propagation through the system, the risk
evaluation of each failure method is compiled. The risk evaluation of failure modes in
FMEA can be performed in a wide manner of ways, with the most common being the
application of the risk priority number (RPN) [31]. This involves either the addition or
multiplication of three factors associated with the failure mode, these being severity (S),
occurrence (O), and detectability (D). In order to express these scores, two main approaches
are utilized, these being the expression of risk factors S, O, D through fuzzy logic, the other
being the use of a 10-level integer scale to quantify each measure [30]. The RPN score using
the latter solution is calculated traditionally as per Equation (11):
RPN = S · O · D (11)
While the FMEA method is a great tool to ensure process quality and safety in the
design stages of a system, it lacks capabilities for online diagnosis, therefore limiting its
capabilities for system supervision and decision-making [33]. Therefore, FMEA is often
enhanced or fused using more rigorous risk assessment techniques with a probabilistic
framework to enable system diagnosis as well. A common approach is the integration of
FMEA into Bayesian networks or Markov models [34].
i
∏ CPF
CPF ( x1 , x2 , . . . , xi ) = x j | pa( x j ) (12)
j =1
Sensors 2024, 24, 3511 7 of 25
X1 true X1 false
X1
X1 X2 X1 X3
X2 X3 X4
X4 F F X4 true X4 false
T F X4 true X4 false
F T X4 true X4 false
T T X4 true X4 false
X5
X4 X5
T X5 true X5 false
F X5 true X5 false
Start
BN-FMEA
(Risk assessment)
No Yes
Raise alarm Acceptable risk?
Fault detection
Yes
Stop Issue warning Fault present?
No
During the online supervision process, the real-time system behavior is observed. The
occurrence probability of failure modes could be directly calculated from the observed
process variables as symptoms of the function of time. The detectability of individual
failure modes based on the missed alarm rate and severity scores assigned based on expert
knowledge are used to create an RPN score for each possible failure mode. Should the
risk level for any failure mode indicated by the RPN exceed the acceptable range, alarms
are instantly raised, while in the case of no serious process risk, the evaluation of the
fault presence is performed. Thus, for real-time risk assessment, the detectability derived
from DPCA, the occurrence probability of failures estimated from the observed process
variables, the Bayesian network, and the severity assigned using expert knowledge and
FMEA analysis aer combined. If an observed anomaly holds no significant process risks, it
is still analyzed, and warnings are issued if faults are detected based on the DPCA results,
but alarms are not raised. If no faults are present, no immediate actions are taken.
problem, including the measured input and output variables, system model, and system
parameters, is provided below.
q1 m3 s−1 q3 m3 s−1
µ1,2 [−] µ2,3 [−] µ3,0 [−]
1.5 × 10−4 1.5 × 10−4 0.5 0.5 0.6
m2 S p m2 S f m2
µ f [−] S li,0 [m]
0.6 1.5 × 10−2 5 × 10−5 5 × 10−5 0
Sensors 2024, 24, 3511 10 of 25
The training data for the DPCA method are obtained by observing system behavior
around the steady state under the conditions shown in Table 1. The development of
the steady-state conditions can be seen in Figure 4. The system of differential equations
describing liquid level changes is solved using Euler’s explicit method.
Figure 4. Steady-state liquid level within the three tanks under the operating conditions of Table 1.
u = usteadystate · (m + σN ) (13)
The changes in the input volumetric flow and the liquid levels compared to their
steady-state values are shown in Figure 5.
The DPCA model was constructed in accordance with the algorithm proposed by
Ku et al. in the original article detailing DPCA [24]. The PCA transform was calculated as
per Equation (3) to Equation (6). The lag number was tuned as well as the number of PCs.
To determine linear relations, a threshold was determined for the eigenvalues (λmin ) of the
corresponding PC scores. The chosen threshold for the eigenvalues was determined using
Equation (14):
λmin = max λi · 10−4 (14)
1≤ i ≤ p
Sensors 2024, 24, 3511 11 of 25
Figure 5. Changes in the inlet volumetric flows (left) and the liquid levels (right) within the tanks
compared to their steady-state conditions.
During the tuning, lag values from 0 to 2 are utilized, for testing. The eigenvalues of
the DPCA transformation as well as the threshold are plotted for the investigated instances,
the results of which can be seen in Figure 6, in the form of a scree plot. As the lag number
increases, the numeric values of the first few eigenvalues also increase; therefore, at lag 2,
the third PC also becomes more significant. Based on the results, an optimal lag number
of 2 is determined, and the first two PCs are retained. The rnew value of new relations is
calculated during the different iterations and plotted, the results of which can be seen in the
second subplot of Figure 6. The subplot shows that the transform reveals new successive
relations due to the addition of lags, and the autocorrelation between data is properly
contained within the PCs; however, after a lag number of 2, no new relationships can
be observed.
Figure 6. Eigenvalues of the PCs (left) and the number of new relations for different lag values (right).
The auto and cross-correlation in the discarded PCs is observed to validate the results
as proposed by Ku et al. [24]. The results for the first two discarded PCs for both the original
PCA transform and the chosen DPCA transform with a lag value of 2 are displayed in
Figures 7 and 8. It must be noted that for 0 lags, the optimal PC number to be retained is 1,
and thus, the correlation plots are displayed for PCs 2 to 3. When comparing Figures 7 and 8,
it is shown that the DPCA transform with 2 lags significantly decreased the autocorrelation
of the discarded PC scores, meaning that the dynamic tendencies of the data are mostly
captured in the transform.
Sensors 2024, 24, 3511 12 of 25
Figure 7. Autocorrelation plots for the first three discarded PCs in basic PCA.
Figure 8. Autocorrelation plots for the first three discarded PCs in basic DPCA, with a lag number
of 2.
To validate the performance of the model on the training data set, the Q statistic was
calculated and evaluated against the upper control limit calculated from Equation (10),
Sensors 2024, 24, 3511 13 of 25
corresponding to a 95 % confidence level. The results are shown in Figure 9; the low value
of the Q statistic and the fact that it nowhere exceeds the control limit indicates that the
model accurately represents the process.
Figure 9. Q-statistic for original training data (Figure 5) with the trained DPCA model.
Based on the analysis, a preliminary BN was established to model the process. The graph-
ical representation of the BN is shown in Figure 10. The model was developed using the
results of the FMEA analysis as well as expert knowledge and process data. The connections
between the observed variables (in this case the liquid levels in the tanks) and the fault
Sensors 2024, 24, 3511 14 of 25
scenarios were established using historical data through process simulation of 7200 h, with a
data set containing 100 setpoint changes and 500 fault scenarios, including simultaneously
occurring fault instances. Since each valve and leakage failure mode has the same associated
severity and detectability scores, the individual valve and leakage failures ( f1 − f6 ) were
not represented. Using the historical data obtained through the simulations, the CPTs of
the liquid level values associated with valve fouling and leakage scenarios were calculated
using the maximum likelihood estimation algorithm (MLE) [44]. The probability of valve
fouling, leakage and the conditional probability of human injury could not be estimated, as no
historical data were available to the authors; therefore, the authors utilized expert knowledge
to give estimates for the probabilities.
Valve
Leak
fouling
L1 L1 L1 L3 L3 L3
V L
normal low high Human V L
normal low high
The valve fouling and leak instances both have two possible states “False-F” and
“True-T”, while the liquid level states can be “Low-L”, “Normal-N” or “High-H”. The state
of the liquid level scores was assigned using Equation (15), where m(li ) and σ(li ) are the
mean and standard deviation of the respective i-th liquid level over the simulation interval,
and li (t) is the i-th liquid level at a given time stamp t:
L
li ( t ) < m ( li ) − σ ( li )
Stateli (t) = N if m(li ) − σ (li ) ≤ li (t) ≤ m(li ) + σ(li ) (15)
H otherwise
Subsequently, the risk profile of each failure mode could be continuously calculated
as a function of time through Equation (11). Since the severity and observability of both
faults modes is known a priori and statically, the time-dependent part of the RPN score
is the probability of a fault mode occurring. Using the BN, the risk of valve fouling and
leakage are constantly calculated using the PCs. Thus, for each failure mode, an RPN risk
Sensors 2024, 24, 3511 15 of 25
profile can be observed, and an acceptable RPN threshold can be given. In the following,
the results of the method for the three-tank system are shown through a case study with a
timescale of 300 h and 5 simulated fault scenarios.
Figure 11 shows the changes in the input volumetric flow of the system as well as the
values of the fault signals over the observation period. Also displayed are the changes in
liquid level, compared to their steady-state values, and the values of the Q statistic for FD.
It can be seen that the greater values of the Q statistic correspond well to the fault signals.
The warning signals of FD, seen in Figure 12, when the Q statistic exceeds the statistic
limit correspond to the fault signals. After running a simulation of 10,000 h time with 1000
randomly generated fault signals and 50 set point changes, the FAR and MAR values were
estimated to be 1.7 and 12.9%, respectively.
Figure 11. Changes in inlet volumetric flow (upper left), system fault signals (upper right), sys-
tem level changes (lower left) and values of the Q statistic for the three-tank benchmark problem
(lower right).
Figure 12. Warning signals for the case study of the three-tank benchmark problem.
Sensors 2024, 24, 3511 16 of 25
Using the values of the liquid levels, the probability of failure modes was calculated
using the BN structure of Figure 10. The resulting probabilities are shown as a function
of time in Figure 13. When comparing Figure 13 with the fault signals in Figure 11, it can
be seen that both leakages and valve fouling can be reliable identified using the BN; in
the case of leakages, the distinction is almost perfect, while in the case of valve fouling,
instances of leakages also result in small probability values of valve fouling but the actual
valve fouling events possess significantly higher probability scores.
Figure 13. Probability of leakage (left) and valve fouling (right) failure modes in the three-tank
benchmark problem.
The RPN scores as a function of time are shown in Figure 14 for both failure modes.
In the case of valve fouling, the low probability events which belong to leakages induce no
great differences in the final RPN score, while actual valve fouling events are characterized
by the maximal possible RPN for this failure mode (40). In the case of leakages, the
accurately identified leakage events all achieve their maximum RPN value (120). The RPN
threshold for this application is also shown; it was chosen as 100.
Figure 14. RPN scores of failure modes in the three-tank benchmark case study.
Finally, the actual alarm signals are shown in Figure 15, taking both process risk and
FD results into account. When compared with the warning signals in Figure 12, it can be
seen that while all fault instances may be reliably detected using the DPCA technique, the
Sensors 2024, 24, 3511 17 of 25
FMEA-based BN risk analysis was able to sort out safety-critical events, which require
the attention of operators and possible shutdowns of the system to prevent accidents in
the technology.
Figure 15. Alarm signals in the three-tank benchmark case study taking both process risk and FD
results into account.
4.2. Case Study of the Liquid Organic Hydrogen Carrier (LOHC) Technology’s
Dehydrogenation Reactor
The LOHC technology is a promising industrial process for the safe storage and
transportation of hydrogen, which is fundamental for the hydrogen-based economy [45].
One of the critical questions of hydrogen-based energy involves the safe and economically
sustainable transport and storage of hydrogen during its lifecycle. The transport and
storage of hydrogen as a low density and highly explosive gas is a critical issue [46].
Various solutions have been proposed for this problem, such as binding hydrogen to
metal-hydrides or storing hydrogen as a high pressure gas or in a liquified state. As an
alternative to these techniques, the LOHC process for hydrogen transport and storage
involves chemically binding hydrogen during a reaction to a liquid organic carrier molecule
for safe transportation, which can be economically beneficial, as it allows storing hydrogen
at ambient conditions [45]. The transport of hydrogen during this procedure is based on
two steps: the first involves the binding of the hydrogen (hydrogenation) into the LOHC
molecule, and the subsequent, second step is the release of hydrogen (dehydrogenation) at
the site of use.
In this study, the kinetics of the reaction were assumed to follow the Langmuir–
Hinshelwood–Hougen–Watson (LHHW) kinetics, which is suitable for heterogeneous
reactions in the presence of a solid catalyst [48]. Parameters of the kinetic equation were
identified using experimental data from a laboratory-scale plug flow dehydrogenation reactor.
Sensors 2024, 24, 3511 18 of 25
In our case study, we utilized a simplified structure of the system. The assumed
layout of the studied unit is displayed in Figure 16. The reactor is fed H2 and MCH,
and the gas streams are mixed before entering the system in the mixer unit (1.), in which
their concentration ratio is controlled. After creating the proper mixture, the gas stream is
heated in a heat exchanger (2.) to the temperature of the operating point before entering
the reactor (3.). The reactor is an adiabatic plug flow reactor where the dehydrogenation
process takes place. After exiting the reactor, the temperature (4.) and concentration of H2
and MCH in the outlet stream are measured using a sensor unit (5.).
4.) 5.)
H2
Temp. Conc.
1.) 2.) 3.)
Diameter
MCH
Length
The constructional parameters of the pilot reactor such as length (l), diameter (d),
cross-section area (A) and volume (V) are shown in Table 3.
Observed variables within the reactor are the concentration of MCH, H2 within the
feed as well as the inlet temperature Tin , and the concentrations of the components at the
outlet as well as the outlet temperature Tout .
Using the identified kinetic parameters, a first-principle model for the system was
developed. The flow regime was approximated as being an ideal plug flow. During the
calculations of energy and component mass balance, the convection (in the longitudinal
direction) and source terms due to reaction were accounted for. Under the above assump-
tions, the component mass and energy balances for the unit were given as a system of
partial differential equations shown in Equation (17):
∂ci ∂c
= − v x i + ri
∂t ∂x
(17)
∂T ∂T ∑ N ∆Hr,i ri
= −v x + i =1
∂t ∂x ρc p
In the equation, ci refers to the concentration of the i-th component, v x is the flow
velocity in the longitudinal direction within the reactor, ri is the reaction source term for a
specific component, ∆Hr,i refers to the reaction heat of specific reactions taking place, and ρ
and c p are the density and heat capacity of the medium within the reactor.
The mathematical model of the system was solved using MATLAB R2020b, with the
appropriate initial and boundary conditions. The initial (x, t = 0) and boundary(x = 0, t)
conditions as well as the parameters of the material within the unit are seen in Table 4,
where B is the inlet volumetric flow rate of the feed. The dependence of heat capacity and
density of the material was studied as a function of temperature, and it was found that in
the investigated regime, the material qualities showed no significant changes. In light of
Sensors 2024, 24, 3511 19 of 25
this, both the density and heat capacity of the material system were assumed to be constant
during the investigations.
Should the flow conditions in the system not allow the use of ideal system models such
as plug flow, then alternatively, computational fluid dynamics methods may be utilized to
obtain data pertaining to system behavior. In the case of material properties such as density
or heat capacity, when these greatly vary over the observation period, then experimental
functions may be fitted to account for their changes due to temperature fluctuations. While
these changes may cause increased computational loads for data generation, they have no
impact on the procedure of the proposed supervision algorithms. If DPCA performance
were to deteriorate, then alternative non-linear methods such as kernel principal component
analysis (KPCA) may be used to characterize system behavior, estimate missed alarm rate,
and perform fault detection.
Table 4. Initial and boundary conditions as well as material parameters within the unit.
c MCH−3 12 c H2 0
mol m−3
mol m
c TOL−3 0 T 593
mol m [K ]
c H2 3 ρ 2.99
mol m−3 kg m−3
T 593 cp 0.23
J kg−1 K −1
[K ]
Fault Root Potential Potential Cause Failure Current Recommended Severity Detectability
Function Failure Mode of Failure Consequences Process Controls Actions
Cause Score (S) Score (D)
Heat Temperature Abnormal Heat exchanger Explosion, Outlet temperature Heat exchanger
control temperature profile fouling catalyst fouling sensor cleaning, process shutdown 9 2
exchanger
Inlet Abnormal inlet Explosion, Inlet composition Shutdown,
Mixer Valve sticking 10 4
concentration control concentration profile product loss sensor valve change
The characteristic safety indicators of the process are the changes within the mixture
temperature and the concentration of hydrogen and MCH. Thus, the risk level of the
process is determined based on the deviation of these three variables. The root causes
for the deviations include the failure of the process heat exchanger due to fouling, which
causes deviation of the process temperature from its nominal values; this, in turn, can lead
to catalyst deactivation within the reactor unit.
Sensors 2024, 24, 3511 20 of 25
Heat exchanger
Mixer failure
failure
M HX H2 H2 H2 OT OT OT
M HX MCH
normal low high normal low high
F F 0.98 0.01 0.01 Outlet H2 Outlet F F N 0.98 0.01 0.01
Outlet temperature
concentration MCH concentration T F N 0.95 0.02 0.03
T F 0.63 0.05 0.32
F T N 0.96 0.02 0.02
F T 0.08 0.01 0.91
T T N 0.9 0.06 0.04
E E CT CT
H2 OT OT
false true false true
N L 0.96 0.04
L L 0.99 0.01
H L 0.72 0.28
N H 0.12 0.88
L H 0.23 0.77
H H 0.02 0.98
Mixer failure and heat exchanger failure as well as explosion and catalyst fouling have
two possible states “False-F” and “True-T”, respectively; the CPTs of these occurrences,
similarly to the previous instance, were filled out using expert knowledge, as no process
data were initially available. In contrast to this, the relationship between the failure modes
and the failure symptoms (inlet concentration and outlet temperature deviation) were
filled out using simulation case studies as before through the use of maximum likelihood
estimation. The failure symptoms have three possible states “Normal-N”,“Low-L” and
“High-H” respectively which were determined similarly as Equation (15).
After training the DPCA model using observation data of the process obtained for
1000 h using a set of 500 observed process faults and 100 set point changes, the FMEA-
based BN was utilized to simultaneously detect faults and observe process risks. There are
three distinct types of process faults, f 1MCH being the fault of the mixer causing changes
within the MCH inlet concentration, f 1H2 being the change in the inlet H2 concentration
due to mixer failure, and f 2 being the fault of the heat exchanger resulting in abnormal
outlet temperature.
The steady-state concentration and temperature profile of the unit are shown in
Figure 18 under the conditions given in Table 4.
The changes within the steady-state boundary conditions and the possible fault signals
are shown in Figure 19 under an investigation of 1 h of simulation time with 5 set point
changes and 5 fault signals. It can be seen that all faults could be isolated using the Q
statistic within reason.
Sensors 2024, 24, 3511 21 of 25
Figure 18. Steady-state operating point of the LOHC technology under the conditions in Table 4.
Figure 19. Fault detection procedure of the LOHC process, operating point changes (upper left), fault
signals (upper right), system responses (lower left), and Q statistic (lower right).
The warning signals due to fault presence are shown in Figure 20; the warning signals
correspond to the fault presence.
The risk of each failure mode was calculated using the trained BN; the results for both
the heat exchanger and mixer failure are displayed. The probability of each failure mode as
a function of time is displayed in Figure 21.
Figure 21. Probability of heat exchanger (left) and mixer (right) failure modes as a function of time.
Finally, the alarm signals based on the RPN score are shown in Figure 23.
Figure 23. Alarm signals as a function of time for the LOHC reactor.
Sensors 2024, 24, 3511 23 of 25
Comparatively, it can be seen that the change in RPN scores of different failure modes
correspond to the actual failure scenarios shown in Figure 19. The RPN scores are defined
by the extent of deviation of the given process variables from their expected steady-state
values; in this case, all three process faults carried significant risks. However, as seen
in Figure 15, when non-safety-critical faults are present, they are eliminated by the RPN
screening. This way, the FD capabilities of the system are not decreased since warnings will
indicate fault presence; however, alarm floods can be prevented, as only critical failures
are highlighted.
This was tested in the case of both studies. In both instances, 10,000 fault random fault
signals were generated, and the alarm and warning numbers were compared. In the case
of the three-tank system, the ratio of alarms to warnings was 0.63, with a MAR of 12.5%.
For the LOHC example, the alarm-to-warning ratio was 0.89 with a MAR of 4.3%. In both
instances, the number of alarm signals was significantly decreased; only safety-critical
faults were highlighted, while faults which posed no significant risks were effectively
filtered out as warnings.
5. Discussion
Conclusively, the results show that the method could effectively diagnose systems,
pinpoint the presence of faults, and differentiate between safety-critical and non-safety-
critical process abnormalities.
Compared with previously introduced methods, the technique has the advantage of
being based on standard risk assessment techniques (FMEA), which is widely available for
industrial applications. In addition, opposed to previous works, where risk was calculated
based on a probability of system malfunction and associated severity scores, in this study,
the modified RPN for risk assessment took fault propagation paths, fault detectability, and
severity into account, integrating the established frameworks of FMEA, Bayesian networks,
and DPCA.
Through the tunable RPN threshold, the safety restrictions can be relaxed or increased,
effectively alternating between recognizing all faults as warnings or alarms (the latter case
being the use of DPCA results, as they are for alarm raising).
6. Conclusions
In this work, the authors introduced a risk-based fault detection (FD) method which
utilizes dynamic principal component analysis for FD and a Bayesian network (BN), con-
structed using failure mode and effect analysis (FMEA) as a risk assessment tool.
The method was used for the online supervision of systems and was showcased using
a three-tank benchmark model and the model of a laboratory scale reactor used during
the dehydrogenation step of the liquid organic hydrogen carrier (LOHC) technology. In
both cases, the method managed to effectively reduce the number of process alarms by
filtering out non-safety-critical process faults, thus reducing the possibility of alarm floods.
The reduction in superfluous alarm signals was between 11 and 37%, respectively, for the
investigated case studies.
On average, the use of the technique reduced the number of raised alarms by 20–30%
in the observed case studies while being sensitive enough to pinpoint all fault presences.
Author Contributions: Conceptualization, B.L.T.; methodology, B.L.T.; software, B.L.T.; validation,
Á.B., T.C., A.E., L.L. and S.N.; formal analysis, Á.B., T.C., A.E., L.L. and S.N.; investigation, B.L.T.;
resources, A.E., L.L. and S.N.; data curation, B.L.T.; writing—original draft preparation, B.L.T.;
writing—review and editing, Á.B., T.C., A.E., L.L. and S.N.; visualization, B.L.T.; supervision, Á.B.,
T.C. and S.N.; project administration, Á.B., T.C., L.L., A.E. and S.N.; funding acquisition, A.E., L.L.
and S.N. All authors have read and agreed to the published version of the manuscript.
Funding: This work has been supported by the project Aquamarine—Hydrogen-based energy storage
solution at Hungarian Gas Storage Ltd. funded by the Ministry of Technology and Industry under
grant agreement No 2020-3.1.2-ZFR-KVG-2020-00001. This work has been implemented by the
TKP2021-NVA-10 project with the support provided by the Ministry of Culture and Innovation of
Sensors 2024, 24, 3511 24 of 25
Hungary from the National Research, Development and Innovation Fund, financed under the 2021
Thematic Excellence Programme funding scheme.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: No new data were created or analyzed in this study. Data sharing is
not applicable to this article.
Conflicts of Interest: Author László Lovas has been involved as an expert at the at Hungarian Gas
Storage Ltd.
References
1. Venkatasubramanian, V.; Rengaswamy, R.; Yin, K.; Kavuri, S.N. A review of process fault detection and diagnosis: Part I:
Quantitative model-based methods. Comput. Chem. Eng. 2003, 27, 293–311. [CrossRef]
2. Venkatasubramanian, V.; Rengaswamy, R.; Kavuri, S.N.; Yin, K. A review of process fault detection and diagnosis: Part III:
Process history based methods. Comput. Chem. Eng. 2003, 27, 327–346. [CrossRef]
3. Venkatasubramanian, V.; Rengaswamy, R.; Kavuri, S.N. A review of process fault detection and diagnosis: Part II: Qualitative
models and search strategies. Comput. Chem. Eng. 2003, 27, 313–326. [CrossRef]
4. Bersimis, S.; Panaretos, J.; Psarakis, S. Multivariate Statistical Process Control Charts and the Problem of Interpretation: A Short
Overview and Some Applications in Industry. Econom. eJournal 2006. [CrossRef]
5. Zadakbar, O.; Imtiaz, S.; Khan, F. Why risk-based multivariate fault detection and diagnosis? IFAC Proc. Vol. 2013, 46, 672–677.
[CrossRef]
6. Misra, M.; Yue, H.H.; Qin, S.J.; Ling, C. Multivariate process monitoring and fault diagnosis by multi-scale PCA. Comput. Chem.
Eng. 2002, 26, 1281–1293. [CrossRef]
7. Zadakbar, O.; Imtiaz, S.; Khan, F. Dynamic risk assessment and fault detection using a multivariate technique. Process Saf. Prog.
2013, 32, 365–375. [CrossRef]
8. Lucke, M.; Chioua, M.; Grimholt, C.; Hollender, M.; Thornhill, N.F. Advances in alarm data analysis with a practical application
to online alarm flood classification. J. Process Control 2019, 79, 56–71. [CrossRef]
9. Kanes, R.; Marengo, M.C.R.; Abdel-Moati, H.; Cranefield, J.; Véchot, L. Developing a framework for dynamic risk assessment
using Bayesian networks and reliability data. J. Loss Prev. Process Ind. 2017, 50, 142–153. [CrossRef]
10. Yu, H.; Khan, F.; Garaniya, V. Risk-based fault detection using Self-Organizing Map. Reliab. Eng. Syst. Saf. 2015, 139, 82–96.
[CrossRef]
11. Amin, M.T.; Khan, F.; Ahmed, S.; Imtiaz, S. Risk-based fault detection and diagnosis for nonlinear and non-Gaussian process
systems using R-vine copula. Process Saf. Environ. Prot. 2021, 150, 123–136. [CrossRef]
12. Isimite, J.; Rubini, P. A dynamic HAZOP case study using the Texas City refinery explosion. J. Loss Prev. Process Ind. 2016,
40, 496–501. [CrossRef]
13. Rutt, B.; Catalyurek, U.; Hakobyan, A.; Metzroth, K.; Aldemir, T.; Denning, R.; Dunagan, S.; Kunsman, D. Distributed dynamic
event tree generation for reliability and risk assessment. In Proceedings of the 2006 IEEE Challenges of Large Applications in
Distributed Environments, Paris, France, 19 June 2006; pp. 61–70.
14. Yazdi, M.; Kabir, S.; Walker, M. Uncertainty handling in fault tree based risk assessment: State of the art and future perspectives.
Process Saf. Environ. Prot. 2019, 131, 89–104. [CrossRef]
15. Lipol, L.S.; Haq, J. Risk analysis method: FMEA/FMECA in the organizations. Int. J. Basic Appl. Sci. 2011, 11, 74–82.
16. Bao, H.; Khan, F.; Iqbal, T.; Chang, Y. Risk-based fault diagnosis and safety management for process systems. Process Saf. Prog.
2011, 30, 6–17. [CrossRef]
17. Khan, F.I.; Amyotte, P.R. How to make inherent safety practice a reality. Can. J. Chem. Eng. 2003, 81, 2–16. [CrossRef]
18. Aven, T. Risk assessment and risk management: Review of recent advances on their foundation. Eur. J. Oper. Res. 2016, 253, 1–13.
[CrossRef]
19. Jon, M.H.; Kim, Y.P.; Choe, U. Determination of a safety criterion via risk assessment of marine accidents based on a Markov
model with five states and MCMC simulation and on three risk factors. Ocean Eng. 2021, 236, 109000. [CrossRef]
20. Sadeghi, N.; Fayek, A.R.; Pedrycz, W. Fuzzy Monte Carlo simulation and risk assessment in construction. Comput.-Aided Civ.
Infrastruct. Eng. 2010, 25, 238–252. [CrossRef]
21. Kabir, S.; Papadopoulos, Y. Applications of Bayesian networks and Petri nets in safety, reliability, and risk assessments: A review.
Saf. Sci. 2019, 115, 154–175. [CrossRef]
22. Faghih-Roohi, S.; Xie, M.; Ng, K.M. Accident risk assessment in marine transportation via Markov modelling and Markov Chain
Monte Carlo simulation. Ocean Eng. 2014, 91, 363–370. [CrossRef]
23. Weber, P.; Medina-Oliva, G.; Simon, C.; Iung, B. Overview on Bayesian networks applications for dependability, risk analysis and
maintenance areas. Eng. Appl. Artif. Intell. 2012, 25, 671–682. [CrossRef]
24. Ku, W.; Storer, R.H.; Georgakis, C. Disturbance detection and isolation by dynamic principal component analysis. Chemom. Intell.
Lab. Syst. 1995, 30, 179–196. [CrossRef]
Sensors 2024, 24, 3511 25 of 25
25. Choi, S.W.; Lee, C.; Lee, J.M.; Park, J.H.; Lee, I.B. Fault detection and identification of nonlinear processes based on kernel PCA.
Chemom. Intell. Lab. Syst. 2005, 75, 55–67. [CrossRef]
26. Jackson, J.E.; Mudholkar, G.S. Control procedures for residuals associated with principal component analysis. Technometrics 1979,
21, 341–349. [CrossRef]
27. Mashuri, M.; Ahsan, M.; Lee, M.H.; Prastyo, D.D. PCA-based Hotelling’s T2 chart with fast minimum covariance determinant
(FMCD) estimator and kernel density estimation (KDE) for network intrusion detection. Comput. Ind. Eng. 2021, 158, 107447.
[CrossRef]
28. Dong, Y.; Qin, S.J. A novel dynamic PCA algorithm for dynamic data modeling and process monitoring. J. Process Control 2018,
67, 1–11. [CrossRef]
29. Luo, R.; Misra, M.; Himmelblau, D.M. Sensor fault detection via multiscale analysis and dynamic PCA. Ind. Eng. Chem. Res. 1999,
38, 1489–1495. [CrossRef]
30. Wu, Z.; Liu, W.; Nie, W. Literature review and prospect of the development and application of FMEA in manufacturing industry.
Int. J. Adv. Manuf. Technol. 2021, 112, 1409–1436. [CrossRef]
31. Bouti, A.; Kadi, D.A. A state-of-the-art review of FMEA/FMECA. Int. J. Reliab. Qual. Saf. Eng. 1994, 1, 515–543. [CrossRef]
32. Peeters, J.; Basten, R.J.; Tinga, T. Improving failure analysis efficiency by combining FTA and FMEA in a recursive manner. Reliab.
Eng. Syst. Saf. 2018, 172, 36–44. [CrossRef]
33. Spreafico, C.; Russo, D.; Rizzi, C. A state-of-the-art review of FMEA/FMECA including patents. Comput. Sci. Rev. 2017, 25, 19–28.
[CrossRef]
34. Brun, A.; Savino, M.M. Assessing risk through composite FMEA with pairwise matrix and Markov chains. Int. J. Qual. Reliab.
Manag. 2018, 35, 1709–1733. [CrossRef]
35. Barua, S.; Gao, X.; Pasman, H.; Mannan, M.S. Bayesian network based dynamic operational risk assessment. J. Loss Prev. Process
Ind. 2016, 41, 399–410. [CrossRef]
36. Farmani, R.; Henriksen, H.J.; Savic, D.; Butler, D. An evolutionary Bayesian belief network methodology for participatory
decision making under uncertainty: An application to groundwater management. Integr. Environ. Assess. Manag. 2012, 8, 456–461.
[CrossRef] [PubMed]
37. Liang, G.; Yu, B. Maximum pseudo likelihood estimation in network tomography. IEEE Trans. Signal Process. 2003, 51, 2043–2053.
[CrossRef]
38. Brahim, I.B.; Addouche, S.A.; El Mhamedi, A.; Boujelbene, Y. Build a Bayesian network from FMECA in the production of
automotive parts: Diagnosis and prediction. IFAC-PapersOnLine 2019, 52, 2572–2577. [CrossRef]
39. Theilliol, D.; Noura, H.; Ponsart, J.C. Fault diagnosis and accommodation of a three-tank system based on analytical redundancy.
ISA Trans. 2002, 41, 365–382. [CrossRef] [PubMed]
40. Sainz, M.A.; Armengol, J.; Vehı, J. Fault detection and isolation of the three-tank system using the modal interval analysis.
J. Process Control 2002, 12, 325–338. [CrossRef]
41. Köppen-Seliger, B.; García, E.A.; Frank, P.M. Fault detection: Different strategies for modelling applied to the three tank
benchmark—A case study. In Proceedings of the 1999 European Control Conference (ECC), Karlsruhe, Germany, 31 August–3
September 1999; pp. 4432–4437.
42. Tarcsay, B.L.; Bárkányi, Á.; Chován, T.; Németh, S. A Dynamic Principal Component Analysis and Fréchet-Distance-Based
Algorithm for Fault Detection and Isolation in Industrial Processes. Processes 2022, 10, 2409. [CrossRef]
43. Bhattacharjee, P.; Dey, V.; Mandal, U. Risk assessment by failure mode and effects analysis (FMEA) using an interval number
based logistic regression model. Saf. Sci. 2020, 132, 104967. [CrossRef]
44. Ji, Z.; Xia, Q.; Meng, G. A review of parameter learning methods in Bayesian network. In Advanced Intelligent Computing Theories
and Applications: 11th International Conference, ICIC 2015, Fuzhou, China, 20–23 August 2015; Part III 11; Springer: Berlin/Heidelberg,
Germany, 2015; pp. 3–12.
45. Niermann, M.; Beckendorff, A.; Kaltschmitt, M.; Bonhoff, K. Liquid Organic Hydrogen Carrier (LOHC)–Assessment based on
chemical and economic properties. Int. J. Hydrogen Energy 2019, 44, 6631–6654. [CrossRef]
46. Rao, P.C.; Yoon, M. Potential liquid-organic hydrogen carrier (LOHC) systems: A review on recent progress. Energies 2020,
13, 6040. [CrossRef]
47. Hamayun, M.H.; Maafa, I.M.; Hussain, M.; Aslam, R. Simulation study to investigate the effects of operational conditions on
methylcyclohexane dehydrogenation for hydrogen production. Energies 2020, 13, 206. [CrossRef]
48. Sekine, Y.; Higo, T. Recent trends on the dehydrogenation catalysis of liquid organic hydrogen carrier (LOHC): A review. Top.
Catal. 2021, 64, 470–480. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.