An Integrated Method For Incorporating Common Cause Failures in System Analysis
An Integrated Method For Incorporating Common Cause Failures in System Analysis
in System Analysis
Zhihua Tang, University of Virginia, Charlottesville
Joanne Bechta Dugan, University of Virginia, Charlottesville
Key Words: Common Cause Failure, Common Cause Group, Binary Decision Diagram, Failure Process, Unreliability.
Module 3
m number of component in a CCG.
Rm(k ) the probability of the occurrence of k success C
events among m components subject to a CCF.
Figure 1: Triple Modular Redundancy (TMR)
p the reliability of a component.
Pj the probability that a component survives j-
Taking CCF into account, the total failure probability of
component failure process component A can be expressed as follows:
U TMR the system unreliability of the TMR system
ignoring CCF. AT = AI + C AB + C AC + C ABC , (1)
CCF where,
U TMR the system unreliability of the TMR system
including CCF. AT = total failure of component A,
λ failure rate. AI = failure of component A from s-independent
1
causes,
The singular & plural of an acronym are always spelled
the same.
Authorized licensed use limited to: b-on: Universidade de Lisboa Reitoria. Downloaded on March 17,2021 at 13:49:40 UTC from IEEE Xplore. Restrictions apply.
C AB = failure of components A and B (but not C) from In the explicit method, any component involving a CCF is
common causes, modeled as a set of basic events in the system fault tree. Each
C AC = failure of components A and C (but not B) from basic event represents a type of failure process, that is, the
system is modeled with failure process level basic events.
common causes, Since all failure processes are s-independent, the resultant
C ABC = failure of components A, B and C from common fault tree can be solved using traditional fault tree analysis
causes. techniques. In practice, the system fault tree with explicit CCF
modeling can be built by expanding a system fault tree with
Similar expressions can be developed for components B extra CCF events. Taking the TMR system (shown Figure 1)
and C. Thus, by assuming that the voter is perfect, the cut sets as an example, the fault tree and the expanded fault tree are
of the TMR system are: { AI , B I } , { AI , C I } , {B I , C I } , illustrated in Figure 2.
{C AB }, {C AC }, {C BC } and {C ABC } . It is evident that if only
independence is assumed, only the first three cut sets are
involved. The neglect of other common cause cut sets makes
the result of reliability analysis optimistic.
Basically there are two different approaches for
incorporating CCF into system analysis: explicit and implicit
methods (Ref. 5). The explicit method models CCF as shared
basic events in the system fault tree, then applies conventional
fault tree analysis approaches to analyze the fault tree with
CCF basic events. In the implicit method, the fault trees are
built without considering CCF and then the algebraic system
unreliability expression is derived in a certain form. Such an
expression then is evaluated in a way that the contribution of
CCF is correctly included.
Unfortunately, both explicit and implicit methods are
inappropriate to the CCF analysis for large-scale fault trees.
The explicit method may involve adding a large number of
CCF basic events into a fault tree. This is tedious to handle Figure 2: A TMR Fault Tree and Its CCF Expanding
especially for high level redundancy. As for the implicit
method, usually the algebraic expression of system In the expanded fault tree, each component is extended
unreliability is not easy to derive. This makes the implicit from the single basic event to three basic events. One of them
method feasible only in hand-calculating the unreliabilities of represents the independent failure process; the other two
relatively small systems. represent the CCF, which are shared with other involved
The method described in this paper combines the implicit components. After being expanded, the system can be
method with the efficient Boolean function manipulation evaluated with either BDD or Markov models.
method – BDD. It remains the advantage of the implicit A major disadvantage of the explicit method is that the
method and makes the implicit method more practical. The number of expanded basic events is subject to combinatorial
CCF analysis in the DFT model is also discussed. explosion as the redundancy level increases; hence the costs
(time and space) of system analysis increase remarkably.
2. CCF ANALYSIS REVIEW m
Generally, extra ∑( m
j ) basic events are needed to explicitly
2.1 General Assumptions j =2
model a system with m redundant components.
1) CCF exist only among the s-identically distributed
components. 2.3 Implicit Method
2) Each component is subject to at most one kind of
CCF. An implicit method considers CCF in the process of
3) All components subjected to the same CCF are analysis rather than in modeling stage, it does not need to
categorized as a common cause group (CCG). perform the tedious basic event expanding to including CCF.
4) Each CCF is associated with a set of failure processes The implicit method consists of three main steps (Ref. 3):
(independent failures and CC failures), which are named 1) For each CCG with m s-identical components, calculate
with the number of components involved. the probabilities of k ( 1 ≤ k ≤ m ) success events, Rm(k ) .
5) All failure processes are mutually statistically-
2) Determine the system reliability/unreliability expression
independent (s-independent).
in terms of the individual-component reliabilities (success
probabilities), without considering any CCF. This expression
2.2 Explicit Method
is linear function of the individual-component reliabilities and
has been reduced into the form of “sum of products”.
Authorized licensed use limited to: b-on: Universidade de Lisboa Reitoria. Downloaded on March 17,2021 at 13:49:40 UTC from IEEE Xplore. Restrictions apply.
3) In any term containing a product of k success With this two reduction rules, a BDD can represent a
probabilities of k basic events belonging to the same CCG of Boolean function efficiently. For details about BDD, the
m s-identical components, replace that product with a Rm(k ) . reader is referred to two papers by Bryant (Refs. 7, 8).
As an example, the unreliability (without considering
3.2 Incorporating the Implicit Method into BDD
CCF) of the TMR system (as shown in Figure 3.1 and
assuming Voter is perfect) is expressed as:
One major inconvenience in the implicit method is to
derive and manipulate the algebraic Boolean expression of
U = 1 − 3 p 2 + 2 p3 , (2) system unreliability. This inconvenience constrains the use of
implicit method to relatively small systems. One way to
where p is the reliability of each redundant component. overcoming this inconvenience is to find an efficient method
to derive and manipulate the algebraic Boolean expression of
After taking CCF into account, p 2 and p3 are replaced as
system unreliability. As we have known, the BDD is the most
R3( 2) and R3(3) respectively. The resulting TMR unreliability efficient data structure to handle Boolean functions. It is a
including CCF is: natural thought to incorporate the implicit method into BDD.
Since a static fault tree is essentially a Boolean formula
U TMR = 1 − 3R3( 2) + 2 R3(3) . (3) and can be efficiently encoded by means of BDD, the BDD
converted from a static fault tree is also essentially a Boolean
formula. The product of the variables (either in operational
The terms Rm(k ) ’s were originally derived in Ref. 6. They state or in failed state) on the edges along one 1-path (the path
are calculated using equations (4) and (5), where Pj is the from root node to 1-terminal node) corresponds to one term in
the Boolean formula. After replacing the variables with
reliability that is associated with j-component failure process:
corresponding probabilities, the sum of products got from all
1-paths is the algebraic Boolean expression of system
m
unreliability. Take the TMR system as an example, the BDD
∏[P ]
( mj−−11 )
Rm(1) = j , (4)
is constructed as shown in Figure 3 and we can find three 1-
j =1
paths.
m
Rm( k ) = ∏R
n = m − k +1
n
(1)
. (5) A
1
Although the implicit method does not bother to expand B B
the basic events, in practice, its limitations are also obvious:
0
1
1) The implicit is practical only for relatively small
systems (e.g. less than 10 components). When the system C
0
1
becomes larger, it is very difficult to determine the reliability
1
0
3. INCORPORATING THE IMPLICIT METHOD INTO From the BDD shown in Figure 3, we can get the Boolean
BDD formula corresponding to the example TMR system as:
3.1 BDD
fTMR ( A, B, C ) = AB + AB C + A BC . (6)
A BDD is a directed acyclic graph representation of a
Boolean function. A BDD has two terminal nodes: 0 and 1 By replacing the variables with corresponding
encoding the two corresponding constant functions. Each reliabilities/unreliabilities, we can get the algebraic expression
internal node has two edges, which are called 0-edge and 1- of system unreliability as:
edge respectively. A BDD is derived by reducing a binary
decision tree, which represents the recursive execution of U TMR = q A q B + q A p B qC + p A q B qC . (7)
Shannon’s decomposition. The following reduction rules are
used in the construction of a BDD from a binary decision tree. Note that equation (7) assumes that component A, B and C
1) Delete all the redundant nodes whose two edges point to are s-independent. If component A, B and C are subject to a
the same node. CC, we need to calculate the joint-probabilities for each 1-
2) Merge all the isomorphic subgraphs. path. The system unreliability is expressed as:
Authorized licensed use limited to: b-on: Universidade de Lisboa Reitoria. Downloaded on March 17,2021 at 13:49:40 UTC from IEEE Xplore. Restrictions apply.
CCF
U TMR = Pr{ AB} + Pr{ AB C} + Pr{ A BC} . (8) Usually the DFT models are evaluated by converting the DFT
into Markov models.
Incorporating CCF analysis into Markov models is
Ref. 9 derived a general formula for calculating the joint-
conceptually straightforward by adding additional transitions
probability of the occurrence of a set of events subject to a
to represent the occurrences of CCF. This can be done based
CC. Given the size of CCG, m, the size of event set, g, and the
on the conventional Markov models, and the resultant Markov
number of success events, k, the joint-probability is calculated
models have the same states with the original Markov models.
as:
g −k
Consider the TMR example, the Markov model that ignores
Pr{E1E2 ...E g } = ∑(
i =0
g −k i (i + k )
i ) ⋅ ( −1) Rm . (9) the CCF is shown in Figure 4. All components have the same
failure rate λ . In the initial state (state 1), all components are
operational. The failure of any one of the three components
will trigger a transition from the initial state to state 2, which
For example, the joint-probabilities in equation (8) are
represents a system configuration with 2 operational
calculated as follows:
components and one failed component. Furthermore, any
2−0
component failure in state 2 will lead to the system failed state
Pr{ AB} = ∑(
i =0
2−0 i (i + 0)
i ) ⋅ ( −1) R3 = 1 − 2 R3(1) + R3( 2) ; (10) (state F), according to the failure criteria of a TMR system.
3λ 2λ
1 2 F
3−1
Pr{ AB C} = ∑(
i =0
3−1 i (i +1)
i ) ⋅ ( −1) R3 = R3(1) − 2 R3( 2) + R3(3) ; (11)
Figure 4: The TMR Markov model ignoring CCF
Authorized licensed use limited to: b-on: Universidade de Lisboa Reitoria. Downloaded on March 17,2021 at 13:49:40 UTC from IEEE Xplore. Restrictions apply.
effects of CCF and causal failure must be carefully examined Systems”, IEEE Transactions on Reliability, Vol. 41, No.3,
in order to determine the correct destinations of transitions. September 1992, pp 363-377.
The Markov model including the CCF can be solved as a
standard Markov model. This usually involves the solving of a BIOGRAPHIES
set of linear, ordinary differential equations, which results in a
set of state probabilities. The system unreliability is the Zhihua Tang
probability of system in any failed state. Department of Electrical & Computer Engineering
351 McCormick Road; PO Box 400743
ACKNOWLEDGMENTS University of Virginia
Charlottesville VA 22904-4743 USA
We would like to thank NASA Langley Research Center
under NASA Contract NAS1-02076, by which the work e-mail: [email protected]
reported in this paper was funded.
Zhihua Tang received his B.S. degree in Applied Physics
REFERENCES from Northeastern University, Shenyang, China, in 1996, the
M.S. degree in Computer Science from Shenyang Institute of
1. A. Mosleh et al. “Procedure for Treating Common- Computing Technology, Chinese Academy of Science, China,
Cause Failures in Safety and Reliability Studies,” U.S. in 1999, and the M.S. degree in Electrical Engineering from
Nuclear Regulatory Commission, NUREG/CR-4780, Vol. I University of Virginia, Charlottesville, VA in 2002. He is now
and II, Washington, DC. 1988. a Ph.D. student in the Department of Electrical & Computer
2. S. Mitra. et al. “Common-Mode Failures in Redundant Engineering at the University of Virginia. He is a student
VLSI Systems: A Survey,” IEEE Transactions on Reliability, member of IEEE.
Vol. 49, No.3, September 2000, pp. 285-295.
3. J. K. Vaurio, “An Implicit Method for Incorporating Joanne Bechta Dugan, Ph.D.
Common-Cause Failures in System Analysis,” IEEE Department of Electrical & Computer Engineering
Transactions on Reliability, Vol. 47, No.2, pp. 173-180, 1998 University of Virginia
June. 351 McCormick Road PO Box 400743
4. B. W. Johnson, “An Introduction to the Design and Charlottesville, Virginia 22904-4743 USA
Analysis of Fault-Tolerant Systems,” In D.K. Pradhan, editor,
Fault-Tolerant Computer System Design, pp.1-84, Prentice e-mail: [email protected]
Hall, 1996.
5. K.N. Fleming, A. Mosleh, “Common-cause data Joanne Bechta Dugan was awarded the B.A. degree in
analysis and implications in system modeling”, Proc. Int’l Mathematics and Computer Science from La Salle University,
Topical Meeting on Probabilistic Safety Methods & Philadelphia, PA in 1980, and the M.S. and Ph.D. degrees in
Applications, 1985 Feb 24 –Mar 1, Vol 1. pp. 3/1-3/12. Electrical Engineering from Duke University, Durham, NC in
6. K. C. Chae, G.M. Clark, “System Reliability in the 1982 and 1984, respectively. Dr. Dugan is currently Professor
Presence of Common-cause Failures,” IEEE Transactions on of Electrical and Computer Engineering at the University of
Reliability, Vol R-35, 1986 Apr, pp 32-35. Virginia. She has performed and directed research on the
7. R. Bryant, “Graph based algorithms for Boolean development and application of techniques for the analysis of
function manipulation,” IEEE Transactions on Computer, computer systems that are designed to tolerate hardware and
35(8), 1987, pp. 667-691. software faults. Her research interests include hardware and
8. R. E. Bryant, “Symbolic Boolean manipulation with software reliability engineering, fault tolerant computing, and
ordered binary-decision diagrams,” ACM Computing Surveys, mathematical modeling using dynamic fault trees, Markov
Vol. 24, No. 3, Sept. 1992, pp.293-318. models, Petri nets, and simulation. Professor Dugan is a
9. Z. Tang, Common Cause Failure Analysis and member of Phi Beta Kappa, Eta Kappa Nu, Tau Beta Pi and
Improved Solution Techniques for Dynamic Fault Trees, IEEE; is an IEEE Fellow; was Associate Editor of the IEEE
Master thesis, University of Virginia, 2002. Transactions on Reliability for 10 years; and is currently
10. J.B. Dugan, Salvatore J. Bavuso and Mark A. Boyd, Associate Editor of the IEEE Transactions on Software
“Dynamic Fault Tree Models for Fault Tolerant Computer Engineering. She is a past winner of both the P.K. McElroy
and the Alan O. Plait Awards.
Authorized licensed use limited to: b-on: Universidade de Lisboa Reitoria. Downloaded on March 17,2021 at 13:49:40 UTC from IEEE Xplore. Restrictions apply.