Fault Tree Analysis
Fault Tree Analysis
Fault tree analysis (FTA) is a graphical method commonly used in both reliability
engineering and system safety engineering (though it is more well known in reliabil-
ity circles). It is a deductive approach that is very powerful as a qualitative analysis
tool that can be quantified. You postulate a top eventor faultsuch as train derail-
ment, then branch down from the top event, listing the faults in the system that must
occur for the top event to occur. This top-down method forces you to go through
systematically, listing the various sequential and parallel events or combinations of
faults that must occur for the undesired top event. Logic gates and standard Boolean
algebra allow you to quantify the fault tree with event probabilities and thus deter-
mine the probability of the top event.
It is important to understand that this is not a model of all possible system failures
or all possible causes, but rather, a model of particular system failure modes and
their constituent faults that lead to the top event. Not all system or component fail-
ures are listed, only the ones leading to the top event. Like the other safety analysis
techniques discussed previously, only credible faults are assessed. The faults can
be events associated with component hardware failures, software glitches, human
errors, and environmental conditionsin short, any of the elements that make up
the complete system.
The fault tree was first developed in 1961 for the U.S. military intercontinental
missile program. The U.S. Nuclear Regulatory Commission published a guide in
1981, and since then, FTA has been used in almost every engineering discipline
around the world, from mass transit to commercial nuclear power plants, chemical
process plants, oil drilling platforms, NASA satellites, and aircraft control centers.
Fault trees are used extensively in accident investigation. NASA used fault trees to
recreate the events that lead up to the Challenger and Columbia Space Shuttle acci-
dents. Fault trees have been combined with event trees, and other root cause analyses
have been used very effectively in accident investigation, including the investiga-
tion of a plutonium spill at a Boulder, Colorado, National Institute of Standards and
Technology laboratory.
205
206 System Safety Engineering and Risk Assessment: A Practical Approach
Dynamic FTA is used more commonly in computer systems fault analysis and
involves employing Markov analysis to generate the tree. Dynamic fault trees are
also frequently used to model fault-tolerant systems. The challenge is that the size of
the tree grows very quickly and can be very cumbersome to manipulate.
NASA succinctly defines (Stamatelatos et al., 2002) the process of conducting
anFTA:
1.
Identify the objective for the FTADetermines what the engineer wants to
know before starting the analysis
2.
Define the top event of the fault treeStates the end result that is being
investigated and should give the information needed to meet the objective
defined in Step 1 (it defines the fault mode of the system)
Downloaded by [Wayne State University] at 13:03 16 August 2016
3.
Determine the scope of the FTABounds how far the analysis should go
and determines which faults will be included and their boundary conditions
4.
Define the resolution of the FTADetails the level of fault causes that will
be followed to reach the top event
5.
Define the ground rules of the FTADetermines the naming scheme for
the analysis and how the fault tree will be modeled
6.
Construct the fault treeBuilds the actual fault tree (graphically and
logically)
7.
Evaluate the fault treeConducts quantitative and qualitative analysis of
the fault tree through cut sets and Boolean algebra
8.
Interpret and present resultsExplains to the reader what all this means
(this is the most important part of the analysis; results must be put into a
context that makes sense and is understandable)
The component fault is the state of existence of that component that contrib-
utes to the mechanism that leads to the next-level fault. In understanding what the
component fault is, it is important to consider what the component state is in and
when it is in that state of existence. Component faults are comprised of primary,
secondary, and command faults. However, most primary and secondary faults are
comprised of component failures, so they are usually called primary and second-
ary failures.
A primary failure is a failure that occurs under normal operating and environ-
mental conditions. A secondary failure is a failure outside of normal conditions. A
command fault occurs when a component performs as designed but produces the
output signal at the wrong time. Roberts etal. (1981) demonstrate command faults
with a humorous story from the American Civil War.
It appears that General Beauregard had sent his courier to deliver a message to
one of the commanders in the field. The battle situation changed, and sometime
later, the general sent another message with updated information. The battle situ-
ation changed again, and the general amended the previous messages with a third.
The messages all arrived (as designed) to the commander in the field, but in the
wrong order. Because the messages arrived in the incorrect order, that fault caused
the battle commander to take the wrong actionwith disastrous results.
Fault tree symbols are divided into four categories: primary event, intermediate
event, gate, and transfer. Figure 7.1 defines each of the symbols used in fault tree
generation.
Primary events are end events; in other words, for one reason or another, they are
not studied further. For example, the circle or basic event describes a fault that is an
initiating event itself and has no inputs depicted in the fault tree. Some examples are
as follows: K1 timer contacts inadvertently open; K2 relay contacts fail to close; bat-
tery 2A is 0 V; or pressure switch contacts fail to open.
An ellipse, or conditioning event, is a sort of message bubble that records any
conditions or restrictions that apply to any of the logic gates. This symbol is used
primarily with INHIBIT and PRIORITY AND gates.
208 System Safety Engineering and Risk Assessment: A Practical Approach
Gate symbols
FIGURE 7.1 Fault tree symbols. (From Roberts, N.H. etal., Fault Tree Handbook, NUREG-
0492, U.S. Nuclear Regulatory Commission, Washington, DC, 1981, p. IV-3.)
Sudden stop
occurs
Wind A Loss of C
pressure guidance
FIGURE 7.2 Fault tree of sudden stop of maglev train. (From Dorer, R.M. and Hathaway,
W.T., Safety of High Speed Magnetic Levitation Transportation Systems: Preliminary Safety
Review of the Transrapid Maglev System, DOT-VNTSC-FRA-90-3, U.S. Department of
Transportation, Washington, DC, 1991, p. C-4.)
210 System Safety Engineering and Risk Assessment: A Practical Approach
any of the four intermediate events occur, it will lead to the top event. Transfer gate A
under the vehicle leaves guideway event indicates that that fault is further developed on
another page. The diamonds are undeveloped events that the analyst felt did not need to
be pursued any further for this study. However, that does not mean that sometime in the
future he may wish to take one or more of those diamonds and investigate their faults.
Fault trees are relatively easy to construct. However, there are a few rules
that should be followed. The U.S. Nuclear Regulatory Commissions Fault Tree
Handbook, NUREG-0492 (Roberts etal., 1981), though published many years ago
is still a classic, provides some good ground rules:
Write the statements that are entered in the event boxes as faults; state pre-
cisely what the fault is and when it occurs (e.g., motor fails to start when
Downloaded by [Wayne State University] at 13:03 16 August 2016
power is applied).
If the answer to the question, Can this fault consist of a component failure?
is Yes, classify the event as a state-of-component fault, add an OR gate
below the event, and look for primary, secondary, and command modes.
If the answer is No, classify the event as a state-of-system fault, and look
for the minimum necessary and sufficient immediate cause or causes. As a
general rule, when energy originates outside the component, the event may
be classified as state of the system.
If the normal functioning of a component propagates a fault sequence, then
it is assumed that the component functions normally. In other words, no
miracles are allowed. If a fault is going to occur, it must occur.
All the inputs to a particular gate should be completely defined before fur-
ther analysis of any one of them is undertaken.
Gate inputs should be properly defined fault events, and gates should not be
connected directly to other gates. Many people shortcut the FTA by hook-
ing the outputs of gates directly into another gate without describing the
event. Do not do that. It is sloppy.
Overheated
A No current D
wire
D = E F
TABLE 7.1
Boolean Manipulation Rules
Algebraic Rule Set Theory Representation Engineering Representation
Commutative law XY=YX X*Y=Y*X
XY=YX X+Y=Y+X
Associative law X Y(Y Z) = (X Y) Z X * (Y * Z) = (X * Y) * Z or X(YZ) = (XY)Z
X (Y Z) + (X Y) Z X + (Y + Z) = (X + Y) + Z
Distributive law X (Y Z) = (X Y) (X Z) X(Y + Z) = XY + XZ
X (Y Z) = (X Y) (X Z) X + Y * Z = (X + Y) * (X + Z)
Idempotent law XX=X X*X=X
XX=X X+X=X
Law of absorption X (X Y) = X X * (X = Y) = X
X (X Y) = X X+X*Y=X
Complementation X X = X * X =
X X = X + X = = 1
(X) = X (X) = X
De Morgans theorem (X Y) = X Y (X * Y) = X + Y
(X Y) = X Y (X + Y) = X * Y
Other operations X= *X=
X=X +X=X
X=X *X=X
X= +X=
= =
= =
212 System Safety Engineering and Risk Assessment: A Practical Approach
B1 B2
C3 Z1 Z2 C4 C3 Z3 Z4 C5
Z5 Z6 Z5 Z6
Z11 Z12 C6 Z9 C6 Z10
Z7 Z8 Z7 Z8
Downloaded by [Wayne State University] at 13:03 16 August 2016
Figure 7.4 is a typical branch of a large fault tree. There are a number of ways
to solve a fault tree: top-down substitution, bottom-up substitution, and even using
Monte Carlo simulations (with actual failure data). Also, a number of computer pro-
grams can solve (and draw) the tree. It is impossible to keep up to date with the
changes in software programs for fault trees. Here are some of the software pro-
grams on the market:
A = B1 * B2
B1 = C 3 + Z1 + Z 3 + Z 4
B2 = C 3 + Z 3 + Z 4 + Z 5
Fault Tree Analysis 213
Start at the top event and then create Boolean equations for each level or branch on the
tree. Once the next couple of levels have been written, you can use the various Table7.1
substitution laws. So, combining B1 and B2 and through Boolean manipulation,
A = C 3 + Z1 * Z 2 + Z 3 * Z 4 + Z 2 * Z 3 + Z 2 * Z 4 + C 4 * Z 3
+ C 4 * Z 4 + C 4 * C 5 + Z1 * C 5 + Z 2 * C 5
Note that two branches are repeated in the tree, the C3 and C6 branches. It is not uncom-
mon that the fault scenario is repeated in a large fault tree. If one subsystem feeds various
plant units, then that branch will be repeated wherever it occurs. Parallel pumps, dual
motors, or even single units (e.g., emergency backup power units) are simple examples
of repeat branches. This is a very important point to note: if a repeat branch happens to
Downloaded by [Wayne State University] at 13:03 16 August 2016
be failure prone, then its faults will be replicated throughout the fault tree:
C3 = Z 5 * Z 6
C 4 = C 6 + Z11 + Z12
C 5 = C 6 + Z 9 + Z10
C6 = Z 9 + Z 8
Again, using Boolean manipulation, the final fault scenario that leads to the top
event, A, can be written as
A = ( Z 7 ) + ( Z 8 ) + ( Z 5 * Z 6 ) + ( Z1 * Z 3 ) + ( Z1 * Z 4 ) + ( Z 2 * Z 3 ) + ( Z 2 * Z 4 )
A cut set is a collection of basic events that will lead to the top event. A minimal cut set is
the smallest combination of component failures, which, if they all occur, will cause the
top event to occur. A single-component minimal cut set means that if that single com-
ponent fails, then the top event will occur. In the aforementioned example, parentheses
have been placed around the cut sets. If the components indicated in the parentheses
fail, then the system will fail. As can be seen, there are numerous single-point failures.
Obviously, the bottom-up method of FTA is the exact opposite of what was just dem-
onstrated. You start at the lowest level, substitute the Boolean equations, and solve
Downloaded by [Wayne State University] at 13:03 16 August 2016
The fault tree is drawn, and then the Boolean equations and minimal cut sets are
derived for the top event. Probability estimates can be generated from hardware
failure data, human error estimation, maintenance frequency, etc. Probability esti-
mates are then assigned to the events. Be sure to take into consideration uncer-
tainty limits to your failure data. Through the laws of probability, combine the
Fault Tree Analysis 215
latter pumps fluid from an infinitely large reservoir into the tank. We shall assume
that it takes 60 s to pressurize the tank. The pressure switch has contacts, which are
closed when the tank is empty. When the threshold pressure has been reached, the
pressure switch contacts open, de-energizing the coil of relay K2 so that relay K2
contacts open, removing power from the pump, causing the pump motor to cease
operation. The tank is fitted with an outlet valve that drains the entire tank in an
essentially negligible time; the outlet valve, however, is not a pressure relief valve.
When the tank is empty, the pressure switch contacts close, and the cycle is repeated.
Initially, the system is considered to be in its dormant mode: switch S1 contacts open,
relay K1 contacts open, and relay K2 contacts open: that is, the control system is de-
energized. In this de-energized state, the contacts of the timer relay are closed. We will
also assume that the tank is empty and the pressure switch contacts are therefore closed.
System operation is started by momentarily depressing switch S1. This applies
power to the coil of relay K1, thus closing K1 contacts. Relay K1 is now electrically
self-latched. The closure of relay K1 contacts allows power to be applied to the coil
of relay K2, whose contacts close to start up the pump motor.
The timer relay has been provided to allow emergency shutdown in the event that
the pressure switch fails to close. Initially, the timer relay contacts are closed and the
Outlet
valve
Relay
K1
Pressure
Relay switch S
K2
Switch Timer relay Pressure
S1 tank
Motor
Pump
From reservoir
FIGURE 7.5 Pressure tank system. (From Roberts, N.H. et al., Fault Tree Handbook,
NUREG-0492, U.S. Nuclear Regulatory Commission, Washington, DC, 1981, p. V-III.)
216 System Safety Engineering and Risk Assessment: A Practical Approach
timer relay coil is de-energized. Power is applied to the timer coil as soon as relay
K1 contacts are closed. This starts a clock in the timer. If the clock registers 60 s of
continuous power application in the timer relay coil, the timer relay contacts open
(and latch in that position), breaking the circuit to the K1 relay coil (previously latched
closed) and thus producing system shutdown. In normal operation, when the pressure
switch contacts open (and consequently relay K2 contacts open), the timer resets to 0 s.
Figure 7.6 is the resulting fault tree. In constructing the fault tree from the pres-
sure tank schematic, it is obvious that the top event should be rupture of pressure
tank after the start of pumping. This is a fairly simplified tree, in which piping,
wiring, etc., have been ignored. The Fault Tree Handbook makes a good point of
emphasizing that the fault must specify what happens and when it occurs.
An OR gate is drawn because the top event can be caused by a component failure.
Downloaded by [Wayne State University] at 13:03 16 August 2016
This is a good example of the use of primary and secondary component failures. The
circle or primary failure of the tank could be due to things such as material fatigue
and poor workmanship. If there is concern that the tank does not meet the minimum
necessary design specifications (i.e., ASME Section VIII), then the circle could be
another rectangle (or secondary failure). However, in this case, we feel that the tank
was designed appropriately. Likewise, the diamond is highly unlikely and would not
need to be developed further.
So, now, we concentrate on the secondary failure of tank rupture. The Fault Tree
Handbook again emphasizes a critical point with primary and secondary faults
namely, that a primary failure is one in which a component fails in the environment
for which it is qualified and the secondary failure is one in which it fails in an envi-
ronment for which it is not qualifiedimportant distinctions.
This secondary failure is composed of component failures, so again, an OR gate
is drawn. The same logic as used earlier is applied here to draw the secondary fail-
ure and the diamond. The INHIBIT gate documents that the input to the fault is a
continuous, t > 60 s pump operation. Remember, this is conditional fault. The pump
must operate longer than 60 s for the failure to occur.
The concept of state-of-component and state-of-system faults is worth discussing
briefly here. If a state-of-component existsin other words, the fault occurs because
of a component failurethen OR gates are used. The use of OR gates connotes that
any of the listed fault inputs can cause the event. If a state-of-system fault occurs,
that means that something in the system failed that caused the event to occur and
thus connotes an AND gateall the fault inputs must occur for the event to occur.
The fact that two faults are in place without a gate between them is not incorrect;
it only indicates that the author wishes to detail the failure sequence. If more detail is
needed to understand the process, then a string of rectangles in series can be drawn.
It is obvious that for the pump to operate continuously, it must have power for longer
than 60 s.
From there, an OR gate is drawn, state-of-component faults; however, the EMF
Applied to K2 Relay Coil for t > 60 s is a state-of-system fault and thus requires an
AND gate. This erroneous command signal to the component is due to other faults
in the system.
On the left side of the AND gate, all the events end as circles or diamonds. In other
fault trees or if the top event is highly significant (such as rupture of the reactor in a
Fault Tree Analysis 217
Rupture of
pressure tank
after the start
of pumping
Tank rupture
Tank Tank
(secondary rupture due to
failure) rupture
improper selection
of installation
(wrong tank)
K2 relay contacts
remain closed
for t > 60 s
EMF applied to
K2 relay K2 relay coil
K2 relay contacts for t > 60 s
(secondary fail to
failure) open
Pressure
switch Pressure
Excess switch EMF through S1 contacts
pressure not contacts
fail to (secondary when pressure switch
sensed by pressure failure) contacts closed for
actuated switch open
t > 60 s
Timer
Timer relay
does not contacts Timer relay
time out fail to (secondary
due to improper open failure)
installation or
setting
FIGURE 7.6 Pump-motor pressure tank fault tree. (From Roberts, N.H. et al., Fault Tree
Handbook, NUREG-0492, U.S. Nuclear Regulatory Commission, Washington, DC, 1981, p. V-III.)
218 System Safety Engineering and Risk Assessment: A Practical Approach
E1
T E2
E3 K2
Downloaded by [Wayne State University] at 13:03 16 August 2016
S E4
Legend: Faults
E1 Top event
E2, E3, E4, E5 Intermediate fault events
R Primary failure of timer relay S1 E5
S Primary failure of pressure switch
S1 Primary failure of switch S1
K1 Primary failure of relay K1
K2 Primary failure of relay K2
T Primary failure of pressure tank
K1 R
FIGURE 7.7 Fault tree example. (From Roberts, N.H. etal., Fault Tree Handbook, NUREG-
0492, U.S. Nuclear Regulatory Commission, Washington, DC, 1981, p. VIII.)
nuclear power plant), then these entries may need to be evaluated further. Likewise,
the far right side of the fault tree ends with similar faults.
The remainder of the fault tree goes into further detail about how relay circuit
can fail. In Figure 7.7, the fault tree has been simplified and the Boolean expressions
developed:
E1 = T + E 2
= T + ( K 2 + E3)
= T + K 2 + (S * E4)
= T + K 2 + S * ( S1 + E 5 )
= T + K 2 + ( S * S1) + ( S * E 5 )
= T + K 2 + ( S * S1) + S * ( K1 + R )
= T + K 2 + ( S * S1) + ( S * K1) + ( S * R )
TABLE 7.2
Failure Probabilities for Pressure Tank Example
Component Symbol Failure Probability (Pr)
Pressure tank T 5 106
Relay K2 K2 3 105
Pressure switch S 1 104
Relay K1 K1 3 105
Timer relay R 1 104
Switch S1 S1 3 105
Downloaded by [Wayne State University] at 13:03 16 August 2016
Pr ( T ) = 5 10 6
Pr ( K 2 ) = 3 10 5
Pr ( S * K1) = 1 10 4 3 10 5 = 3 10 9
( )( )
Pr ( S * R ) = 1 10 4 1 10 4 = 1 10 8
( )( )
Pr ( S * S1) = 1 10 4 3 10 5 = 3 10 9
( )( )
So, by summing the minimal cut sets, the top-event probability of occurrence is
Pr ( E1) = 3.4 10 5
Try to model to the highest level possible that you have data; the more the
data used, the more uncertainty in the model.
Do not put too many inputs that have very small probabilities into gates.
Do not spend too much time on passive components in a system. Remember,
the fault tree really looks at functions, not components.
Do not model human errors of commission because they are very difficult
to capture realistically and can skew results.
220 System Safety Engineering and Risk Assessment: A Practical Approach
Fault trees are extremely powerful methods to demonstrate your safety sys-
tems fault tolerance to an accident. The next time you want to demonstrate
how many things must go wrong for an accident, use fault trees. Fault trees are
great tools to educate a non engineer (e.g., in a lawsuit) of how hard it is for
something to occur.
REFERENCES
Dorer, R. M. and Hathaway, W. T. 1991. Safety of High Speed Magnetic Levitation
Transportation Systems: Preliminary Safety Review of the Transrapid Maglev System.
DOT-VNTSC-FRA-90-3. Washington, DC: U.S. Department of Transportation.
Roberts, N. H., Vesely, W. E., Haasl, D. F., and Goldberg, F. F. 1981. Fault Tree Handbook.
NUREG-0492. Washington, DC: U.S. Nuclear Regulatory Commission.
Stamatelatos, M., Caraballo, J., and Vesely, W. August 2002. Fault Tree Handbook with
Aerospace Applications. Version 1.1. Washington, DC: NASA Office of Safety and
Mission Assurance NASA Headquarters.
FURTHER READING
Anderson, T. and Lee, P. A. 1981. Fault Tolerance: Principles and Practice. Englewood Cliffs,
NJ: Prentice-Hall.
Center for Chemical Process Safety. 1999. Guidelines for Chemical Process Quantitative Risk
Analysis, 2nd edn. Hoboken, NJ: Wiley.
Center for Chemical Process Safety. 2008. Guidelines for Hazard Evaluation Procedures, 3rd
edn. Hoboken, NJ: Wiley.
Haasl, D. F. 1965. Advanced concepts in fault tree analysis. System Safety Symposium,
Seattle, WA.
Henley, E. J. and Kumamoto, H. 2000. Probabilistic Risk Assessment and Management for
Engineers and Scientists. Hoboken, NJ: Wiley-IEEE Press.
International Electrotechnical Commission. 2006. Fault Tree Analysis. IEC 61025. Geneva,
Switzerland: International Electrotechnical Commission.
Lacey, P. 2011. An application of fault tree analysis to the identification and management
of risks in government funded human service delivery. Proceedings of the Second
International Conference on Public Policy and Social Sciences, Kuching, Sarawak,
Malaysia. http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2171117, downloaded
July 9, 2013.
Fault Tree Analysis 221
Lapp, S. A. and Powers, G. J. 1977. Computer-aided synthesis of fault trees. IEEE Transactions
on Reliability, R-26: 213.
Long, A. Beauty and the beastUse and abuse of fault tree as a tool. No date. http://www.
fault-tree.net/papers/long-beauty-and-beast.pdf downloaded May 17, 2014.
National Institute of Standards and Technology. 2009. Root Cause Analysis Report of
Plutonium Spill at Boulder Laboratory. Gaithersburg, MD. http://www.nist.gov/public_
affairs/releases/upload/root_cause_plutonium_010709.pdf downloaded May 17, 2014.
Downloaded by [Wayne State University] at 13:03 16 August 2016
Downloaded by [Wayne State University] at 13:03 16 August 2016