Control Systems Safety Evaluation and Reliability (Recommend)
Control Systems Safety Evaluation and Reliability (Recommend)
Safety Evaluation
and Reliability
Third Edition
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
William M. Goble
Third Edition
ISBN 978-1-934394-80-9
Goble, William M.
Control systems safety evaluation and reliability / William M. Goble.
-- 3rd ed.
p. cm. -- (ISA resources for measurement and control series)
Includes bibliographical references and index.
ISBN 978-1-934394-80-9 (pbk.)
1. Automatic control--Reliability. I. Title.
TJ213.95.G62 2010
629.8--dc22
2010015760
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
This book has been made possible only with the help of many other persons.
Early in the process, J. V. Bukowski of Villanova taught a graduate course in
reliability engineering where I was introduced to the science. This course and
several subsequent tutorial sessions over the years provided the help necessary to
get started.
Many others have helped develop the issues important to control system safety
and reliability. I want to thank co-workers; John Grebe, John Cusimano, Ted Bell,
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Ted Tucker, Griff Francis, Dave Johnson, Glenn Bilane, Jim Kinney, and Steve
Duff. They have asked penetrating questions, argued key points, made
suggestions, and provided solutions to complicated problems. A former boss Bob
Adams deserves a special thank you for asking tough questions and demanding
that reliability be made a prime consideration in the design of new products.
Fellow members of the ISA84 standards committee have also helped develop the
issues. I wish to thank Vic Maggioli, Dimitrios Karydos, Tony Frederickson, Paris
Stavrianidis, Paul Gruhn, Aarnout Brombacher, Ad Hamer, Rolf Spiker, Dan
Sniezek and Steve Smith. I have learned from our debates.
Finally, I wish thank my wife Sandy and my daughters Tyree and Emily for their
patience and help. Everyone helped proofread, type, and check math. While the
specific help was greatly appreciated, it is the encouragement and support for
which I am truly thankful.
vii
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
PREFACE xv
Chapter 1 INTRODUCTION 1
Control System Safety and Reliability, 1
Standards, 4
Exercises, 6
Answers to Exercises, 7
References, 7
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
In Closing, 281
Exercises, 281
Answers to Exercises, 281
References, 282
INDEX 455
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
This general approach is very consistent with those who work to economi-
cally optimize their designs. Design constraints must be balanced in order
to provide the optimal design. The ultimate economic success of the pro-
cess is affected by all of the design constraints. True design optimization
requires that alternative designs be evaluated in the context of the con-
straints. Numeric targets and methods to quantitatively evaluate safety
and reliability are the tools needed to include this dimension in the opti-
mization process.
rate data, the primary input required for most methods, is not precisely
specified or readily available. Precise failure rate data requires an exten-
sive life test where operational conditions match expected usage.
Several factors prevent this testing. First, current control system compo-
nents from quality vendors have achieved a general level of reliability that
allows them to operate for many, many years. Precise life testing requires
that units be operated until failure. The time required for this testing is far
beyond the usefulness of the data (components are obsolete before the test
is complete). Second, operational conditions vary significantly between
control systems installations. One site may have failure rates that are
much higher than another site. Last, variations in usage will affect reliabil-
ity of a component. This is especially true when design faults exist in a
product. Design faults are probable in the complex components used in
today's systems. Design faults, bugs, are almost expected in complicated
software.
Software reliability has been the subject of intense research for over a
decade. These efforts are beginning to show some results. This is impor-
tant to the subject of control systems because of the explosive growth of
software within these systems. Although software engineering techniques
have provided better design fault avoidance methods, the growth has out-
stripped the improvements. Software reliability may well be the control
system reliability crisis of the future.
Safety and reliability are important design constraints for control systems.
When those involved in the system design share common vocabulary,
understand evaluation methods, include all site variables and understand
how to evaluate reliable software; then safety and reliability can become
true design parameters. This is the goal.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
William M. Goble
Ottsville, PA
April 2010
Dr. William M. Goble has more than 30 years of experience in analog and
digital electronic circuit design, software development, engineering
management and marketing. He is currently a founding member and
Principal Partner with exida, a knowledge company focused on
automation safety and reliability.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
xvii
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Given the importance of safety and reliability, how are they achieved?
How are they measured? The science of Reliability Engineering has
advanced quite a bit in recent decades. That science offers a number of
fundamental concepts used to achieve high reliability and high safety.
These concepts include high-strength design, fault-tolerant design, on-line
failure diagnostics, and high-common-cause strength. All of these impor-
tant concepts will be developed in later chapters of this book. When these
concepts are actually understood and used, great benefits can result.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`--- 1
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
2 Control Systems Safety Evaluation and Reliability
Reliability Engineering
The science of reliability engineering has developed a number of qualita-
tive and semi-quantitative techniques that allow an engineer to under-
stand system operation in the presence of a component failure. These
techniques include failure modes and effects analysis (FMEA), qualitative
fault tree analysis (FTA), and hazard and operational analysis (HAZOPS).
Other techniques based on probability theory and statistics allow the con-
trol engineer to quantitatively evaluate the reliability and safety of control
system designs. Reliability block diagrams and fault trees use combina-
tional probability to evaluate the system-level probability of success, prob-
ability of safe failure, or probability of dangerous failure. Another popular
technique called Markov models shows system success and failure via cir-
cles called states. These techniques will be covered in this book.
Life-cycle cost modeling may be the most useful technique of all to answer
questions of optimal cost and justification. Using this analysis tool, the
output of a reliability analysis in the language of statistics is converted to
the clearly understood language of money. It is frequently quite surprising
how much money can be saved using reliable and safe equipment. This is
especially true when the cost of failure is high.
Perspective
The field of reliability engineering is relatively new compared to other
engineering disciplines, with significant research having been driven by
military needs in the mid-1940s. Introductory work in hardware reliability
was done in conjunction with the German V2 rocket program, where inno-
vations such as the 2oo3 (two out of three) voting scheme were invented
[Ref. 1, 2]. Human reliability research began with American studies done
on radar operators and gunners during World War II. Military systems
were among the first to reach complexity levels at which reliability engi-
neering became important. Methods were needed to answer important
Control systems and safety protection systems have also followed an evo-
lutionary path toward greater complexity. Early control systems were sim-
ple. Push buttons and solenoid valves, sight gauges, thermometers, and
dipsticks were typical control tools. Later, single loop pneumatic control-
lers dominated. Most of these machines were not only inherently reliable,
many failed in predictable ways. With a pneumatic system, when the air
tubes leaked, the output went down. When an air filter clogged, the out-
put went to zero. When the hissing noise changed, a good technician
could run diagnostics just by listening to determine where the problem
was. Safety protection systems were built from relays and sensing
switches. With the addition of safety springs and special contacts, these
devices would virtually always fail with the contacts open. Again, they
were simple devices that were inherently reliable with predictable,
(mostly) fail-safe failure modes.
The inevitable need for better processes eventually pushed control sys-
tems to a level of complexity at which sophisticated electronics became the
optimal solution for control and safety protection. Distributed microcom-
puter-based controllers introduced in the mid-1970s offered economic
benefits, improved reliability, and flexibility.
These questions are best answered using quantitative reliability and safety
analysis. Markov analysis has been developed into one of the best tech-
niques for answering these questions, especially when time dependent
variables such as imperfect proof testing are important. Failure Modes
Effects and Diagnostic Analysis (FMEDA) has been developed and refined
as a new tool for quantitative measurement of diagnostic capability. These
new tools and refined methods have made it easier to optimize designs
using reliability engineering.
Standards
Many new international standards have been created in the world of
reliability engineering. Standards now provide detailed methods of
determining component failure rates [Ref. 3]. Standards provide checklists
of issues that should be addressed in qualitative evaluation. Standards
define performance measures against which quantitative reliability and
safety calculations can be compared. Standards also provide explanations
and examples of how systems can be designed to maximize safety and
reliability.
1 10 1 PFDavg q 10 2 10 a $R 100
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Create
Perform
Conceptual Maintenance -
Conceptual Operations
Process Design
SIS Design Procedures
Maint. - Operations
Hazard and Perform Detail Perform Periodic
Risk Analysis SIS Design
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Testing
Verify Safety
Develop Safety SIS
Requirements - * Requirements
Modification or
Determine SIL Have Been
De-commission
Met
* verification of requirements/SIL
The controversy may also come from the experiences that gave rise to
another famous quotation, Garbage in, garbage out. Poor failure rate
estimates and poor simplification assumptions can ruin the results of any
reliability and safety evaluation. Good qualitative reliability engineering
should be used to prevent garbage from going into the evaluation.
Qualitative engineering provides the foundation for all quantitative work.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Quantitative safety and reliability evaluation can add great depth and
insight into the design of a system and design alternatives. Sometimes
intuition can be deceiving. After all, it was once intuitively obvious that
the world was flat. Many aspects of probability and reliability can appear
counter-intuitive. The quantitative reliability evaluation either verifies the
qualitative evaluation or adds substantially to it. Therein lies its value.
Exercises
1.1 Are methods used to determine safety integrity levels of an indus-
trial process presented in ANSI/ISA-84.00.01-2004 (IEC 61511
Mod)?
1.2 Are safety integrity levels defined by order of magnitude quantita-
tive numbers?
1.3 Can quantitative evaluation techniques be used to verify safety
integrity requirements?
1.4 Should quantitative techniques be used exclusively to verify safety
integrity?
Answers to Exercises
1.1 Yes, ANSI/ISA-84.00.01-2004 (IEC 61511 Mod) describes the con-
cept of safety integrity levels and presents example methods on
how to determine the safety integrity level of a process.
1.2 Yes, in the ISA-84.01-1996, IEC 61508 and ANSI/ISA-84.00.01-2004
(IEC 61511 Mod) standards.
1.3 Yes, if quantitative targets (typically an SIL level and required reli-
ability) are defined as part of the safety requirements.
1.4 Not in the opinion of the author. Qualitative techniques are
required as well in order to properly understand how the system
works under failure conditions. Qualitative guidelines should be
used in addition to quantitative analysis.
References
1. Coppola, A. Reliability Engineering of Electronic Equipment: A
Historical Perspective. IEEE Transactions of Reliability. IEEE, April
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
1984.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Random Variables
The concept of a random variable seems easy to understand and yet many
questions and statements indicate misunderstanding. For example, the
random variable in Reliability Engineering is time to failure. A manager
reads that on average an industrial boiler explodes every fifteen years (the
average time to failure is fifteen years) and knows that the unit in their
plant has been running fourteen years. He calls a safety engineer to deter-
mine how to avoid the explosion next year. This is clearly a misunder-
standing.
The process of failure is like many other processes that have variations in
outcome that cannot be predicted by substituting variables into a formula.
Perhaps the exact formula is not understood. Or perhaps the variables
involved are not completely understood. These processes are called
random (stochastic) processes, primarily because they are not well
characterized.
Some random variables can have only certain values. These random vari-
ables are called discrete random variables. Other variables can have a
numerical value anywhere within a range. These are called continuous
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
9
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
10 Control Systems Safety Evaluation and Reliability
random variables. Statistics are used to gain some knowledge about these
random variables and the processes that produce them.
Statistics
Statistics are usually based on data samples. Consider the case of a
researcher who wants to understand how a computer program is being
used. The researcher calls six computer program users at each of twenty
different locations and asks what specific program function is being used
at that moment. The program functions are categorized as follows:
The results of the survey (sample data) are presented in Table 2-1. This is a
list of data values.
Site 10 2 2 2 2 2 2
Site 11 2 5 2 2 3 2
Site 12 3 2 2 4 2 2
Site 13 5 2 1 2 2 3
Site 14 2 2 2 3 2 2
Site 15 3 2 2 4 2 2
Site 16 2 2 2 3 2 5
Site 17 2 3 1 2 2 3
Site 18 1 2 2 2 2 2
Site 19 2 2 3 2 3 1
Site 20 2 2 2 2 1 2
Histogram
One of the more common ways to organize data is the histogram (see
Table 2-1). A histogram is a graph with data values on the horizontal axis
and the quantity of samples with each value on the vertical axis. A histo-
gram of data for Table 2-1 is shown in Figure 2-1.
EXAMPLE 2-1
Solution: The histogram shows that three answers from the total of
one hundred and twenty were within category five. Therefore, the
chances of getting an answer in category five are 3/120, which is
2.5%.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
ting a random variable value within a range. The random variable values
typically form the horizontal axis, and probability numbers (a range of 0 to
1) form the vertical axis.
and
n
P ( xi ) = 1 (2-2)
i=1
f ( x ) dx = 1 (2-3)
Figure 2-2 shows a discrete PDF for the toss of a pair of fair dice. There are
36 possible combinations that add up to 11 possible outcomes. The proba-
bility of getting a result of seven is 6/36 because there are six combina-
tions that result in a seven. The probability of getting a result of two is 1/
36 because there is only one combination of the dice that will give that
result. Again, the probabilities total to one.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Distributions in which all outcomes are equally likely are called uniform
distributions. The probability of getting an outcome within an interval is
proportional to the length of the interval. For example, the probability of
getting a result between 101.0 and 101.2 is 0.02 (0.2 interval length 0.1
probability). As the length of the interval is reduced, the probability of get-
ting a result within the interval drops. The probability of getting an exact
particular value is zero.
b
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
P(a X b) = f ( x ) dx (2-4)
a
EXAMPLE 2-2
120
120
P ( 110 X 120 ) = 0.001 dx = 0.001x
110
110
EXAMPLE 2-3
0 otherwise.
kt dt = 1
0
which equals
4
2
kt
------- = 1
2
0
Evaluating,
16k
---------- = 1, k = 0.125
2
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
EXAMPLE 2-4
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
t=0
u=0.5
u du
= 0.01e --------------
0.01
u=0
Evaluating,
0.01 ( 50 ) 0.01 ( 0 )
= [ e ] [ e ]
= 0.60653 + 1 = 0.39347
F ( xn ) = P ( xi ) (2-5)
i=1
F(x) = f ( x ) dx (2-6)
dF
F ( ) = 1, F ( ) = 0, and ------ 0
dx
This means that the area under the PDF curve between a and b is simply
the CDF area to the left of b minus the CDF area to the left of a.
EXAMPLE 2-5
Problem: Calculate and graph the discrete CDF for a dice toss
process where the PDF is as shown in Figure 2-2.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Figure 2-4. Dice Toss Cumulative Distribution Function
EXAMPLE 2-6
Solution: During the interval from 100 to 110, the PDF equals a
constant, 0.1. The PDF equals zero for other values of x. Therefore,
using Equation 2-6,
x
The CDF is zero for values of x less than 100. The CDF is one for
values of x greater than 110. This CDF is plotted in Figure 2-5.
Mean
The average value of a random variable is called the mean or expected
value of a distribution. For discrete random variables, the equation for
mean, E, is
n
E ( xi ) = x = xi P ( xi ) (2-8)
i=1
EXAMPLE 2-7
E ( x ) = 2 -----
1 + 3 -----
2 + 4 -----
3 + 5 -----
4
36- 36- 36- 36-
+ 6 -----
5 + 7 -----
6 + 8 -----
5 + 9 -----
4
36- 36- 36- 36-
+ 10 -----
3 + 11 -----
2 + 12 -----
1
36- 36- 36-
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
= 252
---------- = 7
36
n
xi
i=1 -
x = -------------- (2-9)
n
EXAMPLE 2-8
Problem: A pair of fair dice is tossed ten times. The results are 7, 2,
4, 6, 10, 12, 3, 5, 4, and 2. What is the sample mean?
Solution: Using Equation 2-9, the samples are added and divided by
the number of samples. The answer is 5.5. Compare this to the
answer of seven obtained in Example 2-7. The difference of 1.5 is
large but not unreasonable since our sample size of ten is small.
EXAMPLE 2-9
Solution: Using Equation 2-9, the samples are added and divided by
the number of samples. The total of all one hundred samples
(including those from Example 2-8) is 725. Dividing this by the
number of samples yields a sample mean result of 7.25. This result is
closer to the distribution mean obtained in Example 2-7.
E(x) = xf ( x ) dx (2-10)
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
EXAMPLE 2-10
t --8- tdt
1
E(t) =
0
Integrating,
4
3
1t 64
E ( t ) = --- ---- = ------
83 24
0
Median
The median value (or simply, median) of a sample data set is a measure
of center. The median value is defined as the data sample where there are
an equal number of samples of greater value and lesser value. It is the
middle value in an ordered set. If there are two middle values, it is the
value halfway between those two values.
Knowing both the mean and the median allows one to evaluate symmetry.
In a perfectly symmetric PDF, such as a normal distribution, the mean and
the median are the same. In an asymmetric PDF, such as the exponential,
the mean and the median are different.
EXAMPLE 2-11
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
EXAMPLE 2-12
Variance
As noted, the mean is a measure of the center of mass of a distribution.
Knowing the center is valuable, but more must be known. Another impor-
tant characteristic of a distribution is spreadthe amount of variation. A
process under control will be consistent and will have little variation. A
measure of variation is important for control purposes [Ref. 1, 2].
One way to measure variation is to subtract values from the mean. When
this is done, the question, How far from the center is the data? can be
answered mathematically. For each discrete data point, calculate
( xi x ) (2-11)
2 n 2
( xi ) = ( xi x ) P ( xi ) (2-12)
i=1
where xi refers to each discrete outcome and P(xi) refers to the probability
of realizing each discrete outcome. If a set of sample data is given, the for-
mula for sample variance is
n
2
( xi x )
2 i=1
s ( x i ) = ------------------------------- (2-13)
n
where xi refers to each piece of data in the set. The size of the data set is
given by n. As in the case of the sample mean, the sample variance will
provide a result similar to the actual variance as the sample size grows
larger. But, there is no guarantee that these numbers will be equal. For this
reason, the sample variance is sometimes called an estimator variance.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
2 2
(x) = (x x) f ( x ) dx (2-14)
Standard Deviation
A standard deviation is the square root of variance. The formula for stan-
dard deviation is
2
(x) = (x) (2-15)
Standard deviation is often assigned the lowercase Greek letter sigma ().
It is a measure of spread just like variance and is most commonly associ-
ated with the normal distribution.
EXAMPLE 2-13
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
calculated as seven. Using Equation 2-12,
2 21 22 23 24
( x i ) = ( 2 7 ) ------ + ( 3 7 ) ------ + ( 4 7 ) ------ + ( 5 7 ) ------
36 36 36 36
25 26 25 24
+ ( 6 7 ) ------ + ( 7 7 ) ------ + ( 8 7 ) ------ + ( 9 7 ) ------
36 36 36 36
23 22 21
+ ( 10 7 ) ------ + ( 11 7 ) ------ + ( 12 7 ) ------
36 36 36
= 5.834
EXAMPLE 2-14
Common Distributions
Several well known distributions play an important role in reliability engi-
neering. The exponential distribution is widely used to represent the proba-
bility of component failure over a time interval. This is a direct result of a
constant failure rate assumption. The normal distribution is used in many
areas of science. In reliability engineering it is used to represent strength
distributions. It is also used to represent stress. The lognormal distribution
is used to model repair probabilities.
Exponential Distribution
The exponential distribution is commonly used in the field of reliability. In
its general form it is written
kx
f ( x ) = ke , for x 0 ; (2-16)
= 0, for x < 0
kx
F ( x ) = 0, for x < 0 ; = 1 e , for x 0 (2-17)
The equation is valid for values of k greater than zero. The CDF will reach
the value of one at x = (which is expected). Figure 2-6 shows a plot of an
exponential distribution PDF and CDF where k = 0.6. Note: PDF(x) =
d[CDF(x)]/dx.
Normal Distribution
The normal distribution is the most well known and widely used proba-
bility distribution in general science fields. It is so well known because it
applies (or seems to apply) to so many processes. In reliability engineer-
ing, as mentioned, it primarily applies to measurements of product
strength and external stress.
The PDF is perfectly symmetrical about the mean. The spread is mea-
sured by variance. The larger the value of variance, the flatter the distribu-
tion. Figure 2-7 shows a normal distribution PDF and CDF where the
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
mean equals four and the standard deviation equals one. Because the PDF
is perfectly symmetric, the CDF always equals one half at the mean value
of the PDF. The PDF is given by
2
( x )-
---------------------
2
1 2
f ( x ) = -------------- e (2-18)
2
where is the mean value and is the standard deviation. The CDF is given
by
2
(x )
----------------------
x 2
1 2
2
F ( x ) = -------------- e dx (2-19)
This is a normal distribution that has a mean value of zero and a standard
deviation of one. Tables can be generated (Appendix A and Ref. 2, Table
1.1.12) showing the numerical values of the cumulative distribution func-
tion. These are done for different values of z. Any normal distribution
with any particular and can be translated into a standard normal dis-
tribution by scaling its variable x into z using Equation 2-20. Through the
use of these techniques, numerical probabilities can be obtained for any
range of values for any normal distribution.
EXAMPLE 2-15
P ( T 70 ) = 0.99865
Lognormal Distribution
The random variable X has a lognormal PDF if lnX has a normal distribu-
tion. Thus, the lognormal distribution is related to the normal distribution
and also has two parameters.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Exercises
2.1 A failure log is provided. Create a histogram of failure categories:
software failure, hardware failure, operations failure and mainte-
nance failure.
FAILURE LOG - 10 sites
Symptom Failure Category
1. PC crashed Software
2. Fuse Blown Hardware
3. Wire shorted during repair Maintenance
4. Program stopped during use Software
5. Unplugged wrong module Maintenance
6. Invalid display Software
7. Scan time overrun Software
8. Water in cabinet Maintenance
9. Wrong command Operations
10. General protection fault Software
11. Cursor disappeared Software
12. Coffee spilled Hardware
13. Dust on thermocouple terminal Hardware
14. Memory full error Software
15. Mouse nest between circuits Hardware
16. PLC crashed after download Software
17. Output switch shorted Hardware
18. LAN addresses duplicated Maintenance
19. Computer memory chip failed Hardware
20. Power supply failed Hardware
21. Sensor switch shorted Hardware
22. Valve stuck open Hardware
23. Bypass wire left in place Maintenance
24. PC virus causes crash Software
25. Hard disk failure Hardware
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Outcome Probability
1 0.1
2 0.2
3 0.5
4 0.1
5 0.1
What is the average failure time in days? What is the median fail-
ure time in days?
2.7 A controller will fail when the temperature gets above 80C. The
ambient temperature follows a normal distribution with a mean of
40C and a standard deviation of 10C. What is the probability of
failure?
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Answers to Exercises
2.1 Software failures - 9, Hardware failures - 10, Maintenance failures -
5, Operations failures - 1. The histogram is shown in Figure 2-10.
The Venn diagram is shown in Figure 2-11.
2.2 The chances that the next failure will be a software failure are 9/25.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
2.3 Figure 2-12 shows the PDF. Figure 2-13 shows the CDF.
2.4 The mean is calculated by multiplying the values times the proba-
bilities per equation 2-8. Mean = 1 0.1 + 2 0.2 + 3 0.5 + 4 0.1
+ 5 0.1 = 2.9.
2.5 The variance is calculated using equation 2-12. Answer 1.09.
2.6 Mean of the data equals 427.94 days. Average failure time in hours
equals 427.94 24 = 10271 hours. Median failure time in days
equals (183+228)/2 = 205.5. In hours = 4932 hours. Note that this
assumes that failure occurred precisely at the end of each 24-hour
day. This is not accurate but a common assumption.
2.7 Using a standard normal distribution, z = (80 40)/10 = 4.00. From
Appendix A at z = 4.00, 0.999969. Probability of failure then equals
1 0.999969 = 0.000031.
References
1. Mamzic, C. L. Statistical Process Control. Research Triangle Park:
ISA, 1996.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Failures
A failure occurs when a system, a unit, a module, or a component fails to
perform its intended function. Control systems fail. Everyone understands
that this potentially expensive and potentially dangerous event can occur.
To prevent future control system failures, the causes of failure are studied.
When failure causes are understood, system designs can be improved. To
obtain sufficient depth of understanding, all levels of the system must be
examined. For safety and reliability analysis purposes, four levels are
defined in this book. A system is built from units. If the system is redun-
dant, multiple units are used. Non-redundant systems use a single unit.
Units are built from modules and modules are built from components (see
Figure 3-1). Many real control systems are constructed in just this manner.
Although system construction could be defined in other ways, these levels
are optimal for use in safety and reliability analysis, especially the analysis
of fault tolerant systems. These terms will be used throughout the book.
Failure Categorization
A group of control system suppliers and users once brainstormed at an
ISA standards meeting on the subject of failures. A listing of failure
sources and failure types resulted:
Humidity
Software bugs
Temperature
Power glitches
--``,,`,,,`,,`,`,,,```,,,``,``,
33
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
34 Control Systems Safety Evaluation and Reliability
SYSTEM
UNIT
MODULE
COMPONENT
This list is quite diverse and includes a number of different failure types,
as well as failure sources, which need to be sorted if any good understand-
When thinking about all system failures, two types emerge: random fail-
ures and systematic failures (Figure 3-2). The functional safety standard
IEC 61508 [Ref. 1] defines a random failure as Failure occurring at a ran-
dom time which results from one or more of the possible degradation
mechanisms in the hardware. It does note that There are many degrada-
tion mechanisms occurring at different rates in different components and,
since manufacturing tolerances cause components to fail due to these
mechanisms after different times in operation, failures of equipment com-
prising many components occur at predictable rates but unpredictable
(i.e., random) times.
FAILURES
Other failures are called functional failures (or systematic failures). A sys-
tematic failure occurs when no components have failed, yet the system
does not perform its intended function. IEC 61508 defines a systematic
failure as a Failure related in a deterministic way to a certain cause,
which can only be eliminated by a modification of the design or of the
manufacturing process, operational procedures, documentation or other
relevant factor.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
pressed. The failure may or may not even appear repeatable depending on
the data entered.
There is debate about how to classify some failures [Ref. 6]. One view is
that all failures are systematic because an engineer should have properly
anticipated all stress conditions and designed the system to withstand
them! Different analysts may classify the same failure as random or sys-
tematic with the distinction often depending on the assumptions made by
the analyst. If a component manufacturer specifies a strict requirement for
clean air per an ISA standard and a failure is caused by intermittent bad
air caused by failure of the air filter, then the manufacturer will likely clas-
sify the failure as systematic. The (unrealistic) assumption is that the com-
ponent user can design an air supply system that never allows dirty air.
The component user, who designed and installed the air filter system,
knows that air quality can vary randomly due to random events (e.g., air
filter failure) and would likely classify the same failure as random. Of
course more than a few of these random failures will likely convince the
component user to find a better designed component that can withstand
random air quality failure events.
Both random and systematic failures have attributes that are important to
control system safety and reliability analysis. This failure information is
needed to help determine how to prevent future failures of both types.
Information is recorded about the failure source and the effect on the con-
trol systems function. The term failure stress source is used to represent
the cause of a failurethe cause of death. This information should
include all the things that have stressed the system. The failure stress
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Item Class
Humidity External failure stress source
Software bugs Internal design failure stress source
Temperature External failure stress source
Power glitches External failure stress source
Stuck pneumatic actuator Failure type - random failure
Dirty air External failure stress source
System design errors Internal design failure stress source
Electrostatic Discharge (ESD) External failure stress source
Sticky O-ring Failure type - random failure
Broken wires Failure type - random failure
Corrosion-induced open circuits Failure type - random failure
Random component failures Failure type - random failure
Repairman error External failure stress source
RFI External failure stress source
Operator closing wrong switch Failure type - systematic failure
Vibration External failure stress source
Broken structural member Failure type - random failure
Improper grounding External failure stress source
Wrong configuration loaded Failure type - systematic failure
Incorrect replacement Failure type - systematic failure
Wrong software version installed Failure type - systematic failure
Random Failures
Random failures may be permanent (hard error) or transient (soft error).
In some technologies a random failure is most often permanent and attrib-
utable to some component or module. For example, a system that consists
of a single-board controller module fails. The controller output de-ener-
gizes and no longer supplies current to a solenoid valve. The controller
diagnostics identify a bad output transistor component. A failure analysis
of the output transistor shows that it would not conduct current and failed
with an open circuit. The failure occurred because a thin bond wire inside
the transistor melted. Plant Operations reports a nearby lightning strike
just before the controller module failed. Lightning causes a surge of elec-
trical stress that can exceed the strength of a transistor. It should be noted
that lightning is considered a random event. For this failure:
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Systematic Failures
If all the physical components of a control system are working properly
and yet the system fails to perform its function, that failure is classified as
systematic. Systematic (or functional) failures have the same attributes as
physical failures. Each failure has a failure stress source and an effect on
the control function. The failure stress sources are almost always design
faults, although sometimes a maintenance error or an installation error
causes a systematic failure.
The exact source of a systematic failure can be terribly obscure. Often the
failure will occur when the system is asked to perform some unusual func-
tion, or perhaps the system receives some combination of input data that
was never tested. Some of the most obscure failures involve combinations
of stored information, elapsed time, input data, and function performed.
The ability to resist such stress may be considered a measure of strength
in the design.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Failure stress sources are either internal or external to the system (see Fig-
ure 3-3). Internal failure stress sources typically result in decreased
strength. Internal sources include design faults (product) and manufactur-
ing faults (process). The faults can occur at any level: component, module,
unit, or system. External failure stress sources increase stress. They
include environmental sources, maintenance faults, and operational
faults. A failure is often due to a combination of internal weakness
(decreased strength) and external stress. Most, if not all, failures due to
external stress are classified as random. Internal failure stress sources can
be classified as either random or systematic.
Internal Design
Design faults (insufficient strength) can cause system failures. They are a
major source of systematic failures. Many case histories are recorded [Ref.
8, 9, 10]. In some cases the designer did not understand how the system
would be used. Sometimes the designer did not anticipate all possible
environmental or operational conditions. Different designers working on
different parts of the system may not have completely understood the
whole system. These faults may apply to component design, module
design, unit design, or system design.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
System) radar in Thule, Greenland, had detected the rising moon. The
designers did not anticipate this input. Reports indicate that this design
was strengthened via a redesign, shortly after the failure occurred. This
is characteristic of design faults that are properly remedied: strength
increases. For this failure:
process where a second party (not the designer) describes the design to a
group of reviewers. Other techniques involve the use of experienced per-
sonnel who specifically look for design faults.
Internal Manufacturing
Manufacturing defects occur when some step in the process of manufac-
turing a product (component, module, unit, or system) is done incorrectly.
For example, a controller module failed when it no longer supplied cur-
rent to a valve positioner. It was discovered that a resistor in the output
circuit had a corroded wire that no longer conducted electricity. This was
a surprise since the resistor wire is normally coated to protect against cor-
rosion. A closer examination showed that the coating had not been
applied correctly when the resistor was made. Small voids in the coating
occurred because the coating machine had not been cleaned on schedule.
For this failure:
External Environmental
Control systems are used in industrial environments where many things
are present that help cause failures (Figure 3-4).
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
External -
Environmental Failure
Sources
Temperature Humidity
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Temperature can directly cause failure. A PC based industrial computer
module that performs a data logging function is mounted in a ventilated
shed. One hot day, the ventilation fan fails and the temperatures reach
65C. The power supply overtemperature circuit shut down the computer.
For this failure:
Many other industrial failure stress sources are present. Large electrical
currents are routinely switched, generating wideband electrical noise.
Mechanical processes such as mixing, loading, and stamping cause shock
and vibration. Consider an example: A controller module mounted near a
turbine had all its calibration data stored in a programmable memory
mounted in an integrated circuit socket. After several months of opera-
tion, the controller failed. Its outputs went dead. An examination of the
failed controller module showed that the memory had vibrated out of the
socket. The controller computer was programmed to safely de-energize
the outputs if calibration data was lost. For this failure:
ure rates caused by environmental sources vary with the product design.
Some products are stronger and are designed to withstand higher envi-
ronmental stress; they do not fail as frequently.
1. They can cause direct failure when their magnitude exceeds design
limits. For example, an integrated circuit mounted in a plastic
package works well even when hot (some are rated to junction
temperatures of 150C). But when the temperature exceeds the
glass transition point of the plastic (approximately 180C), the
component fails. In another example, the keyboard for a personal
computer worked well in a control room even through the humid
days of summer. But it failed immediately when a glass of water
tipped into it. For this failure:
Title: Operator console failed
Failure: Keyboard failedwater spill
Failure Type: Random
Primary Failure Stress Source: Water on keyboard
Secondary Failure Stress Source: None identified
should be taken into account when calculating failure rates and used as
adjustment factors based on experience.
In the safety and reliability analysis of control systems, many other types
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Stress
Stress varies with time. Consider, for example, stress caused by ambient
temperature. Ambient temperature cycles every day when the earth
rotates. Temperature levels change with the season and other random
events. When temperature is viewed as a whole, the stress level can be
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
characterized as random.
A Stress Variable
Figure 3-5. Normally Distributed Stress
When evaluating stress levels that cause failures, the probability of getting
a stress level lower than a particular value must be determined. This is
represented by an inverse distribution function as shown in Figure 3-6.
Strength
As mentioned above, the strength of a product is a measure of the ability
of the product to resist a failure stress source (stressor). Strength is the
Probability
A Stress Variable
Figure 3-6. Probability of Getting a Stress Level Lower Than a Particular Value
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
is marked with a crosshatch.
1
Probability
A Stress Variable
Figure 3-7. Strength Versus Inverse Stress CDF
w = yx (3-1)
The product succeeds when w > 0 and fails when w 0. The failure proba-
bility is represented by the area where the stress-strength curves interfere.
This area is reduced when more safety factor (the difference between
mean strength and mean stress) is used by the designer.
EXAMPLE 3-1
Solution: How much of the area under the normal curve that
represents stress (ambient temperature) exceeds the strength (90C)
of the transistor? First, relate the transistor strength to ambient
temperature. Since the transistor operates 30 degrees hotter than
ambient,
A simulation can show how many modules will fail as a function of time
given a particular stress-strength relationship. A simulation was done
[Ref. 16] using a normally distributed stress and a constant strength.
Choosing values (Figure 3-8) similar to previous work done at Eindhoven
University in the Netherlands [Ref. 15], the results are shown in Figure 3-
9. Within the limits of the simulation, a relatively constant number of
modules would fail as a function of time.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
0.9
0.8
Probability
0.7 Stress: Mean = 10
0.6 Std. Dev. = 1
0.5
0.4
Strength: 12.7
0.3
0.2
0.1
0
10 11 12 13
A Stress Variable
Figure 3-8. Stress Versus Strength Simulation Values Identical Strength
0.006
0.005
Failure rate
0.004
0.003
0.002
0.001
0
101
201
301
401
501
601
701
801
901
1
Figure 3-9. Failure Rate as a Function of Operation Time Interval Identical Strength
Strength Varies
Actual manufacturing processes are not ideal, and newly manufactured
products are not identical in strength. In addition to the effects of manu-
facturing defects, the raw materials used to make components vary from
batch to batch, the component manufacturing process varies, and the
module level manufacturing process can differ in any number of places.
The result of this variation in strength can be characterized by a probabil-
ity density distribution. Experience, supported by the central limit theo-
rem, would suggest another normal distribution.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
0.9
0.8
Probability
0.7
Strength:
0.6 Stress: Mean = 10.0 Mean = 12.7
0.5
Std.Dev. = 1.0 Std. Dev. = 0.5
0.4
0.3
0.2
0.1
A Stress Variable
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Figure 3-10. Stress Versus Strength Simulation Values Normal Strength
0.008
0.007
0.006
Failure rate
0.005
0.004
0.003
0.002
0.001
0
101
201
301
401
501
601
701
1
801
901
Figure 3-11. Failure Rate as a Function of Operating Time Interval Normal Strength
Strength Decreases
Strength changes with time. Although there are rare circumstances where
product strength increases with time, a vast majority of strength factors
decrease. Even in the absence of changes in stress, as strength decreases,
the likelihood of failure increases, and the rate at which failures occur will
0.9
Starting Strength:
0.8
Mean=12.7
Probability
0.7
Std.Dev.=0.5 @ t = 0.
0.6 Stress: Mean=10.0
0.5
Std.Dev.=1.0
0.4
0.3
Strength @ t = 800.
0.2
0.1
A Stress Variable
Figure 3-12. Stress Strength with Decreasing Strength
0.025
0.02
Failure rate
0.015
0.01
0.005
0
201
301
401
501
601
701
801
1
101
Measuring Strength
Strength is measured by testing a product (a component, a module, a unit,
or a system) until it fails (or at least until the limits of the test equipment
have been reached). When several identical devices are actually tested to
destruction during a design qualification test, a good indication of
strength is obtained. Many industrial stressors have been characterized by
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Temperature
Temperature stress testing is defined by the IEC 60068-2 standard [Ref.
17]. Different temperature levels for operation and storage are defined. A
series of tests under various conditions are defined. The most common
tests for operational temperature stress are 68-2-1 Test Ab, 68-2-2 Test Bb
and 68-2-14 Tests Na and Nb. A typical industrial specification might read
Operating Temperature: 0 60C, tested per IEC 60068-2-2.
Although the test standards do not require operation beyond the test lim-
its, testing to destruction should be done during design qualification to
ensure that the measured strength has a good margin. A high strength
industrial controller should successfully operate at least 30C beyond the
specification. High temperature-related strength is designed into industrial
electronic modules by adding large metal heat sinks and heat sink blocks to
critical components like microprocessors and power output drivers.
Humidity
Humidity testing is also done per the IEC 60068-2 series of standards. The
most common testing is done using 68-2-3 Test Ca for operational humid-
ity and 68-2-30 Test Dd for cycling humidity. Common specifications for
industrial devices are 5 95% non-condensing for operation and 0 100%
condensing for storage. Control modules with plastic spray coating and
contact lubricant will resist humidity to much higher stress levels.
Mechanical Shock/Vibration
Mechanical shock and vibration stressors can be quite destructive to con-
trol systems, especially when they are installed near rotating equipment
like pumps and compressors. Testing is typically done by vibrating the
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
product under test at various frequencies between 10 and 150 Hz per IEC
68-2-6 Test Fc. Although displacement and acceleration are specified for
different portions of the frequency spectrum, it is useful to think of the
stressor levels in Gs. IEC 68-2-27 Test Ea defines how mechanical shocks
are applied.
Corrosive Atmospheres
Failures due to corrosive atmospheres are common in control systems,
especially in the chemical, oil and gas, and paper industries. Some of the
corrosive gases attack the copper in electronic assemblies. The ISA-71.04
standard [Ref. 18] specifies several levels of corrosion stress. Testing can
be done to verify the ability of a system to withstand corrosive atmo-
spheres over a period of time.
Electromagnetic Fields
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
The IEC 61000-4 series of standards [Ref. 19] specify levels of various elec-
tromagnetic stressors. IEC 61000-4-3 specifies test methods for radiated
electromagnetic fields (EMF) and IEC 61000-4-6 specifies test methods for
conducted electromagnetic fields. Industrial control equipment is typi-
cally specified to withstand radiated EMF (sometimes called radio fre-
quency interferenceRFI) at levels of 10 V/m over frequencies between
80 MHz and 1000 MHz. There should be enough strength to withstand 10
V conducted disturbances over a 150-kHz to 80-Mhz range. Module
shielding and EMF filtering will add strength to a module.
Electrostatic Discharge
IEC 61000-4-2 explains how to determine strength against electrostatic dis-
charge. An electrostatic discharge can occur when two dissimilar materi-
als are rubbed together. When an operator gets out of a chair or a person
removes a coat, an electrostatic discharge may result. Equipment that will
withstand a 15 kV air discharge or an 8 kV contact discharge is suited for
use in an industrial environment. High strength can be provided by
grounded conductive (metal) enclosures that provide a safe path for elec-
trostatic discharge.
Exercises
3.1 What failure stress source has caused the most failures in your
plant?
3.2 Can software design faults cause failures of computer-based con-
trollers?
3.3 What failure stress sources are primarily responsible for infant
mortality failures?
3.4 What method can be used to cause manufacturing defects to
become failures in a newly manufactured product?
3.5 How is a wearout mechanism related to strength of a product?
3.6 The input module to a process control system will withstand a
maximum of 2000 volts transient electrical overstress without fail-
ure. Input voltage transients for many installations over a period of
time are characterized by a normal distribution with a mean of
1500 volts and a standard deviation of 200 volts. What is the proba-
bility of module failure due to input voltage transient electrical
overstress?
3.7 A stress is characterized as having a normal distribution with a
mean of 10 and a standard deviation of 1. If devices will fail when
subject to stresses above 12.7, what is the probability of failure?
3.8 Contact lubricant is a fluid used to seal electrical connectors and
provide strength against what stressors?
Answers to Exercises
3.1 The answer depends on a particular plant site.
3.2 Yes, software design faults can cause control systems to fail. This
risk may go up as the quantity and complexity of software in our
systems increases.
3.3 Manufacturing defects.
3.4 Stress screening during manufacture, burn-in.
3.5 Wearout occurs when the strength of a device decreases with time.
3.6 Using a standard normal distribution with x = 2000, = 1500 and
= 200, z = 2.50. From Appendix A, probability of failure =
1 0.993791 = 0.006209.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
References
1. IEC 61508-2000 Functional Safety of Electrical/Electronic/Programma-
ble Electronic Safety-related Systems. Geneva: International Electro-
technical Commission, 2000.
7. Watson, G. F. Three Little Bits Breed a Big, Bad Bug. IEEE Spec-
trum, New York: IEEE, May 1992.
17. IEC 60068-2-2 Ed. 4.0 b:1974 Environmental Testing Part 2: Tests.
Tests B: Dry Heat. Geneva: International Electrotechnical Commis-
sion, 2000.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Reliability Definitions
As the field of reliability engineering has evolved, several measures have
been defined to express useful parameters that specifically relate to suc-
cessful operation of a device. Based on that work, additional measures
have been more recently defined that specifically relate to safety engineer-
ing. These measures have been defined to give the different kinds of infor-
mation that engineers need to solve a number of different problems.
Time to Failure
The term random variable is well understood in the field of statistics. It
is the independent variablethe variable being studied. Samples of the
random variable are taken. Statistics are computed about that variable in
order to learn how to predict its future behavior.
The sample average (or mean) of the failure times can be calculated. For
this set of test data, the sample mean time to failure, MTTF, is calculated to
be 3,248 hours. This measurement provides some information about the
future performance of similar modules.
59
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
60 Control Systems Safety Evaluation and Reliability
Reliability
Reliability is a measure of success. It is defined as the probability that a
device will be successful; that is, that it will satisfactorily perform its
intended function when required to do so if operated within its specified
design limits. The definition includes four important aspects of reliability:
Probability
Unreliability
Unreliability, F(t), a measure of failure, is defined as the probability that a
device will fail during the operating time interval, t. In terms of the ran-
dom variable T,
F( t) = P(T t) (4-2)
Unreliability equals the probability that failure time will be less than or
equal to the operating time interval. Since any device must be either suc-
cessful or failed, F(t) is the ones complement of R(t).
Availability
Reliability is a measure that requires success (that is, successful operation)
for an entire time interval. No failures (and subsequent repairs) are
allowed. This measurement was not enough for engineers who needed to
know the chance of success when repairs may be made.
RELIABILITY
AVAILABILITY
UNRELIABILITY
UNAVAILABILITY
Unavailability
Unavailability, a measure of failure, is also used for repairable devices. It
is defined as the probability that a device is not successful (is failed) at
any moment in time. Unavailability is the ones complement of availabil-
ity; therefore,
U( t) = 1 A(t) (4-4)
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
EXAMPLE 4-1
Problem: A controller has an availability of 0.99. What is the
unavailability?
Solution: Using Equation 4-4, unavailability = 1 - 0.99 = 0.01.
Probability of Failure
The probability of failure during any interval of operating time is given by
a probability density function (See Chapter 2). Probability density func-
tions for failure probability are defined as
dF ( t )
f ( t ) = ------------- (4-5)
dt
This can be interpreted as the probability that the failure time, T, will
occur between a point in time t and the next interval of operation, t + t,
and is called the probability of failure function.
The probability of failure function can provide failure probabilities for any
time interval. The probability of failure between the operating hours of
2000 and 2200, for example, is:
2200
P ( 2000, 2200 ) = f ( t ) dt (4-7)
2000
MTTF is merely the mean or expected failure time. It is defined from the
statistical definition of expected or mean value (Equation 2-10). Using
the random variable operating time interval, t, and recognizing that there
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
EXAMPLE 4-2
What is the probability that it will fail after the warranty (6 months,
4,380 hr) and before plant shutdown (12 months, 8760 hr)?
This evaluates to
0.0002 8760 0.0002 4380
P ( 4380, 8760 ) = e ( e )
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
This result states that the probability of failure during the interval from
4380 hours to 8760 hours is 24.3%.
is no negative time we can update the mean value equation and substitute
the probability density function f(t):
+
E( t) = tf ( t ) dt (4-8)
0
Substituting
d[R(t)]
f ( t ) = ------------------- (4-9)
dt
+
E( t) = t d[ R ( t ) ]
0
+
E(T) = [ tR ( t ) ] 0 R ( t ) dt
0
The first term equals zero at both limits. This leaves the second term,
which equals MTTF:
+
MTTF = E ( T ) = R ( t ) dt (4-10)
0
1
MTTF --- by definition (4-11)
NOTE: The formula MTTF = 1/ is valid for single components with a con-
stant failure rate or a series of components, all with constant failure rates.
See The Constant Failure Rate later in this chapter.
RELIABILITY
AVAILABILITY
UNRELIABILITY
UNAVAILABILITY
MTTF
MTTR
Successful Operation
MTTF MTTR
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
MTBF
Failure
TIME
Figure 4-4. MTTF, MTTR, and MTBF
The term MTBF has been misused. Since MTTR is usually much smaller
than MTTF, MTBF is often approximately equal to MTTF. MTBF, which by
definition only applies to repairable systems, is often substituted for
MTTF, which applies to both repairable and non-repairable systems.
EXAMPLE 4-3
EXAMPLE 4-4
EXAMPLE 4-5
Solution: Using formula 4-12, the MTBF = 87,602 hours. The MTBF
is effectively equal to the MTTF.
EXAMPLE 4-6
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
68 Control Systems Safety Evaluation and Reliability
Failure Rate
Failure rate, often called hazard rate by reliability engineers, is a com-
monly used measure of reliability that gives the number of failures per
unit time from a quantity of components exposed to failure.
Failure rate has units of inverse time. It is a common practice to use units
of failures per billion (109) hours. This unit is known as FIT for Failure
unIT. For example, a particular integrated circuit will experience seven
failures per billion operating hours at 25C and thus has a failure rate of
seven FITs.
Note that the measure failure rate is most commonly attributed to a sin-
gle component. Although the term can be correctly applied to a module,
unit, or even system where all components are needed to operate (called a
series system), it is a measure derived in the context of a single component.
EXAMPLE 4-7
The failure rate function is related to the other reliability functions. It can
be shown that
f(t)
( t ) = ----------- (4-14)
R(t)
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Consider the failure log for a highly accelerated life test (HALT) as shown
in Table 4-2. The number of failures is decreasing during the first few
weeks. The number then remains relatively constant for many weeks.
Toward the end of the test the number begins to increase.
Failure rate is calculated in column four and equals the number of fail-
ures divided by the number of module hours (surviving modules times
hours) in each weekly period. The failure rate also decreases at first, then
remains relatively constant, and finally increases. These changes in the
failure rates of components are typically due to several factors including
variations in strength and strength degradation with time. Note that, in
such a test, the type and level of stress do not change.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
7 28 2 0.0004
8 26 1 0.0002
9 25 1 0.0002
10 24 0 0.0000
11 24 2 0.0005
12 22 1 0.0002
13 21 1 0.0003
14 20 0 0.0000
15 20 1 0.0003
16 19 0 0.0000
17 19 1 0.0003
18 18 1 0.0003
19 17 0 0.0000
20 17 1 0.0003
21 16 1 0.0004
22 15 0 0.0000
23 15 1 0.0004
24 14 0 0.0000
25 14 1 0.0004
26 13 0 0.0000
27 13 1 0.0005
28 12 0 0.0000
29 12 1 0.0005
30 11 0 0.0000
31 11 1 0.0005
32 10 1 0.0006
33 9 1 0.0007
34 8 1 0.0007
35 7 1 0.0008
36 6 1 0.0010
37 5 2 0.0024
38 3 3 0.0059
TIME
Figure 4-5. Decreasing Failure Rate
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Components wear at different rates. Imagine a life test of 100 fan motors in
which the motor bearings wore at exactly the same rate; one day all the
motors would fail at the same instant. Because components do not wear at
the same rate, they do not fail at the same time. However, as a group of
components approach wear-out, the failure rate increases.
Bathtub Curve
As we have seen, a group of components will likely be exposed to many
kinds of environmental stress: chemical, mechanical, electrical, and physi-
cal. The strength factors as initially manufactured will vary and strength
will change at different rates as a function of time. When the failure rates
due to these different failure sources are superimposed, the well-known
bathtub curve results.
The failure rate of a group of components decreases in early life. The left
part of the curve has been called the roller coaster curve (Ref. 2, 3). The
failure rate will be relatively constant after the components containing
manufacturing defects are removed. This failure rate can be very low if the
components have few design faults and high strength. As physical
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Failure Rate
Useful Life
f(t) = et (4-15)
F(t) = 1 et (4-16)
and
R(t) = et (4-17)
decreasing failure rate. In these cases, though, the constant failure rate
represents a worst-case assumption and can still be used.
+
MTTF = R ( t ) dt
0
and integrating,
1 t
MTTF = --- [ e ] 0
When the exponential is evaluated, the value at t = infinity is zero and the
value at t = 0 is one. Substituting these results, we have a solution:
1 1
MTTF = --- [ 0 1 ] = --- (4-18)
EXAMPLE 4-8
Problem: A motor has a constant failure rate of 150 FITs. What is the
motor reliability for a mission time of 1000 hours?
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
EXAMPLE 4-9
EXAMPLE 4-10
0.000000276 8760
R(t) = e
= 0.9976
EXAMPLE 4-11
1
MTTF = ------------------------------- = 3,623,188 hr
0.000000276
A Useful Approximation
Mathematics has shown how certain functions can be approximated by a
series of other functions. One of these approximations can be useful in reli-
ability engineering. For all values of x, it can be shown that
2 3 4
x x x x
e = 1 + x + ----- + ----- + ----- + (4-19)
2! 3! 4!
ex = 1 + x
A rearrangement yields
x
1 e x
t
P ( failure ) = 1 e t (4-20)
when t is small.
EXAMPLE 4-12
1
= ---------------- (4-21)
MTTR
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
EXAMPLE 4-13
= 1
--- = 0.25
4
For single components with a constant failure rate and a constant restore
rate, steady-state availability can be calculated (see Chapter 8 for more
detail) using the formula:
A = ------------- (4-22)
+
MTTF
A = ----------------------------------------- (4-23)
MTTF + MTTR
EXAMPLE 4-14
EXAMPLE 4-15
EXAMPLE 4-16
EXAMPLE 4-17
Safety Terminology
When evaluating system safety an engineer must examine more than the
probability of successful operation. Failure modes of the system must also
be reviewed. The normal metrics of reliability, availability, and MTTF
only suggest a measure of success. Additional metrics to measure safety
include probability of failure on demand (PFD), average probability of
failure on demand (PFDavg), risk reduction factor (RRF), and mean time
to fail dangerously (MTTFD). Other related terms are probability of failing
safely (PFS), mean time to fail safely (MTTFS), and diagnostic coverage.
These terms are especially useful when combined with the other reliability
engineering terms.
Failure Modes
A failure mode describes the way in which a device fails. Failure modes
must be considered in systems designed for safety protection applications,
called Safety Instrumented Systems (SIS). Two failure modes are impor-
tantsafe and dangerous. ISA standard 84.00.01-2004 (IEC 61511 Mod.)
defines safe state as state of the process when safety is achieved. In
the majority of the most critical applications, designers choose a de-ener-
gized condition as the safe state. Thus a safe failure mode describes any
failure that causes the device to go to the safe state. A device designed for
these safety protection applications should de-energize its outputs to
achieve a safe state. Such a device is called normally energized.
A safe failure in such a device (Figure 4-8) happens when the output de-
energizes even though there is no potentially dangerous condition. This is
frequently called a false trip. There are many different reasons that this
can happen. Input circuits can fail in such a way that the logic solver
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
thinks a sensor indicates danger when it does not. The logic solver itself
can miscalculate and command the output to de-energize. Output circuits
can fail open circuit. Many of the components within an SIS can fail in a
mode that will cause the system to fail safely.
PLC
Switch Solid State operation,
switch opens. Output Switch output circuit
de-energizes.
Discrete Input
Solenoid
-
Figure 4-7. Successful Operation Normally Energized System
+ +
PLC
Solid State
Switch Output fails de-
Input Circuit Discrete Input
Switch
energized.
fails de-
energized.
LOAD
Logic Solver reads an
incorrect logic 0 on the input,
solves the logic incorrectly, or -
incorrectly sends a logic 0
and de-energizes the output.
There are many component failures that might cause dangerous system
failure, especially if a system is not designed for high safety. An IEC 61508
Certified PLC is specifically designed to avoid this failure mode using a
number of design techniques.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Pressure Output
Input Sense circuit fails
PLC
Solid State
circuit fails Switch Output Switch
energized.
Discrete Input
energized.
LOAD
-
Figure 4-9. Dangerous Failure Normally Energized System
PFS/PFD/PFDavg
There is a probability that a normally energized SIS will fail with its out-
puts de-energized. This is called probability of failure safely (PFS). There
is also a probability that the system will fail with its outputs energized.
This is called probability of failure on demand (PFD). The term refers to
the fact that when a safety protection system is failed dangerously, it will
NOT respond when a demand occurs. Figure 4-10 shows the relationship
of safe and dangerous failure modes to overall system operation.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
PFS
RELIABILITY
AVAILABILITY
PFD
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
A ( t ) = 1 [ PFS ( t ) + PFD ( t ) ] (4-25)
EXAMPLE 4-18
T
1
T
PFDavg = --- PFD ( t ) dt (4-26)
0
Since PFD increases with time, the average value over a period of time is
typically calculated numerically (Figure 4-11).
MTTFS/MTTFD
As has been mentioned, MTTF describes the average amount of time until
a device fails. The definition includes all failure modes. In industrial con-
trol systems, the measure of interest is the average operating time between
failures of a particular mode, ignoring all other modes. In the case of an
SIS, the mean time to fail safely (MTTFS) and the mean time to fail danger-
ously (MTTFD) are of interest.
gerously again. So far the system has failed dangerously twice. The first
time occurs after operating 2327 hours. The second time occurs after oper-
ating 8537 hours (4016 + 4521). The PLC is again repaired and placed in
service, and the failure times shown in Table 4-3 are recorded.
EXAMPLE 4-19
Problem: A PLC has measured failure data from Table 4-3. What is
the MTTFD?
Diagnostic Coverage
The ability to detect a failure is an important feature in any control or
safety system. This feature can be used to reduce repair times and to con-
trol operation of several fault tolerant architectures. The measure of this
ability is known as the diagnostic coverage factor, C. The diagnostic cover-
age factor measures the probability that a failure will be detected given
that it occurs. Diagnostic coverage is calculated by adding the failure rates
of detected failures and dividing by the total failure rate, which is the sum
of the individual failure rates. As an example consider a system of ten
components. The failure rates and detection performance are shown in
Table 4-4:
Although only one component failure out of a possible ten is detected, the
coverage factor is 0.991 (or 99.1%). For this example the detected failure
rate is 0.00991. This number is divided by the total failure rate of 0.01. The
coverage factor is not 0.1 as might be assumed by dividing the number of
detected failures by the total number of known possible failures. Note that
the result would have been quite different if Component 1 was NOT
detected, while Component 2 was detected.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
The terms C1 and C2 can be used for specific failure modes, such as safe
and dangerous, as well as resulting in terms:
CS1 - Coverage for safe failures due to single unit reference diag-
nostics
Exercises
4.1 Which term requires successful operation for an interval of time:
reliability or availability?
4.2 Which term is more applicable to repairable systems: reliability or
availability?
4.3 Unavailability for a system is given as 0.001. What is the
availability?
4.4 When does the formula MTTF = 1/ apply?
4.5 Availability of the process control system is quoted as 99.9%. What
is the unavailability?
4.6 A control module has an MTTF of 60 years. Assuming a constant
failure rate, what is the failure rate?
4.7 A control module has an MTTF of 60 years. It has an average repair
time of 8 hours. What is the steady-state availability?
4.8 A control module has an MTTF of 60 years. What is the reliability
of this module for a time period of six months?
4.9 An SIS has a PFS of 0.002 and a PFD of 0.001. What is the
availability?
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Answers to Exercises
4.1 Reliability.
4.2 Availability.
4.3 Availability = 0.999.
4.4 The formula MTTF = 1/ applies to single components with a con-
stant failure rate or series systems with a constant failure rate.
4.5 Unavailability = 0.001.
4.6 The failure rate equals 0.000001903 = 1903 FITs.
4.7 Availability = 0.9999847.
4.8 Reliability = 0.9917.
4.9 Availability = 0.997.
References
1. Billinton, R., and Allan, R. N. Reliability Evaluation of Engineering
Systems: Concepts and Techniques. NY: New York, Plenum Press,
1983.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Failure Modes and Effects Analysis (FMEA) and Failure Modes Effects
and Diagnostic Analysis (FMEDA) are commonly used analysis tech-
niques in the fields of reliability and safety engineering. Both techniques
will be discussed in this chapter.
FMEA Procedure
The minimum steps required in the FMEA process (here, at the compo-
nent level) are simple:
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
87
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
88 Control Systems Safety Evaluation and Reliability
At a module or unit level, simply list the functional failure modes of that
level. Often these modes will be identified by another lower level FMEA.
FMEA Limitations
Because each component is reviewed individually, combinations that
cause critical problems are not addressed. In fault tolerant systems, com-
mon cause failures (see Chapter 10) are rarely identified since they require
more than one component failure.
FMEA Format
A FMEA is documented in a tabular format as shown in Table 5-1. Com-
puter spreadsheets are ideal tools for a FMEA. Each column in the table
has a specific definition.
When the FMEA is done at the component level, column one describes the
name of the component under review. Column two is available to list the
part number or code number of the component under review. Column
three describes the function of the component. A good functional descrip-
tion of each component can do an effective job in helping to document sys-
tem operation.
Column four describes the known failure modes of the components. One
row is typically used for each component failure mode. Examples of com-
ponent failure modes include fail shorted, fail open, drift, stuck at one,
stuck at zero, etc., for electronic components. Mechanical switch failure
modes might include stuck open, stuck closed, contact weld, ground
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
short, etc. (Ref. 2 provides a database listing of failure modes for possible
system components.) Column five describes the cause of the failure mode
of column four. Generally this is used to list the primary stress causing the
failureheat, chemical corrosion, dust, electrical overload, RFI, human
operational error, etc. (see Chapter 3).
Column six describes how this component failure mode affects the func-
tion of the module (or subsystem) of which the component is a part. Col-
umn seven lists the way in which this component failure mode affects the
module (or subsystem). In safety evaluations, this column may be used to
indicate safe versus dangerous failures.
Column eight is used to list the failure rate of the particular component
failure mode. The use of this column is optional when FMEAs are being
done for qualitative purposes, and required for quantitative FMEA. When
quantitative failure rates are desired and specific data for the application
is not available, failure rates and failure mode percentages are available
from handbooks (Ref. 2).
TSH
TSH
1
EXAMPLE 5-1
all failure modes for each of the system components. We must then
fill out every relevant column for each failure in the table. Table 5-2
shows the results of this system level FMEA. The FMEA has
identified six critical items that should be reviewed to determine the
need for correction. We could consider installing a smart IEC 61508
certified temperature transmitter with automatic diagnostics. We
could install two drain pipes and pipe them in parallel. This would
prevent a single clogged drain from causing a critical failure. A level
sensor on the water tank could warn of insufficient water level. Many
other possible design changes could be made to mitigate the critical
failures or to reduce the number of false trips.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
92 Control Systems Safety Evaluation and Reliability
PSH
1
PSH
2
PSH
3
EXAMPLE 5-2
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
94 Control Systems Safety Evaluation and Reliability
Additional columns are added to the chart as shown in Table 5-3. The 10th
column is an extension to the original MIL standard (Ref. 1) for the pur-
pose of identifying that this component failure is detectable by a diagnos-
tic technique. A number 1 is entered to designate that this failure is
detectable. A number 0 is entered if the failure is not detectable. Column
11 is an extension to the standard used to identify the diagnostic used to
detect the failure. The name of the diagnostic should be listed. Perhaps the
error code generated or the diagnostic function could also be listed.
The safe undetected failure rate is shown in column 14. This number is
calculated by multiplying the failure rate (Column 8) by the failure mode
number (Column 12) and one minus the detectability (Column 10).
Column 15 lists the dangerous detected failure rate. The number is
obtained by multiplying the failure rate (Column 8) by one minus the
failure mode number (Column 12) and the detectability (Column 10).
Column 16 shows the calculated failure rate of dangerous undetected
failures. The number is obtained by multiplying the failure rate (Column
8) by one minus the failure mode number (Column 12) and one minus the
detectability (Column 10).
5V ISO.
R1 R2 D2 +5V
10K 200K
VIN V1 V2
AC INPUT
0.22 F 10K D1
C1 R3
L2 R4
OC1
10K
The FMEDA for this circuit is shown in Table 5-4. Component failure rates
must be listed for each component failure mode since diagnostic coverage
calculations are done based on a weighted average of failure rates.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Name Code Function Mode Cause Effect Criticality Failure Rate Remarks Detectability Diagnostics Mode SD SU DD DU
R1-1K 4555-1 Input filter Short No filter Safe 0.5 0 1 0 0.5 0 0
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
FMEA / FMEDA 97
VCC
OC1 D1 D2 R1
ACIN
R3 1N5948B 1N5948B 10K
100K PC844
INA16
C1 R4
0.002 F 100K R2
10K
VCC
OC2
R5
100K PC844 ACCOM
INA16
C2 R6
0.002 F
10K
Another AC input circuit in shown in Figure 5-4. This circuit was designed
for high diagnostic coverage. Very high levels of diagnostic coverage are
needed for high safety and high availability systems. The circuit uses two
sets of opto-couplers. The microprocessor that reads the inputs can read
both opto-couplers. Under normal operation both readings should be the
same. In addition, readings must be taken four times per AC voltage cycle.
This allows the microprocessor to read a dynamic signal. When all compo-
nents are operating properly, a logic 1 is a series of pulses. This circuit
design is biased toward fail-safe with a normally energized input sensor.
Table 5-5 shows the FMEDA.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
FMEDA Limitations
The FMEDA technique is most useful on mechanical devices to accurately
show the impact of automatic diagnostic devices like partial stroke test
boxes or the effectiveness of a manual proof test procedure. The FMEDA
technique is also effective to determine the diagnostic coverage factors for
automatic diagnostics and manual proof test procedures on simple elec-
tronic circuits and processes where the failure modes of the components
are relatively well known.
There are times when the effect of a particular component failure is not
easily predicted from analysis. A simple test technique called fault injec-
tion testing is used to quickly determine the effect. Particular component
failure modes are actually simulated in an operating unit and the impact is
observed and documented in the FMEDA.
Exercises
5.1 List the steps for a FMEA.
5.2 List the limitations of a FMEA.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Answers to Exercises
5.1 The steps for a FMEA are:
1. List all components.
2. For each component, list all failure modes.
3. For each component/failure mode, list the effect on the
next higher level.
4. For each component/failure mode, list the severity of
the effect.
5.2 A FMEA does not identify combinations of failures since each
component is reviewed individually. Operational and mainte-
nance failures may be missed. All failure modes of components
must be known or they will be missed.
5.3 Yes, within practical limits the evaluation of the diagnostic ability
of a circuit, module, unit, or system can be verified by fault injec-
tion testing.
5.4 Yes. As with most human engineering activities, a FMEA or a
FMEDA can benefit from a review.
5.5 Answer depends on process and plant.
References
1. U.S. MIL-STD-1629: Failure Mode and Effects Analysis. Springfield:
National Technical Information Service
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
A fault tree, when well done, can also be a valuable engineering document
describing how the system is supposed to operate under various fault con-
ditions. This provides necessary documentation for more detailed reliabil-
ity and safety analysis (Ref. 2).
While it is true that fault trees are used primarily during engineering
design to help identify potential design weaknesses, fault trees can also be
of great value when investigating causes of failures or accidents. All of the
trigger events and contributing events can be documented in a graphical
format showing the overall relationship between events and a resultant
failure.
103
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
An example fault tree appears in Figure 6-2. We identify the system failure
event: Fire. In a normal atmosphere, we know that two additional things
are required to start a fire: an ignition source and combustible material.
Working down the tree, we identify sources of combustible material and
the basic faults involved, which include a fuel leak and a fuel spill. We also
identify trigger events that may provide an ignition source.
FIRE
AND
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
OR OR
FUEL FUEL
SPARK LEAK SPILL
SMOKING
At the bottom of the fault tree are basic faults and trigger events.
These are normally considered to be the root cause elements of any failure.
The basic fault is represented by a circle. A trigger event is represented by
the house symbol.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Inhibit Gate
OR Gate
AND Gate
House Event
Figure 6-3. Fault Tree Symbols
Shutdown fails
Operator pushes
wrong button
Shutdown alarm
sounds
should be investigated further. The priority AND gate is used to show that
inputs must be received in a particular sequence.
Shutdown fails
Conditional Event
Alternative Conditional Event
Transfer In
Transfer Out
Incomplete Event that needs
attention.
Priority AND - event 1 before
event 2
1
2
covered. Consider the fault tree of Figure 6-7. A power system is being
reviewed. The system has three independent sources of power: the com-
mercial utility, a local generator, and a battery system. The AND gate at
the top of drawing indicates that all three sources must fail in order to lose
power.
There are many ways in which utility power can fail but the main concern
is the protective utility breaker. Therefore the drawing shows the incom-
plete event symbol indicating that other utility power failure issues are
Power system
failure
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Operator fails
to restart
Other Utility
breaker
Generator Batteries
blown
off line discharged
Charger fails
especially true when events into an AND gate are not independent or
when PFDavg is to be calculated. It is also important that the probability
unions be carefully considered so that multiple instances of a given proba-
bility are calculated only once.
AND Gates
With an AND gate, all inputs must be non-zero for the gate to output a
non-zero probability. For example, with a two input AND gate, both
events A and B must have a non-zero probability for the gate output to be
non-zero. Referring to failure events, both failures must occur for the gate
to be failed with a probability. If these events are independent (see Appen-
dix C for definitions of independent events and mutually exclusive
events), then the probability of getting failure event A and failure event B
is given by the formula:
EXAMPLE 6-1
Power System
fails
Battery Commercial
System fails Power fails
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
EXAMPLE 6-2
OR Gates
With an OR gate, any non-zero input allows the gate output to be non-
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
EXAMPLE 6-3
Pressure signal
not available
Transmitter Transmitter
fails fails
EXAMPLE 6-4
Temperature
signal not
available
Approximation Techniques
Often, in order to speed up and simplify the calculation, the faults and
events in a fault tree are assumed to be mutually exclusive and indepen-
dent. Under this assumption, probabilities for the OR gates are added.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Probabilities for the AND gates are multiplied. This approximation tech-
nique can provide rough answers when probabilities are low, as is fre-
quently the case with failure probabilities. The approach is generally
conservative when working with failure probabilities because the method
gives an answer that is larger than the accurate answer. For failure proba-
bility this may be sufficient.
EXAMPLE 6-5
EXAMPLE 6-6
Power system
failure
Operator fails to
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
restart - 0.01
Utility
Other breaker
blown Generator Batteries
0.000001
0.001 off line discharged Charger fails
0.2 0.1 0.01
Solution: Working from the bottom up, the failure probability for
battery system failure = 0.01 0.1 = 0.001. The failure probability for
a generator failure = 0.2 0.01 = 0.002. The failure probability for
utility power failure = 0.001. The system failure probability =
0.000000002.
Common Mistake: When building a fault tree and using the Gate Solution
method, it is important to model individual components only once. It is
easy to add something in twice, especially in complex systems.
Consider the case of a pressure switch and a process connection for that
pressure switch (Figure 6-12). If the pressure switch has a PFD of 0.005
and the process connection has a PFD of 0.02, the PFD of the system could
be modeled with a fault tree OR gate as shown in Figure 6-13.
VVVV
VVVV
Process
Connection
Pressure
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Switch
Assume that the designer felt this probability of fail-danger was too high
and wanted a second (redundant) pressure switch. The system is designed
to trip if either switch indicates a trip (1oo2 architecture). One might
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
assume the fault tree to be two copies of Figure 6-13 with an additional
AND gate as shown in Figure 6-14.
When solving this fault tree, one must understand whether the process
connection boxes represent two independent failure events, each with its
own probability, or if the two boxes represent one failure event. A simple
Gate Solution approach that assumes independent events would get the
answer 0.0249 0.0249 = 0.00062. If both boxes are marked identically, it
often means they represent one event. Physically this means that two pres-
sure switches are connected to a common pressure connection. In that case
the correct answer is (0.005 0.005) + 0.02 (0.02 0.005 0.005) = 0.020.
Of course it is recommended that the fault tree be drawn more clearly, as
is done in Figure 6-15.
Pressure Connection
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Fail - Danger
1oo1 Architecture
For a 1oo1 component or system we can use the simplified approximation
formula below to calculate the PFDavg:
PFDavg = DT/2
1oo2 Architecture
In a 1oo2 architecture, two elements are available to perform the shut-
down function and only one is required. This can be represented by an
AND gate for probability of dangerous failure. Since the PFDavg repre-
sents the unavailability of the system when a demand occurs, then for a
1oo2 architecture both units A and B must fail dangerously for a loss of
safety function.
If the PFDavg is used as the input from each unit then the PFDavg, using
incorrect gate probability calculations, for the 1oo2 architecture would be
(This is the incorrect result of multiplying average values, i.e., the averag-
ing is done before the logic multiplication.)
A more accurate equation to calculate the PFDavg for a 1oo2 system is:
T T T
This is because A B is not equal to A B
0 0 0
T
1
where PFDavg = --- PFD ( t ) dt
T
0
The correct way to solve a fault tree is to numerically calculate PFD values
as a function of time and average the numerical values (Chapter 13 of Ref.
3). Alternatively, an analytical equation for PFD can be obtained and inte-
grated (this book, Chapter 14). In both cases the solution technique per-
forms averaging after the logic has been applied.
Fault trees are also used to document system failure events. They can rep-
resent a rather complex set of circumstances and describe the situation(s)
that led to the failure in a way much clearer than words.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
EXAMPLE 6-7
performed. The emergency batteries did supply power for six hours
until they were exhausted.
Communications Failure
Other
Other
Switchgear power fails Battery system
failure - alarms not
Generator failure acknowledged
Utility Power
failure
Shutdown
Relay 20th floor alarm 14th floor alarm
mis-set unacknowledged unacknowledged
Other
Utility Power
switched off - 15th floor
load shedd alarm
unmanned
Exercises
6.1 The thermocouples of Example 6-4 have an unavailability of 0.05.
What is the unavailability of the sensor subsystem? (Use the full,
accurate method)
6.2 What is the answer to exercise 6-1 using the approximate method?
What is the error?
6.3 The power system of Figure 6-11 has probabilities of failure as fol-
lows:
Utility breaker blown = 0.1
Utility other = 0.02
Generator off-line = 0.1
Operator fails to restart = 0.02
Batteries discharged = 0.2
Charger failed = 0.02
What is the probability of system failure?
6.4 What are the advantages and limitations of fault tree analysis?
6.5 Draw a fault tree to document a failure in your plant.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Answers to Exercises
6.1 0.142625
6.2 0.15; the error is almost 5%.
6.3 0.000000944.
6.4 Fault Tree Analysis is a top-down approach that is capable of iden-
tifying combinations of failure events that cause system failure. It
is systematic and allows the review team to focus on a specific fail-
ure. It is limited, however, in that it depends on the skill of the
reviewers. In addition, a fault tree can only show one failure (or
failure mode) on a single drawing. This sometimes obscures inter-
action among multiple failure modes.
6.5 Answer depends on specific circumstances.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
References
1. Henley, E. J. and Kumamoto, H. Probabilistic Risk Assessment Reli-
ability Engineering, Design and Analysis. NY: New York, IEEE Press,
1992, pg. 3.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Reliability Block Diagrams
Many modules, units, and systems with one failure mode, such as are
used in industrial control applications, can be modeled through the use of
simple block diagrams. These block diagrams are used to show probabil-
ity of success/failure combinations and may show devices (components,
modules or units) in series, in parallel, or in combination configurations.
3K\VLFDO0RGHO 8QGHUVWDQG)DLOXUH0RGHV
8QGHUVWDQG2SHUDWLRQ
5HOLDELOLW\%ORFN'LDJUDP
$QDO\]HPRGHOXVLQJ
5XOHVRI3UREDELOLW\
The first step in the process of reliability block diagram modeling (Figure
7-1) is to convert from a physical model into a reliability block diagram
model. This step is often the hardest and is certainly the most critical. A
121
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
122 Control Systems Safety Evaluation and Reliability
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
A reliability block diagram may be viewed as showing the success
paths. For each device that is working successfully, the path goes
through a box representing that device. For each device that has failed, the
path is blocked by the box. For all combinations of successful and failed
devices, if the analyst can find a path horizontally across the reliability
block diagram, those devices are sufficient to allow the system to operate.
Consider the control system drawn in Figure 7-2. Three sensors are
present. All three sensors are wired to each of two controllers. Each con-
troller implements a voting algorithm in order to tolerate the failure of
one sensor. Either controller is capable of operating the valve.
The reliability block diagram model for this system is illustrated in Figure
7-3. This model shows several success paths. One such path through the
block diagram model consists of Sensor A, Sensor B, Controller A, and
Valve. If only these four devices operate, the system operates.
The rules of probability are used to evaluate the reliability block diagram.
Normally, work is done with reliability block diagram device probabili-
ties, and device failures are assumed to be independent. Sometimes it is
easier to use probability of failure, and sometimes it is easier to use proba-
bility of success. The general term probability of success may mean reli-
ability or it may mean availability. If working with non-repairable
systems, reliability (probability of success over the operating time interval
t) is used. In repairable systems, availability (probability of success at time
t) is used.
Series Systems
A series system (Figure 7-4) is defined as any system in which all
devices must work for the system to work. Taking the pessimistic perspec-
tive, a series system fails if any component fails. A prime example of a
series system would be the string of antique Christmas tree lights with
which the author struggles each year. The lights are wired in such a way
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
that when one bulb fails, all the lights go out! A series system offers no
fault tolerance; there is no redundancy.
A B
Figure 7-4. Series System
Consider the series system as shown in Figure 7-4. The system has two
components; A and B, and:
RS = RA RB (7-1)
RS = Ri (7-2)
i=1
These equations are a direct result of one of the rules of probability, which
states that
RS = 1 F ( A ) F ( B ) + F ( A ) F ( B ) (7-3)
A t
RA = e and
B t
RB = e
This equals
( A + B )t
RS = e (7-4)
S = i (7-6)
i=1
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
EXAMPLE 7-1
Solution: Since the system will fail if any one of the devices fail, the
reliability block diagram (Figure 7-5) consists of a series system. As
per Equation 7-6, the failure rates are added:
EXAMPLE 7-2
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
AS = 0.950.90.70.70.70.8 = 0.2346
EXAMPLE 7-3
Substituting,
EXAMPLE 7-4
Substituting,
R(5 years) = e-0.0000036387605 = 0.853
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
MTTF = R ( t ) dt
0
= [ R1 ( t ) R2 ( t ) Rn ( t ) ] dt
0
For a series system with constant device failure rates (exponential PDF):
n
i t
MTTF SYSTEM = e i = 1 dt
0
1
MTTF SYSTEM = --------------- (7-7)
n
i
i=1
1
= -------------------------------------------
1 + 2 + + n
Therefore, for a series system with constant device failure rates where
TOTAL = 1 + 2 + + n
1
MTTF SERIES SYSTEM = -------------------- (7-8)
TOTAL
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Parallel Systems
A parallel system is defined as a system that is successful if any one of
the devices is successful. From the pessimistic perspective, the parallel
system fails only if all of its devices fail. A parallel system offers fault tol-
erance that is accomplished through redundancy.
A
B
Figure 7-7. Parallel System
RS = RA + RB RA RB (7-9)
Note that the system fails only if both A and B fail. The formula for system
failure is:
FS = FA FB (7-10)
n
FS = Fi (7-11)
i=1
n
RS = 1 FS = 1 Fi (7-12)
i=1
EXAMPLE 7-5
SENSOR
SENSOR
SENSOR
EXAMPLE 7-6
Pump 0.7
Either cooling system will fail if any of its devices fails, but the jacket
will be cooled if either cooling system works. What is the probability
that the jacket will not be cooled?
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
WATER POWER
PUMP
TANK SOURCE
WATER POWER
TANK PUMP
SOURCE
EXAMPLE 7-7
SENSOR VALVE
CONTROLLER
SENSOR VALVE
and
The chances of successful operation for the next month are now
0.8954 (90%). This is a substantial improvement over the results of
Example 7-1 in which a one month reliability of 0.6406 was
calculated.
MTTF SYSTEM = [ RA + RB RA RB ] dt
0
for a two device parallel system. If these two devices have constant failure
rates (exponential PDF), then:
A t B t ( A + B ) t
MTTF SYSTEM = [e +e e ] dt
0
1 A t -----
1 B t -------------------
1 ( A + B ) t
------- e -e + e
A B A + B 0
1 1 1
MTTF PARALLEL SYSTEM = ------- + ------ ------------------- (7-13)
A B A + B
Notice that MTTF does not equal one over . A parallel system of devices
with constant failure rates no longer has an exponential PDF.
1
MTTF PARALLEL SYSTEM --------------------
TOTAL
The equation MTTF equals one over does not apply to any system that
has parallel devices. This includes any system that has redundancy. Triple
modular redundant systems, dual systems, and partially redundant sys-
tems are all in this category.
k-out-of-n Systems
Many systems, especially load sharing systems, are designed with extra
devices. In systems where the extra devices can share the load without
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
5 = -----------------------
5!
- = 5
4 4! ( 5 4 )!
Notice the notation used. Two numbers stacked within parenthesis is the
notation for combinations of numbers. (See Appendix C for a detailed
explanation of combinations.)
Thermocouple 1
Thermocouple 2
Thermocouple 3
Thermocouple 4
Thermocouple 5
TANK
Given that there are five combinations of four devices, we can build a
series/parallel reliability block diagram. The block diagram in Figure 7-12
has five parallel paths. Each path has four devices in series. An examina-
tion of the block diagram shows that each path is a different combination
of four devices (out of a possible five).
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
It is easy to calculate the probability of success for one path. Using Equa-
tion 7-2, the probability of success for the top path is
RPATH1 = R1 R2 R3 R4
RPATH2 = R1 R2 R3 R5
RPATH3 = R1 R2 R4 R5
RPATH4 = R1 R3 R4 R5
RPATH5 = R2 R3 R4 R5
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
The probability of success for the system is the union of these five path
probabilities. These numbers cannot merely be added. The four paths are
NOT mutually exclusive. To obtain the union, the path probabilities must
be added and then the combinations of two intersections must be sub-
tracted; the combinations of three intersections must be added, then the
combinations of four intersections subtracted; and the combinations of
five intersections must be added. This equation would be quite long!
Fortunately, if all devices are the same, the calculation can be drastically
simplified. Note that the intersection of path 1 and path 2 is:
RPATH1 RPATH2 = R1 R2 R3 R4 R5 = RI
The same result occurs for all intersections including intersections of two
at a time, three at a time, four at a time, and even five at a time. Taking
advantage of this fact, a reasonable equation can be written for the union
of the five paths.
5 5 5 5
RI + RI RI + RI
2 3 4 5
10R I + 10R I 5R I + 1R I
If all the devices in the model are the same, further simplification is possi-
ble. In the thermocouple system example, all the thermocouples are the
same. Therefore:
R1 = R2 = R3 = R4 = R5 = R
4 5
R SYSTEM = 5R 4R (7-14)
EXAMPLE 7-8
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
The general form of the equation can be derived by answering the ques-
tion, When is the system successful? The system is successful when:
What is the probability of getting one combination of four good and one
bad? This can be written as
R R R R (1 R)
k nk
R (1 R)
n R k ( 1 R ) n k (7-15)
k
For the 4oo5 thermocouple problem, n equals five and k equals four. Sub-
stituting these values into Equation 7-16:
4 1 5 0
R SYSTEM = 5 [ R ( 1 R ) ] + 1 [ R ( 1 R ) ]
Simplifying yields
4 5
R SYSTEM = 5R 4R (7-17)
Of course, this is the same as Equation 7-14 showing that both derivation
methods yield the same result.
Control
Computer
Control Voting
Computer Logic
Control
Computer
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
EXAMPLE 7-9
Solution: This system requires that two out of three (2oo3) control
computers be successful. The voting logic circuit must also be
successful. The reliability block diagram is shown in Figure 7-14. We
will first solve the left side of the reliability block diagram.
2 1 3 0
A 2oo3 = 3 [ ( 0.95 ) ( 1 0.95 ) ] + [ ( 0.95 ) ( 1 0.95 ) ]
2 3
= 3 ( 0.95 ) 2 ( 0.95 )
= 0.99275
The control system shown in Figure 7-15 has three sensors, two control-
lers, and two valves. For the system to be successful, a sensor must signal
a controllervalve set. The two controllers each have two inputs. Each
controller is wired to two of the sensors and can operate successfully if one
of the two sensors to which it is wired is successful. The sensors have an
availability of 0.8. The controllers have an availability of 0.95. The valves
have an availability of 0.7.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
To start the analysis, each controller and its associated valve can be treated
as a series system with an availability of
The probability of success for this system can be obtained using the event
space method. With five devices in the block diagram, 32 (25 = 32) combi-
nations of devices can be expected. The combinations are listed in groups
according to the number of failed devices. Group 0 has one combination,
all devices successful. It is listed as item 1 in Table 7-1.
Next, all combinations of one failure are listed. With five devices, five
combinations are expected, with failed devices marked with an asterisk.
These are listed as items two through six in Table 7-1.
In the next step, all combinations of two failures are listed. The equation
for combinations (Equation C-16, Appendix C) is used to determine that
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
5 = -----------------------
5! -
= 10
2 2! ( 5 2 )!
5 = -----------------------
5!
- = 10
3 3! ( 5 3 )!
After listing all combinations of failed devices, the combinations that will
cause system failure are identified. Of course, the combination in Group 0
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
combination of one failed device. It can be concluded then, that all Group
1 combinations represent system success. Group 2 must be examined care-
fully; however, combination 7 has devices A and B failed. Looking again
at Figure 7-16, a path still exists across the block diagram using devices C
and E; therefore, the system is still successful with combination 7. As the
other combinations of two failures are examined, no system failures can be
found until combination 16. That combination has devices D and E failed.
There is no path across the block diagram when these devices fail. This is
illustrated in Figure 7-17.
groups. Table 7-5 compiles the list of all combinations and shows which
combinations are successful.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
11 A B* C* D E Success
12 A B* C D* E Success
13 A B* C D E* Success
14 A B C* D* E Success
15 A B C* D E* Success
16 A B C D* E* Failure
Group 3 17 A* B* C* D E Failure
18 A* B* C D* E Success
19 A* B* C D E* Failure
20 A B* C* D* E Failure
21 A B* C* D E* Success
22 A* B C* D* E Success
23 A B C* D* E* Failure
24 A* B C D* E* Failure
25 A B* C D* E* Failure
26 A* B C* D E* Success
Group 4 27 A B* C* D* E* Failure
28 A* B C* D* E* Failure
29 A* B* C D* E* Failure
30 A* B* C* D E* Failure
31 A* B* C* D* E Failure
Group 5 32 A* B* C* D* E* Failure
If the combination probabilities are added, the result should equal 1. This
is a good checking mechanism.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
R SYSTEM = R AD + R BD + R BE + R CE (7-19)
R SYSTEM = R AD + R BD + R BE + R CE
( R AD R BD ) ( R AD R BE ) (7-20)
( R AD R CE ) ( R BD R BE ) ( R BD R CE ) ( R BE R CE )
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Again, not a very useful result. The only thing determined from the suc-
cess approximation approach is that there is a probability of success some-
where between 0.07395 and 1. The method works much better when
device probabilities of success are low. In our example, the numbers are
near 1.
Exercises
7.1 A system consists of a power source, a pump, and a valve. The sys-
tem operates only if all three components operate. The power
source has an availability of 0.8. The pump has an availability of
0.7. The valve has an availability of 0.75. Draw a reliability block
diagram of the system. What is the system availability?
7.2 A second pump and a second valve are added in parallel with the
system of Exercise 7.1. This is shown in Figure 7-18. What is the
system availability?
Sensor 0.8
Input Module 0.95
Controller 0.9
Output Module 0.95
Valve 0.75
What is the system availability?
7.4 A controller consultant suggests that your control system (the sys-
tem from Exercise 7.3) could be improved by adding a redundant
controller. If a second controller is put in parallel with the first,
how much does system availability improve?
7.5 A nonrepairable controller module has a constant failure rate of
500 FITS. What is the MTTF for the controller?
7.6 Two nonrepairable controller modules with a constant failure rate
of 500 FITS are used in parallel. What is the system MTTF?
7.7 You wish to approximate the probability of failure for a reliability
block diagram. All block diagram devices have a probability of
success in the range of 0.95 to 0.998. Which model method should
you use?
7.8 You have a reliability block diagram with six devices. Each device
has one failure mode. How many combinations must be listed in
an event space evaluation?
7.9 You have a reliability block diagram with four devices. Each
device has two failure modes. How many combinations must be
listed in an event space evaluation?
Answers to Exercises
7.1 This is a series system. The system availability is 0.8 0.7 0.75 =
0.42
7.2 System availability equals 0.6195.
7.3 System availability equals 0.187, not very impressive for compo-
nents with such high availabilities. Note that the system availabil-
ity is always much lower than component availabilities. (Did you
get an answer of 0.487? Do not forget there are four sensors and
two valves.)
7.4 System availability now equals 0.206, not much of an increase.
7.5 MTTF = 2,000,000 hours
7.6 System MTTF = 3,000,000 hours
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
1. Billinton, R. and Allan, R. N. Reliability Evaluation of Engineering
Systems: Concepts and Techniques. NY: Plenum Press, 1983.
Repairable Systems
Repairable systems are typical in an industrial environment. It is possible
and highly desirable to install systems in places where they can be
repaired. Such systems offer many advantages in terms of system avail-
ability and safety. Several different fault tolerant system configurations of
repairable modules have been created. Some systems are fully repairable.
All components in the system can be repaired. In some systems, not all
components can be repaired. These are called partially repairable systems.
Markov Models
Markov modeling, a reliability and safety modeling technique that uses
state diagrams, can fulfill these goals. The Markov modeling technique
uses only two simple symbols (Figure 8-1). It provides a complete set of
evaluation tools when compared with many other reliability and safety
evaluation techniques (Ref. 1).
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
149
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
150 Control Systems Safety Evaluation and Reliability
Transition Arc
State
An example Markov model is shown in Figure 8-2. This model shows how
the symbols are used. Circles (states) show combinations of successfully
operating devices and failed devices. Each state is given an identification
number unique to each model. Possible device failures and repairs are
shown with transition arcsarrows that go from one state to another.
Some states are called transient states. These states have both in and out
arcs (Figure 8-2, states 0,1,2 and 4). Some states are called absorbing
states. These have only incoming arcs (Figure 8-2, states 3 and 5).
Fail
Energized
3
Degraded
Detected
1
OK Comm
0 loss
4
Degraded
Undetected
2
Fail Safe
5
tem failure states. It should be noted that multiple failure modes for a
device can be shown on one drawing with more than one failure state cir-
cle (Figure 8-2, state 3 and state 5).
The Markov model building technique involves the definition of all mutu-
ally exclusive success/failure states in a device. These are represented by
labeled circles. The system can transition from one state to another when-
ever a failure or a repair occurs. Transitions between states are shown
with arrows (transition arcs) and are labeled with the appropriate failure
or repair probabilities (often approximated using failure/repair rates).
This model is used to describe the behavior of the system with time. If
time is modeled in discrete increments (for example, once per hour), simu-
lations can be run using the probabilities shown in the models. Calcula-
tions can be made showing the probability of being in each state for each
time interval. Since some states represent system success, the probabilities
of these states are added to obtain system reliability or system availability
as a function of time. Many other related metrics are also obtained using
various model solution techniques.
L$t
OK Fail
0 1
Figure 8-3. Markov Model, Single Nonrepairable Component
A single repairable component with one failure mode has a Markov model
as shown in Figure 8-4. The two states are the same as previously
described for the nonrepairable component. Two transitions are present.
The upper transition represents a failure probabilitymovement from
state 0 to state 1. The lower transition represents a restore probability
movement from failure state 1 to success state 0. The restore rate is
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
represented by the lowercase Greek letter mu (). The repair rate times a
time increment (t) represents the probability of making that movement
during the time increment.
L$t
OK Fail
0 1
M$t
Figure 8-4. Markov Model, Single Repairable Component
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Time-Dependent Probabilities
Consider an industrial forging process. A forging machine stamps out
large pieces once every ten minutes (six times an hour). Records show that
1 time out of 100, the machine fails. The average repair time is 20 minutes.
A discrete time Markov model would do an excellent job of modeling this
process. A good selection for the time interval would be 10 minutes. Two
states are required; state 0, defined as success, and state 1, defined as
failure.
The system starts in state 0, success. From state 0, the system will either
stay in state 0 or move to state 1 in each time interval. There is a 1 in 100
(0.01) probability that the system will move from state 0 to state 1. For each
time interval, the system must either move to new state or stay in the
present state with a probability of one. The probability that the system will
stay in state 0 is therefore 99 out of one hundred (1 - 0.01 = 0.99). Once the
system has failed, it will either stay in state 1 (has not been repaired) or
move to state 0 (has been repaired). The probability of moving from state 1
to state 0 in any time interval is 0.5 (10 minute interval/20 minute repair
time). The system will stay in state 1 with a probability of 0.5 = (1 - 0.5).
The Markov model for this process is shown in Figure 8-5.
Transition Matrix
The model can be represented by showing its probabilities in matrix form
(Ref. 5). An n n matrix is drawn (n equals the number of states) showing
all probabilities. This matrix is known as the Stochastic Transition Proba-
bility Matrix. It is often called the Transition Matrix, nickname P. The
transition matrix for the forging machine is written as follows:
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Each row and each column represents one of the states. In Equation 8-1,
row 0 and column 0 represent state 0, while row 1 and column 1 represent
state 1. If more states existed they would be represented by additional
rows and columns. The numerical entry in a given row and column is the
probability of moving from the state represented by the row to the state
represented by the column. For example, in Equation 8-1, the number in
row 0, column 1 (0.01) represents the probability of moving from state 0 to
state 1 during the next time interval. The number in row 0, column 0 (0.99)
represents the probability of moving from state 0 to state 0 (i.e., of
remaining in state 0) during the next time interval. The other entries have
similar interpretations. A transition matrix contains all the necessary
information about a Markov model. It is used as the starting point for
further calculations.
Steady-State Availability
The behavior of the forging system can be seen by following the tree dia-
gram in Figure 8-6 (Ref. 6). Starting at state 0, the system moves to state 1
or stays in state 0 during the first time interval (T). Behavior during subse-
quent time intervals is shown as the tree diagram branches out. The step
probabilities are marked above each arrow in the diagram.
Certainly, one of the most commonly asked questions about a system like
this is, How much system downtime should we expect? A system is
down when it is in the failure state. With a Markov model, we can
translate this question into, What is the average probability of being in
state 1? The probability of being in state 1 can be calculated using the tree
diagram.
two, three, and four. The probability of following this path is calculated in
Equation 8-2.
This procedure can be followed to find path probabilities for each time
interval. At time interval one, two paths exist. The upper path has one step
with a probability of 0.99. The lower path has one step with a probability
t=0
t=1
of 0.01. To find the total probabilities of being in each state after the time
interval, add all paths to a given state. For time interval one, there is only
one path to each state so no addition is necessary. The probability of being
in state 0 equals 0.99 and the probability of being in state 1 equals 0.01.
Notice that the probability of being in either state 0 or state 1 is one! (0.99 +
0.01 = 1). At any time interval, the state probabilities should sum to one.
This is a good checking mechanism.
After the second time interval, the probability of being in state 0 equals the
sum of two paths as calculated in Equation 8-3.
The probability of being in state 1 also equals the sum of two paths. The
same method is used repeatedly to obtain the path probabilities for time
intervals three and four as shown in Table 8-2.
Notice that the values are changing less and less with each time interval. A
plot of the two probabilities is shown in Figure 8-7. The values are heading
toward a steady state. This behavior is characteristic of fully repairable
systems. If the tree diagram were fully developed, and those values
reflected in Table 8-2, it would be seen that the steady-state probability of
being in state 1 is 0.01961. This is the answer to the question about down-
time. We should expect the system to be down 1.961% of the time on the
average. For such fully repairable systems with constant failure and repair
rates, downtime is known as steady-state unavailability. Note that in
this simple example, the same number could be calculated from the failure
records if detailed maintenance records are kept.
The tree diagram approach is not practical for Markov models of realistic
size. However, there is another method for calculating steady-state proba-
bilities. Remember that the transition matrix, P, is a matrix showing prob-
abilities for moving from any one state to another state in one time
interval. This matrix can be multiplied by itself to get transition probabili-
ties for multiple time intervals.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
This process can be continued as long as desired to obtain the n-step prob-
ability transition matrix. For example:
After multiplying these further, notice that the result changes less and less
with each step:
A point is reached where Pn+1 = Pn. The numbers will not change further.
This matrix, labeled PL, is known as the limiting state probability
matrix. The top and bottom row of the limiting state probability matrix are
the same numbers. The probability of going to state 0 in n steps is the same
regardless of starting state.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
S0 is the starting probability list (time interval 0). For example, if a system
always starts in one particular state, S0 will contain a single one and a
quantity of zeros. The forging machine example always starts in state 0.
The starting probability S would be
S 0 = [1 0] (8-4)
As with P, the numbers change less and less each time. Eventually, there is
no significant change:
L
P = 0.98039 0.01961 0.99 0.01 = 0.98039 0.01961
18 17
S = S = S (8-5)
0.5 0.5
Again, the limiting state probability row, labeled SL, has been reached.
The limiting state probability matrix can be created by merely replicating
the rows as often as necessary, taking advantage of the fact that all rows in
the limiting state probability matrix are the same.
The limiting state probabilities have been reached when Sn+1 multiplied
by P equals S. This fact allows an algebraic relationship to solve the prob-
lem directly. Limiting state probabilities exist when
L L L
0.99S 1 + 0.5S 2 = S 1 (8-7)
and
L L L
0.01S 1 + 0.5S 2 = S 2 (8-8)
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
L L
0.01S 1 = 0.5S 2
The problem has two variables and only one equation; it would appear
that no further progress can be made. However, an earlier rule can help.
The probabilities in a row should always add to one. This gives the addi-
tional equation:
L L
S1 + S2 = 1 (8-9)
Substituting,
L 0.01 L
S 2 = ---------- S 1
0.5
L 0.01 L
S 1 = ---------- S 1 = 1
0.5
Finally:
L 1
S 1 = ---------- = 0.98039
1.02
L
S 2 = 1 0.98039 = 0.01961
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
system success or system failure. All failure probabilities and repair prob-
abilities are assumed to be constant. To calculate steady-state availability,
identify the system success states and sum their probabilities. The sum of
the failure state probabilities provides the steady-state unavailability.
One success state, state 0, is present in the forging process example (Figure
8-4). Thus, steady-state availability for this forging process is 0.98039 or
98.039%. One failure state exists, state 1; therefore, steady-state unavail-
ability is 0.01961 or 1.96%.
EXAMPLE 8-1
1
OK Fail
0 3
2
Figure 8-8. Control System Markov Model
L
S0 = 0.958254
L
S1 = 0.018975
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
L
S2 = 0.018975
L
S3 = 0.003794
EXAMPLE 8-2
L L L
A ( s ) = S0 + S1 + S2
= 0.958255 + 0.018975 + 0.018975
= 0.996205
Time-Dependent Availability
When a discrete time Markov model begins at t = 0 in a particular starting
state, the availability will vary with time. Figure 8-7 showed this behavior
for the forging system example. If the system is fully repairable with con-
stant repair probabilities, the availability and unavailability will eventu-
ally reach a steady-state value. Until steady-state is reached, the
availability and unavailability values may change in significant ways.
EXAMPLE 8-3
Solution: The state probabilities after step one (one hour) are
1
0.999
Probability
0.998
0.997
0.996
0.995
100 200 300 400 500
Time Increment
Figure 8-9. Availability (t)
It can be seen that availability decreases from a value of 1.0 toward the
steady-state value (0.996205 from Example 8-2) as time passes. The time
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
increments for this model are one hour. The numbers on the time line
shown in Figure 8-9 and Figure 8-10 are hours.
Absorbing States
Sometimes a system fails in such a way that it cannot be repaired. An obvi-
ous example of this is when a failure causes a destructive major explosion.
In other situations, it is desirable to model failure behavior of a system for
0.004
0.0035
Probability
0.003
0.0025
0.002
0.0015
0.001
0.0005
0
100 200 300 400 500
Time Increment
Figure 8-10. Unavailability (t)
A Markov model of such system behavior would show one or more failure
states from which there is no exit. State 1 in Figure 8-3 is an example. Such
states are known as absorbing states. They are typically applied when-
ever a failure occurs for which there is no feasible repair during the time
period of interest.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Time-Dependent Reliability
Reliability has been defined as the probability of system success over a
time interval. This definition of reliability does not allow for system
repair. This fits in perfectly with systems that cannot be fully repaired at
the system level.
Repairs can be made, however, below the system level. A system that has
a discrete time Markov model with more than one success state may move
between those states without altering system reliability. Component or
module failures and subsequent repairs that cause movement only
between system success states do not cause system failure. When a com-
ponent or module failure causes the system to move from a success state
to an absorbing failure state, system failure occurs.
Reliability for such systems can be calculated for any time interval using
methods similar to those used to calculate availability as a function of
1
OK Fail
0 3
2
Figure 8-11. Partially Repairable Control System Markov Model
EXAMPLE 8-4
Solution: Using a discrete time Markov model with a one hour time
increment, reliability is calculated by multiplying each successive S
by P. The transition matrix for this model is:
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
The state probabilities for the first hour are given by:
Adding the probabilities from states 0, 1, and 2, the reliability for the
first hour equals 1.0.
Continuing the process, the reliability values for 750 hours are
calculated. These are plotted in Figure 8-12.
1
Probability
0.99
0.98
0.97
0.96
0.95
100
133
166
199
232
265
298
331
364
397
430
463
496
529
562
595
628
661
694
727
34
67
1
Time Increment
Figure 8-12. Reliability (t), Partially Repairable Control System
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
plex systems. A shortcut is available, however. Many spreadsheet pro-
grams in common use have the ability to numerically invert a matrix. This
tool can be used to make quick work of previously time-consuming MTTF
calculations.
The N matrix provides the expected number of time increments that the
system dwells in each system success state (a transient state) as a function
of starting state. In our example, the top row states the number of time
increments per transient state if we start in state 0. The middle row gives
the number of time increments if we start in state 1. The bottom row states
the number of time increments if we start in state 2. If a system always
starts in state 0, we can add the numbers from the top row to get the total
number of time increments in all system success states. When this is multi-
plied by the time increment, we obtain the MTTF when the system starts
in state 0. In our example, this number equals 26,250 hours since we used a
time increment of one hour. If we started this system in state 1, we would
expect the system to fail after 26,000 hours on the average. If we should
start the system in state 2, we would also expect 26,000 time increments to
pass until absorption (26,000 hours until failure).
This situation can also be accurately modeled with a discrete time Markov
model using numerical solution techniques. Consider the case of a safety
instrumented system with periodic proof testing, which typically includes
an inspection and test, or series of tests. While the time increments are
counting and have not reached the test time, the probability of a restore
operation from a fail-danger state is zero because the failure is not known.
When the time increment reaches the periodic test time, a proof test is per-
formed and repair is made if a failure is detected by the test. If the system
Figure 8-13 model shows a simplified 1oo2 system without common cause
or diagnostics. (For a complete model of a 1oo2 system, see Chapter 14.)
Failure rates with superscript S are fail-safe and cause an output to de-
energize. Failure rates with superscript D are fail-danger and cause an
output to energize. Restore rates for failure detected by automatic diag-
nostics have the subscript O indicating an on-line restore. Restore rates
with subscript P indicate a restoration only when periodic proof test
detects the failure. The periodic restore arc from state 3 to state 0 is not
constant. At test time, the system either works correctly or it is repaired in
a finite period of time. At the end of the test time, the system is restored to
state 0. The probability of moving from state 3 to state 0 is a time-depen-
dent function as shown in Figure 8-14.
This model can be solved using discrete time matrix multiplication for the
case where a perfect periodic proof test is done to detect failures in state 3.
The P matrix is normally:
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
0
Operating Time Interval Time increments
D S D S
1 ( 2 + 2 )t 2 t 2 t 0
S D S D
O t 1 ( O + + )t t t
S t 0 0 0
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
0 0 0 1
When the time increment counter equals the end of the proof test and
repair period the matrix is changed to represent the known probabilities of
failure. The P matrix used then is:
1 0 0 0
1 0 0 0
1 0 0 0
1 0 0 0
This matrix represents the probability of detecting and repairing the fail-
ure. The 1 indicates the assumption of perfect proof testing and repair.
0
Operating Time Interval Time increments
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Modeling Imperfect Proof Test and Repair
In the previous analysis, an assumption was made that is not realistic. The
assumption of perfect proof test and repair is made with many safety
integrity (i.e., low demand SIS verification) models because of the limita-
tions in the modeling tool. When solving the models using discrete time
matrix multiplication with a Markov model, this is easily solved. To
include the effect of imperfect proof test and repair, the failure rate going
to the fail-energize (absorbing) state can be split into failures detected dur-
ing a periodic proof test and those that are not. The split is based on proof
test effectiveness, which can be determined with a FMEDA (Chapter 5).
The upgraded Markov model is shown in Figure 8-16.
D S D S
1 ( 2 + 2 )t 2 t 2 t 0 0
S D S D D
O t 1 ( O + + )t t E t ( 1 E ) t
S t 0 0 0 0
0 0 0 1 0
0 0 0 0 1
When the time counter equals the end of the proof test and repair period
the matrix is changed to represent the known probabilities of failure. The
P matrix then used is:
MP $t System Fail
% LD$t Fail-
Energize
4
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
0 0 0 0 1
This matrix indicates that all failures are detected and repaired except
those from state 4, where they remain failed. A plot of the PFD (state 3 or
state 4) as a function of operating time interval is shown in Figure 8-17.
The dotted line at the bottom of the figure shows the impact of the failures
that are not detected during the proof test. Those undetected failures
cause increasing PFD as a function of operating time interval.
Modeling Notation
It is common practice not to display the t in Markov model drawings or
in matrix descriptions. When this is done, the drawing in Figure 8-16
would look like Figure 8-18. It is understood that a time increment is being
used in the solution of such drawings and matrix descriptions.
--``,,`,,,`,,`,`,,,```,,,``,``,
0
Operating Time Interval Time increments
MP System Fail
% LD Fail-
Energize
4
Figure 8-18. Shortcut Notation for the Markov Model of Figure 8-16
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Exercises
8.1 A system contains three modules. Each module can be either oper-
ating successfully or failed. How many possible states may exist in
the Markov model for this system?
8.2 A system has the Markov model shown in Figure 8-19. The system
is fully repairable. The arcs are labeled with probabilities in units
of failure probability per hour. The system is successful in states 0
and 1. Calculate the limiting state probability row. What is the
steady-state availability?
OK
1 2
0
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Figure 8-19. Markov Model for Exercise 8.2
Answers to Exercises
8.1 23 = 8
8.2 The P matrix is
To 0 1 2
From 0 0.9898 0.01 0.002
1 0.05 0.945 0.005
2 0 0.05 0.95
References
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`--- 179
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
180 Control Systems Safety Evaluation and Reliability
The system MTTF (Mean Time to Failure) can be calculated using matrix
algebra. Since MTTF is a measure that does not consider system failure
and subsequent repair, the model is modified by eliminating the repair arc
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
from state 2 to state 1. The P matrix for the modified model is:
1 2 2 0
P = o 1 ( + o )
0 0 1
Using this new P matrix, solve for MTTF by following the steps detailed in
Chapter 8 (see Appendix B for Matrix math). Assume the system starts in
state 0. After inverting the (I - Q) matrix and adding the top row (reference
Appendix B for details of the derivation):
3 +
MTTF = ---------------- (9-1)
2
2
be distinguished from those that are not. This distinction must be made
because diagnostic coverage affects repair time.
1
o = ------- (9-2)
TR
The variable TR refers to average restore time. The on-line restore rate
applies to all failures that are covered (detected by on-line diagnostics).
In Figure 9-2, a new Markov model is presented that accounts for the dif-
ference between detected failures and undetected failures in the degraded
(not fully operational) state. From state 0, a detected failure will take the
system to state 1. Repairs are made from state 1 to state 0 at the on-line
repair rate. An undetected failure will take the system to state 2. When the
system is in state 1 or in state 2, one controller is operating. From these
states another failure will cause system failure.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
3 + 3 o 2C o
MTTF COV = --------------------------------------------------- (9-3)
2
2 + 2 o 2C o
Imperfect Switching
Another assumption made in the ideal model of Figure 9-1 was no switch-
ing mechanism is present. Many practical implementations of redundancy
include a switching mechanism to select the appropriate module output.
The new system is drawn in Figure 9-4. It has an output selector switch
that chooses which module output to route to the system output.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
If the failure is not detected by the diagnostics, one of two things will
happen.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
versus coverage for this system is shown in Figure 9-6. The effect
of diagnostic coverage is again critical because one half of the
undetected failures cause immediate failure.
Diagnostics can detect these dangerous failures and allow the system to be
quickly repaired. There is a significant improvement in RRF when such a
system has good diagnostics (and plant maintenance policies that ensure
reasonably quick repair).
The RRF calculations assume a repair time of 24 hours. The plot shows
RRF is highest when the dangerous diagnostic coverage is 100%.
900
reliable control systems, the ability to measure and evaluate those diag-
nostics is important. This is done using a FMEDA (Chapter 5) and verified
with testing that simulates failures and records diagnostic performance.
EXAMPLE 9-1
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
coverage factor, CS. The superscript D is used for the dangerous coverage
factor, CD.
EXAMPLE 9-2
Solution: The total safe failure rate in Table 9-1 is 70.04 FITS. The
safe detected failure rate is 65.02 and therefore the safe coverage
factor = 0.928. The total dangerous failure rate is 23.5 FITS. The
dangerous detected failure rate is 15 FITS. The dangerous coverage
factor is 0.64. This circuit is based on a conventional PLC input circuit
with added diagnostics. Many would judge such conventional PLC
circuits to be insufficient for safety applications.
Measurement Limitations
Although the FMEDA technique can provide good diagnostic coverage
factor numbers when done accurately, the main limitation is that the
FMEDA creator(s) and its reviewers must know about all possible failure
modes of the components used in the circuit/module/unit. Should unan-
ticipated (unknown) failure modes occur, they may or may not be
detected by the on-line diagnostics as designed. Fortunately, the failure
rates of those failure modes is likely to be very small or someone would
know about them. This is especially true on components that are used in
many applications. The possibility of unknown/undetected failure modes
is higher for new components. When the possibility of unknown failure
modes is considered likely this can be indicated on the FMEDA as shown
in Table 9-2. This is the FMEDA of Table 5-4 with an unknown failure
mode added. Notice the dangerous diagnostic coverage dropped from
99.96% to 99.73%.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Diagnostic Techniques
As control computer capability increases, we expect better diagnostic cov-
erage. The computer HAL from the movie 2001 said, Ive just picked
up a fault in the AE35 unit. Its going to 100% failure within 72 hours.
While our newest machines are not yet quite at this level, new automatic
diagnostic techniques are constantly being developed and used to
improve diagnostic coverage.
High coverage factors (C > 95%) are hard to achieve. Controllers must be
designed from the ground up with self-diagnostics as a goal. Electronic
hardware must be carefully designed to monitor the correct operation of
each circuit. Software must properly interact with the hardware. In the
past, control computers have been estimated to provide 93% to 95% cover-
age [Ref. 5 and 6]. More recent designs have achieved diagnostic coverage
factors greater than 99% [Ref. 7 and 8]. To achieve these levels of coverage,
a number of diagnostic techniques have been developed. They can be clas-
sified (Chapter 4) in two ways: a comparison to a predetermined reference
and a comparison to a known good operating unit.
Reference Diagnostics
A comparison to a predetermined reference is the most commonly used
diagnostic technique. The auto mechanic measures oil pressure, wet and
dry compression pressure, or mechanical clearances. Then these results
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Analog-to-digital converters are useful for more than process input mea-
surement. Certain voltages and currents within a module indicate failure.
Circuits can monitor the voltage of any power source. If a component fail-
ure causes the supply voltage to exceed bounds, the failure is detected.
Power consumption can also be measured, and many failures are indi-
cated by an increase in current.
Many digital circuits use known sequences of bit patterns. If the sequence
of binary numbers is added, a sum results. The same number should be
calculated every time the sequence repeats. If the number is different, a
failure has occurred.
Output current sensors can be used to detect open and short circuits in
output devices. If the current rises above an upper bound, a short failure is
detected. Outputs can be pulsed briefly to verify that the output is able to
switch. The normal output state is restored automatically as soon as the
switch is verified as good. The I/O power can be monitored, which allows
detection of failed I/O power or possibly a failed cutoff switch.
the input circuit has failed. This diagnostic method will detect stuck-at-
one and stuck-at-zero input circuit failures.
Analog input signals have better diagnostic potential than discrete signals.
In normal situations, an analog signal varies. One good diagnostic tech-
nique is the use of a stuck signal detector. If the analog signal gives the
exact same reading for several scans in a row, it has probably failed.
Comparison Diagnostics
A comparison to another operating unit is also useful. Results of dynamic
calculations can be compared. In a dual configuration, any disagreement
may identify a fault. In a triple configuration, a voting circuit is used to
identify when one unit disagrees with the other two. It is likely the dis-
agreeing unit has failed. This can be a useful diagnostic technique for
many failures.
data compared and coverage effectiveness. In general, the more data com-
pared, the higher the effectiveness.
must include the valves, sensors, field transmitters, limit switches, sole-
noids, and other devices, along with the associated wiring, junction boxes,
and connections.
Field devices with microprocessors built into them are known as smart
devices. The ability to put a microprocessor into a field device allows
diagnostic capabilities never before possible. Techniques formerly used in
a controller module are now practical within a field device.
Figure 9-13. Differential Pressure Measurements with Plugged Impulse Lines (Ref. 9)
motionless for years. Many component failures occur that can cause the
valve to stick. These failures include the cold welding of O-rings and seals
and corrosion between moving parts. A controller or a smart device in the
field can automatically test for this condition and indicate the failure. A
partial stroke test can be set up to move the valve a small amount. This can
detect a significant percentage of dangerous failures in the final element
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
assembly (Ref. 10).
Diagnosing failures in some field devices, even dumb field devices, can
be done with intelligent input/output circuits in a control system module.
Output current sensors can detect field device failures. If the average out-
put current exceeds an upper bound for too long, this indicates a short cir-
cuit failure of the load device or the field wiring. If an output channel is on
and a minimum current is not being drawn, this indicates open circuit fail-
ure of the field device or the associated wiring.
Comparison diagnostic techniques are popular for use with field sensors,
analog and discrete. For discrete sensors, Boolean comparisons can detect
differences in sensors, although care must be taken that scanning differ-
ences and noise do not cause false diagnostics. The logic of Figure 9-15 has
a timer to filter out temporary differences between two discrete inputs
labeled A and B. A Time OFF (TOFF) timer is used for the normally ener-
gized signals. In Figure 9-16, 2oo3 voting logic compares three discrete
inputs labeled A, B and C. The majority signal drives a coil. Additional
logic could be added to specifically compare each of the three combina-
tions of two signals. When one signal appears in two comparison mis-
matches, it is likely to be the bad signal.
TIMER
TOFF
Diagnostic
Alarm
An equivalent technique can be used for analog. Analog signals are sent to
a median selector. The median selectors output is used as the process sig-
nal. In addition, comparisons must be made between analog signals, look-
ing for differences greater than a certain magnitude and that last longer
Exercises
9.1 Records indicate that actual repair time for the equipment in the
plant averages four hours. What is the repair rate for immediately
detected failures?
9.2 A dual controller system has the ideal Markov model shown in
Figure 9-1. Using the repair rate from exercise 9.1 and a controller
module failure rate of 0.0001 failures/hour, what is the system
MTTF?
9.3 A dual controller has a watchdog timer diagnostic with a coverage
of 0.7. Considering the more detailed Markov model of Figure 9-2
that considers diagnostic coverage, what is the MTTF using the
failure and repair rate numbers from exercise 9-2?
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Answers to Exercises
9.1 The repair rate for immediately detected failures is 0.25 repairs per
hour.
9.2 Using equation 9-1, the MTTF = 12,515,000 hours or 1428 years!
(8760 hours per year)
9.3 Using equation 9-4, the MTTF now equals 26,651 hours or 3 years.
This is much less than the answer from 9.2 showing the impact of
realistic modeling that includes diagnostic coverage.
9.4 Using 9-4, the MTTF is 109,246 hours or 12.47 years.
9.5 Thermocouple burnout results in an open circuit of the thermocou-
ple. One technique for detecting this failure would involve con-
necting a small current source so that the voltage across the
thermocouple would go to a few volts (instead of the normal milli-
volts) when an open occurs.
9.6 A shorted 420 mA transmitter in a 2-wire circuit could be detected
if the current goes above the 20-mA rating by a significant amount
(a 21-mA threshold is recommended.)
9.7 A frozen D/A converter might be detected by checking to deter-
mine if the digital output remains the same for several readings.
This assumes that normal process noise creates variations for each
reading when the D/A is working.
9.8 For most switches there is no electrical difference between a closed
switch and a short circuit. Therefore detection of such a failure is
extremely hard. Most control system designers are replacing
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
References
1. Smith, S. E. Fault Coverage in Plant Protection Systems. ISA
Transactions, Vol. 30, Number 1, Research Triangle Park: ISA, 1991.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
10. van Beurden, I. and Amkreutz, R. The Effects of Partial Valve Stroke
Testing on SIL Level. Sellersville, PA: exida, 2001.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Common-Cause Failures
A common-cause failure is defined as the failure of more than one device
due to the same stress (cause). A common-cause failure negates the bene-
fits of a fault tolerant system [Ref. 1, 2, and 3]. For example, many fault tol-
erant systems provide two modules to prevent system failure when a
module failure occurs. If both modules fail due to a single stress, such a
fault tolerant system will fail; a similar problem occurs when a fault toler-
ant system provides three or more redundant devices. This is not a theo-
retical problem. Actual study of some field installations has shown that
the reliability metrics PFD, MTTF, and so on, are much worse than reli-
ability models predicted. The autopsy reports of failures in these situa-
tions indicate that, in some cases, more than one redundant device has
failed due to a common-cause stress.
In one case history, the door of a control rack cabinet was opened to check
on a status display. Just before the maintenance technician was finished, a
call came on the walkie-talkie: Its time for lunch. The simple response
was, Ill be there soon. In the cabinet were three controller modules
mounted in the same card rack in a fault tolerant system. When the
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
201
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
202 Control Systems Safety Evaluation and Reliability
Several loose wire lugs in the bottom of a cabinet created excess resistance
in I/O conductors that normally carry several amps. These high resistance
connections were generating heat in the cabinet. Two microprocessor
cards in the cabinet were configured redundantly. As the temperature
went up, the precise timing needs of the digital signals were no longer met
and both microprocessors failed in a short period of time. The system
failed and shut down a chemical process. For this failure:
An engineer was adding new logic to a dual redundant PLC. When the
download command was sent, a memory full error message was
received just before both units crashed and shut down. For this failure:
A new engineer noticed that the pressure readings in the boiler burner
management system (BMS) were a little off. He recalibrated all three trans-
mitters using the wrong procedure and set all three to the wrong range. If
the pressure ever went into shutdown range, none of the three transmit-
ters would have sent the correct signal. For this failure:
Two valves were piped in series to ensure fuel flow could be shut off
when one or both were closed. If one valve should stick open, the second
valve would do the job. To avoid power dissipation during the long peri-
ods where the valves would not be used, the system was designed to be
energized to trip. A fire started near the process unit. This was sensed by
the safety instrumented system and both valve outputs were energized.
Unfortunately, the cables for the valves were routed through the same
tray and that tray was over the fire. Both cables burned, and the valves did
not close. For this failure:
All of the things that cause failure can be the source of a common-cause
failure. As we have seen, they may be internal and include design errors
and manufacturing errors. They may be external and include environmen-
tal stress, maintenance faults, and operational errors.
Electrical stress includes voltage spikes, lightning, and high current levels.
Mechanical stress includes shock and vibration. Chemical stress includes
corrosive atmospheres and salt air. Physical stress includes temperature
and humidity.
Design errors, most often software design errors, are a major source of
common-cause failure. Consider the design process. The complexity of
modern control systems increases the chances of design errors. In addi-
tion, during product development, testing is extensive, but system com-
plexity may prevent complete testing. In many cases, the system is not
tested in such a way that a design fault is apparent. Then, one day, the sys-
tem receives inputs that require proper operation of something that was
designed incorrectly. System elements do not operate as needed. The sys-
tem does not operate properly. By definition, this is a system failure. If
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
redundant elements are identical, they will suffer from the same design
errors and may fail in exactly the same way, given the same inputs.
Common-Cause Modeling
There are several models available to predict the effects of common-cause
susceptibility. One of the models should be used in the safety and reliabil-
ity analysis of redundant systems. Without proper consideration of com-
mon-cause effects, safety and reliability models can produce extremely
optimistic results.
The area within the rectangle of Figure 10-3 represents the total rate of
stress events (stress rate) where stress is high enough to cause a failure.
When only one component is subjected to the stress, the stress rate equals
the failure rate. Thus, the area within the rectangle represents the rate at
which one or more components fails due to stressthe failure rate. Over a
portion of the area, stress is high enough that two or more units fail due to
stress. That portion is designated with the Greek lower case letter beta.
The beta factor is used to divide the failure rate into the common-cause
portion, C and the normal (independent) portion, N. The following
equations are used:
C
= (10-1)
and
N
= (1 ) (10-2)
EXAMPLE 10-1
successfully operating. In state 1, one power supply has failed but the
system is successful. In state two, both power supplies have failed
and the system has failed. If common cause were not considered, 1
would be 50,000 FITS (25,000 FITS for each operating power
supply), 2 would be 25,000 FITS, and 3 would be zero. What would
the failure rates be for 1, 2 and 3 if the beta factor is 0.1?
Solution: Using the beta model, the failure rates for each power
supply are divided into normal and common cause. Using a beta
factor of 0.1:
C
= 25,000 = 2,500
N
= ( 1 ) 25,000 = 22,500
For the beta factor of 0.1, the common-cause failure rate is 2,500
FITS. The normal mode failure rate is 22,500 FITS for each power
supply. The failure rates in the Markov model are:
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
in that it assumes that all failures are due to a stress event (all failures
occur when stress exceeds an associated strengthChapter 3).
The objective of the MESH model is to calculate the failure rates of one,
two, three, or more failures per stress event. These failure rates are defined
as:
(1) = the failure rate where one unit fails per stress event;
(2) = the failure rate where two units fail per stress event;
(3) = the failure rate where three units fail per stress event;
and so on.
(n) = the failure rate where n units fail per stress event.
The calculation starts with an estimate of the probability that one, two,
three, etc. units will fail per stress event. These probabilities are:
P1 = The probability that one unit will fail per stress event;
P2 = The probability that two units will fail per stress event;
P3 = The probability that three units will fail per stress event;
and so on.
Pn = The probability that n units will fail per stress event.
Note that the sum of all the probabilities must equal one since every stress
event will fail some quantity of units.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
The stress event rate is represented by the Greek letter nu, . If only one
unit is exposed, the stress event rate will equal the failure rate, = . If
multiple units (n = the number of units) are exposed to the stress event
then sometimes more than one unit will fail with each stress event. Under
such conditions, the stress event rate is less than n times the failure rate:
<n (10-3)
The relationship between stress event rate and individual failure rates is
given by
=n/M (10-4)
where M is the average number of units failed per stress event. This can be
calculated using the formula for the expected value of a discrete probabil-
ity density function (Equation 2-8).
Once M is calculated, the stress event rate can be calculated using equa-
tion 10-3. The failure rates for one unit, two units, three units and so forth
are calculated using:
(1) = P1;
(2) = P2;
(3) = P3;
EXAMPLE 10-2
Figure 10-5. Markov Model of Dual Power System with MESH Failure Rates
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
EXAMPLE 10-3
EXAMPLE 10-4
Some references multiply the beta number by three when modeling 2oo3
systems [Ref. 6, 7 and 8]. The 2oo3 architecture fails if two of the three
units fail. Therefore, the beta number should be multiplied by three
because there are three combinations of two units exposed to the stress.
Any combination of those three will cause system failure. However, com-
mon-cause failure simulation and further research show that the multi-
plier is 3/2 [Ref. 9]. Examples of how this is used are in Chapter 14 where
2oo3 systems have a 3/2 multiplier on the beta factor.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Common-Cause Avoidance
Control system designers at all levels must recognize that common-cause
failures drastically reduce system safety and reliability in redundant sys-
tems. Therefore these systems must be designed to achieve desired reli-
ability goals even when common-cause failure rates are included in
reliability models. Designers must recognize the failure sources that are
responsible for common-cause failures. Specific solutions must be imple-
mented to combat common-cause failures. The common-cause defense
rules can be grouped into categories that result in three basic rules.
The technique has been tested [Ref. 10 and 11] and has had some success
in hardware and software. However, there are serious tradeoffs and cost
considerations that must be taken into account. The testing has shown
design diversity has not eliminated all common-cause design errors. Sys-
tem-level design errors have not been eliminated. In addition, many new
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
There are significant tradeoffs between the extra effort required to repli-
cate a design more than once (extra design training, extra design docu-
mentation, extra maintenance training, extra spare parts, etc.) and the
effort required to avoid faults during design. The extra complexity created
when connecting diverse machines into a fault tolerant system creates
design faults. Given the new, inevitable problems caused by diversity,
such systems should be approached with caution.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
EXAMPLE 10-5
EXAMPLE 10-6
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
identical in design and manufacture. What is the estimated beta
factor?
EXAMPLE 10-7
little benefit. In systems with two or three redundant units, it is not neces-
sary to use multiple parameter common-cause models. The simplest
approach to modeling triple systems is to use the beta model with the 3/2
factor when three units are exposed to the same stress [Ref. 9] and beta
alone when two units are exposed to the same stress.
Occasionally (as with nuclear systems) a fourth unit is added to the design
so that three units can be active even if one unit is removed from service
for maintenance. In systems with four redundant units an accurate model
might use the MESH or the extended beta model for common-cause fail-
ures. When estimating parameters for the MESH model, most agree that
P2 > P3 > P4, etc. Most estimates put the ratio of the factors in a range of
2X to 10X. When using the extended beta model, most estimates put the
2 / 3 ratio at a value of 2X to 10X as well. Little hard statistical evidence
has been published to support these estimates, however.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
EXAMPLE 10-8
When including the effects of common cause, the block diagram of Figure
10-6 is used. The normal failure rate for a power supply is (1-) = 0.95
0.00005 = 0.0000475 failures per hour. The common-cause failure rate =
0.05 0.0005 = 0.0000025 failures per hour.
The reliability for the power system including common cause = 0.99785
0.99750 = 0.99536.
The effects of common cause can be modeled using a fault tree in a man-
ner similar to the application of reliability block diagrams. A fault tree
showing common cause is drawn in Figure 10-7. Again, two power sup-
plies are provided, either of which can operate the system. Without con-
sidering common cause, both must fail in order for the system to fail.
When common cause is considered, it is the equivalent of another failure
that will fail the system. This is shown as another input to an OR gate.
Figure 10-9. Markov Model Showing a Three Device System Without Common Cause
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Figure 10-10 shows the same system with possible common-cause failures
included. From state 0, one, two or three devices can fail per stress event.
The arc marked 3(1) means three devices exposed, one fails. The arc
marked 3(2) means three devices exposed, two fail. The arc marked 3(3)
represents the case where all three devices fail due to a common cause.
Other arcs are marked with the other possibilities for common-cause fail-
ure. The Markov modeling method is quite flexible when considering
common cause.
Figure 10-10. Markov Model Showing a Three Device System with MESH Common
Cause
Exercises
10.1 List sources of common-cause failures.
10.2 How can one software fault cause the failure of two redundant
controllers?
10.3 Describe the concept of diversity.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
10.4 How could you achieve hardware and software diversity in a sys-
tem design?
10.5 Redundant pressure sensors are used to measure pressure in a
high pressure protection system. They are fully isolated electri-
cally. They are mounted several feet apart on the process vessel.
They are identical in design. What is the estimated beta factor?
10.6 A sensor has a dangerous failure rate of 0.00005 failures per hour.
Two of these sensors are used in a 1oo2 safety configuration (the
system will not fail dangerous unless both sensors fail). The beta
factor is estimated to be 0.025. What is the dangerous common-
cause failure rate?
Answers to Exercises
10.1 Design errors, manufacturing errors, maintenance faults, and oper-
ational errors. Another source is environmental stress which
includes electrical stress, mechanical stress, chemical stress (corro-
sive atmospheres, salt air, etc.), physical stress and (in software)
heavy usage.
10.2 A software fault can fail two or more redundant controllers if both
use the same software and if the two controllers are subject to the
same software stress (e.g., identical inputs, identical timing, and
identical machine state). Common-cause software failures may be
reduced with diverse software and/or asynchronous operation
and timing.
10.3 Diversity is the use of different designs in redundant components
(modules, units) in order to reduce susceptibility to a common
stress. Diversity works best when the diverse components respond
differently to a common stress. Diversity does not work well when
the diverse components use basically the same technology such
that they respond the same way to a common stress.
10.4 Hardware diversity can best be achieved when redundant compo-
nents are of different technologies, such as a mechanical switch
and an electrical switch. Some level of diversity is achieved when
programmable electronic circuits are redundant with non-pro-
grammable circuits. Software diversity is accomplished through
References
1. Gray, J. A Census of Tandem Availability Between 1985 and
1990. IEEE Transactions on Reliability, Vol. 39, No. 4, Oct. 1990.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Software Failures
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
As we saw in Chapter 10, design errors, most often software design errors,
are a major source of common-cause failure. These failures are different
from most other failure types in that all software failures are inadvertently
designed into the system. Software does not wear out. Software is manu-
factured with no undetectable duplication error. There are no latent man-
ufacturing defects.
This situation leads some to believe that software failures cannot be mod-
eled using statistical techniques. The argument states that a computer pro-
gram always fails given a particular computer execution sequence.
Therefore, reliability (as mathematically defined) equals 0 for that execu-
tion sequence. Reliability equals one for input sequences that do not result
in failure. The system is completely deterministic, not probabilistic.
223
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
224 Control Systems Safety Evaluation and Reliability
An industrial operators console had been operating in the plant for two
years with no problems. A new operator joined the staff and during one of
his first shifts the console stopped updating the display screen and
responding to operator commands shortly after an alarm acknowledg-
ment. The console was powered down and restarted perfectly. There were
no hardware failures. Since the manufacturer had over 500 units in the
field with 12 million total operating hours, it was hard to believe a signifi-
cant software fault existed in such a mature product.
The problem could not be duplicated after many hours of testing so a test
engineer visited the site and interviewed the operator. During this visit it
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
was observed that this guy is very fast on the keyboard. That input
allowed the problem to be traced. It was discovered that if an alarm
acknowledgment (Ack) key was struck within 32 milliseconds of the alarm
silence (Sil) key, a software routine would overwrite an area of memory and
cause the computer to crash. For this failure:
stances the operation was always successful and the software fault was
masked. Occasionally, the dynamic memory allocation algorithm picked
memory that had not been cleared. The system failure occurred only when
the software module did not append the zero in combination with a mem-
ory allocation in an uncleared area of memory. For this failure:
There are many things in addition to inputs that cause software to fail.
Consider the above cases. Timing of input data was involved. Size of input
data caused a failure. Changing operation (dynamic memory allocation)
contributed. There are more. With such a wide variety of failure sources,
each of which can be treated statistically, there is a solid basis for the sta-
tistical analysis of the reliability and safety of software.
down. My computer hung again. I forgot to save my file and just blew
away four hours work. Our experience is far from perfect. As our soft-
ware dependency increases, our incentive for higher levels of software
reliability is greater.
Intuitively, we may guess that the software failure rate has some relation-
ship to the number of faults (human design errors; bugs) in the soft-
ware. Fault count is a strength factor. A program with few faults is
stronger than one with many faults. Software strength is also affected by
the amount of stress rejection designed into the software. Software that
checks for valid inputs and rejects invalid inputs will fail much less fre-
quently. Consider the communications example above. Although the soft-
ware did check for correct data frame format, it did not check for correct
data format. If it had, it is likely the failure would not have occurred.
We may also guess that the software failure rate relates to the way in
which the software is used. The usage stress to a software system is the
combination of inputs, timing of inputs and stored data seen by the CPU.
Inputs and the timing of inputs may be a function of other computer sys-
tems, operators, internal hardware, external hardware, or any combination
of these things.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Software Complexity
Why are software failure rates going up? Why does the strength of soft-
ware seem to be decreasing? As mentioned earlier, many think the answer
is complexity that is growing beyond the ability of the tools that help
humans deal with complexity. To understand this growth in complexity, a
view from the microprocessor unit (MPU) is needed. An MPU always
starts the same way. It looks at a particular place in memory and expects
to find a command. If that command is valid, the MPU begins the execu-
tion of a long series of commands from that point, reading inputs and gen-
erating outputs. There are three ways to view this process: as digital states,
as a sequence of digital states called a path, or as a series of inputs.
State Space
The first digital computers were state machinesdigital circuits that
moved from state to state depending on the input conditions and memory
contents. A state was represented by a number of binary bits stored on
flip-flop (bistable latch) circuits. One group of flip-flop circuits was called
the program counter. Other groups were called registers.
The machine was successful if it moved from state to state through the
state space as intended by the software engineer. But, if any bits were in
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,
not likely to follow the desired path or perform its intended function. This
constituted a failure.
Path Space
A sequence of states followed by a computer during the execution of a
program is called a path. The collection of all possible paths is called the
path space. A particular path is determined by the contents of memory and
the input received by the computer during the program execution. Simple
MPUs such as those installed in appliances repeatedly execute only a few
paths, computers execute many more.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
1234567
121234567
12121234567
1212121234567
123454567
12345454567
1234545454567
121212345454567
However, if we count only a single repetition through each loop, the pro-
gram has three paths: 1234567, 121234567, and 123454567. A path that
includes only a single loop iteration is called a control flow path. Con-
trol flow path structures associated with common program constructs are
shown in Figure 11-3.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Paths are identified and counted in computer programs for many reasons,
including the development of program test strategies. Theoretically, every
path should be tested. If this is done, all software design faults should be
discovered. Test coverage is a measure of the percentage of paths tested.
EXAMPLE 11-1
Problem: How many control flow paths are present in the program
diagrammed in Figure 11-4?
12345678,
1212345678, and
1234565678.
All program control flow paths could be tested with three test runs.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Figure 11-4. Program State Diagram
Data-Driven Paths
Control flow path testing does not account for path variations due to input
data. Whenever the number of loop iterations is controlled by input data,
each loop iteration count should be considered a different path. The pro-
gram from Figure 11-4 would have two paths for each possible data value
of time, an input obtained in step 4. For the input value of one, the paths
are 12345678 and 1212345678, because the loop from step 6 to step 5 does
not occur. Table 11-1 lists data-dependent paths for the valid data range of
1 through 5.
Even testing all these 10 paths may not find all software design faults. Cer-
tain errors of omission are not detected until the program is tested with
unexpected inputs. What happens when an input of zero is received by
our toaster program? Assume that the time value is stored in an eight-bit
binary number. The program follows its path to step 5. The number is dec-
remented. A zero is decremented to a binary minus one, represented by
eight binary ones (11111111). This is the same representation as the num-
ber 255. In step 6, the time number will not equal zero. The program will
decrement 255 times! By the time the heater is turned off, the toast will be
ashes and the kitchen will be full of smoke. Most users would consider
this behavior to be product failure (a systematic failure). In order to fully
test this program, two paths would need to be tested for all possible input
numbers. With 256 input numbers, 512 paths would need to be tested.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
untoasted bread is popped up. Invalid numbers are thus rejected. Figure
11-5 shows the modified state diagram. For each valid input, two paths
exist as before. For each input above the limit, only two paths exist (a total
of 10). For all inputs below the limit, only two paths exist. The path space
has been reduced. A total of only 14 paths need to be tested.
Asynchronous Functions
The path count goes up by orders of magnitude whenever a computer is
designed to perform more than one function in an asynchronous manner.
Most computers implement asynchronous functions with a feature known
as an interrupt. An electronic signal is sent to the computer. The com-
puter is designed in such a way that whenever the signal is received, the
computer stops following its path, saves enough information to return
later to the same spot in the original path, and then starts following a new
path. In effect, every time an interrupt occurs, a new path is created.
Imagine our new product development team has identified a need in the
toaster market. They have estimated thousands of additional units could
be sold if we add a clock to the side of the toaster. In order to keep the cost
down, we add only a digital display and a 1-second timer. The timer will
interrupt the microprocessor. The microprocessor will update the digital
clock display. The state diagram for our enhanced toaster program is
shown in Figure 11-6.
Since the interrupt that causes the computer to go to state B may occur at
any time relative to the path execution, the timer interrupt is asynchro-
nous to the main program. There are many more paths. Consider the case
when the time input equals one. We previously had two possible paths:
12123456789A and 123456789A. If state B can occur between any of the
other states, we now have the following possibilities for the path
12123456789A:
Two paths became 22! The path space has increased by an order of
magnitude.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Figure 11-6. Enhanced Program State Diagram with Data Range Checking
Not testing all paths is a problem only if one of those paths results in some
unacceptable operation of the computer. This happens. Software design-
ers often cannot foresee all the possible the interactions of these asynchro-
nous tasks. The probability of interaction is increased whenever common
resources, such as memory, are shared by different tasks. Asynchronous
tasks should not be used as a design solution unless necessary. When
asynchronous tasks are necessary, resources such as memory or I/O
devices should not be shared.
starts following the path required to receive an instruction from the voice
recognition system, that path cannot be interrupted. To solve this prob-
lem, the computer logic gives the communication interrupt the highest
priority. The communication protocol normally takes less than 1 second to
complete a message. Under error conditions, such as electrical noise, mes-
sages are repeated up to 10 times. Four messages are required for the voice
recognition system to deliver a complete command.
One morning Emily goes into the kitchen and says, Light toast, expect-
ing a 10-second heat time. She turns on the blender (generating lots of
electrical noise) and pushes down the toast lever. Tyree walks in just then
and says, Medium toast next. The message from the voice recognition
system interrupts the toaster as its main program is decrementing the
timer. Because of the blender noise, the messages take 40 seconds. The
toast, given 50 seconds of heat instead of 10 seconds, is burned. This sys-
tematic failure occurred because the computer followed an unanticipated,
untested path.
Input Space
So far, we have viewed computer operation in terms of state space and
path space. The computer is successful if it stays in successful states; it is
successful if it follows successful paths. Other states or other paths cause
systematic failure. Dr. J. V. Bukowski explains in Reference 5 that there is a
third way of looking at the problem: Consider the inputs. An input condi-
tion or sequence of input conditions will cause a particular path to be fol-
lowed. Programs accept input from the external world and the computers
memory. The input space is the collection of all possible input conditions or
sequences of input conditions [Ref. 5].
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
These models serve a number of useful purposes. They have some poten-
tial to measure expected field failure rates. This information is needed for
accurate system level reliability evaluations. The models also provide
information to prospective buyers. They indicate the level of software
quality. In addition, the models provide information useful in comparing
new software development processes. There are many other uses for the
information.
design faults are repaired as soon as they are discovered. Thus, this test
period represents a reliability growth process. The expected failure rate
drops as reliability grows.
Basic Model
Versions of the basic model were developed independently by Jelinski-
Moranda and Shooman and were first published in 1972 [Ref. 11]. In both
versions, the model depends on the measurement of time between fail-
ures. The model assumes there is some quantity of software design faults
at the beginning of the test, and that these faults are independent of each
other. The model assumes all faults are equally likely to cause failure and
an ideal repair process is in place, in which a detected fault is repaired in
negligible time and no new faults are introduced during the repair pro-
cess. It also assumes the failure rate is proportional to the current number
of faults in the program. A constant, k, relates failure rate to the number of
faults that remain in the program. This is illustrated in Figure 11-7.
During the test period, the failure rate as a function of cumulative number
of faults equals
( nc ) = k [ N0 nc ( t ) ] (11-2)
N0 is the number of faults at the beginning of the test period. The cumula-
tive number of faults that have been repaired is given by nc(t). Faults
remaining in the software at any time during the test period are calculated
by subtracting nc(t) from N0. After the test period, faults are not detected
and repaired. The failure rate then remains constant.
EXAMPLE 11-2
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
( nc ) = k [ N0 nc ( ) ] (11-3)
The Greek lower case letter tau () represents execution time. Normally,
for control system computers, execution time equals calendar time, since
the computers are dedicated to the control function.
Proceeding with the model development, it is noted that with both the
basic model and the basic execution time model, it is assumed the detec-
tion of any fault is equally likely. This means in any given time period, a
constant percentage of faults will be detected. To illustrate the concept, we
estimate a program has 1000 faults (5000 lines of source times 0.2 average
faults per thousand lines). Assume each week we find a constant 25 per-
cent of the remaining faults. Table 11-2 lists the results.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Since the number of faults detected (and repaired) each period is a con-
stant percentage of faults remaining, we can state:
The number of faults remaining equals the total starting number of faults
(N0) minus the cumulative quantity of repaired faults. Therefore:
dn
--------c = k [ N 0 n c ( ) ] (11-6)
d
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
k
nc ( ) = N0 [ 1 e ] (11-7)
k
( ) = kN 0 e (11-8)
EXAMPLE 11-3
The parameters used in the curve are N0 equals 140 and k equals
0.01. Using these parameters along with our total execution time of
400 hours in Equation 11-8, a failure rate is obtained.
Actual Data
Actual Data
Fitted
Fitted Curve
Curve
Execution Time
Figure 11-10. Failure Rate versus Test Time
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
EXAMPLE 11-4
Problem: The boss will not accept our graphic curve fitting technique.
Redo Example 11-3 using numerical linear data regression.
B
( ) = Ae (11-9)
then taking the natural log of both sides of the equation results in
ln ( ) = ln A + B
The Logarithmic Poisson (LP) model assumes that failure rate variation as
a function of cumulative repaired faults is an exponential.
n c
( nc ) = 0 e (11-10)
For the LP model, the graph indicates that some faults are more likely to
cause failures. When they are removed, the failure rate drops fast. Other
faults are less likely to cause failure. As they are removed, the failure rate
drops more slowly. We should also notice that the rate does not drop sig-
nificantly past a certain point in the fault removal process. This is charac-
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
The cumulative number of repaired faults for the LP model is given by the
equation:
1
n c ( ) = ---- ln ( 0 + 1 ) (11-11)
This should be compared Equation 11-7 for the BET model. The cumula-
tive number of repaired faults for both models is plotted in Figure 11-12.
The expected number of repaired faults tends towards No for the BET
model as execution time increases. For the LP model, however, the
expected number of repaired faults tends toward infinity. This is charac-
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Figure 11-12. Faults versus Test Time
( ) = 0
(11-12)
0 + 1
The LP model is used in a manner similar to the BET model. Failure times
are recorded during product tests. Parameters are estimated by best case
curve fitting. Figure 11-13 shows the data from Table 11-2 fitted with both
a BET curve and an LP curve.
Operational Profile
Earlier in the chapter we discussed the concept of an input spacethe col-
lection of all possible inputs. These inputs cause the computer to execute
different paths through its program. Some failures occur only when cer-
tain inputs are received by the computer. Often, sets of inputs are grouped
according to computer function. Commands that tell the computer to exe-
cute certain functions, followed by input data, represent a logical group-
ing within the input space.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
tionality, design error, coding error, etc. These reasons generally result in
independent faults. Occasionally, dependent faults occur, but we have no
strong reason to doubt this assumption.
We conclude then that this assumption is not valid for most testing pro-
cesses. Two effects have been attributed to this assumption violation. First,
the data is generally noisy. The actual data in Figures 11-10 and 11-13,
for example, bounces around quite a bit. This could result from nonran-
dom testing.
New fault introduction does not prevent us from using the model; it
merely changes the parameters. If the new fault introduction rate is simi-
lar to the fault removal rate or larger, the failure rate will not decline. This
is an indication that something more drastic needs to be done. Redesign or
abandonment of the program is in order.
Although the assumption is not met in practice, again the effect is mini-
mal. One such effect is that actual data does not correlate well with the
best-fit curve. When this happens, alternatives exist. We may switch to the
LP model or model each major element of the operational profile as a sep-
arate program.
Model Usage
Several studies have been done using these models, as well as a few oth-
ers, on actual software test data. The results have been moderately suc-
cessful. Usually, one model fits better than others for a particular piece of
software. The characteristics of the software that allow one model or
another to work best have not yet clearly been identified. Some patterns
seem to be forming.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
The LP model seems to fit better for larger, more complex programs, espe-
cially when the operational profile is not uniform. The BET model seems
to fit better when program size is changing. Research continues in this
important area.
Exercises
11.1 List several system failures that are caused by software.
11.2 Describe how software failures can be modeled statistically.
11.3 What strength factors can be attributed to a software design?
11.4 What stress factors cause a program to fail?
11.5 Estimate how much software complexity has grown in the last
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
decade.
11.6 Calculate the number of control flow paths in the program dia-
grammed in Figure 11-15.
11.7 Are all software design faults detected by executing all control
paths?
11.8 How many control flow paths are present in the program dia-
grammed in Figure 11-7?
11.9 We have tested a program for 500 execution hours. A BET model
shows a good correlation using parameters of k = 0.01 and N0 =
372. If we continue testing and repairing problems for another 200
execution hours, what field failure rate could be expected?
11.10 The program of Exercise 11.9 has a customer-specified maximum
software failure rate of 0.0001 failures/hour. How many hours of
testing and repair will be required to meet the specification?
11.11 We have tested a program for 1000 execution hours. An LP model
shows a good correlation using parameters of 0 equals 2 and equals
0.03. What is the expected field failure rate?
11.12 We have a software reliability goal of 0.0001 failures per hour for
the program of Exercise 11.11. How many test hours are required?
11.13 Is it reasonable to expect the program from Exercise 11.11 to ever
achieve the reliability goal of 0.0001 failures per hour if new faults
are added when old faults are repaired?
Answers to Exercises
11.1 The answer can vary according to experience. Authors list: crash
during word processor use that destroyed chapter 4, PC hung up
during email file transfer, PC hung up during printing with three
applications open, PC crashed after receiving email message,
11.2 Software failures can be modeled statistically because much like
hardware, the failure sources create stress that can be character-
ized as random variables.
11.3 Software strength is increased when fewer design faults are
present, when stress rejection is added to the software, when
software execution is consistent (less variableno multitasking,
few or no interrupts, little or no dynamic memory allocation), and
when software diagnostics operate on-line to detect and report
faults.
11.4 Stress on a software program includes inputs (especially unex-
pected inputs), the timing of the inputs, the contents of memory,
and the state of the machine.
11.5 Software complexity appears to have grown over six orders of
magnitude.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
References
1. Gray, J. A Census of Tandem Availability Between 1985 and
1990. IEEE Transactions on Reliability, Vol. 39, No. 4, Oct. 1990.
14. Musa, J. D., Iannino, A., and Okumoto, K. Software Reliability: Mea-
surement, Prediction, Application. NY: McGraw-Hill, 1987.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Key Issues
The amount of detail to be included in a safety and reliability model
depends on the objectives of the modeling. The amount of effort (and
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
cost!) is affected by the level of detail in the model when modeling is done
manually, but it should be noted that for a given level of detail, costs
depend much more on the available computer tools.
255
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
256 Control Systems Safety Evaluation and Reliability
Probability Approximations
When the failure rate of a component, module, or unit is known, the prob-
ability of failure for a given time interval is approximated by multiplying
the failure rate times the time interval (See A Useful Approximation,
Chapter 4). In safety instrumented systems it is a good practice to periodi-
cally inspect the system for failures. In this situation the time interval used
in the calculation is the inspection period. While this method is an approx-
imation, errors tend to be in the pessimistic direction and therefore the
method is conservative. The failure probabilities for system components
can be used in fault tree diagrams to calculate system failure probabilities.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
System fails
F(4380)= 0.00048
Figure 12-1 shows a simple ideal fault tree for the 1oo2 series wired switch
system. The probability of short circuit (dangerous) failure for one switch
is approximated by multiplying the short circuit failure rate, D, times the
periodic inspection interval, TI (time interval). The system fails short cir-
cuit only if both switches A and B fail short circuit. Therefore, the approxi-
mate probability of the system failing short circuit is given by
This simple fault tree assumes that there is no common cause. It assumes
that perfect inspection and perfect repairs are made at each inspection
period. It does not account for more rapid repairs made if diagnostics
detect the failure. The model assumes constant failure rates.
If the two switches had different failure rates (diverse designs) then the
equation would be
EXAMPLE 12-1
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
time interval? What is the risk reduction factor (RRF)?
Since both units must fail for the system to fail, system probability of
failure using Equation 12-1 is
probability of failure over the entire time interval. This metric is called
PFDavg.
The approximate PFDavg over the time interval for the 1oo2 system is
given by:
t
1 D 2
PFDavg ( t ) = --- ( t' ) dt'
t
0
substituting t = TI
TI
1 D 2
PFDavg ( TI ) = ------
TI ( t' ) dt
0
and integrating
3 TI
1 D 2t
PFDavg ( TI ) = ------ ( ) ----
TI 3 0
which evaluates to
( ) D 2
TI 2 (12-3)
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
PFDavg =
3
NOTE: As will be shown later in this chapter, this simplified equation rep-
resents an ideal situation. This simplified equation should not be used for
safety design verification.
EXAMPLE 12-2
The Markov model for this simple system is shown in Figure 12-2.
2L D LD
1 2 D 2 D 0
P = 0 1 D D
0 0 1
Solving this simple model using a P matrix multiplication technique yields
the PFD as a function of the operating time interval of 4380 hours as
shown in Figure 12-3.
0.0005
0.0004
PFD
0.0003
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
0.0002
0.0001
0
4380 Hours
Taking the average of these PFD numbers gives a result of 0.0001585. The
Markov model RRF is 6309. This is higher than the value of Example 12-2
(6255). This result again shows the impact of the simplified equation
approximation, which does not affect the numeric Markov solution. How-
ever, the differences are not significant in this case. Note: differences do
become significant for longer time intervals and higher failure rates.
DN = (1 - ) D (12-4)
and
DC = D (12-5)
A new fault tree is shown in Figure 12-4. The formula for approximate
PFD using this fault tree is
PFD = DN ( ) 2
TI 2 + DC TI (12-6)
System fails
A and B fail
Unit A fails Unit B fails
Common Cause
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
The equation for PFDavg for Figure 12-4 can be derived by integrating the
PFD equation and dividing by the time interval. The average approxima-
tion with common cause is given by:
TI
1 DN 2 DC
PFDavg ( TI ) = ------ [ ( t' ) + t' ] dt'
TI
0
and integrating
3 2 TI
1 DN 2 t DC t
PFDavg ( TI ) = ------ ( ) ---- + ----
TI 3 2 0
which evaluates to
t2 t TI
( )
DN 2
3
+ DC
2 0
PFDavg =
( )
DN 2
TI 2
+
DC TI (12-7)
3 2
EXAMPLE 12-3
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
2LDN LD
LDC
Figure 12-5. 1oo2 System Markov Model with Common Cause
The Markov model for a 1oo2 system with common cause is shown in Fig-
ure 12-5. The P matrix for this model is
1 (2 DN + DC ) 2 DN DC
P = 0 1 D D
0 0 1
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
0.003
0.002
PFD
0.001
0
4380 Hours
Operating Time Interval
Figure 12-6. PFD 1oo2 System Markov Model with Common Cause
Taking the average of these PFD numbers gives a result of 0.00116. The
RRF is 860. This is higher than the value of Example 12-3 (816).
On-line Diagnostics
As we have seen, system reliability and safety can be substantially
improved when automatic diagnostics are programmed into a system to
detect component, module, or unit failures. This can also benefit the sys-
tem by reducing actual repair time, the time between failure detection,
and the completion of repairs. Diagnostics can identify and annunciate
failures. Repairs can be made quickly as the diagnostics indicate exactly
where to look for the problem.
Imagine the two switches used in the 1oo2 example have microcomputers
that periodically open the switch for a few microseconds and check to ver-
ify that the current flowing through the switch begins to drop. With such a
diagnostic it is possible to detect many of the short circuit failure modes in
the switch.
SU - Safe, undetected
SD - Safe, detected
DU - Dangerous, undetected
DD - Dangerous, detected
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
The appropriate diagnostic coverage factors used are as follows:
SU = (1 - CS) S (12-8)
SD = CS S (12-9)
DU = (1 -CD) D (12-10)
and
DD = CD D (12-11)
EXAMPLE 12-4
The system fails when one combination of the two switches fails. The OR
gate in the left side fault tree accounts for this.
An equation can be developed from the fault tree. The probability that a
switch will be failed dangerous if the failure is undetected is approxi-
mated by multiplying the dangerous undetected failure rate times the
inspection interval (DU TI). The probability that a switch will be failed
dangerous when the failure is detected actually depends on maintenance
policy. The switch will be failed dangerous only until it is repaired. The
shorter the repair time, the lower the probability of dangerous failure. This
PFD = DU TI ( )2
(
+ 2 DD RT DU TI + DD RT) ( ) 2
(12-12)
The equation for PFDavg can be derived by integrating the PFD equation
and dividing by the time interval. This approximation is given by:
TI
1 DU 2 DD DU DD 2
PFDavg = ------ [ ( t ) + 2 RTt + ( RT ) ] dt
TI
0
which evaluates to
3 2 TI
1 DU 2 t DD DU t DD 2
PFDavg = ------ ( ) ---- + 2 RT ---- + ( RT ) t
TI 3 2 0
PFDavg =
( ) DU 2
TI 2
+ DD DU RT TI + DD RT ( )
2
(12-13)
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
3
EXAMPLE 12-5
Substitute failure rate and repair time data into Equation 12-13 to
obtain PFDavg.
EXAMPLE 12-6
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Solution: The first term of Equation 12-13 is
PFDavg =
( DU
TI ) 2
3
Substituting our values, PFDavg = 0.0000004 and RRF = 2,502,033.
The PFDavg represents only 45% of the previous value
(0.00000089), and the RRF is more than double. This result is
dangerous and misleading. The first term of Equation 12-13 is often
published as the Simplified Equation to be used for a 1oo2 system.
The author asserts it should only be used on components with no
diagnostics or a low coverage factor. For a low coverage factor (less
than 50%) this simplification will be more accurate (about 95%).
An on-line repair rate from state 1 to state 0 assumes that the system can
be repaired without a shutdown. In state 2, one switch has failed danger-
ous and the failure is not detected by on-line diagnostics and therefore no
repair rate from this state is shown.
In state 3, the system has failed dangerous with both failures detected by
diagnostics. From this state, the first switch repair takes the model back to
state 1. In state 4, the system has failed dangerous, but the failure is
detected in only one of the switches. Therefore, repair takes the model
back to state 2. The model assumes maintenance policy allows on-line
repair of the system without shutting down the process. State 5 represents
the condition where the system has failed dangerous and the failures are
not detected by diagnostics.
2LDD LDD
Degraded System Fail
M O Danger
MO 1 Fail
2 Detected
Detected LDU 3
OK 1
0 MO System Fail
Degraded
2LDU 1 Fail
Danger
LDD 1 D /1 U
Undetected 4
2
System Fail
LDU Danger
2 Undetected
5
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Figure 12-8. 1oo2 System Markov Model Accounting for Diagnostics (No Common
Cause)
1 (2 D ) 2 DD 2 DU 0 0 0
O 1 ( D + O ) 0 DD DU 0
0 0 1 ( D ) 0 DD DU
P =
0 2 O 0 1 2 O 0 0
0 0 O 0 1 O 0
0 0 0 0 0 1
EXAMPLE 12-7
P 0 1 2 3 4 5
0 0.99999 9.5-06 5-07 0 0 0
1 0.013889 0.986106 0 4.75-06 2.5-07 0
2 0 0 0.999995 0 4.75-06 2.5-07
3 0 0.27778 0 0.972222 0 0
4 0 0 0.013888889 0 0.986111 0
5 0 0 0 0 0 1
The four failure rates SU, SD, DU and DD are divided using the beta
model:
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
SDN = (1 - ) SD (12-14)
SDC = SD (12-15)
SUN = (1 - ) SU (12-16)
SUC = SU (12-17)
DDN = (1 - ) DD (12-18)
DDC = DD (12-19)
DUN = (1 - ) DU (12-20)
DUC = DU (12-21)
EXAMPLE 12-8
SDN = 0.00000405
SDC = 0.00000045
SUN = 0.00000045
SUC = 0.00000005
DDN = 0.000004275
DDC = 0.000000475
DUN = 0.000000225
DUC = 0.000000025
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
System fails
Dangerous
A and B fail due to Switch A fails Switch B fails A and B fail due to Switch A fails Switch B fails
common stress DD DU common stress DD DD
DUC DDC
Figure 12-9. 1oo2 System Fault Tree Diagnostics and Common Cause
System fails
Dangerous
Figure 12-10. Alternative 1oo2 System Fault Tree - Diagnostics and Common Cause
(
PFD = DUN TI )2
(
+ 2 DDN RT DUN TI + DDN RT ) ( ) 2
+ DDC RT + DUC TI (12-22)
The equation for PFDavg can be derived by integrating the PFD equation
and dividing by the time interval. This approximated average is given by:
TI
DUN 2 DDN DUN DDN 2 DDC DUC
1
PFD avg = ------ ( t ) + 2 RTt + ( RT ) + RT + t dt
TI
0
which evaluates to
3 2 2 TI
1 DUN 2 t DDN DUN t DDN 2 DDC DUC t
PFD = ------ ( ) ----- + 2 RT ----- + ( RT ) t + RTt + -----
avg TI 3 2 2 0
PFDavg =
( )DUN 2
TI 2
+ DDN DUN RT TI + DDN RT ( )2
+ DDC RT + DUC
TI
(12-23)
3 2
EXAMPLE 12-9
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
LDDC
2LDDN LDD System Fail
Degraded
M O Danger
MO 1 Fail
2 Detected
Detected LDU 3
OK 1
0 MO System Fail
Degraded
2LDUN 1 Fail
Danger
LDD 1 D /1 U
Undetected 4
2
System Fail
LDUC LDU Danger
2 Undetected
5
Figure 12-11. 1oo2 System Markov Model Diagnostics and Common Cause
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
EXAMPLE 12-10
Problem: Using the Markov model of Figure 12-11 and the failure
rates of Example 12-8, calculate the PFDavg and RRF of the 1oo2
system for a time interval of six months.
Solution: When the failure rates and repair rates are substituted into
the P matrix, the following numeric values result:
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
P 0 1 2 3 4 5
0 0.9999905 0.00000855 0.00000045 0.000000475 0 0.000000025
1 0.013888889 0.986106111 0 0.00000475 0.00000025 0
2 0 0 0.999995 0 0.00000475 0.00000025
3 0 0.027777778 0 0.972222222 0 0
4 0 0 0.013888889 0 0.986111111 0
5 0 0 0 0 0 1
This fault tree can be analytically evaluated although the equations get
complicated. The approximate probability of dangerous failure for one of
the lower OR gates is:
System fails
Dangerous
Switch A fails Switch A fails Switch A fails Switch B fails Switch B fails Switch B fails
E (DUN) (1-E) (DUN) E (DUN) (1-E) (DUN)
DDN DDN
Proof Test Detect No Proof Test Detect Proof Test Detect No Proof Test Detect
Figure 12-12. Alternative 1oo2 System Fault Tree - Diagnostics, Common Cause, and Imper-
fect Proof Test
(
+ DDN RT + E DUN TI + (1 E ) DUN LT )2
(12-24)
Multiplying the second term of Equation 12-24 expands the equation to:
+ E DUC TI (12-26)
+ (1 E ) DUC LT (12-27)
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
+ ( DDN RT ) 2 (12-28)
+ ( E DUN TI ) 2 (12-32)
EXAMPLE 12-11
Problem: Using failure rates from Example 12-8, calculate the PFD for
each term of Equation 12-24 for the fault tree in Figure 12-12 and the
total PFD for a unit lifetime of 10 years. Assume E is 80%.
It is clear that some terms are more significant than others with the
most significant being those involving imperfect proof test failures
over the devices operating lifetime. The total PFD is 0.00058518.
EXAMPLE 12-12
Problem: Using failure rates from Example 12-8, calculate the PFDavg
and the RRF for the fault tree of Figure 12-12 for a unit lifetime of 10
years. Assume E is 80%.
A Markov model can also include the effect of imperfect proof testing. The
Markov model of Figure 12-14 is similar to Figure 12-10. An additional
periodic repair rate from state 5 to state 0 was added. Figure 12-14 also
has an added state, state 6. In this state, dangerous failures are not
detected by either on-line diagnostics or the periodic inspection and test.
--``,,`,,,`,,`,`,,,```,,,``,``,
0.0006
0.0005
0.0004
PFD
0.0003
0.0002
0.0001
0
2 4 6 8 10
Operating Time Interval - Years
Figure 12-13. Alternative 1oo2 System Fault Tree Diagnostics, Common Cause, and
Imperfect Proof Test, PFD vs. Operating Time
LDDC
2LDDN LDD System Fail
Degraded
M O Danger
MO 1 Fail
2 Detected
Detected LDU 3
OK 1
0 MO System Fail
2LDUN Degraded
1 Fail
Danger
LDD 1 D /1 U
MP Undetected 4
2
MP E L System Fail
DUC
4: E LDU Danger
m 2 Undetected
del - 5
and (1-E)LDU
ause System Fail
ect
(1-E)LDUC Danger
2 Undetected
6
Figure 12-14: Markov Model Showing the Impact of Imperfect Proof Testing
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
EXAMPLE 12-13
Problem: Using the Markov model of Figure 12-14 and the failure
rates of Example 12-8, calculate the PFDavg and RRF of the 1oo2
system for a unit lifetime of 10 years. Assume E is 80%.
Solution: When the failure rates and repair rates are substituted into
the P matrix, the following numeric values result:
P 0 1 2 3 4 5 6
0 0.999991 0.00000855 0.00000045 0.000000475 0 0.00000002 0.000000005
1 0.013889 0.98610611 0 0.00000475 0.00000025 0 0
2 0 0 0.999995 0 0.00000475 0.0000002 0.00000005
3 0 0.02777778 0 0.972222222 0 0 0
4 0 0 0.01388889 0 0.98611111 0 0
5 0 0 0 0 0 1 0
6 0 0 0 0 0 0 1
At the end of the period inspection, test, and repair time, the P matrix
is:
P 0 1 2 3 4 5 6
0 1 0 0 0 0 0 0
1 1 0 0 0 0 0 0
2 1 0 0 0 0 0 0
3 1 0 0 0 0 0 0
4 1 0 0 0 0 0 0
5 1 0 0 0 0 0 0
6 0 0 0 0 0 0 1
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
0.0006
0.0005
0.0004
PFD
0.0003
0.0002
0.0001
0
2 4 6 8 10
Operating Time Interval - Years
Figure 12-15. PFD vs. Operating Time for Markov Model Showing the Impact of
Imperfect Proof Testing
Diagnostic Failures
What happens when the subsystem performing the on-line diagnostics
fails? Most of the time, this does not immediately impact the safety func-
tion, but it does cause loss of the automatic diagnostics that detect poten-
tially dangerous failures. This secondary failure may have a significant
impact on PFD and PFDavg depending on the quality of the diagnostics.
Systems with a very good diagnostic coverage factor will suffer the most
when a diagnostic failure occurs (Ref. 2).
Figure 12-17 shows the impact of diagnostic failure. A new state (state 1) is
added to the model. In state 1, the diagnostics have failed. It can be seen
that the worst-case impact is that all failures previously classified as dan-
gerous detected are no longer detected and therefore go to state 3, danger-
ous undetected failure. This level of modeling is needed whenever the
diagnostic coverage is in the 95% plus range to be sure the results are not
optimistic (Ref. 3).
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
LDD
FDD
MO 1
OK LDU
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
0
FDU
2
LDD
FDD
MOL 2
OK LDU
0
LAU FDU
AU 3
1
LDD + LDU
Figure 12-17. Markov Model Single Switch with Failure of Automatic Diagnostics
System fails
Dangerous
A and B initial A and B fail due to A and B fail due to A and B fail due to
common cause common stress common stress common stress
failure DDC E(DUC) (1-E) (DUC)
Switch A fails Switch A fails Switch A fails Switch B fails Switch B fails Switch B fails
E (DUN) (1-E) (DUN) E (DUN) (1-E) (DUN)
DDN DDN
Proof Test Detect No Proof Test Detect Proof Test Detect No Proof Test Detect
Switch A Switch B
Initial PFD Initial PFD
In Closing
It can be seen that on-line diagnostics have a major impact on the accuracy
and validity of modeling results. Building a model that ignores on-line
diagnostics can be unnecessarily pessimistic. Common cause has a major
impact, as does imperfect proof testing and repair. A model that does not
consider common cause and imperfect proof testing and repair will pro-
duce overly optimistic results.
Exercises
12.1 Example 12-9 uses a fault tree that accounts for both common
cause and on-line diagnostic capability. Repeat that example using
a smaller beta factor of 0.02.
12.2 Compare the result of Exercise 12.1 to the result of Example 12-3.
What is the % difference?
12.3 Repeat Example 12-9 for a periodic inspection interval of one year
(8760 hours).
12.4 Why is the RRF result of Exercise 12-6 so much higher than the
others?
Answers to Exercises
12.1 Lowering the beta to 0.02 improves the RRF to 53,631.
12.2 The RRF from Example 12-1 was 6255. The RRF from Example 12-3
was 816. Example 12-3 RRF was about 13% of that of Example 12-1.
This shows almost an order of magnitude difference in the result.
12.3 The RRF dropped from 11,151 to 6,864.
12.4 Exercise 12-6 used a simplified equation that does not consider
PFD due to repair time, and it does not consider common cause or
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
References
1. Bukowski, J. V. Modeling and Analyzing the Effects of Periodic
Inspection on the Performance of Safety-Critical Systems. IEEE
Transactions on Reliability (Vol. 50, No. 3). New York: IEEE,
Sep. 2001.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Safety Model
Construction
System Model Development
A successful reliability and safety evaluation of a system depends in large
part, on the process used to define the model for the system. Although not
sufficient by itself, a knowledge of proper system operation is essential.
Perhaps more important is an understanding of system operation under
failure conditions. One of the best tools to systematically gain such an
understanding is Failure Modes and Effects Analysis (FMEA) (Chapter 5).
A series of steps, including an FMEA, can be taken to help ensure the con-
struction of an accurate reliability and safety model. The following steps
are recommended:
283
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
284 Control Systems Safety Evaluation and Reliability
If the system fails such that the pressure switch will not open, both con-
troller outputs are always energized or the valve is jammed closed, it is
called a dangerous failure because the safety system cannot relieve the
pressure. If the system fails in such a way that the pressure switch fails
open, either controller output is failed de-energized or the valve fails
open, it is called a safe failure, since pressure is inadvertently relieved.
The FMEA chart is presented in Table 13-1. Each system level component
is listed along with its failure modes. The system effect for each failure is
listed and categorized. The failure rates are typically obtained from the
component manufacturers. These are given in units of FITS (failures per
109 hours).
pressure corrosion.
jam closed corrosion, cannot trip dangerous 100
dirt
coil open elec. surge false trip safe 50 if coil opens, valve opens
coil short corrosion, false trip safe 50 if coil shorts, valve opens
wire
pres. switch PSH1 sense short power surge system output dangerous 100
overpressure energize -
cannot trip
open many false trip safe 400
ground fault corrosion cannot trip dangerous 100 assume grounding of positive side
Controller 1 C1 logic solver no comlink many no effect --- 145 no effect on safety function
output energize surge, heat cannot trip dangerous 230
output open many false trip safe 950
Controller 2 C2 logic solver no comlink many no effect --- 145 no effect on safety function
output energize surge, heat cannot trip dangerous 230
output open many false trip safe 950
Starting with the two primary failure categories, all FMEA failure rates
can be placed into one of the two categories using a Venn diagram (Figure
13-2). For example, a short circuit failure of the pressure switch belongs in
the dangerous (valve closed) category. The total failure rate is divided:
S D
TOT = + (13-1)
where the superscript S represents a safe failure and the superscript D rep-
resents a dangerous failure.
SAFE
EXAMPLE 13-1
Problem: Divide the failure rates of Table 13-1 into the two
categories, safe and dangerous, based on the FMEA table.
Solution: For each component, add the failure rates for each mode:
S
C = 950 109 failures per hour
D
C = 230 109 failures per hour
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
must also divide the dangerous failures into two groups, dangerous com-
mon-cause failures (DC) and dangerous normal failures (DN). The failure
rates are divided into two mutually exclusive groups where:
S SC SN
= + (13-2)
and
D DC DN
= + (13-3)
The beta factor is based on the chances of a common stress failing multiple
redundant components. The factors to be considered are physical location,
electrical separation, inherent strength of the components versus the envi-
ronment, and any diversity in the redundant components. Though there
may be different beta factors for safe failures and dangerous failures, the
considerations are the same; therefore, the same is typically used.
The failure rates must be further classified into those that are detected by
on-line diagnostics and those that are not. Detected failures are classified
as Detected. Those not detected by on-line diagnostics are classified as
Undetected. Both safe and dangerous normal failures are classified. Safe
and dangerous common-cause failures are also classified.
EXAMPLE 13-2
SN
C = 0.9 ( 950 ) = 855
DC
C = 0.1 ( 230 ) = 23
DN
C = 0.9 ( 230 ) = 207
from the undetected failures. The eight failure rate categories (Chapter 12)
are calculated as follows:
SDN S SN
= C (13-4)
SUN S SN
= ( 1 C ) (13-5)
DDN D DN
= C (13-6)
DUN D DN
= ( 1 C ) (13-7)
SDC S SC
= C (13-8)
SUC S SC
= ( 1 C ) (13-9)
DDC D DC
= C (13-10)
DUC D DC
= ( 1 C ) (13-11)
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
EXAMPLE 13-3
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
X + 1 as the rules of math do not seem pessimistic enough.
As a check, add up the failure rate categories for each component. The
total (with allowance as appropriate for rounding) should equal the start-
ing number. For a controller, the total equals 1180. All totals match.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
The first step in fault tree construction for the dangerous failure mode is
shown in Figure 13-3. Three major events will cause the system to fail dan-
gerous: a dangerous switch failure, a dangerous failure in both controllers,
and a dangerous valve failure.
The process continues using the failure rate checklist. A switch can fail
dangerous if it fails dangerous detected or if it fails dangerous undetected.
An OR gate is added to the fault tree showing these events. The only dan-
gerous valve failure is dangerous undetected. A basic fault showing this
condition is added to the fault tree.
There are a number of ways in which both controllers can fail dangerous.
These include a detected common cause failure or an undetected common
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
cause failure. Both controllers can fail dangerous if any of four combina-
tions of dangerous detected and dangerous undetected occur. These are
also added to the fault tree. The complete model is shown in Figure 13-4.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
The arc drawn from state 0 to the new failure state is labeled with the sym-
bol 1. This failure rate includes all those circled in Figure 13-6:
Four failure rates cause new states. These are circled in Figure 13-8. When
the additional system success states are added to the Markov model, the
interim diagram looks like Figure 13-9. Since states 2 and 3 represent fail-
ures detected by on-line diagnostics, an on-line repair rate (made without
shutting down, assuming that maintenance policy allows) goes from these
states to state 0.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
DDN
2 = C1 (13-13)
DDN
3 = C2 (13-14)
DUN
4 = C1 (13-15)
DUN
5 = C2 (13-16)
The remaining failure rates from Figure 13-5 cause the system to fail with
the valve closed. These are circled in Figure 13-10. To show this system
behavior, two more failure states are added to the model. State 6 repre-
sents a condition where the system has failed with the valve closed (dan-
gerous) but the failure is detected. In state 7, the system has failed
dangerous and the failure is not detected. Assuming maintenance policy
allows an on-line repair could be made from state 6. This is indicated with
a repair arc from state 6 to state 0. The new states are shown in Figure 13-
11, where:
DDC DD
6 = C + PSH1 (13-17)
DUC DU DU
7 = C + PSH1 + VALVE1 (13-18)
A check of the failure rates originally listed in Figure 13-5 will show that
all failure rate categories have been included in the Markov model.
The Markov model continues by remembering the rule: For each success-
ful system state, list all failure rate categories for all successful compo-
nents. The interim model now has four successful system states that must
be considered. Construction continues from state 2. In this state the system
has one controller, the pressure switch, and the valve working success-
fully. State 1 failure rates are circled in Figure 13-12.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
An examination shows that these failures will send the system to either
state 1 (the valve open failure state), state 6 (the valve closed with all fail-
ures detected state), or new system failure states where some failures are
detected and some are not. These will be modeled with two new states in
order to show the effects of on-line repair. Four arcs are added from state 2
as shown in Figure 13-13, where:
SU SD SU SD SU
8 = PSH1 + C + C + VALVE1 + VALVE1 (13-19)
DD DD
9 = PSH1 + C (13-20)
DU DU
10 = PSH1 + VALVE1 (13-21)
DU
11 = C (13-22)
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
The model is still not complete. States 4 and 5 are system success states
where additional failures will fail the system. Figure 13-16 shows the fail-
ure rates in state 4. Any safe failure rate will take the Markov model to
state 1. That group of failure rates is the same group as in states 2 and 3.
Any dangerous detected failure rate will take the system to a new failure
state where there is a combination of detected and undetected component
failures. A repair from that new state will return the system to state 4. Any
dangerous undetected failure will take the system to state 7 where all dan-
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
gerous failures are undetected. These are labeled:
DU DU DU
12 = PSH1 + VALVE1 + C (13-23)
State 5 has the same failure and repair rates. Figure 13-17 shows the arcs
from states 4 and 5. Figure 13-18 shows these additions to the completed
Markov model. The system is successful in states 0, 2, 3, 4, and 5. The sys-
tem has failed safe in state 1. The system has failed dangerous in states 6,
7, 8, 9, 10, 11, 12, and 13.
Note that the repair arcs from dangerous system failure states where com-
ponent failures are detected (states 6, 8, 9, 10, 11, 12, and 13) are valid only
if repairs are made to the system without shutting it down. In some com-
panies, operators are instructed to shut down the process if a dangerous
detected failure occurs. The model could be changed to show this by
replacing the current repair arc with an arc from those states to state 1
with the average shutdown rate (1/average shutdown time).
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Figure 13-19. First Simplification of Markov Model
Further state merging is possible. Both state 2 and state 3 have identical
transition rates to state 0. These two states also have the same exit rate to
states 1, 6, and 7. There are no more exit rates to check; these two states can
be merged. A similar situation exists for state 4 and state 5. These also can
be merged. A repair rate may be added from state 1 to state 0 if a process is
restarted after a shutdown. The simplified Markov model is redrawn in
Figure 13-20.
7
1 i 1 2 + 3 4 + 5 6 7
i=1
SD 1 ( SD ) 0 0 0 0
P =
O 8 1 ( O + 8 + 9 + 10 + 11 ) 0 9 + 10 + 11 0
0 8 0 1 ( 8 + 9 + 12 ) 9 12
O 0 0 0 1 O 0
0 0 0 0 0 1
From this matrix many reliability and safety measures can be calculated.
These include MTTF, MTTFS, MTTFD, time dependent PFD, PFDavg,
availability, PFS, and many others All calculations can be done on a per-
sonal computer spreadsheet.
Exercises
13.1 Why is a system level FMEA necessary to accurately model control
system reliability?
13.2 List degraded modes of operation that are valid for a control sys-
tem. Are these modes of operation considered failures?
13.3 A control system has no on-line diagnostics and no parallel (redun-
dant) components. The control system can fail only in a de-ener-
gized condition. How many failure rate categories exist?
13.4 Under what circumstances are common-cause failures distin-
guished from normal failures?
13.5 When must detected versus undetected failures be modeled differ-
ently?
Answers to Exercises
13.1 An FMEA (or equivalent) is necessary to understand how system
components fail and how those failures affect the system.
13.2 Many different degraded modes of operation exist. In some cases,
the response time of the system slows. This should be considered a
failure only if the system response time becomes slower than the
requirement. In other cases, an automatic diagnostic function fails.
This has an impact on reliability and safety and can be modeled
but it is typically not a failure per the stated requirements. If one
considers the necessary performance of the system as described by
the requirements, other degraded modes of operation may become
clear.
13.3 One failure mode, no diagnostics, and no common cause: one fail-
ure rate should be modeled.
13.4 Common cause failures must be modeled when redundant compo-
nents exist in a system.
13.5 It is important to distinguish detected failures from undetected
failures when the system takes a different action with these fail-
ures; for example, when detected failures de-energize a switch and
undetected failures do not. It is also important to distinguish these
failures when a different repair situation will exist, which is usu-
ally the case.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
References
1. Shooman, M. L., and Laemmel, A. E. Simplification of Markov
Models by State Merging. 1987 Proceedings of the Annual Reliability
and Maintainability Symposium, IEEE, 1987.
Introduction
There are many ways in which to arrange control system components
when building a system. Some arrangements have been designed to maxi-
mize the probability of successful operation (reliability or availability).
Some arrangements have been designed to minimize the probability of
failure with outputs energized. Some arrangements have been designed to
minimize the probability of failure with outputs de-energized. Other
arrangements have been designed to protect against other specific failure
modes.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
305
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
306 Control Systems Safety Evaluation and Reliability
Four of the architectures, 1oo1, 1oo2, 2oo2 and 2oo3, have existed since the
early days of relay logic. The architectures with the D designation have
one or more output switches controlled by automatic diagnostics. These
diagnostics are used to control system failure modes and to modify the
failure behavior of units within the system. These architectures were
developed starting in the 1980s when microcomputer systems had enough
computing power to perform good automatic diagnostics.
EXAMPLE 14-1
What are the failure rate categories for the single board controller?
What are the diagnostic coverage factors for both safe and
dangerous failures?
Solution: The unit has eight input circuits, four output circuits, and
one common set of circuitry. It is assumed that this is a series system
(the failure of any circuit will cause failure of the entire unit) so the
failure rates may be added per Equation 7-6. For the single board
controller (SBC):
Conventional SD SU DD DU Total
Conv. In 0 70 0 24 94
Conv. Out 26 27 4 34 91
Conv. MPU 144 203 289 272 908
Total 248 871 305 600 2024
The diagnostic coverage factor for safe failures is the detected safe
failure rate (SD) divided by the total safe failure rate.
CS = 248 / (248 + 871) = 22.2%
The diagnostic coverage factor for dangerous failures is the detected
dangerous failure rate (DD) divided by the total dangerous failure
rate.
CD = 305 / (305 + 600) = 33.7%
EXAMPLE 14-2
What are the failure rate categories for the single safety controller?
What are the diagnostic coverage factors for both safe and
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
dangerous failures?
Solution: The unit has eight input circuits, four output circuits, and
one common set of circuitry. It is assumed that this is a series system
so the failure rates may be added per Equation 7-6. For this SSC:
The diagnostic coverage factor for safe failures is the detected safe
failure rate (SD) divided by the total safe failure rate.
EXAMPLE 14-3
Solution: The four failure rate categories are divided using Equations
12-14 through 12-21.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
EXAMPLE 14-4
Solution: The four failure rate categories are divided using Equations
12-14 through 12-21.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
16.7 failures per billion hours
System Configurations
A number of configurations exist in real implementations of control sys-
tems. Some simplified configurations are representative. For these simpli-
fied configurations, fault trees and Markov models are developed.
The models will account for on-line diagnostic capability and common
cause. The Markov models will be solved using time dependent solutions
for a one year mission time. It is assumed that no manual inspection is
done during this year. Therefore, the important variable of manual proof
test coverage is not modeled. This approach shows the architectural differ-
ences more clearly but does not reflect reality of industrial operations
where mission times are much longer, with periodic manual proof testing
done during the mission. Models for those situations should include proof
test coverage as explained in Chapter 12.
1oo1Single Unit
The single controller with single microprocessing unit (MPU) and single
I/O (Figure 14-2) represents a minimum system. No fault tolerance is pro-
vided by this system. No failure mode protection is provided. The elec-
tronic circuits can fail safe (outputs de-energized, open circuit) or
dangerous (outputs frozen or energized, short circuit). Since the effects of
on-line diagnostics should be modeled, four failure categories are
included: DDdangerous detected, DUdangerous undetected, SD
safe detected and SUsafe undetected.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
fault tree for the probability of dangerous failure (sometimes called proba-
bility of failure on demand, PFD) assuming perfect repair:
PFD1oo1 = DD RT + DU MT (14-1)
And integrating over the mission time period to get the average
EXAMPLE 14-5
EXAMPLE 14-6
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Figure 14-4. 1oo1 PFS Fault Tree
PFS1oo1 = SD SD + SU SD (14-3)
EXAMPLE 14-7
EXAMPLE 14-8
reach three other states. State 1 represents the fail-safe condition. In this
state, the controller has failed with its outputs de-energized. State 2 repre-
sents the fail-danger condition with a detected failure. In this state, the
controller has failed with its outputs energized but the failure is detected
by on-line diagnostics and can be repaired. The 1oo1 system has also failed
dangerous in state 3 but the failure is not detected by on-line diagnostics.
Note that state 2 in a 1oo1 architecture can only exist if no automatic shut-
down mechanism has been added to the system. Although partial auto-
matic shutdown can be achieved in a 1oo1 architecture, a second switch is
required to guarantee an automatic shutdown for all dangerous failures
(examplethe output switch). Therefore, a single architecture with auto-
matic shutdown is different and was first named 1oo1D in an earlier edi-
tion of this book [Ref. 1].
EXAMPLE 14-9
Problem: Calculate the PFS and PFD from the Markov model of a
1oo1 single safety controller using failure rates from Example 14-2.
Average repair time is 72 hours. Average startup time after a
shutdown is 96 hours. The mission time interval is one year.
Solution: The failure rate and repair rate numbers are substituted
into a P matrix. The numeric matrix is:
The PFS at 8760 hours is 0.00015216. The PFD is the sum of state 2
and state 3 probabilities. At 8760 hours the PFD is 0.00021766. The
PFDavg is calculated by direct numerical average of the time
dependent PFD results. The PFDavg is 0.00013833.
The PFD for the single board controller in a 1oo1 architecture might be
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
to fail dangerous. The 1oo2 configuration typically utilizes two indepen-
dent main processors each with its own independent I/O (see Figure 14-
7). The system offers low probability of failure on demand, but it increases
the probability of a fail-safe failure.
+
Output Circuit
Logic Solver
Input Circuit Common Circuitry
LMP
Sensor
Output Circuit
Logic Solver
Input Circuit Common Circuitry
LMP Actuator
Final Element
-
Figure 14-7. 1oo2 Architecture
(
PFD = DUN MT )2
(
+ 2 DDN RT DUN MT + DDN RT ) ( ) 2
+ DDC RT + DUC MT (14-4)
The equation for the PFDavg can be derived by integrating the PFD equa-
tion and dividing by the time interval. This average approximation is
given by:
[( ]
MT
PFDavg =
1
MT
DUN
t )
2
(
+ 2 DDN DUN RTt + DDN RT )
2
+ DDC RT + DUC t dt
0
which evaluates to
1 DUN t3 t2 t 2 MT
PFDavg =
MT
( )
2
3
2
( )
+ 2 DDN DUN RT + DDN RT t + DDC RTt + DUC
2 2 0
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
PFDavg =
( )DUN 2
MT 2
+ DDN DUN RT MT + DDN * RT ( )2
+ DDC RT + DUC
MT
(14-5)
3 2
It should be noted that Equation 12-23 is the same as Equation 14-5 except
for the MT notation instead of the TI notation. The reader may compare
the equations to see how the level of model detail impacts the equation.
EXAMPLE 14-10
EXAMPLE 14-11
System Fails
Safely
EXAMPLE 14-12
EXAMPLE 14-13
The on-line repair rate from state 4 to state 0 assumes that the repair tech-
nician will inspect the system and repair all failures when making a ser-
vice call. If that assumption is not valid, state 4 must be split into two
states, one with both controllers failed detected and the other with one
detected and one undetected. The state with both detected will repair to
state 0. The state with only one detected will repair to state 2. The assump-
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
tion made for this model does simplify the model and has no significant
impact on the fail-danger probability unless coverage factors are low.
SD 0 0 1 SD 0 0
O 0 0 0 1 O 0
0 0 0 0 0 1
Numeric solutions for the PFD, PFS, MTTF (mean time to failure) and
other reliability metrics can be obtained from this matrix using a spread-
sheet.
EXAMPLE 14-14
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Solution: First the failure rate data is substituted into the transition
matrix:
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
SYSTEM FAILS
DANGEROUSLY
EXAMPLE 14-15
EXAMPLE 14-16
System Fails
Safely
A fails B fails
SN SN
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
cause, an open circuit failure in both A and B must occur for the system to
fail safe. The first order approximation equation to solve for PFS is
EXAMPLE 14-17
Problem: Two single board controllers are used in a 2oo2
architecture system. The failure rates are obtained from the FMEDA
of a conventional micro PLC (Example 14-3). The average repair time
is 72 hours. Mission time is one year (8760 hours). The average time
to restart the process after a shutdown is 96 hours. What is the
approximate PFS?
Solution: Using Equation 14-8, PFS2oo2 = 0.000058.
EXAMPLE 14-18
Problem: Two single safety controllers are wired into a 2oo2
architecture system. The failure rates are obtained from a FMEDA
(Example 14-4). The average repair time is 72 hours. Mission time is
one year (8760 hours). The average time to restart the process after
a shutdown is 96 hours. What is the approximate PFS?
Solution: Using Equation 14-8, PFS2oo2 = 0.00000336.
SD 0 0 1 SD 0 0
0 0 0 1 0
O O
0 0 0 0 0 1
Numeric solutions for the various reliability and safety metrics are practi-
cal and precise.
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
System Architectures 325
EXAMPLE 14-19
The PFS and PFD are calculated by multiplying the P matrix by a row
matrix S starting with the system in state 0. When this is done, the
PFD at 8760 hours = 0.00043076 and the PFS at 8760 hours =
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
0.000000321. Note that the PFS is very low. This is expected for a
2oo2 architecture.
+
Diagnostic Circuit(s)
Output Circuit
Sensor Input Circuit Logic Solver
Common Circuitry
LMP Actuator
Final Element
-
Figure 14-15. 1oo1D Architecture
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
EXAMPLE 14-20
SD = 0 FITS,
SU = 30 FITS, and
AU = 20 FITS.
Solution:
EXAMPLE 14-21
It is clear that the first term of the equation is most significant. This is
true until the diagnostic coverage gets above 98%. Then the second
term becomes significant and should not be ignored.
EXAMPLE 14-22
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
1 - ( SD + SU + DD + DU + AU ) AU ( SD + SU + DD) DU
0 1 - ( SD + SU + DD + DU ) ( SD + SU ) ( DD + DU )
P ==
SD 0 1 SD 0
0 0 0 1
EXAMPLE 14-23
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Two outputs from each controller unit are required for each output unit.
The two outputs from the three controllers are wired in a voting circuit,
which determines the actual output (Figure 14-20). The output will equal
the majority. When two sets of outputs conduct, the load is energized.
When two sets of outputs do not conduct, the load is de-energized.
A closer examination of the voting circuit shows that it will tolerate a fail-
ure in either failure modedangerous (short circuit) or safe (open circuit).
Figure 14-21 shows that when one unit fails open circuit, the system effec-
tively degrades to a 1oo2 configuration. If one unit fails short circuit the
system effectively degrades to a 2oo2 configuration. In both cases, the sys-
tem remains in successful operation.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
can fail short circuit, the AC leg can fail short circuit, or the BC leg can fail
short circuit. These are shown in the top level events of the PFD fault tree
of Figure 14-23. Each leg consists of two switches wired in series like a
1oo2 configuration. The subtree for each leg is developed for the 1oo2 con-
figuration and each looks like Figure 14-8, the 1oo2 PFD fault tree. It
should be noted that the system will also fail dangerous if all three legs fail
dangerous. This can happen due to common cause or a combination of
three independent failures. Since this is a second order effect, it can
assumed to be negligible for first order approximation purposes. This is
indicated in the fault tree with the incomplete event symbol. An approx-
imation equation for PFD can be derived from the fault tree. The first
order approximate equation for the PFD is:
Note that a factor of 3/2 was used to scale the beta model for common
cause for a three unit system. This was explained in Chapter 10.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
EXAMPLE 14-24
EXAMPLE 14-25
dangerous failures except that the failure modes are different. This is the
result of the symmetrical nature of the architecture. Note that each major
event in the top level of the fault tree is equivalent to the 2oo2 fault tree of
Figure 14-13. The approximate equation for the PFS derived from this fault
tree is:
EXAMPLE 14-26
EXAMPLE 14-27
From states 1 and 2 the system is operating. Further safe failures will fail
the system with outputs de-energized (safe). Further dangerous failures
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
will lead to secondary degradation. From states 3 and 4 the system is oper-
ating in 2oo2 mode. Additional dangerous failures will fail the system
with outputs energized. Further safe failures degrade the system again.
Repair rates are added to the diagram. It is assumed that the system is
inspected and that all failed controllers are repaired if a service call is
made. Therefore, all repair rates transition to state 0.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Figure 14-26. 2oo3 Markov Model
where 1 indicates one minus the sum of all other row elements.
Reliability and safety metrics can be calculated from this P matrix using
numerical techniques.
EXAMPLE 14-28
The PFS and PFD are calculated by multiplying the P matrix by a row
matrix S starting with the system in state 0. When this is done, the
PFD at 8760 hours = 0.00000822 and the PFS at 8760 hours =
0.00000665.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
2oo2D Architecture
The 2oo2D is an architecture that consists of two 1oo1D controllers
arranged in a 2oo2 style (Figure 14-28). Since the 1oo1D protects against
dangerous failures when the diagnostics detect the failure, two controllers
can be wired in parallel to protect against a false trip. As the 2oo2 architec-
ture provides the best possible architecture to provide fault tolerance
against false trips, the 2oo2D is designed to also provide highly effective
protection against false trips and safety if implemented well. Effective
diagnostics are important to this architecture as an undetected dangerous
failure in either controller will fail the system dangerous.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
EXAMPLE 14-29
EXAMPLE 14-30
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
The 2oo2D architecture shows good tolerance to both safe and dangerous
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
EXAMPLE 14-31
Problem: Using failure rate values from Example 14-29, calculate the
PFS and PFD of a 2oo2D system.
2oo2D Safety
P 0 1 2 3 4 5
0 0.99999512 0.00000464 0.00000016 0.00000005 0.00000000 0.00000004
1 0.01388889 0.98610864 0.00000000 0.00000245 0.00000002 0.00000000
2 0.00000000 0.00000000 0.99999753 0.00000245 0.00000000 0.00000002
3 0.01041667 0.00000000 0.00000000 0.98958333 0.00000000 0.00000000
4 0.01388889 0.00000000 0.00000000 0.00000000 0.98611111 0.00000000
5 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 1.00000000
Multiplying the P matrix times a starting matrix S gives the results that
the PFD at 8760 hours is 0.00031194 and the PFS at 8760 hours is
0.00000510.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
1oo2D Architecture
The 1oo2D architecture is similar to the 2oo2D but it has extra control lines
to provide 1oo2 safety functionality. These control lines signal diagnostic
information between the two controller units. The controllers use this
information to control the diagnostic switches. Figure 14-32 shows the
1oo2D architecture. Comparing this to Figure 14-28, note the added con-
trol lines. 1oo2D is designed to tolerate both safe and dangerous failures.
The primary difference between 2oo2D and 1oo2D can be seen when a
dangerous undetected failure occurs in one controller. This is shown in
Figure 14-33. The upper unit has a failure that causes the output switch to
fail short circuit. The failure is not detected by the self-diagnostics in that
unit so the diagnostic switch is not opened by its control electronics. How-
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
ever, when the failure is detected by the other unit, the diagnostic switch
is opened via the additional control line.
EXAMPLE 14-32
EXAMPLE 14-33
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
The Markov model for a 1oo2D system is shown in Figure 14-36. Four sys-
tem success states are shown: 0, 1, 2 and 3. State 1 represents a safe
detected failure or a dangerous detected failure. As with 2oo2D, the result
of any detected failure is the same since the diagnostic switch de-energizes
the output whenever a failure is detected.
Another system success state, state 2, represents the situation in which one
controller has failed dangerous undetected. The system will operate
correctly in the event of a process demand because the other unit will
detect the failure and de-energize the load via its 1oo2 control lines (Figure
14-33).
The third system success state is shown in Figure 14-37. One unit has
failed with its output de-energized. The system load is maintained by the
other unit which will still respond properly to a process demand.
In state 1 the system has degraded to 1oo1D operation. A second safe fail-
ure or a dangerous detected failure will fail the system safe. As with
1oo1D, a dangerous undetected failure will fail the system dangerous.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Here, however, the system fails to state 5 where one unit has failed
detected and one unit has failed undetected. Note that an assumption has
been made that all units will be inspected and tested during a service call
so an on-line repair rate exits from state 5 to state 0.
In state 2 one unit has failed with a dangerous undetected failure. The sys-
tem is still successful since it will respond to a demand as described above.
From this state, any other component failure will fail the system danger-
ous. In Figure 14-38, a dangerous undetected failure occurred first in the
top unit and a safe detected failure then occurred in the lower unit. Since it
is assumed that any component failure in a unit causes the entire unit to
fail, it must be assumed that the control line from the lower unit to the
upper unit will not work. Under those circumstances, the system will not
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Figure 14-38. 1oo2D with Two Failures, State 5
In state 3 one unit has failed safe undetected. In this condition the system
has also degraded to 1oo1D operation. Additional safe failures (detected
or undetected) or dangerous detected failures will cause the system to fail
safe. An additional dangerous undetected failure will fail the system dan-
gerous taking the Markov model to state 6 where both units have an unde-
tected failure. Failures from this state are not detected until there is a
manual proof test.
The transition matrix for the Markov model of Figure 14-36 is:
EXAMPLE 14-34
Problem: Using failure rate values for a single board safety rated
controller from Example 14-29, calculate the PFS and PFD of a
1oo2D system.
1oo2D Safety
P 0 1 2 3 4 5 6
0 0.999995115 0.000004637 0.000000035 0.000000163 0.000000049 0.000000000 0.000000000
1 0.013888889 0.986108644 0.000000000 0.000000000 0.000002449 0.000000018 0.000000000
2 0.000000000 0.000000000 0.999997533 0.000000000 0.000000000 0.000002366 0.000000101
3 0.000000000 0.000000000 0.000000000 0.999997533 0.000002449 0.000000000 0.000000018
4 0.010416667 0.000000000 0.000000000 0.000000000 0.989583333 0.000000000 0.000000000
5 0.013888889 0.000000000 0.000000000 0.000000000 0.000000000 0.986111111 0.000000000
6 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 1.000000000
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
1oo2D, if the automatic self-diagnostics do not detect the fault the compar-
ison diagnostics in both units will de-energize the diagnostic switches
when a mismatch is detected in order to ensure safety. The notation C2
will be used to indicate this additional diagnostic coverage factor. This is
one of several variations of the 1oo2D architecture. Additional variations
and modeling are shown in Reference 3.
EXAMPLE 14-35
Problem: Two single safety controllers with diagnostic channels are
used in a 1oo2D architecture system. Using the failure rates for a
safety rated single board computer with a diagnostic channel from
Example 14-25 and a coverage factor for comparison of 99.9%, what
is the approximate PFD?
Solution: Using Equation 14-17, PFD1oo2D = 0.0000000004 8760
+ [(1 - 0.999) 0.0000000176 8760] = 0.00000315.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
either unit fails safe undetected, if both units fail safe, or if there is an
undetected failure that causes a comparison mismatch. Note that the fault
tree shows an incomplete symbol, which indicates that the failures not
detected by self-diagnostics or by the comparison diagnostics are not
included in the tree since they are orders of magnitude smaller and there-
fore considered insignificant. The first order approximation techniques
can be used to generate a formula for the probability of the system failing
safe from this fault tree:
PFS1oo2D = (SDC+SUC+DDC+2C2SUN+2C2DUN) SD
+ (SDN RT+DDN RT)2 (14-18)
EXAMPLE 14-36
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
detected by the comparison. Previously undetected safe and dangerous
failures detected by the comparison diagnostics are added to the arc from
state 0 to state 4. Only failures undetected by the comparison diagnostics
will cause a transition to states 2 and 3.
The transition matrix for the Markov model of Figure 14-42 is:
EXAMPLE 14-37
Problem: Using failure rate values from Example 14-29, calculate the
PFS, and PFD of a 1oo2D system with comparison diagnostics.
Comparing Architectures
When the results of the example calculations are examined, several of the
main features of different architectures become apparent. The results for
the four classic architectures (1oo1, 1oo2, 2oo2 and 2oo3) implemented
with a conventional micro PLC are compiled in Table 14-2. Of the four
architectures, the highest safety rating goes to the 1oo2 architecture, with
the lowest PFD of 0.00013. The 2oo3 architecture also does well, with a
PFD of 0.00031. Note that PFD for the 2oo3 is roughly three times worse
than 1oo2. This is to be expected as there are three sets of parallel switches
in 2oo3. It should also be noted that all PFD and PFS results will certainly
be different for different failure rates and other parameter values.
Of the classic architectures, 2oo2 has the lowest PFS as expected. Even so,
the 2oo2 PFS should theoretically be much lower. However, a close exami-
nation of the PFS equation shows that a safe undetected failure in one unit
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
would remain failed for the entire mission time. When this happens a sec-
ond safe failure in the other unit causes system failure. This situation dom-
inates the PFS result and clearly shows that automatic diagnostics are as
important as the architecture.
Table 14-3 compares the results for the four classic architectures using the
single safety controller models. It is interesting to note that the fault tree
results and the Markov results are similar for this set of parameters. The
failure rates used in the examples are sufficiently small to allow the first
order approximation to be reasonably accurate.
A comparison of all results for the safety PLC is shown in Table 14-4. It is
interesting to note that the fault tree results (denoted by the ft letters) and
Markov results (denoted by the mm letters) are practically identical, with
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
the Markov providing slightly lower numbers. This is to be expected as
the Markov numerical solution technique provides more precision than
the approximation techniques. The author favors the Markov approach,
not because of these differences in the results, but because the Markov
model development is more systematic. At least for the author, with
techniques other than Markov models it is too easy to neglect
combinations of two or more failures that might have an impact for some
ranges of parameters.
The architectures 2oo3, 1oo2D, and 1oo2D with comparison diagnostics all
provide excellent safety. 2oo2, 2oo2D, 2oo3, 1oo2D, and 1oo2D with com-
parison diagnostics provide excellent operation without an excessive false
trip rate (low PFS).
Table 14-4. Single Safety Controller Model Results for All Architectures
Architecture Comparison - 61508 Certified Safety PLC
1oo1 1oo2 2oo2 1oo1D 2oo3 2oo2D 1oo2D 1oo2DComp.
PFDft 0.00021773 0.00000440 0.00043110 0.00015896 0.00000822 0.00031467 0.00000318 0.00000315
PFSft 0.00015216 0.00030128 0.00000336 0.00023510 0.00000826 0.00000548 0.00000524 0.00002364
PFDavgft 0.00013889
PFDmm 0.00021766 0.00000440 0.00043076 0.00015827 0.00000822 0.00031194 0.00000345 0.00000315
PFSmm 0.00015210 0.00030112 0.00000321 0.00023500 0.00000665 0.00000510 0.00000510 0.00002368
PFDavg mm 0.00013833 0.00000279 0.00027380 0.00007903 0.00000521 0.00015600 0.00000166 0.00000158
The 2oo2D architecture can provide the best compromise if the automatic
diagnostics are excellent. This level of automatic diagnostics is being
achieved in new designs, especially those with microprocessors specifi-
cally designed for use in IEC 61508 certified systems. This new approach is
being pursued because failure probability comparison results like Table
14-4 show the results to be superior to the traditional approach of using
2oo3. When new architectures are used to provide better overall designs,
the value of probabilistic analysis as a design tool is clear.
Exercises
14.1 A safety instrumented function (SIF) uses three analog input chan-
nels, two digital input channels, and two digital output channels
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
on a conventional micro PLC controller. The following dangerous
failure rates are given:
Analog Input Channel dangerous failure rate = 7 FITS
Digital Input Channel dangerous failure rate = 24 FITS
Digital Output Channel dangerous failure rate = 34 FITS
Common and Main Processing Circuits dangerous
failure rate = 355 FITS
What is the total dangerous failure rate failure rate for the portion
of the controller used in the SIF?
14.2 We are given the following coverage factors for the single board
controller of Exercise 14.1:
Analog Input Channel = 0.97
Digital Input Channel = 0.99
Digital Output Channel = 0.95
Main Processing Circuits = 0.99
What are the DD and DU failure rates of all failure categories when
the board is used in a 1oo1 system configuration?
14.3 Two redundant digital input channels are implemented on a com-
mon circuit board and a beta factor of 3% is estimated. If the DU
failure rate of digital input circuit is 0.24 FITS, what are the DUC
and DUN failure rates?
Answers to Exercises
14.1 The total dangerous failure rate is 3 7 + 2 24 + 355 + 2 34 = 492
FITS.
14.2 AI DD = 0.97 7 = 6.8 FITS
AI DU = (1 - 0.97) 7 = 0.2 FITS
DI DD = 0.99 24 = 23.8 FITS
DI DU = (1 - 0.99) 24 = 0.2 FITS
Com DD = 0.99 355 = 351.4 FITS
Com DU = (1 - 0.99) 355 = 3.6 FITS
DO DD = 0.95 34 = 32.3 FITS
DO DU = (1 - 0.95) 34 = 3.6 FITS
14.3 DI DUC = 0.03 0.24 = 0.0072 FITS
DI DUN = (1-0.03) 8 0.24 = 0.2328 FITS
14.4 1oo2, 2oo3, 1oo2D, 1oo2D/Comparison. 2oo2D can achieve high
safety (low PFD) if it has effective automatic diagnostics.
14.5 2oo2, 2oo2D, 2oo3, 1oo2D, 1oo2D Comparison
14.6 Yes, previously undetected failures will be detected and repaired.
This results in higher safety and higher availability in redundant
systems.
14.7 AU modeling must be done when automatic diagnostic coverage
exceeds 98%. AU modeling should be considered when automatic
diagnostic coverage exceeds 95%.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
14.8 Answer d. The D in the architecture name 1oo1D means that the
architecture has an independent switch to de-energize the output
when a failure is detected by the automatic diagnostics.
References
1. Goble, W. M. Evaluating Control System Reliability: Techniques and
Applications, First Edition. Research Triangle Park: ISA, 1992.
Risk Cost
Risk is usually defined as the probability of a failure event multiplied by
the consequences of the failure event. The consequences of a failure event
are measured in terms of risk cost. The concept of risk cost is a statistical
concept. An actual cost is not incurred each year. Actual cost is incurred
only when there is a failure event (an accident). The individual event cost
can be quite high. If event costs are averaged over many sites for many
years, an average risk cost per year can be established. If actions are taken
to reduce the chance of a failure event or the consequences of a failure
event, risk costs are lowered.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
359
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
360 Control Systems Safety Evaluation and Reliability
EXAMPLE 15-1
Risk Reduction
There are risks in every activity of life. Admittedly some activities involve
more risk than others. According to Reference 1, the chance of dying dur-
ing a 100 mile automobile trip in the midwestern United States is 1 in
588,000. The average chance each year of dying from an earthquake or vol-
cano is 1 in 11,000,000.
While inherent risk and even acceptable risk are very hard to quan-
tify, risk reduction is a little easier. Several methods have been proposed
to determine the amount of risk reduction to at least to an order-of-magni-
tude level (Ref. 2).
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
demand. In the case of a de-energize-to-trip system, this is the probability
that the system will fail with its outputs energized. For low demand sys-
tems, a dangerous condition occurs infrequently therefore PFDavg (Chap-
ter 4) is the relevant measure of probability of dangerous failure. In such
cases the risk reduction factor (RRF) achieved is defined as:
Average Probability of
Safety Integrity Failure on Demand Risk Reduction Factor Typical Applications
Level (PFDavg) Low Demand (RRF)
Chemical
Processes
W3 W2 W1
C1
NSS NS NS
P1
F1 SIL1 NSS NS
C2 P2
F2 P1 SIL2 SIL1 NSS
P2
C3 F1
F2 P1 SIL3 SIL2 SIL1
P2
C4 F1
P1 SIL4 SIL3 SIL2
F2
P2
NPES SIL4 SIL3
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
NS - No Safety Requirements
NSS - No Special Safety Requirements
NPES - Single SIS Insufficient
EXAMPLE 15-2
Figure 15-4. Screen ShotRisk Graph for Personnel, Equipment, and Environmental
(Ref. 5)
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
EXAMPLE 15-3
SIS Architectures
An SIS consists of three categories of subsystems: sensors/transmitters,
controllers, and final elements, that work together to detect and prevent,
or mitigate the effects of a hazardous event. It is important to design a sys-
tem that meets RRF requirements. It is also important to maximize the
production uptime (minimize false trips). In order to achieve these goals,
system designers often use redundant equipment in various architectures
(Chapter 14). These fault tolerant architectural configurations apply to
field instruments (sensors/transmitters and final elements) as well as to
controllers.
Sensor Architectures
An SIS must include devices capable of sensing potentially dangerous
conditions. There are many types of sensors used including flame detec-
tors (infrared or ultraviolet), gas detectors, pressure sensors, thermocou-
ples, RTDs (resistance temperature detectors), and many types of discrete
switches. These sensors can fail, typically in more than one failure mode.
Some sensor failures can be detected by on-line diagnostics in the sensor
itself or in the controller to which it is connected. In general, all of the reli-
ability and safety modeling techniques can be used.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
EXAMPLE 15-4
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
The pressure transmitter is configured to send its output to 20.8 mA
(over-range) if a failure is detected by the internal diagnostics. The
transmitter is fully tested and calibrated every five years. In this
application, what is the PFDavg for a single transmitter in a 1oo1
architecture for a five year mission time? What is the Spurious Trip
Rate (STR)?
Solution: The trip amplifier will falsely trip if the transmitter fails with
its output signal high. The system will fail dangerous if the transmitter
fails with its output signal low or if it has a dangerous undetected
failure. The failure rates are therefore:
The term Spurious Trip Rate refers to the average rate at which a
subsystem will cause a shutdown when no dangerous condition
occurs. The STR is equal to the safe failure rate of 3.23 10-7 trips
per hour.
EXAMPLE 15-5
Solution: The logic solver will detect out-of-range current signals and
hold the last pressure value. Therefore, these failures are dangerous
detected (DD). The total DD failure rate is 356 FITS (One FIT = 1
10-9 failures per hour). Using Equation 14-2,
Spurious Trip Rate: Because the logic solver is configured to hold the
last pressure reading on failure of the sensor, no false trips will occur.
The STR is zero.
Figure 15-6 shows two discrete sensors measuring the same process vari-
able. These two sensors can be configured in a 1oo2 architecture by simply
adding logic to initiate a shutdown if either of the two sensors signals a
dangerous condition. Like the 1oo2 controller architecture (Chapter 14),
this configuration will substantially reduce the chance of a dangerous fail-
ure but will almost double the chance of a safe failure. Note that common
cause failures apply when redundant configurations are used. The com- --``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
EXAMPLE 15-6
What is the PFDavg for a 1oo2 sensor subsystem for a five year
mission time? What is the Spurious Trip Rate for the sensor
subsystem?
PFDavg can be calculated using Equation 14-5. Note that there are
no detected failures, therefore RT (average repair time) = 0. PFDavg
is 0.00014.
Discrete Sensor
Logic Solver
+ Discrete Input
Pressure
Sensor -
1oo2 logic
trip if either
Discrete Sensor sensor
indicates a trip is
needed
Pressure + Discrete Input
Sensor -
The 1oo2 sensor concept can be applied to analog sensors as well. Figure
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
15-7 shows two analog sensors measuring the same process variable. A
high select or low select function block (depending on the fail-safe
direction) is used to select which analog signal will be used in the
calculation.
Analog Transmitter
4 to 20 mA
Logic Solver
Pressure
+ Analog Input
Transmitter -
1oo2 Logic -
high or low
Analog Transmitter
4 to 20 mA select
depending on
Pressure + Analog Input trip function
Transmitter -
EXAMPLE 15-7
sensor subsystem?
Solution: The logic solver will detect out-of-range signals and hold
the last pressure value. Therefore, these out-of-range failure rates
are classified as dangerous detected (DD). The total DD failure rate
for each sensor is 356 FITS. Considering common cause, the failure
rates are:
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
PFDavg = 0.000088
STR = 0.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Air Supply
Solenoid Valve
Actuator
Valve
EXAMPLE 15-8
Discrete Output
Valve Valve
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
EXAMPLE 15-9
Problem: Using the failure rates of Example 15-8, what are the STR
and PFDavg of a 1oo2 final element assembly? Assume a common-
cause beta factor of 10%. No diagnostics are available. The final
element assembly is removed from service and tested/rebuilt every
five years.
SUC = 24 FITS
SUN = 219 FITS
DUC = 105 FITS
DUN = 942 FITS
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
EXAMPLE 15-10
Solution: Assuming a constant failure rate for the impulse line, the
failure rate is calculated using Equation 4-18. = 1/(20 8760) =
0.0000057 failures per hour. Equation 14-2 can be used to
approximate the PFDavg. Without the diagnostic algorithm the entire
failure rate must be classified as dangerous undetected.
Exercises
15.1 A process is manned continuously and has no risk reduction mech-
anism. An accident could cause death to multiple persons. Danger-
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
ous conditions do build slowly and alarm mechanisms should
warn of dangerous conditions before an accident. Using a risk
graph, determine how much risk reduction is needed.
15.2 A quantitative risk assessment indicates an inherent risk cost of
$250,000 per year for an industrial process. Plant management
would like to reduce the risk cost to less than $25,000 per year.
What risk reduction factor is required? What SIL classification is
this?
15.3 What components must be considered in the analysis of an SIF?
15.4 A process connection clogs every year on average. This is a danger-
ous condition. No diagnostics can detect this failure. Assuming a
constant failure rate, what is the dangerous undetected failure
rate?
15.5 Using the failure rate of Exercise 15.4, what is the approximate
PFDavg for a three-month inspection interval in a 1oo1
architecture?
Answers to Exercises
15.1 Death to multiple persons is classified as a C3 consequence. The
frequency of exposure is continuous, F2. Alarms give a possibility
of avoidance, P1. Since no protection equipment is installed the
probability of unwanted occurrence is high, W3. The risk graph
indicates SIL3. The risk reduction factor needed for an SIF would
be in the range of 1,000 to 10,000.
15.2 The necessary risk reduction factor is 250,000/25,000 = 10. This is
classified as SIL1.
15.3 SIF reliability and safety analysis should consider all components
from sensor process connection to valve process connection. Typi-
cally this includes impulse lines, manifolds, sensors, controllers,
power supplies, solenoids, air supplies, valve actuators, and valve
elements. If communications lines are required for safe shutdown
by an SIS then they must be included in the analysis.
15.4 All failures are dangerous undetected. The dangerous undetected
failure rate is 1/8760 = 0.000114155 failures per hour.
15.5 Using Equation 14-2, the approximate PFDavg is calculated as
0.125. At this high failure rate the approximation method is
expected to have some error. Use of the full equation or a Markov
based tool would eliminate the error.
References
1. A Fistful of Risks. Discover Magazine. NY: Walt Disney Magazine
Publishing Co., May 1996.
7. Failure Modes Effects and Diagnostic Report, exida, Delta Controls S21
Pressure Switch, DRE 06/06-33 R001. Surrey, West Molesey: Delta
Controls Ltd., Oct. 2006.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
9. Safety Equipment Reliability Handbook, Volume 3, Third Edition,
Page 139, Bettis CB/CBA Series Spring Return Actuator. Sellers-
ville: exida, 2007.
12. Wehrs, D., Detection of Plugged Impulse Lines Using Statistical Process
Monitoring Technology. Chanhassen: Rosemount, 2006.
379
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
380 Control Systems Safety Evaluation and Reliability
The cost factors that need to be included in a lifecycle cost analysis vary
from system to system. Specific cost factors may not be relevant to a spe-
cific system. The object is to include all costs of procurement and opera-
tion over the lifecycle of a system. The various costs occur at different
times during system life. This is not a problem, since commonly-used life-
cycle costing techniques account for interest rates and the time value of
money.
The two primary categories of costs are procurement cost and operating
cost. These costs are added to obtain total lifecycle cost:
Total Cost
Procurement Cost
Operating Cost
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Procurement Costs
Procurement costs include system design cost, purchase cost of the equip-
ment including initial training, installation cost, and start-up/system com-
missioning cost. These costs occur only once. The total is obtained by
summing:
The next step, detailed design work, usually represents the biggest cost.
The effort required depends to a great extent on the available system tools
and experience of the engineering staff. System tools that include a control
point database, good report-generation facilities, a graphics language pro-
gramming interface, and other computer-aided design assistance can
really cut costs. Drawings and other documentation costs are also affected
by the choice of tools; computer-aided graphics tools save time and reduce
system level design errors.
Good training can reduce engineering costs. This is especially true when
the engineers have no experience with the control system. A good training
program can jump start the design effort.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Purchase Costs
Purchase costs always include the cost of equipment. Sometimes
neglected are the additional costs of cabinetry, environmental protection
(air conditioning, etc.), wire termination panels, factory acceptance tests (if
required), initial training, initial service contracts, and shipping. Purchase
costs normally represent the focal point of any system comparison. This is
quite understandable, since capital expenditures often require a tedious
approval procedure and many projects compete for a limited capital bud-
get. These costs are also easy to obtain and have a high level of certainty.
However it is a mistake to consider only purchase cost.
Installation Costs
Installation costs must account for delivery of the equipment to the final
location, placing and mounting, any weatherization required, piping, and
wiring. These costs are affected by the design of the equipment. Small
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
modular cabinets are easier to move and mount compared to large cabinet
assemblies. Wiring is simplified when wire termination panels are easy to
access. Some systems offer field mountable termination panels to reduce
wiring costs. Remote I/O and distributed I/O are also concepts that
are designed to reduce wiring costs by placing the I/O near the
transmitter.
Start-Up Costs
Start-up costs must include the inevitable system test, debug process, and
safety functional verification. Configuration tools that help manage sys-
tem change, and automatic change documentation, can cut costs in this
area. Testing and full functional verification are a big portion of start-up
costs. Testing is expedited when the control system allows detail monitor-
ing and forcing of I/O points. On-line display of system variables, ideally
within the context of the system design drawings, can also speed the test-
ing and debugging of complex systems.
Operating Costs
Operating costs can overshadow procurement costs in many systems.
Depending on the consequences of lost production, operating costs may
dominate any lifecycle cost study. One experienced system designer
stated, Our plant manager simply cannot tolerate a shutdown due to con-
trol system failure! There are reasons for this attitude; the cost of a shut-
down can be extremely high.
The cost of a shutdown is not the only operating cost. Other operating
costs include the cost of system engineering changes (both software and
In the case of Safety Instrumented Systems (SIS), the risk cost (Chapter 15)
is increased when the system probability of failure on demand goes up.
While safety ratings in terms of the required safety integrity level are usu-
ally dictated by a safety study, reduced risk cost may justify even higher
levels of safety integrity.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Engineering Changes
Engineering changes are part of operating costs. All systems are inevitably
changed. As a system is used, it becomes obvious that changes will
improve system operation. Everything cannot be anticipated by the origi-
nal designers. System-level design faults must be repaired. The strength of
the system design can be increased.
The cost of making a change can vary widely. Upper bound and lower
bound estimates should be made. Factors to be considered when making
the estimate include the procedures required to change the system, the
ease of changing the documentation, and the ease of testing and verifying
the changes. In the case of Safety Instrumented Systems a review of initial
hazard analysis and tasks such as updates to the periodic maintenance test
procedures must be considered. System capabilities affect these estimates.
Systems that have automatic updating of system design documentation
can be changed with much less effort.
Consumption Costs
System operation requires energy consumption and parts consumption.
These costs are typically estimated on an annual basis. Energy consump-
tion includes both the energy required to operate the control equipment
and the energy required to maintain an equipment environment when the
control system does not have sufficient internal strength to withstand spe-
cific environmental stresses.
The timing of the failure may affect cost. If the braking system of your
automobile fails just as another vehicle pulls in front of you, the costs are
likely to be high. Brake failure when coming to a stop on a long straight
empty road is likely to cost much less. This illustrates the concept of fail-
ure on demand. The concept is particularly important to an SIS. A cata-
strophic event can be extremely expensive.
All of these factors complicate the task of estimating the expected cost of
system failure. Though the determination is complicated, general risk
analysis techniques apply. Reliability analysis techniques provide the
probability of failure information. Reliability parameters can be used to
calculate the expected costs. Costs of failure are multiplied by the proba-
bility of failure.
with system downtime. Repair labor and lost production should be esti-
mated with consistent units (typically on a per hour basis). For a fully
repairable, single-failure state system:
where:
where:
The number of operating hours used in Equation 16-3 and Equation 16-4 is
typically the number of hours in one year (8,760). In systems that do not
operate continuously, actual operational hours are used.
C Failure = C E U E (16-5)
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
C OP = [( C EC + C FM + C CC + C Failure ) years life] (16-6)
A lifecycle cost analysis requires a reliability analysis, costs estimates, and
a lifecycle analysis. The hardest part is sometimes getting started. What
factors must be accounted for in your system? Bringing together the points
that have been discussed earlier in the chapter, Table 16-1 shows a check-
list of lifecycle cost items that should be considered.
EXAMPLE 16-1
Design Cost
52 Hours @ $75/Hour Engineering Time
22 Hours @ $45/Hour Drawing/Documentation Time
16 Hours @ $75/Hour Safety Review
One repair technician for 10 machines.
Consumption Costs
Electricity $1,200/Year
Lubricating Oil $200/Year
Filters $100/Year
Failure Costs
Repair Labor Rate $100/Hour
Lost Production Cost $2,000/Hour
52 $75 = $3,900
22 $45 = $990
16 $75 = $1,200
Purchase $120,000
Installation Cost
Truck Rental $300
32 Hours @ $75/Hour
Start-Up Cost
Training Course Fee $1,500
80 Hours @ $75/Hour Training Time
10 Hours @ $75/Hour Equipment Assembly
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
The reliability analysis has determined that the forging machine has a
steady-state availability of 98.04% and a steady-state unavailability of
1.96%. Using unavailability, the yearly failure cost is calculated by
using Equation 16-3.
$367,561.60 10 = $3,675,616
Please note that failure costs for this example are much larger than
other costs.
EXAMPLE 16-2
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
EXAMPLE 16-3
There are two failure modes in the temperature control system. If the
control system fails with its outputs energized, the excess heat will
destroy the entire extruding machine. Including the cost of lost
production, this event would cost $1,200,000. If the temperature
control system fails with its outputs de-energized, the heat is removed
and the material in the extruder will harden. The extruding machine
must be rebuilt if this occurs. The cost of a machine rebuild, including
the cost of lost production, is $200,000 per event.
Discount Rate
Almost everyone is familiar with the concept of compound interest. An
amount of money (or principal) is invested at a particular interest rate. The
interest earned is reinvested so that it too earns interest. Inflation is
another familiar concept. When considering the time value of money, both
interest rates and inflation rates must be taken into account. A term that
combines both is called the discount rate. While it might seem logical to
simply take the interest rate and subtract the inflation rate, the concept of
discount rate also includes some judgment regarding the uncertainty of
numbers estimated for the future. Financial experts also consider financial
risk. Risky projects are often given a higher discount rate. The discount
rate R is expressed as a percentage. The best source for the discount rate
estimate is the corporate financial officer.
M + MR
This amount is considered the principal for the second year. At the begin-
ning of the third year, the compounded amount of money is:
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
This formula can be generalized for any number of years because the pat-
tern continues. For any quantity of years, N, the future value of money,
FV, after N years equals:
EXAMPLE 16-4
Present Value
Another way to account for the time value of money is to calculate the
present value of some future expense. This is the equivalent of asking how
much money to invest now in order to pay for some future expense. The
equation for present value is obtained from Equation 16-9. Look at Exam-
ple 16-4. The compound amount $447,700 is the future value. The princi-
pal of $250,000 is the present value of $447,700. To directly calculate a
present value when a future value is known, starting with Equation 16-9
and substituting M for FV and PV for M, the equation for PV is:
M
PV = --------------------- (16-10)
N
(1 + R)
where:
EXAMPLE 16-5
EXAMPLE 16-6
Total yearly costs are the sum of the yearly fixed costs and the yearly
failure costs. Total yearly costs are $367,561.60.
For each cost, calculate the cost with discount. This is done using
Equation 16-9. Initial costs must include finance charges for the
entire 10-year period. The calculation is as follows:
The calculation is repeated for each year and the results are added.
A personal computer spreadsheet program is highly recommended
for this type of problem. Such programs are quick to set up, reduce
mistakes, and provide great flexibility. Discount rates and costs can
be varied each year. A listing obtained from a spreadsheet shows
yearly costs for the problem.
EXAMPLE 16-7
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Year 6 Present Value 274,280.13
Year 7 Present Value 261,219.17
Year 8 Present Value 248,780.16
Year 9 Present Value 236,933.48
Year 10 Present Value 225,650.94
Initial costs 137,000.00
Annuities
An annuity is a sequence of payments made at fixed periods of time over a
given time interval (it is assumed that payments are made at the end of
each period). Yearly lifecycle costs can be modeled as an annuity. Both
future value costs and present value costs can be calculated.
The future value of an annuity can be obtained by using Equation 16-9 for
each year.
FV = M + M(1 + R)1 + M(1 + R)2 + + M(1 + R)N1
Therefore,
FV(1 1 R) = M M(1 + R)N
EXAMPLE 16-8
EXAMPLE 16-9
Risk costs are calculated by multiplying the cost of an event times the
probability of an event. If an SIS is used, the probability of an event equals
the probability of an event without the SIS times the PFD of the SIS.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
PEVENT with SIS = PEVENT without SIS PFDSIS (16-13)
Risk costs are included in yearly costs and may be treated as an annuity.
EXAMPLE 16-10
Problem: With no risk reduction, the probability of an event is 0.01. If
an SIS is added with a PFD of 0.001, what is the probability of an
event?
Solution: Using Equation 16-13, PEVENTwith SIS = 0.01 0.001 =
0.00001.
Other cost factors that must be considered when adding an SIS to a pro-
cess include the cost of a false trip. An SIS may fail safe, falsely tripping
the system. The decreased risk cost must be greater than the increased
production downtime cost. Other lifecycle cost factors are much the same
for an SIS as for other process control systems.
EXAMPLE 16-11
Solution: The risk cost without the SIS equals 0.01 $2,000,000 =
$20,000 / year. The risk cost with the SIS equals 0.01 0.001
$2,000,000 = $20 / year. With the SIS added to the system
incremental trip costs equal 0.01 $1000 = $10 / year. The cost
comparisons are:
No SIS SIS
Procurement Costs: $0 $50,000
Yearly Risk/Failure Cost: $20,000 $30
Yearly Operational Cost: $0 $600
Converting yearly expenses to present value and adding the totals for
each year:
No SIS Discount 5%
Total Yearly $20,000 Cumulative $0
PV year 1 $19,048 Year 1 $19,048
PV year 2 $18,141 Year 2 $37,188
PV year 3 $17,277 Year 3 $54,465
PV year 4 $16,454 Year 4 $70,919
PV year 5 $15,671 Year 5 $86,590
Total lifecycle costs for five years $86,590
SIS Discount 5%
Total Yearly $630 Cumulative $50,000
PV year 1 $600 Year 1 $50,600
PV year 2 $571 Year 2 $51,171
PV year 3 $544 Year 3 $51,716
PV year 4 $518 Year 4 $52,234
PV year 5 $494 Year 5 $52,728
Total lifecycle costs for five years $52,728
Exercises
16.1 A control system has a procurement cost of $100,000. Fixed yearly
costs are $5,000. The system has an availability of 99%. The failure
costs are $1,000 per hour. The system will be used for five years.
During this time period, inflation and interest are identical so the
discount rate will be zero (no need to account for the time value of
money). Calculate the lifecycle cost.
16.2 The control system of Exercise 16.1 is fully repairable, with avail-
ability calculated using MTTF equals 9900 hours and MTTR equals
100 hours. An expert system can be purchased at a cost of $10,000.
This expert system will diagnose error messages and identify
failed components. The MTTR will be reduced to 50 hours. What is
the new availability? What is the new lifecycle cost? Should the
expert system be purchased?
16.3 The control system of Exercise 16.1 is available with dual redun-
dant modules. The extra modules will increase fixed yearly costs to
$6,000. The MTTF will increase to 100,000 hours. The MTTR
remains at 50 hours when the expert system is used. The procure-
ment cost increases to $210,000. What is the new availability? What
is the new lifecycle cost? Should the dual redundant modules be
purchased?
16.4 An expenditure of $100,000 must be made. The discount rate is 4%.
What is the future value of this expenditure after five years?
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
16.5 An expenditure of $150,000 at todays cost must be made in five
years. What is the amount that must be invested now (present
value) in order to purchase the item in five years? Assume that the
discount rate is 5%.
16.6 Repeat Exercise 16.1 assuming that the discount rate is 5%. How
much did the lifecycle cost change?
Answers to Exercises
16.1 Yearly failure costs equal 0.01 $1000 per hour 8760 hours per
year = $87,600. The totals are:
Availability 0.99
Procurement Costs: $100,000
Yearly Risk/Failure Cost: $87,600
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Yearly Operational Cost: $5,000
Discount Rate 0%
Total Yearly $92,600 Cumulative $100,000
1 $92,600 Year 1 $192,600
2 $92,600 Year 2 $285,200
3 $92,600 Year 3 $377,800
4 $92,600 Year 4 $470,400
5 $92,600 Year 5 $563,000
16.6 At a discount rate of 5%, the lifecycle cost dropped from $563,000
to $500,910.
Availability 0.99
Procurement Costs: $100,000
Yearly Risk/Failure Cost: $87,600
Yearly Operational Cost: $5,000
Discount Rate 5%
Total Yearly $92,600 Cumulative $100,000
1 $88,190 Year 1 $188,190
2 $83,991 Year 2 $272,181
3 $79,991 Year 3 $352,173
4 $76,182 Year 4 $428,355
5 $72,555 Year 5 $500,910
References
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
0.3 0.617912 0.621720 0.625516 0.629301 0.633072
0.4 0.655422 0.659098 0.662758 0.666403 0.670032
0.5 0.691463 0.694975 0.698469 0.701945 0.705402
0.6 0.725748 0.729070 0.732372 0.735654 0.738915
0.7 0.758037 0.761149 0.764239 0.767306 0.770351
0.8 0.788146 0.791031 0.793893 0.796732 0.799547
0.9 0.815941 0.818590 0.821215 0.823816 0.826392
1.0 0.841346 0.843754 0.846137 0.848496 0.850831
1.1 0.864335 0.866502 0.868644 0.870763 0.872858
1.2 0.884931 0.886862 0.888769 0.890653 0.892513
1.3 0.903201 0.904903 0.906584 0.908242 0.909878
1.4 0.919244 0.920731 0.922197 0.923643 0.925067
1.5 0.933194 0.934479 0.935745 0.936993 0.938221
1.6 0.945202 0.946302 0.947385 0.948450 0.949498
1.7 0.955435 0.956368 0.957285 0.958186 0.959071
1.8 0.964070 0.964853 0.965621 0.966376 0.967117
1.9 0.971284 0.971934 0.972572 0.973197 0.973811
2.0 0.977251 0.977785 0.978309 0.978822 0.979325
2.1 0.982136 0.982571 0.982998 0.983415 0.983823
2.2 0.986097 0.986448 0.986791 0.987127 0.987455
2.3 0.989276 0.989556 0.989830 0.990097 0.990359
2.4 0.991803 0.992024 0.992240 0.992451 0.992657
2.5 0.993791 0.993964 0.994133 0.994297 0.994458
2.6 0.995339 0.995473 0.995604 0.995731 0.995855
2.7 0.996533 0.996636 0.996736 0.996834 0.996928
2.8 0.997445 0.997523 0.997599 0.997673 0.997745
2.9 0.998134 0.998193 0.998250 0.998305 0.998359
3.0 0.998650 0.998694 0.998736 0.998777 0.998817
3.1 0.999033 0.999065 0.999096 0.999126 0.999156
401
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
402 Control Systems Safety Evaluation and Reliability
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
0.0 0.519939 0.523922 0.527903 0.531882 0.535857
0.1 0.559618 0.563560 0.567495 0.571424 0.575346
0.2 0.598707 0.602569 0.606420 0.610262 0.614092
0.3 0.636831 0.640577 0.644309 0.648028 0.651732
0.4 0.673646 0.677243 0.680823 0.684387 0.687934
0.5 0.708841 0.712261 0.715662 0.719044 0.722406
0.6 0.742155 0.745374 0.748572 0.751749 0.754904
0.7 0.773374 0.776374 0.779351 0.782306 0.785237
0.8 0.802339 0.805107 0.807851 0.810571 0.813268
0.9 0.828945 0.831474 0.833978 0.836458 0.838914
1.0 0.853142 0.855429 0.857692 0.859930 0.862145
1.1 0.874929 0.876977 0.879001 0.881001 0.882978
1.2 0.894351 0.896166 0.897959 0.899729 0.901476
1.3 0.911493 0.913086 0.914658 0.916208 0.917737
1.4 0.926472 0.927856 0.929220 0.930564 0.931889
1.5 0.939430 0.940621 0.941793 0.942948 0.944084
1.6 0.950529 0.951544 0.952541 0.953522 0.954487
1.7 0.959942 0.960797 0.961637 0.962463 0.963274
1.8 0.967844 0.968558 0.969259 0.969947 0.970622
1.9 0.974413 0.975003 0.975581 0.976149 0.976705
2.0 0.979818 0.980301 0.980774 0.981238 0.981692
2.1 0.984223 0.984614 0.984997 0.985372 0.985738
2.2 0.987776 0.988090 0.988397 0.988697 0.988990
2.3 0.990614 0.990863 0.991106 0.991344 0.991576
2.4 0.992858 0.993054 0.993245 0.993431 0.993613
2.5 0.994614 0.994767 0.994915 0.995060 0.995202
2.6 0.995976 0.996093 0.996208 0.996319 0.996428
2.7 0.997021 0.997110 0.997197 0.997282 0.997365
2.8 0.997814 0.997882 0.997948 0.998012 0.998074
2.9 0.998411 0.998462 0.998511 0.998559 0.998605
3.0 0.998856 0.998894 0.998930 0.998965 0.998999
3.1 0.999184 0.999211 0.999238 0.999264 0.999289
3.2 0.999423 0.999443 0.999462 0.999481 0.999499
3.3 0.999596 0.999611 0.999624 0.999638 0.999651
3.4 0.999720 0.999730 0.999740 0.999750 0.999759
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Matrix
Math
The Matrix
A matrix is an array of numeric values or algebraic variables. The matrix is
written with rows and columns enclosed in brackets. There is no single
numerical value. An example 3 3 matrix (third order) is shown below.
a 11 a 12 a 13
P = a 21 a 22 a 23
a 31 a 32 a 33
Column Matrix
If a matrix has only one column, it is known as a column matrix or column
vector. An example is shown below.
3
C = 1
0.6
2
Row Matrix
If a matrix has only one row, it is known as a row matrix or row vector. An
example is shown below.
S = 1 0 0 00
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
405
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
406 Control Systems Safety Evaluation and Reliability
Identity Matrix
The identity matrix is a square matrix in which all the elements are zero
except those on a diagonal from upper left to lower right. The diagonal
elements are unity.
1 0 0 0 0
0 1 0 0 0
I = 0 0 1 0 0
0 0 0 1 0
0 0 0 0 1
Matrix Addition
Two matrices of the same size can be added. The elements in
corresponding positions are summed.
Matrix Subtraction
Two matrices of the same size can be subtracted. The elements in
corresponding positions are subtracted.
a b c g h i = ag bh ci (B-2)
d e f j k l dj ek fl
Matrix Multiplication
Two matrices may be multiplied if the number of columns of the first
matrix equals the numbers of rows of the second matrix. The result will be
a matrix that has a quantity of rows equal to that of the first matrix and a
quantity of columns equal to that of the second matrix.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
g h
a b c ag+bi+ck ah+bj+cl (B-3)
i j =
d e f dg+ei+fk dh+ej+fl
k l
Matrix Inversion
There is no procedure that allows matrices to be divided. However, we
can obtain the reciprocal or inverse of a square matrix. The inverse of
one matrix can be multiplied by another matrix in a manner analogous to
algebraic division.
The inverse (M1) of a square matrix (M) is another square matrix defined
by the relation:
1
MM = I (B-4)
1
4 --- = 1
4
EXAMPLE B-1
4 0 5
M = 0 1 6
3 0 4
and
4 0 5
1
M = 18 1 24
3 0 4
4 0 5 4 0 5 1 0 0
0 1 6 18 1 24 = 0 1 0
3 0 4 3 0 4 0 0 1
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
a b c 1 0 0
d e f 0 1 0
g h i 0 0 1
1 0 0 j k l
0 1 0 m n o
0 0 1 p q r
The matrix:
j k l
1
M = m n o
p q r
a b c
M = d e f
g h i
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
EXAMPLE B-2
4 0 5
M = 0 1 6
3 0 4
4 0 5 1 0 0
0 1 6 0 1 0
3 0 4 0 0 1
The objective is to manipulate matrix rows until the left side of the
composite equals the identity matrix. One good strategy is to put zeros
into the left side. As a first step, manipulate rows in order to replace
the five with a zero. Multiply row 3 by 5/4. The result is
4 0 5 1 0 0
0 1 6 0 1 0
15 5
------ 0 5 0 0 ---
4 4
1 5
1 0 ---
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
--- 0 0
4 4
0 1 6 0 1 0
15 5
------ 0 5 0 0 ---
4 4
Next, replace the 15/4 with a zero. To accomplish this, use rule 2; any
row may be multiplied by a nonzero scalar (row 1 = 15 row 1).
15 75
------ 0 0 15 0 ------
4 4
0 1 6 0 1 0
15 5
------ 0 5 0 0 ---
4 4
Next, use rule 3; any row may be added to the multiple of another row
(row 3 = row 3 row 1).
75
15 15 0 ------
------ 0 0 4
4
0 1 0
0 1 6
80
0 0 5 15 0 ------
4
The 6 on the left side is the next target. Multiply row 3 by 6/5.
15
------ 0 0 15 0 75
------
4 4
0 1 6 0 1 0
0 0 6 18 0 24
15
------ 0 0 15 0 75
------
4 4
0 1 0 18 1 24
0 0 6 18 0 24
1 0 0 4 0 5
0 1 0 18 1 24
0 0 1 3 0 4
The identity matrix is present in the left side of the composite matrix.
The job is finished and the inverted matrix equals:
4 0 5
1
M = 18 1 24
3 0 4
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
EXAMPLE B-3
Problem: Figure 9-1 shows the Markov model for an ideal dual-
controller system. The I Q matrix for this model is
I Q = 2 2
+
The equation for MTTF (Equation 9-1) is found from the N matrix, the
inverse of the I Q matrix. Invert the I Q matrix and find the equation
for MTTF.
2 2 1 0
+ 0 1
Multiply row 1:
+
R1 = R1 -------------
2
+
+ + ------------- 0
2
+
0 1
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
+
0 ------------- 1
2
+
0 1
Multiply row 1 and replace row 2 with the sum of row 1 and row 2.
R1 = R1 ---
( + )
---------------------------- ---
2
0 2
0 + ( + ) +
---------------------------- -------------
2
2
1
R1 = R1 ---
1
R2 = R2 -------------
+
+ 1
------------- ---
2
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
1 0 2
0 1 1
---------- ---
2
2
+ 1
------------- ---
2
N = 2
1
---------- ---
2
2
+ 2 3 +
MTTF = ------------- + ---------- = ----------------
2 2 2
2 2 2
Probability
Theory
Introduction
There are many events in the world for which there is not enough knowl-
edge to accurately predict an outcome. If a penny is flipped into the air, no
one can predict with certainty that it will land heads up. If a pair of dice
is rolled, no one possesses enough knowledge to state which numbers will
appear face up on each die. These events are called random.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
While random events like a coin toss do not seem related to reliability
engineering, the concept of a random event as previously defined in
Chapter 3 is the same. In reliability analysis the experiment is operating
the device for another time intervalthe possible outcomes of the experi-
ment are successful operation, failure in mode 1, failure in mode 2, etc. If a
controller module is run for another hour will it fail or not? This is much
like asking if the result of the next coin flip will be heads or tails.
413
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
414 Control Systems Safety Evaluation and Reliability
Probability
Probability is a quantitative method of expressing the likelihood of an
event. A probability is assigned a number between zero and one, inclu-
sive. A probability assignment of zero means that the event is never
expected. A probability assignment of one means that the event is always
expected.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
n
P ( E ) = lim ---- (C-2)
N N
Venn Diagrams
A convenient way to depict the outcomes of an experiment is through the
use of a Venn diagram. These diagrams were created by John Venn (1834-
1923), an English mathematician and cleric. They provide visual represen-
tation of data sets, including experimental outcomes. The diagrams are
drawn by using the area of a rectangle to represent all possible outcomes;
this area is known as the sample space. Any particular outcome is shown by
using a portion of the area within the rectangle.
A fair coin is defined as a coin that is equally likely to give a heads result
or a tails result. For a fair coin flip, the Venn diagram of possible outcomes
is shown in Figure C-1. There are two expected outcomes: heads and tails.
Each has a well-known probability of one-half. The diagram shows the
outcomes, with each allocated area in proportion to its probability.
For the toss of a fair pair of dice (fair meaning that every number is
equally likely to come up on each die), the possible outcomes are shown in
Figure C-2. The outcomes do not occupy the same area on the diagram.
The probabilities of some outcomes are more likely than others; these
occupy more area. The area occupied by each outcome is proportional to
Complementary sets are easily shown on Venn diagrams. Since the diagram
represents the entire sample space, all area not enclosed within an event is
the complement of the event set. In Figure C-5, the set A is represented by
a circle. Its complement is set B, represented by the remainder of the dia-
gram.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Combining Probabilities
Certain rules help to combine probabilities. Combinations of events are
common in the field of reliability evaluation. System failures often occur
only when certain combinations of events happen during certain times.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
EXAMPLE C-1
Problem: Two fair coins are flipped into the air. What is the probability
that both coins will land with heads showing?
Solution: Each coin toss has only two possible outcomes: heads or
tails. Each outcome has a probability of one-half. The coin tosses are
independent. Therefore,
P ( H1 H 2 ) = P ( H1 ) P ( H2 )
1 1
= --- ---
2 2
1
= ---
4
--``,,`,,,`,,`,`,,,``
EXAMPLE C-2
Solution: The outcome of one die does not affect the outcome of the
other die. Therefore, the events are independent. The probability of
getting one dot can be obtained by noting that there are six sides on
the die and that each side is equally likely. The probability of getting
one dot is one-sixth (1/6). The probability of getting snake eyes is
represented as:
1 1
P ( 1, 1 ) = --- ---
6 6
1
= ------
36
Check the area occupied by the 2 result on Figure C-2. Is that area
equal to 1/36?
EXAMPLE C-3
Problem: A controller fails only if the input power fails and the
controller battery also fails. Assume that these factors are
independent. The probability of input power failure is 0.0001. The
probability of battery failure is 0.01. What is the probability of
controller failure?
Solution: Since input power and battery failure are independent, the
probability of both events is given by Equation C-3:
Probability Summation
If the probability of getting a result from set A equals 0.2 and the probabil-
ity of getting a result from set B equals 0.3, what is the probability of get-
ting a result from either set A or set B?
It would be natural to assume that the answer is 0.5, the sum of the above
probabilities, but that answer is not always correct. Look at the Venn dia-
gram in Figure C-8. If the area of set A (6/36) is added to the area of set B
(6/36), the answer (12/36) is too large. (The answer should be 11/36.)
Since there is an intersection between sets A and B, the area of the intersec-
tion has been counted twice. When summing probabilities, the intersec-
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
tions must be subtracted. Thus, the probability of the union of event sets A
and B is given by:
P( A B) = P(A ) + P( B) (C-5)
EXAMPLE C-4
EXAMPLE C-5
EXAMPLE C-6
Solution: On each die are six numbers. Three of the numbers are
odd (1, 3, 5) and three of the numbers are even (2, 4, 6). All numbers
are mutually exclusive. Equation C-5 gives the probability of getting
an even number on one die.
P ( Even ) = P ( 2, 4, 6 ) = P ( 2 ) + P ( 4 ) + P ( 6 )
1 1 1
= --- + --- + ---
6 6 6
1
= ---
2
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
P ( Even, Even ) = P ( Set A Even ) P ( Set B Even )
1 1
= --- ---
2 2
1
= ---
4
EXAMPLE C-7
P( A B ) = P( A) + P( B ) P(A B )
1 1 1
= --- + --- -----
-
6 6 36
11
= ------
36
Conditional Probability
It is often necessary to calculate the probability of some event under spe-
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
cific circumstances. The probability of event A, given that event B has
occurred, may need to be calculated. Such a probability is called a condi-
tional probability. The situation can be envisioned by examining the Venn
diagram in Figure C-9.
has occurred. This means that only the state space within the area of circle
B needs to be examined. This is a substantially reduced area! The desired
probability is the area of circle A within circle B, divided by the area of cir-
cle B, expressed by:
P( A B)
P ( A B ) = ------------------------ (C-6)
P( B)
EXAMPLE C-8
Solution: The probability of {2,2}, given that one die has a two, is
given by Equation C-6:
1 36
P ( 2, 2 ) = --------------
16
1
= ---
6
In this case, the answer is intuitive since the outcome of each die is inde-
pendent. The problem could have been solved by noting that
1 1
P ( A B ) = P ( A ) P ( B ) = --- ---
6 6
P(A) P( B)
P ( A B ) = -------------------------------- (C-7)
P(B)
= P(A)
In the example, the result could have calculated by merely knowing the
probability of getting a two on the second die.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
This states that the intersection of events A and B can be obtained by mul-
tiplying the probability of A, given B, times the probability of B. When the
statistics are kept in a conditional format, this equation can be useful.
EXAMPLE C-9
Solution: There are only two ways to get a sum of seven, given that
one die has a two. Those two combinations are {2,5} and {5,2}. There
are 10 combinations that show a two on exactly one die. These sets
are {2,1}, {2,3}, {2,4}, {2,5}, {2,6}, {1,2}, {3,2}, {4,2}, {5,2}, and {6,2).
Using Equation C-6:
P(A B)
P ( A B ) = -------------------------
P(B)
2 36
= -----------------
10 36
= 2 10
Bayes Rule
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Consider an event A. The state space in which it exists is divided into two
mutually exclusive sections, B and B' (Figure C-11). Event A can be written
as:
A = ( A B ) ( A B' ) (C-9)
P ( A ) = P ( A B ) + P ( A B' ) (C-10)
This states that the probability of event A equals the conditional probabil-
ity of A, given that B has occurred, plus the conditional probability of A,
given that B has not occurred. This is known as Bayes rule. It is widely
used in many aspects of reliability engineering.
EXAMPLE C-10
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
The probabilities of failure for each shift are calculated by dividing the
number of failures during each shift by the number of hours in each
shift. Substituting the numbers into Equation C-12:
= 0.001139
EXAMPLE C-11
EXAMPLE C-12
32=6
This can be verified by creating all the character strings. Starting with
the letter A, they are {A,B}, {A,C}, {B,A}, {B,C}, {C,A}, and {C,B}.
EXAMPLE C-13
Solution: Four steps are required to select the controller. In the first
step, one of three models is selected. One of four communications
options is chosen in the second step. One of two I/O options is
chosen in the third step. One of two memory options is chosen in the
fourth step. Using the first counting rule, the number of variations is:
3 4 2 2 = 48
EXAMPLE C-14
4 3 2 1 = 24
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Permutations
An ordered arrangement of objects without repetition is known as a per-
mutation. The four-letter sequences from Example C-14 are permutations.
The number of permutations of n objects is n! (n! is pronounced n facto-
rial and is the mathematical notation for the product 1 2 3 n 1
n).
EXAMPLE C-15
6! = 720
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
P ( n, r ) = n ( n 1 ) ( n 2 ) ( n r + 1 ) (C-13)
( n r ) ( n r 1 )1
----------------------------------------------------
( n r ) ( n r 1 )1
then
n ( n 1 ) ( n 2 ) ( n r + 1 ) ( n r ) ( n r 1 )1-
P ( n, r ) = -------------------------------------------------------------------------------------------------------------------------- (C-14)
( n r ) ( n r 1 )1
The numerator is n!. The denominator is (n r)!. Therefore:
n!
P ( n, r ) = ------------------ (C-15)
( n r )!
Equation C-15 is used to determine permutations, the number of ways
that r objects from a set of n objects can be arranged in order without repe-
tition.
EXAMPLE C-16
This can be verified by listing the permutations: AB, AC, AD, BA, BC,
BD, CA, CB, CD, DA, DB, and DC.
EXAMPLE C-17
This can be verified by using the basic counting rule. The first step
has six possibilities. The second step has five. The third step has four
possibilities, and the fourth step has three. The basic counting rule
tells us:
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
P ( 6, 4 ) = 6 5 4 3 = 360
Combinations
Combinations are groupings of elements in which order does not count.
Since order does not count, that is, when n objects are taken r at a time, the
number of combinations is always less than the number of permutations.
Consider the number of permutations of three letters (3! = 6). They are:
ABC, ACB, BAC, BCA, CAB, and CBA. If order does not count (three
objects are taken three at a time), all these arrangements are the same.
There is only one combination. The number has been reduced by a factor
of 3!.
n!
C ( n, r ) = ----------------------- (C-16)
r! ( n r )!
Comparing this formula with Equation C-15, note that the number of per-
mutations is reduced by a factor r! to obtain the number of combinations.
EXAMPLE C-18
6! 720
C ( 6, 4 ) = -------------------------- = ---------------- = 15
4! ( 6 4 )! 24 2
Exercises
C.1 Three fair coins are tossed into the air. What is the probability that
all three will land heads up?
C.2 A control loop has a temperature transmitter, a controller, and a
valve. All three devices must operate successfully for the loop to
operate successfully. The loop will be operated for one year before
the process is shut down and overhauled. The temperature trans-
mitter has a probability of failure during the next year of 0.01. The
controller has a probability of failure during the next year of 0.005.
The valve has a probability of failure during the next year of 0.05.
Assume that temperature transmitter, controller, and valve fail-
ures are independent but not mutually exclusive. What is the prob-
ability of failure for the control loop?
C.3 Three fair coins are tossed into the air. What is the probability that
at least one coin will land heads up?
C.4 A pair of fair dice is rolled. If the result is a six or an eight, you hit
Park Place or Boardwalk and go broke. If the result is nine or more,
you pass GO and collect $200.00. What is the probability of going
broke on the next turn? What is the probability of passing GO?
C.5 A control system has four temperature-control loops. The system
operates if three or more loops operate. The system fails if fewer
than three loops operate. How many combinations of loop failures
cause system failure?
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
controllers fail or if all three controllers fail. A controller is repair-
able and has a steady-state probability of success of 0.95. What is
the probability that two controllers are failed and the other control-
ler is successful? How many combinations of two failed controllers
and one successful controller exist? What other combinations
result in system failure? What is the overall probability of system
failure?
C.11 Using the Venn diagram of Figure C-7, estimate the probabilities of
the various failure sources.
C.12 During a year, the probability of getting a dangerous condition in
an industrial boiler is 0.00001. The probability of a safety instru-
mented protection system failing to respond to the dangerous
demand is 0.000001. If there is a dangerous condition AND the
protection system does not respond, there will be a boiler explo-
sion. What is the probability of a boiler explosion?
C.13 The probability of an explosion in an industrial process is 0.00002.
The insurance underwriter wants a safety instrumented system
designed that will reduce the probability of an explosion to
0.0000001. What is the maximum probability of failing danger-
ously allowed in the safety instrumented system?
Answers to Exercises
C.1 Each coin must land heads-up. The probability of that event is .
The combination of all three heads events is = 1/8.
C.2 P(control loop success) = P(transmitter success) P(controller suc-
cess) P(valve success) = (1-0.01)(1-0.005)(1-0.05) =
0.990.9950.95 = 0.9358. P(control loop failure) = 1- 0.9358 =
0.0642. Note that an approximation could be obtained by adding
the failure probabilities. The approximation would be: P(approx.
control loop failure) = 0.01 + 0.005 + 0.05 = 0.065. This probability
summation method is based on equation C-5 and is not exact
because the failures are not mutually exclusive. The exact method
is done by expanding equation C-4. While the approximate
method is not exact, it is usually faster and produces a conserva-
tive result.
C.3 P(at least one heads up) = 1 - P(no heads up) = 1 - 1/8 = 7/8. Alter-
natively, the problem could be solved be creating a list of all com-
binations of three coin toss outcomes. There will be eight possible
combinations. Each combination will be mutually exclusive. It will
be seen that seven of the eight combinations will have at least one
heads up.
C.4 P(going broke) = P(6 or 8) = P(6) + P(8) since outcomes are mutu-
ally exclusive. P(going broke) = 5/36 + 5/36 = 10/36. P(passing
GO) = P(9 or more) = P(9) + P(10) + P(11) + P(12) = 4/36 + 3/36 +
2/36 + 1/36 = 10/36.
C.5 All combinations of one loop operating and two loops operating
represent system failure. Combinations of three or four loops oper-
ating represent system success. Combinations of one loop operat-
ing are 4!/1!(4-1)! = 4. Combinations of two loops operating are
4!/2!(4-2)! = 6. There is a total of 10 combinations of successful/
failed controllers that represent system failure.
C.6 The number of combinations is given by the basic counting rule:
3 2 1 = 6.
C.7 The number of combinations of letters is given by the basic count-
ing rule. In this case: 5 4 3 2 1 = 120.
C.8 Using the formula for combinations: 3!/2!(3-2)! = 3.
C.9 P(one successful controller and one failed controller) = 2 0.95
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Bibliography
1. Johnsonbaugh, R. Essential Discrete Mathematics. NY: Macmillan
Publishing Company, 1987.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Test Data
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Reliability parameters can be calculated from life test data. Table 4-2
shows accelerated reliability life test data for a set of fifty modules. A
number of variables are used to describe this data. The original number of
modules in the test is denoted by the variable No. The number of modules
surviving after each time period t is denoted by the variable Ns. The
cumulative number of modules that have failed is denoted by the variable
Nf. The reliability function can be calculated as follows:
Ns ( t )
R ( t ) = ------------
- (D-1)
No
Nf ( t )
F ( t ) = ------------
- (D-2)
No
F ( tn ) F ( tn 1 )
f ( t ) = --------------------------------------
t
This equals
1 Nf ( tn ) Nf ( tn 1 )
f ( t ) = ------ -------------------------------------------- (D-3)
No t
435
Copyright International Society of Automation
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
436 Control Systems Safety Evaluation and Reliability
Since the failure rate equals the PDF divided by the reliability function,
f(t) 1
( t ) = ----------- = ----------- f ( t )
R(t) R(t)
No 1 Nf ( tn ) Nf ( tn 1 )
= ----------------------- ------ ---------------------------------------------
Ns ( tn 1 ) No t
1 Nf ( tn ) Nf ( tn 1 )
( t ) = ----------------------- -------------------------------------------- (D-4)
Ns ( tn 1 ) t
Using the data from Table 4-2, at the end of the first week forty-one mod-
ules survived and nine modules failed. The calculations for week one are
shown below.
41
R ( t 1 ) = ------ = 0.82
50
9
F ( t 1 ) = ------ = 0.18
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
50
1 90
f ( t 1 ) = ------ --------------- = 0.00107 failures/hr
50 24 7
1 90
( t 1 ) = ------ --------------- = 0.00107 failures/hr
50 24 7
36
R ( t 2 ) = ------ = 0.72
50
14
F ( t 2 ) = ------ = 0.28
50
1 14 9
f ( t 2 ) = ------ --------------- = 0.00059 failures/hr
50 24 7
1 14 9
( t 2 ) = ------ --------------- = 0.00072 failures/hr
41 24 7
Table D-1 shows the calculations for the first ten weeks.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
The failure rate calculated from Table D-2 is shown in Figure D-5.
Compare this to Figure D-4, which was calculated from summary
(censored) data. The censored data produces a noisy plot. In general, the
more failure time resolution for the data, the better the analysis.
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Continuous-
Time Markov
Modeling
When the size of the time increment in a discrete Markov model is
reduced, accuracy is increased as approximations are reduced. Taken to
the limit, the time increment approaches zero. The limit as the time incre-
ment (t) approaches zero is labeled dt. At the limit, we have achieved
continuous time.
lim t = dt
t 0
Assume the model starts in state 0 at time t. The model will be in state 0
during the next instant (time = t + t) only if it stays in state 0. This can be
expressed mathematically as:
S 0 ( t + t ) = S 0 ( t ) ( 1 t )
441
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Figure E-1. Markov Model for Single Non-Repairable Component
S 0 ( t + t ) S 0 ( t ) = S 0 ( t )t
S 0 ( t + t ) S 0 ( t )
-------------------------------------------- = S 0 ( t ) (E-1)
t
The left side of Equation E-1 is the deviation with respect to time. Taking
the limit as t approaches zero results in:
dS 0 ( t )
- = S 0 ( t )
--------------- (E-2)
dt
dS 1 ( t )
- = S 0 ( t )
--------------- (E-3)
dt
Equations E-2 and E-3 are first-order differential equations with constant
coefficients. One of the easiest ways to solve such equations is to use a
Laplace Transform to convert from the time domain (t) to the frequency
domain (s). Taking the Laplace Transform:
sS 0 ( s ) S 0 ( 0 ) = S o ( s )
and
sS 1 ( s ) S 1 ( 0 ) = S 0 ( s )
Since the system starts in state 0, substitute S0(0) = 1 and S1(0) = 0. This
results in:
sS 0 ( s ) 1 = S 0 ( s ) (E-4)
and
sS 1 ( s ) = S 0 ( s ) (E-5)
( s + )S 0 ( s ) = 1
Therefore:
1
S 0 ( s ) = ------------ (E-6)
s+
sS 1 ( s ) = ------------
s+
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
S 1 ( s ) = -------------------
s(s + )
t
S0 ( t ) = e (E-7)
and
t
S1 ( t ) = 1 e (E-8)
Since state 0 is the success state, reliability is equal to S0(t) and is given by
Equation E-7. Unreliability is equal to S1(t) and is given by Equation E-8.
This result is identical to the result obtained in Chapter 4 (Equations 4-16
and 4-17) when a component has an exponential probability of failure.
Thus, the Markov model solution verifies the clear relationship between
the constant failure rate and the exponential probability of failure over a
time period.
S 0 ( t + t ) = S 0 ( t ) ( 1 t ) + S 1 ( t ) ( t )
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
This can be rearranged as:
S 0 ( t + t ) S 0 ( t ) = [ S 0 ( t ) + S 1 ( t ) ]t
dS 0 ( t )
---------------- = S 0 ( t ) + S 1 ( t ) (E-9)
dt
In a similar manner:
dS 1 ( t )
- = S 0 ( t ) S 1 ( t )
--------------- (E-10)
dt
Equations E-9 and E-10 are first-order differential equations. Again, using
the Laplace Transform solution method:
sS 0 ( s ) S 0 ( 0 ) = S 0 ( s ) + S 1 ( s )
Rearranging:
( s + )S 0 ( s ) = S 1 ( s ) + S 0 ( 0 )
1
S 0 ( s ) = ------------ S 1 ( s ) + ------------ S 0 ( 0 ) (E-11)
s+ s+
In a similar manner:
1
S 1 ( s ) = ------------ S 0 ( s ) + ------------ S 1 ( 0 ) (E-12)
s+ s+
1 1
S 1 ( s ) = ------------ ------------ S 1 ( s ) + ------------ ------------ S 0 ( 0 ) + ------------ S 1 ( 0 )
s + s + s + s + s+
- -----------
1 ----------- - 1
S ( s ) = ------------ S 1 ( 0 ) + ------------ S 0 ( 0 )
s + s + 1 s+ s+
Creating a common denominator for the left half of the equation yields:
( s + ) ( s + ) 1
----------------------------------------------- S 1 ( s ) = ------------ S 1 ( 0 ) + ------------ S 0 ( 0 )
(s + )(s + ) s+ s+
If both sides of the equation are divided by the first term, the S1(s) term is
isolated.
(s + )(s + ) 1
S 1 ( s ) = ----------------------------------------------- ------------ S 1 ( 0 ) + ------------ S 0 ( 0 )
( s + ) ( s + ) s + s+
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Multiplying the denominator of the first term and canceling out equal
terms:
1
S 1 ( s ) = ----------------------------- [ ( s + )S 1 ( 0 ) + S 0 ( 0 ) ] (E-13)
s(s + + )
To move further with the solution, we must arrange Equation E-13 into a
form that will allow an inverse transform. A partial fraction expansion of
S1(s) where:
A B
S 1 ( s ) = ---- + --------------------- (E-14)
s s++
A B 1
---- + --------------------- = ----------------------------- [ ( s + )S 1 ( 0 ) + S 0 ( 0 ) ] (E-15)
s s++ s(s + + )
[ ( s + )S 1 ( 0 ) + S 0 ( 0 ) ] = A ( s + + ) + B ( s )
This relation holds true for all values of s. Therefore, to solve for A and B,
we should pick a value of s that will simplify the algebra as much as possi-
ble. To solve for A, a value of s = 0 is the best choice. At s = 0,
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
[ S 1 ( 0 ) + S 0 ( 0 ) ] = A ( + )
Therefore:
A = ------------- [ S 1 ( 0 ) + S 0 ( 0 ) ] (E-16)
+
s = ( + )
Substituting for s:
[ S 1 ( 0 ) + S 0 ( 0 ) ] = B ( + )
Rearranging,
1
B = ------------- [ S 1 ( 0 ) S 0 ( 0 ) ] (E-17)
+
1 1 1
S 1 ( s ) = ------------- --- [ S 1 ( o ) + S 0 ( o ) ] + ------------- --------------------- [ S 1 ( o ) S 0 ( o ) ]
+ s + s + +
1 1 1
S 0 ( s ) = ------------- --- [ S 1 ( o ) + S 0 ( o ) ] + ------------- --------------------- [ S 0 ( o ) S 1 ( o ) ]
+ s + s + +
( + )t
e
S 0 ( t ) = ------------- [ S 0 ( o ) + S 1 ( o ) ] + ------------------- [ S 0 ( o ) S 1 ( o ) ]
+ +
and
( + )t
e
S 1 ( t ) = ------------- [ S 0 ( o ) + S 1 ( o ) ] + ------------------- [ S 1 ( o ) S 0 ( o ) ]
+ +
S0 ( o ) = 1
and
S1 ( o ) = 0
Substituting,
( + )t
e
S 0 ( t ) = ------------- + ----------------------- (E-18)
+ +
and
( + )t
e
S 1 ( t ) = ------------- ----------------------- (E-19)
+ +
Provided by IHS under license with ISA Licensee=FMC Technologies /5914950002, User=klayjamraeng, jutapol
No reproduction or networking permitted without license from IHS Not for Resale, 06/01/2017 00:00:50 MDT
448 Control Systems Safety Evaluation and Reliability
S 0 ( ) = ------------- (E-20)
+
and
S 1 ( ) = ------------- (E-21)
+
The limiting state probability is the expected result at infinite time. Thus,
Equations E-20 and E-21 provide this information.
These methods can be used to solve for analytical state probabilities for
more complex models; however, the mathematics become quite complex
for realistic models of several states. The use of numerical techniques in
combination with a discrete-time Markov model is rapidly becoming the
method of choice when time-dependent state probabilities are needed.
P = 1 (E-22)
1
L L 1 = SL SL
S0 S1 0 1
1
This yields:
L L L
( 1 )S 0 + S 1 = S 0
and
L L L
S 0 + ( 1 )S 1 = S 1
L L
S 1 = --- S 0
L L
S0 + S1 = 1
yields:
L L
S 0 + --- S 0 = 1
L
A ( s ) = S 0 = ------------- (E-23)
+
L L
U ( s ) = S 1 = 1 S 0 = ------------- (E-24)
+
1
MTTF = ---
1
MTTR = ---
MTTF
A ( s ) = ----------------------------------------- (E-25)
MTTF + MTTR
is obtained.
--``,,`,,,`,,`,`,,,```,,,``,``,
Again, assume that the model starts in state 0. The model will be in state 0
in the next time instant only if it stays in state 0. This can be expressed
mathematically as:
S D
S 0 ( t + t ) = S 0 ( t ) ( 1 t t )
S D
S 0 ( t + t ) S 0 ( t ) = S 0 ( t )t S 0 ( t )t
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
Figure E-3. Markov Model for Single Component with Multiple Failure Modes
S 0 ( t + t ) S 0 ( t ) S D
-------------------------------------------- = S 0 ( t ) S 0 ( t ) (E-26)
t
The left side of Equation E-1 is the deviation with respect to time. Taking
the limit as t approaches zero results in:
dS 0 ( t ) S D
- = S0 ( t ) S0 ( t )
--------------- (E-27)
dt
dS 1 ( t ) S
- = S0 ( t )
--------------- (E-28)
dt
and
dS 2 ( t ) D
- = S0 ( t )
--------------- (E-29)
dt
Equations E-27, E-28, and E-29 are first-order differential equations with
constant coefficients. Using a Laplace Transform to convert from the time
domain (t) to the frequency domain (s):
S D
sS 0 ( s ) S 0 ( 0 ) = S 0 ( s ) S 0 ( s )
S
sS 1 ( s ) S 1 ( 0 ) = S 0 ( s )
and
D
sS 2 ( s ) S 2 ( 0 ) = S 0 ( s )
For the initial conditions S0(0) = 1, S1(0) = 0, and S2(0) = 0, the equations
reduce to:
S D
sS 0 ( s ) 1 = S 0 ( s ) S 0 ( s ) (E-30)
S
sS 1 ( s ) = S 0 ( s ) (E-31)
and
D
sS 2 ( s ) = S 0 ( s ) (E-32)
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
S D
( s + + )S 0 ( s ) = 1
Therefore:
1
S 0 ( s ) = --------------------------- (E-33)
S D
s+ +
Substituting Equation E-33 into Equations E-31 and E-32 and solving:
S
sS 1 ( s ) = ---------------------------
S D
s+ +
S
S 1 ( s ) = ----------------------------------- (E-34)
S D
s(s + + )
and
D
S 2 ( s ) = ----------------------------------- (E-35)
S D
s(s + + )
S D
( + )t
S0 ( t ) = e = R(t) (E-36)
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
A B
S 1 ( s ) = ---- + --------------------- (E-37)
s s++
S
A B
---- + --------------------------- = -----------------------------------
s s + S + D s(s + + )
S D
S S D
= A(s + + ) + B(s)
This relation holds true for all values of s. Therefore, to solve for A and B,
we should pick a value of s that will simplify the algebra as much as possi-
ble. To solve for A, a value of s = 0 is the best choice. At s = 0,
S S D
= A( + )
Therefore:
S
A = ------------------- (E-38)
S D
+
S D
s = ( + )
Substituting for s:
S S D
= B ( + )
Rearranging,
S
B = ------------------- (E-39)
S D
+
S S
S 1 ( s ) = -------------------------- ---------------------------------------------------------
S D S D S D
s( + ) ( + )(s + + )
S D
S S ( + )t
e
S 1 ( t ) = ------------------- ------------------------------
S D S D
+ +
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
S
S D
- ( 1 e ( + )t )
S 1 ( t ) = ------------------ (E-40)
S D
+
Similarly:
D
S D
- ( 1 e ( + )t )
S 2 ( t ) = ------------------ (E-41)
S D
+
Figure E-4. Time Dependent Probabilities for Single Component with Multiple Failure
Modes
multiple failure modes, 94, 151, 153, 269, 272, reference diagnostics, 84, 190, 228
287, 290, 384, 450 reliability, 435
reliability network, 133
N matrix, 169, 411412 repair rate, 293
nonrepairable component, 124, 152, 441 resulting fault, 105
NORAD, 41 risk, 360361
normal distribution, 20, 2225, 47, 4950 analysis, 365, 384
cost, 1, 359, 365, 383, 395
graph, 361362
on-line repair rate, 293
risk reduction factor (RRF), 1, 78, 184186,
operational profile, 247248, 250251
256259, 261262, 265268, 271, 273, 275,
output readback, 192193
277, 361, 365366, 433
roller coaster curve, 71
parallel system, 130131
partial fraction, 446
safe coverage factor, 84, 187
path space, 230, 234235, 237, 246
safety critical software, 228229
periodic inspection, 257, 265, 273, 275, 277, 383
safety functional verification, 382
periodic inspection interval, 256
safety instrumented system (SIS), 4, 78, 8081,
periodic maintenance, 383
154, 170, 173, 187, 195, 201203, 256, 359
PFD average (PFDavg), 1, 78, 8082, 109, 114
361, 363, 366, 368, 372, 383384, 395396,
116, 184, 256, 258, 261, 265268, 271, 273,
431
275, 277278, 291, 301, 312, 315, 317, 361,
safety integrity levels (SIL), 4, 224, 361362,
367371, 373375
364, 367368, 370
physical failures, 38
safety PLC, 78, 95, 187, 354
physical separation, 213
sample mean, 19, 21
plausibility assertions, 229
sample variance, 21
PLC input circuit, 9597, 188
series system, 73, 307
power system, 56, 107
shock, 43, 46, 5354, 203, 207
--``,,`,,,`,,`,`,,,```,,,``,``,,-`-`,,`,,`,`,,`---
pressure transmitter, 39, 110, 203
simplification techniques, 300
priority AND gate, 106
simulation, 49, 51, 72, 152, 168
probability density function (pdf), 1113, 47,
single-board controller, 37
6364, 7273, 208, 435
single-loop controllers, 306
probability of failing safely (PFS), 1, 78, 8081,
software
84, 301, 313, 315, 317321, 323, 325, 328
common-cause strength, 213
330, 334335, 338, 341, 343, 345346, 349,
diagnostics, 227
351, 353354, 396
diversity, 212
probability of failure on demand (PFD), 1, 78,
error, 213
8082, 84, 113114, 116, 151, 172175, 184,
fault avoidance, 226
201, 256257, 259262, 265, 268, 271, 275
maturity model, 227
278, 290291, 301, 311312, 315318, 320
strength, 212, 227
322, 325333, 338340, 343345, 349350,
testing, 227, 248249
353354, 361, 367, 383, 395396
Software Engineering Institute, 227
program flow control, 228
standard deviation, 47
programmable electronic systems (PES), 3, 305
standard normal distribution, 25, 401
programmable logic controllers (PLC), 7879,
starting state, 159, 164, 169
8183, 9597, 110111, 184185, 188, 193
state diagrams, 149
194, 202, 214215, 223224, 306310, 312
state merging, 300
313, 315, 318319, 322323, 333, 353354
state space, 229, 237, 423
statistics, 2, 10, 59, 365, 413
Q matrix, 169, 411412 steady-state availability, 76, 156, 162
steady-state probabilities, 158
random failures, 3537 stress rejecter, 234
random variable, 916, 1820, 22, 26, 5960, stress rejection, 226227, 229
63, 65
variance, 2122, 24
vibration, 43
virtual infinity, 227, 230, 246
voting circuit, 92, 193, 330